Computation and the Transformation of Practically Everything: Life Sciences
GIFFORD: Today, this morning, we're going to talk about computation and transformation of the life sciences. And it's transforming many things here at MIT. One thing is our educational system. For example, we recently announced a joint program at the undergraduate level between computer science and biology, [? which was ?] extraordinarily exciting.
And our speakers are going to provide different insights into the aspects of how computation is really influencing life sciences here at MIT, broadly construed. So I'm very pleased to introduce Eric Lander, who's a professor, here, of biology at MIT, Professor of Systems Biology at Harvard, and the founding director of the Broad Institute. Please welcome Eric.
LANDER: Thanks, Dave. Can you hear me? Excellent. I'm just delighted to be doing this. 150 years of MIT, wow. I have been at MIT since 1985 in one form or another-- so about 25 years or so, or, rather terrifyingly, 1/6 of the span of MIT. Somehow that concerns me.
I want to talk about, as part of your transformation of practically everything, the transformation of genetics, the transformation, in s very deep and fundamental way, of the life sciences. I can think of-- here's a picture to start off with. These are chromosomes viewed in the microscope.
Biologists reveled in looking down the microscope at things, even if they didn't know what the things were. These are chromosomes, which, since they had no idea what they were when they were discovered that they stained with a dye, were called "colored things," meaning that it was chromosomes. Biologists just love details.
When I was in high school, I hated biology, because it just seemed like it was tons of details. There were no general principles. Indeed, as a high school student, and as a college student, and getting my own PhD, I was a mathematician. I got my PhD in mathematics and algebraic combinatorics and never imagined that I would be interested in biology, because I just didn't go in for all of these fuzzy details.
When a biologist wanted to study a disease-- oh, heart disease, brain degeneration, bowel disorders, autoimmune diseases-- you had to be an expert in the heart, the brain, the bowel, or the immune system. There was nothing general, nothing generic.
Of course, we know the great power of mathematics and computation is the general-- being able to see all things as essentially the same thing from the right perspective. How did that change in biology? That is the story of what has changed in biology over the course of the last 30 years or so-- is biology has become, in a very deep way, about information. And that has changed everything about biology, including building extraordinary ties between the biological sciences at MIT and the computational sciences.
It's not better illustrated than by looking at the Whitehead Institute built in the early 1980s by one of the greatest biologists of the second half of the 20th century, David Baltimore. When he designed the Whitehead Institute in around 1983, there was absolutely no provision whatsoever for computers. At the last minute, they scrambled to put in a little mainframe room on the second floor, but they never imagined that computers would matter.
Next to it stands the Broad Institute, built sometime later-- about 25 years later-- which I direct, and roughly half of what the Broad Institute does is computational. And no serious, card-carrying genomicist today would not have reasonable training in the computational sciences. It's a total sea-change. Now, how did this come about?
Well, the generic-- it came about with the understanding that there were generic approaches you could take. Well, actually, there was an old generic approach dating back to 1913 in fruit fly genetics that made it possible to trace the inheritance of a disease in a family. It was a family of fruit flies in 1913 where you would trace one trait, and another trait, and if they showed co-inheritance, you could say that the two genes that caused them-- that must be nearby. It was a mathematical inference you could draw.
But you couldn't really do this for human beings until a little bit of DNA sequencing technology came along, because you couldn't make crosses between humans, and you couldn't follow traits like curly wings and funny bristles and all that. But you could follow DNA spelling differences. And if you found a DNA spelling difference-- such as is shown here, where dad is passing it onto four of his eight children-- and it tracks along with the disease that's also passing in the family and you saw that for enough families, you could eventually conclude that whatever caused the DNA spelling-- that whatever caused the disease was in fact nearby on one of the chromosomes to one of those spelling differences.
That was an abstract inference that you could make regardless of whether it was a heart disease, a brain degeneration, or whatever. You therefore knew you were close by. You could then use that piece of DNA, get the next piece of DNA, the next piece, next piece, next piece, next piece of DNA-- in an unbelievably boring process called chromosomal walking.
I come from Brooklyn, and I call it chromosomal schlepping. It took five years-- five years-- to go from a linked genetic marker that showed 99% correlation to actually isolating the gene for cystic fibrosis. Ah, but when you got there, you were rewarded with this-- lots of letters.
Well, those three letters-- CTT-- that are deleted are deleted on the vast majority of chromosomes with cystic fibrosis. And this gene is the cause of cystic fibrosis. This particular mutation is the most common cause. And you could test people very quickly, by amplification reactions, to see who's a carrier.
Not only that, you could learn what this gene does without having any prior knowledge by computing. You could take the sequence of the gene, throw it into a computer, and ask the computer to do a kind of string comparison. Is the string of amino acids encoded by this gene similar to-- by some definition-- any other gene you'd seen before? And bingo-- it was similar to dozens of other genes that encoded proteins that sat in the membrane and transported things back and forth. Congratulations, you've probably found a transporter gene.
You're handed a hypothesis on a silver platter. Whereas the official biological path was to wait five or six years to somehow stumble into a hypothesis randomly, you now had a hypothesis generator of enormous power. Well, of course, the power of that hypothesis generator depended on how much data there was in there.
There was a network effect. The more that was in there, the more you could learn, the more you could learn about other things. Now, the problem with the story I've just told you was this took five years. It took tens of millions of dollars, hundreds of people to get one gene-- cystic fibrosis.
That's too much work. That was the purpose of the Human Genome Project. It was to begin to generate data on a much more efficient, rapid, large scale so we could collect all these data, and then to increase the tools-- to improve the tools-- for being able to do this kind of work, to make genetic maps of those spelling differences that you could use to trace inheritance, physical maps of all the DNA so you wouldn't have to schlep up and down the chromosomes yourself, all the sequence maps so you could just double-click on a region and see what was there, gene lists so you would already know how to parse up where the genes were, and to make sure all the information was totally, freely available to everybody without restriction.
The Human Genome Project was successful. Around 2001, a draft sequence was published. It took about 11 years to get to that point, about 13, 14 years to get to a finished sequence that had virtually the whole thing laid out with very few errors.
3 billion nucleotides, we had to cover them about seven-fold, on average, to get enough coverage-- about 20-odd billion nucleotides of data collected. Whole world working together, coming down a learning curve. We were so pleased with ourselves.
But then what's happened in the last five or six years is really breathtaking. Technology has shifted in a way that even you guys who work in computer science would find impressive. All sorts of new machines have come along to give us a new world of DNA sequencing, and generating data at an amazing clip.
Machines that do massively parallel sequencing-- instead of looking at DNA sequence on a line through some little capillary and observing it, you array it out on a slide. You can simultaneously follow many, many, many different DNA sequencing reactions all going on in parallel by optical methods. And here's what it felt like.
Back in 1999, we were so proud of ourselves, at the Broad Institute, to have done 1 billion bases of DNA sequence-- and to get up to 20 billion bases by the time of the height of the Human Genome Project, and to 70 billion bases by 2006 per year. Then the new technologies came along. The next two years looked like this-- 1,700 billion bases. The next year, 19,000 billion bases. Last year, 125,000 billion bases. This year, probably about 10-fold higher.
If I put this on a cost curve of the sort you guys would recognize, Moore's law is shown here on the log scale. Sequencing costs are shown there in red. They decreased, over the course of a decade, by 100,000-fold, running twice as fast as Moore's law, and it ain't done yet. Now, it will reach some limits.
This actually is not going to go on forever. This represents the jumping from one S curve to another S curve, and I have to confess, to an audience like this, it involves exploiting technologies that were there already for the exploitation. But nonetheless, just imagine the feeling to us when you can begin to say, what experiments could I do now that I couldn't do 10 years ago given that I can collect data a million times faster or cheaper? Boy, a lot-- a lot of amazing experiments.
What difference is it making? I just want to give you a little sense of what difference it's making. The idea of building these maps, of laying out those spelling differences along the chromosome and filling in the sequence and all that-- it turns out to be a foundation for so much more, because once you had a first map, you could get fragmentary information from any other experiment. Any experiment that could give you bits of sequence, you could start laying it on the map.
Imagine Google Maps. We can begin populating it with satellite imagery, with imagery about good donut stores, about all sorts of things layered on top, and on top, and top of the map. And the map gets richer-- you know, where your friends are-- it gets richer and richer and richer. We can lay, on top of this map, the locations of genes that come from some experiments, evolutionary conservation across species, [? its ?] chromatin state-- whatever that is-- inherited variation maps, disease association maps, evolutionary selection maps, cancer gene maps, and that's what's going on.
This pile of information is piling up from so many different sources, and the returns are coming to how do you integrate that information? How do you make inferences across modalities of information? How do you generate hypotheses-- like we did with cystic fibrosis-- that you never would have ever imagined, but now not just from one stream comparison, but from comparisons across thousands of experiments being done across the world?
Well, what have we learned? We've learned a lot about how to sequence DNA. I'm going to just briefly sketch-- very, very briefly-- what we've learned about the functional elements encoded in the genome in this way, and every single thing I'm going to tell you about relies crucially on computation. You couldn't even begin to work on the problem without major computation-- the functional elements, the evolution of the genome, the basis of inherited disease, the basis of cancer, human history.
I'll just rip through some stories quickly. The details don't matter, but the general sense does, because I want to be sure to leave time for questions. Discovering the functional elements encoded in the genome-- how are we going to read the sequence and see what's important? Genomes are pretty boring. This is genome sequence.
I mean, it's as exciting as the 1's and 0's on your hard disk, right? Without some way to parse it-- file headers of some sort, ends of files, et cetera-- it's very hard. In principle, what we'd like to do is look at every single letter and see if it matters by changing it and seeing if it has an effect. Change some A to a T, and grow a new human being that has that sequence, and see if there's any difference. That's considered unethical, and for the most part, impractical.
But in fact, that experiment has been done. It's been run by--
[INAUDIBLE] --had a certain sequence, and various [? changes ?] have been tried and tested along the way. You can ask, what sequences matter? What--
--retained across that time?
And by comparing the sequence of the mouse to the sequence of the human-- this is just conceptual. The sequence is not written in the English alphabet-- you can run code, and code will tell you, a-ha! I see hidden messages in the two sequences. And you get into interesting questions here.
Well, how exact do the matches have to be? What if I have to compare k strings with a certain degree of mismatch between them? What's the optimal algorithms for doing this and that, et cetera-- for sensitivity, for speed, et cetera? Lots of rich problems there.
So it meant we wanted much more than the human. We wanted the mouse, so we got the mouse. We wanted the rat and the dog, so we got the mouse and the rat and the dog. In fact, we wanted dozens of mammals, and so we now have 40 different-- 30 different mammalian genomes. And we've got fishes, and we've got all sorts of things running around, and they're all freely available on the web.
Lots and lots of information-- what have we learned from looking at all this information? From careful computational analysis of all this information, we learned that we were so far off on how many genes there are in the genome. I teach MIT students Introductory Biology. Indeed, over the last two decades, I have taught about 55% of all MIT students, I calculate, in Introductory Biology.
And I taught at least a decade of them that there were 100,000 genes in the human genome, and that's totally wrong, and I'm still wracked with guilt about having misinformed them about this fact, because in fact, by doing evolutionary comparison, computational analysis, we can show there's only 21,000 genes in the genome. I can't give you the argument, but I can tell you that that's about right.
But we also find that, to our surprise, while there are fewer genes, there's much more important information. Evolution has conserved-- again, a computation-- about 5% or 6% of the human genome, but only 1.2% encodes proteins. So there's tons of noncoding regulatory information reversing our picture, which before had a little bit of regulation and a lot of protein coding. It's now a little bit of protein coding and a lot of regulation. And so genes involved in early development-- like that red guy down there at the bottom, that's a protein-coding gene involved in early development-- it looks like all those purple things are involved in ensuring that that gene gets turned on at the right time and the right place, et cetera.
And you can go beyond stuff like this. You can ask, biologically, how is the DNA controlled by proteins sitting on top of it? You might think, oh, I've got to do lots of biochemistry. Well, you do have to do biochemistry. The DNA's wrapped around proteins that are modified with methyl groups, and acetyl groups, and other things.
How am I going to find where they all are? Well, again, large-scale parallel approaches-- get an antibody that recognizes any particular modification, pull down the DNA that's attached to that modification, sequence it, lay it on the map, and you build yourself maps. Maps of this flavor of modification, that flavor of modification, that flavor-- and all the places that have that modification get pulled down, sequenced, and laid out on the map.
Then you can begin to see things like genes that are turned on have distinctive signatures-- green bits and blue bits. Now, here we knew what to look for-- genes being turned on. But in fact, recent work done at CSAIL-- in fact, by my colleague, Manolis Kellis-- has taken lots of these just chromatin signatures, and in an unbiased way, asked, are there combinations of signatures that denote things we never guessed in biology? And there are.
When you look at these things, you can use them to scan along the genome. Say, see? I see three genes right now. I compare it to my catalog and say, ah, that's in the catalog. That is. That's not a protein-coding gene, [INAUDIBLE].
4,000 new genes that don't encode proteins pop out at you. These are the kinds of returns that come from computation. So a computational graduate student in my lab discovered 4,000 new genes by computing. Pretty cool.
Well, anyway, lots more stuff like this. On the basis of disease, I told you about cystic fibrosis-- a simple Mendelian disease you trace simply in a family. We're doing pretty well with that. Almost 3,000 of those have been found so far.
The real challenge has been to find the basis of common diseases, polygenic diseases, heart disease, Alzheimer's-- things that aren't simple, one-gene diseases. And even as recently as the year 2000, we only had about 25 examples that we really knew the genes. But again, a generic approach, a computational approach arose in the mid '90s.
Several of us proposed, hey, you know, there's only 10 or 20 million genetic variants that are common in the human population. Let's find them all. Let's simply test every one of them in thousands of people with the disease and see which ones are enriched in people with the disease and not in people without the disease. Simple idea, just kind of nuts. You need to be able to find and assay 10 million variants and thousands of people. That's 10 billions of genotypes. And it's usually done one at a time, and the students tend to object to doing 10 billion of them.
So the solution? Make it easier. Databases were created of the variants. Genotyping expands it in speed. And today, we have the 20 million variants. We can genotype to a million of them at a time, and we can do all this. And here's what it felt like in terms of the discovery of genes underlying common diseases.
In each of the years 2000 to 2004, one gene was discovered related to common diseases. 2005, though, four. 2006, 7. 2007, 2008, '09, '10, today-- 1,100 loci for more than 165 different common diseases, et cetera. Similar things are going on for cancer today. I won't go into the details, but by taking tumors and normals and sequencing them, you can find the mutations that occurred in the tumor that weren't present, and then match normal tissue, begin to collect them in a recent paper that just came out from MIT here and from the Broad, six new biological pathways [? in ?] the blood cancer multiple myeloma. Lots of examples are coming out computationally like that.
Then finally, human history. There is so much historical information of the human genome that is so fun to find. We had an old story about humans came out of Africa-- that's correct-- migrated out, and split and split and split into different populations around the world. And we knew that selection played some role in all this, but we didn't know what the role was.
Well, with the data that's come along and with computational tools and stochastic models and all sorts of rich computation I won't talk about, we know so much more today. We know that it's a story not of just splitting, but tremendous mixing, of India being a mixture of two different populations coming here, Europe lots of mixtures. And my favorite mixture-- we know that humans are 4% Neanderthal. We can now read out the Neanderthal mixture in humans by looking at Neanderthal sequence and comparing it to other sequences and doing the right stochastic models, and it's very clear we're part Neanderthal.
We can read out positive selection in the human genome, sweeps when something became advantageous and swept with it a whole region of the chromosome around it. And more than 300 such regions have been identified. And we can go back and ask, what kind of selections happened in the speciation of humans from our common ancestor with chimp? And identify genes-- dozens of genes-- that underwent strong positive selection.
Well, what have we learned? We've learned a lot. We've learned specifics about biology, specifics about disease, specifics about evolution, specifics about all sorts of things. But what we've really learned is a lot about the structure of biology, and about scientific community.
We've learned that biology is inherently about information. It's most powerful and richest when you view it as a form of information, because you can do things generically and generally. And we learn that to get that information, we have to work together to build common infrastructure, standards, ways of talking to each other, and we have to share data freely.
In short, we've learned a whole lot of lessons that the world of computer science already knows. And it's a good thing we have, because biology is-- while it'll always still be wet, and it'll always be looking down the microscope and all those sort of things, and it'll care about its details, biology is now inexorably, inevitably, and permanently about information. Thanks so much.
GIFFORD: Well, thanks, Eric, for such a wonderful talk.
GIFFORD: We have a question, here.
AUDIENCE: I was talking, last night, to a CCL graduate patent attorney at dinner, and he was defending the use of intellectual property laws because there are-- unlike, as he conceded, their inapplicability to computer science and that realm, areas where it's very important-- you couldn't do pharma products without that system. How does this intersect with what you're talking about? Because it sounds to me like what you have is a very open process, and it's dependent on that.
LANDER: Exactly. The Human Genome Project made a commitment that all the data would be open, available without patents, without any restrictions, and we felt it was incredibly important. On the other hand, I totally defend the need to have patents on small molecules that go in a bottle and are subjected to $100 million of clinical testing, because if we don't have patents on that, someone else is going to come in and sell the product, therefore, nobody will be incented to make the investment.
Essentially, it's about, what is the social bargain? The social bargain, to me, of granting a monopoly on something where somebody's going to invest $100 million for a clinical trial makes great sense. The social bargain for information that is cheap and easy to obtain, anybody can get it-- why do I want to be handing out monopolies to somebody who's merely running a sequencer there?
So we fought a battle. There was a great battle in the Human Genome Project between a public and a private version, and the model for the private version was this would be privatized. It would be private databases where you'd pay to access. And I think, in computer science, we see the value of integrating data across so many sources.
I think biology has to figure out exactly where to draw those lines. The right way to draw it would be to say, what's the right bargain to strike societally? The law doesn't do that. The law says, what's obvious and inventive? And we create these fictions about what's obvious and inventive and decide that certain things we don't want to make a bargain aren't really inventive, or aren't obvious. And I think we're edging to trying to get that right.
GIFFORD: Thank you very much. Question over here.
AUDIENCE: You spoke quite eloquently of all of the things that we're now learning because of our ability to mine this information in ways that we haven't been able to do so. And I know you just spoke a little bit about some of the social issues of integration and other things. I was wondering if you could speak a moment or two about the computational, and perhaps other issues that are still on the table, and what the challenges are for the next generation? And maybe that's a simple hard problem, but--
LANDER: It's a fabulous question. So this past generation managed to get a lot of mileage out of linear string comparisons-- maybe end strings being compared together or something. The next problem is to go up one level. It's to infer circuitry.
We have to reverse-engineer circuitry-- and circuitry that wasn't even designed by some collection of engineers over the course of decades, but in fact, evolved in a totally opportunistic way over the course of billions of years. How do we take information that we can read out of the cell, we can read out what genes are turned on and off, how the chromatin is changed in various ways, at increasing resolution?
We can go in, and we can cut wires. We can interfere with a gene using RNA interference and other things. We can introduce new genes. How do we take those tools and do reverse-engineering and infer circuitry? How do we exploit evolutionary comparisons because we know that circuits aren't invented that often?
And the new generation-- faculty like here at MIT, Aviv Regev in the Biology Department who's a card-carrying mathematician and computer scientist-- she's thinking deeply about how do you reverse-engineer circuitry? Those are hard problems-- problems for which there aren't even great answers in computer science and in statistics that are great starting points for that kind of machine learning, but it's machine learning about a very unusual kind of machine. So it's a great set of problems.
Anybody who's done with whatever they're doing in computer science and has a couple spare cycles and wants to think about these problems, we still-- the ratio of computer science needed to computer science available within biology is still way off the optimum that we have to achieve. I know it's time.
GIFFORD: David, thank you.
I'm very pleased to introduce Collin Stultz, who's the [? KEK ?] professor here at MIT. Welcome, Collin.
STULTZ: Well, first-- thanks-- I want to thank the organizers for inviting me here to talk about some aspects of medicine and computation now. I guess when I first heard about this symposium, I thought giving a talk about computation and the transformation of medicine would be a good topic, but that would be a six-hour endeavor.
My area of expertise is cardiovascular medicine, so I thought it'd be best to focus on that. But to put this into some perspective, starting out from talking about big things, big problems, problems about things which involve things that you can see with the naked eye, and going down to things that are much more relevant for the future and discussions as such.
So I want to begin by talking about medicine at the turn of the century. [INAUDIBLE] [? it ?] basically is about in 1903, I'd say, this picture was taken from. And parenthetically, I'll say I showed this picture to my wife, who's a psychiatrist-- that's an important piece of information. And watching her eyes, she looked at the gentleman's face. She looked at his clothes, and only afterwards looked at all of these contraptions and the buckets filled with salt water and said, well, this is clearly somebody who's about to be electrocuted for some nefarious crime.
When in actuality, this is one of the first ECG recordings. So now, this entire contraption could probably fill a very expensive but small condominium in Back Bay, but it was state-of-the-art at the time, where we could record electrochemical signals that originate from the heart and get information about how these signals are propagated, and actually, cardiovascular health. And you can contrast this to-- any of you who've been to your physician as of late-- what electrocardiogram looks today to see the difference. All because advances in both hardware, and in software.
So now not only the state-of-the-art [? at ?] the turn of the century was that electrocardiogram, but it was possible to visualize the cardiovascular system as well-- so in vivo imaging. The problem is that that happened at autopsy, so that you needed to die to be able to understand, actually, the structures within the heart and if there was disease. And we learned a lot of important information from these types of studies.
We learned about the blood flow through the heart, blood flow returns to what's called the right atrium. The right heart goes out to the lungs, returns to the left heart, the left atrium, the left ventricle, and out to the aorta, to the rest of the body. So very simple design principles arose from those studies, but those were all what we call ex vivo post-death.
Now, there are many other things that I can talk about-- leeches and other sorts of voodoo and so forth that happened at that time-- but it's much more interesting to talk about where we are today. Now, here, this is the electrocardiogram that I showed you before, and this is probably the much more familiar picture today, where all of those boxes and tubs of salt water and so forth are replaced by this small device, and we have leads which we've replaced the buckets with, and in which we can get much more information about how the electrical signals are transmitted through the heart, and very precise information about what areas of the heart are damaged from looking at the standard 12-lead electrocardiogram. Again, all because of advances both on the software and the hardware side.
We can also acquire this information from people who are just walking around. Many of you here may have actually worn a Holter monitor. So this contraption, which is here by the side of the patient, is now reduced to something that's here in the pocket of this patient. And by recording these impulses over time, it can get information about heart rhythms and diagnose what we would call arrhythmias, incidences of ischemia, where the heart is not getting enough blood-- usually due to atherosclerosis.
But the major advances, I should say, come in cardiac imaging, and they can happen, now, pre-autopsy. So this is a type of study called an echocardiogram-- a cardiac ultrasound-- and it's a special type of cardiac ultrasound called a transesophageal study. And here we get information about the left atrium, the valves, which control the flow of blood from the left atrium to the left ventricle, the left heart, the right heart. And we can look at the structures and diagnose disease in this manner.
And this involved a lot of increasing signal to noise problems in terms of the software, and also many different hardware problems that had to be overcome. And in addition, applying the simple principles of Doppler shift, which originated from a lot of studies that came from MIT, we can look at blood flow across the valves So here, this red flow represents blood that's going at a relatively high velocity from the left ventricle to the left atrium.
Normally, blood goes the opposite way-- from the left atrium to the left ventricle. So we can diagnose valvular heart disease. This is a leaky valve. And we can do this all prior, non-invasively, by standard methods and principles of ultrasound.
We can look at specific valvular structures to ensure that they are normal. This is the aortic valve, and the aortic valve normally has-- we refer to as three leaflets-- again, all non-invasive studies. Parenthetically, I will say that this sign-- we originally refer to, when the valve is closed, as a Mercedes-Benz sign. But as the income of physicians have devolved, it's now referred to as a peace sign, which I thought was sort of an interesting piece of information.
We can take these two-dimensional images that are required via cardiac ultrasound and merge them together to get a three-dimensional image of the heart. So we're advancing now. We started out with the electrocardiography, and these are all things that are available routinely today in the care of patients with cardiovascular disease.
We can talk about cardiac ultrasound with these two-dimensional images, combine them together to get a three-dimensional image of the heart. We can slice it in what's called sagittally-- that's from the tip here to the tip here to the base. And now we can slice it in an orthogonal position to get a look at specific valvular structures, where this is a look at what's called the mitral valve.
Some of the more impressive changes have come with cardiac MRI. And here, unlike the cardiac ultrasound studies, we get very detailed pictures where the signal-to-noise quality's actually a lot better. Here, we start from what's called the base of the heart, and we can see the blood flow from the left ventricle out into the aorta, and we can go down from the base all the way to the apex. And we can look at various regions of the left ventricle, the right ventricle, and all of these pictures-- so left ventricle and the right ventricle-- and we can get very precise information about cardiac function, valvular function.
And this now corresponds to the standard of care. There are many other things that I can talk about, but the advances that have happened over the course of the last century-- even in the course of the last 50 years-- are pretty profound. There are many studies that we can do that can allow us to diagnose things that we could never dreamed of diagnosing at the turn of the century. But I think the most interesting discussion is where are we going?
What are the challenges for the future? Where does computation come into play when we talk about looking at patients and improving patient care in the future? And just to put this into perspective, cardiovascular disease, despite all of these advances, is still a major problem.
In 2009, there was a heart attack in the United States every 30 seconds. 1 in 4 deaths were due to cardiovascular disease. It's an enormous cause of morbidity and mortality, and we're still not horribly very good at it. If you look at all of the patients who've died from cardiovascular disease, we're able to predict a relatively small number of them. So the goal of cardiovascular medicine is really about risk stratification-- that is, identifying high-risk patients so that you can intervene in a timely fashion to prevent adverse outcomes.
So if you asked someone how to do this, you'd get different answers. How do we improve our ability to restratify patients? And some would say, well, we need higher-resolution MRI scans. Some would say we need non-invasive visualization of the coronary arteries, which can occur via very good MRI scans, or CT scans, or other things that actually cost a whole lot of money.
Now, it's interesting that what I'm standing on is a little higher than the floor, because I will view this as my soapbox and use this as an opportunity to express some of my own biases, and my own reflections on where computation can play in the future of cardiovascular care. So there's a lot of low-cost, easily-obtained data-- like the electrocardiogram-- that can be acquired over an extended period of time that have prognostic value.
When you go into your physician and you get an electrocardiogram or you get an MRI scan, it's called a point of care test. So you walk in. You may not be feeling well at that time, or your physician, for some reason or another, is concerned by something that you've told him or her. And he says, well, you know, let's go get that test. And then you walk out, and you live your life as you do.
And it could be that at the time that the test is taken, you're fine, but later on down the line, it could be, after you've left the test, that you have problems. And the hypothesis here is that if you were able to observe patients, looking at easily-obtained data-- heart rate changes, electrocardiographic data, and so forth-- you'd be able to better prognosticate.
And there are two parts to this. You've got to obtain data easily, and you've got to be able to analyze it. So you've got an information problem. So there are lots of data that we could obtain from the patient.
So here you see a mature gentleman clutching his chest, about to have a heart attack. Typical sign called the Levine sign in patients-- pressure across the chest, possibly shortness of breath, and other associated symptoms. Now, as an outpatient, we can obtain electrocardiographic data. We can obtain blood pressure data.
We can obtain data about one's breathing. If this patient were in the hospital, we could obtain continuous blood pressure monitoring. We could obtain measurements of pulmonary artery pressures. Lots of information that we can obtain. In fact, so much information It's hard for the physician to make sense of it just by looking at that data by him or herself.
So I refer to this as a macroscopic challenge. We have to obtain and interpret large amounts of easily-obtained data, physiologic data, and use it for risk stratification. And I think this is a place where computation can play a major role in the future.
So we're addressing the macroscopic challenge. And with respect to obtaining the data, here I'm going to focus on work that is currently going on at the Institute. So there are devices that can be designed that are relatively small that we can actually place on the patient that can record data, we hope, over the span of a couple of weeks.
We can record motion data with accelerometers. We can record electrocardiographic data, [? record ?] information about the O2 saturations. And that is to be contrasted with the typical Holter monitor.
Now, you can imagine wearing this contraption for a day or two is not too much of too much of a hardship. But wearing this stuff for a week, two weeks, three weeks is a-- who would want to do that? If you put this on me, I'd take it off after the first three days.
So the smaller it is, and the more comfortable it is, the more likely that you'll be able to obtain the data over an extended period of time that has value for risk stratification. We can even get smaller. You can imagine a device-- and this is, again, work that's going on at the Institute-- that fits behind the ear like a hearing aid that records, again, electrocardiographic information, information about O2 saturation, and to use those data to make very rough estimates of what's called the cardiac output-- how much blood is expelled from the heart per unit time.
And this is where I think things are going. And this is not I'm talking about 50 years down the line. I'm talking about just a few years down the line. So there's gathering the data, and I think that that's sort of largely, in some sense, a hardware problem, but there's interpreting macroscopic data. And there are many ways that this can be done.
So here I'm going to give you an example. If I just looked at electrocardiographic data-- this is a normal electrocardiogram, and you usually see this if you turn on the popular television shows that involve doctors. I think ER was the popular show, I think, when I was in medical school. I don't know what it is now.
But a normal electrocardiogram is actually very informative for several reasons. You have to have a normal conduction system, and you also have to have a normal interaction between the heart and the nervous system, because these two things interact with one another. So if you're watching the Yankees and Red Sox, depending on your disposition, and depending on which game you watched, your heart rate may increase a great deal. So to say that actually, there are interactions between these two systems. And a normal electrocardiogram, with a normal heart rate and a normal morphology, actually tells you quite a bit about each of these different factors.
So in using these ECG data for prognostication-- and some of this work is in collaboration with [INAUDIBLE] [? Sayid, ?] who's a professor at Michigan, and John Guttag here at CSAIL, and myself that subtle beat-to-beat changes in the morphology-- that's the shape of the ECG signal over time-- that can't be detected by eye because they are so subtle, may have prognostic significance.
Now, all the people are not good at looking at detecting these changes. Computers are. And by using sophisticated computer algorithms, you can quantify these to beat-to-beat changes and morphology and assess their prognostic value. And so one of these measures that looks at, actually, these beat-to-beat changes and quantifies them in a very sophisticated way is called morphological variability.
And sure enough, if you take a group of patients who've had a recent cardiovascular event-- which you call an acute coronary syndrome, or a myocardial infarction, or a heart attack-- and you look at those patients that have a high morphological variability and you look at their rate of death as a function of time versus those that have a low morphologic variability, high morphological variability by itself predicts almost an eight or nine-fold increased risk of death, with much of that occurring within the first 30 days post the event.
So just the electrocardiographic data by itself, looking at subtle changes using sophisticated algorithms, gives you quite a bit of information for risk stratification. And that holds true in many different patient subgroups. So if you are mature, or if you are young, it still holds true, female, smoker, non-smoker, diabetics. And the majority of these cases, in this plot, the thing to get across is that points over on this side are statistically significant increases in your risk.
You can also combine morphologic variability, or these types of measures, with established measures of risk to improve the performance. So for example, physicians typically look at what's called the left ventricular ejection fraction. So that's the amount of blood that's ejected from the heart with each beat.
If the left ventricular ejection fraction is less than 40%, we say that patients fall into high-risk subgroups, are at an increased risk of adverse events. Now, if you're greater than 40%, physicians generally say, OK, well, you're all right. I'm not going to intervene too much, because you fall into a low-risk category. But although the risk is lower with patients that have an ejection fraction greater than 40%, most of the deaths occur in the subgroup.
So it's not sufficient to say, well, OK, well, your EF-- your Ejection Fraction-- was greater than some number, and I can leave you alone. And it turns out that when you combine morphological variability measures, again, just by looking at electrocardiographic data using these subtle algorithms, you can improve the performance further. So if you look at patients that fall into this high-risk subgroup-- patients that have an ejection fraction less than 40%-- they have higher rates of death. But if your ejection fraction is less than 40 and your morphologic variability's high, there's almost a twofold increase risk of death.
More importantly, if you're put into this low-risk subgroup, a physician would normally say, OK, well, I shouldn't do too much, because your risk is relatively low to begin with. If your morphologic variability is high, you're at almost a two or three-fold increased risk of death. So I think the point here is that you can combine these very low-cost, easily-obtained information, apply subtle algorithms founded in principles of signal processing and artificial intelligence, to be able to dissect information that are quite powerful for making statements about the future of cardiovascular health of the patient.
So now I'm going to turn to what's called the microscopic challenge. And the microscopic challenge is not looking at information that's obtained from the patient as a whole, but looking at the pathophysiologic processes that are involved in heart attacks. So what happens when you have a major heart attack?
So the person who's walking down the street, and he or she has chest pain and falls over and drops dead-- what's the process that happens there? Well, here's our very mature gentleman here who's grasping his chest because he has discomfort. And if you look at the heart, at this point in time, there's a blockage within a vessel that feeds the heart. That blockage actually breaks open. It's called plaque rupture.
A clot forms in the vessel, and then blood cannot flow through that vessel anymore, and the heart dies. And so the microscopic challenge here, I would say, is to visualize and understand the early events that precipitate heart attacks. And there's some very interesting work happening with my colleague, Brett Bouma, at Mass General Hospital, and he's an HST, MIT-affiliated faculty member.
We can get very precise and pristine images of coronary arteries early on in the process. And you can see places where there are lipid-laden areas filled with cholesterol, places with calcium, and so forth. So that's one part of the challenge that I think will be part of the future.
These types of studies, I think, will become routine in the not-so-distant future. Let's look at even a smaller level of detail. So we can look at the vessel. We can look at the atherosclerotic lesions within the vessel. But this process, as I said, involves these plaques that are within the vessel, and they rupture.
So this is actually a specimen taken from a patient who died from a catastrophic heart attack. And this is an atherosclerotic plaque here. And over it, this is covering this lining of collagen which ruptures. And when it ruptures, the blood with inside the lumen of the vessel comes into contact with all of this bad stuff, and the clot forms.
And you can see this clot takes up a lot of room, so that before this clot was there, you had blood flowing through this entire region, and after the clot is formed, you have blood which is now limited to a smaller area. When you blow this area up where the college has been degraded, there are lots of inflammatory cells. These inflammatory cells secrete enzymes, which degrade the collagen.
Now, collagen, when you look at it-- if you remember from your biochemistry textbooks-- is a very rigid molecule-- for those of you who took biochemistry, I should say. But the principle that I want to articulate is that collagen is like a steel rod. It's a very strong metal.
So saying that enzymes are able to degrade collagen, at least looking at the structures of collagen as they appear in structural databases, doesn't make sense. It's like a man with a hammer trying to break down this data center. , But what happens is that-- so the structures of collagen-- oh, I've lost my batteries.
So the structure of collagen that we look at under in these databases were obtained under conditions that are far from physiologic. So the question is, how can we obtain accurate pictures of this structure on a physiologic condition? And this is where computation can play a role. And here, this is where a lot of these methods, these molecular dynamic simulations-- and again, I should say a lot of these longtime simulations have been pioneered by David Shaw, who will be the next speaker-- you can get an estimate of how the molecule moves just by solving Newton's equations of motions for particular molecules.
And when you do that for collagen, what you find is that collagen actually-- it really isn't like this very stiff rod that you see in crystal structures. It's very floppy. At least, that's what the simulations would say. And this floppy molecule is a dominant structure at physiologic temperatures. Most people in the Northeast are at 37 degrees Celsius. And at that temperature, you'll find that most of the-- when I go to Indiana, people boo me when I give that joke, but people usually laugh in Massachusetts.
In any case, so it's very floppy, and this is the stuff that the enzymes can degrade very readily. So there really is no problem. The simulations have revealed that plaque rupture and stuff should happen, because collagen itself is so floppy. Almost done. [INAUDIBLE] [? time. ?]
So this gives us a new paradigm for drug design. So if you think about collagen as flipping between two different states-- a state that's very rigid, hard to degrade, and a state that's floppy that can be degraded easily, and you look at the enzyme which degrades collagen, the degradation process proceeds as follows.
The enzyme binds near this floppy region, it attaches, and it cleaves it. Drug companies have tried to develop molecules that prevent this process, that prevent this process as well. But here, here's a new avenue. How can you prevent these transitions from collagen going from its very rigid structure that is not able to be cleaved, to this relatively floppy region? And so I think that computation has a role in helping us understand these things at very, very high resolution.
I know that I'm out of time, so I'll just close very quickly. Computation, I think, has improved our ability to diagnose heart disease and to prognosticate a myriad of avenues where computation can improve our ability to analyze large data sets that have prognostic value, and I think there's a lot of avenues here in the future for computation. And lastly, I think novel techniques, including [? molecular ?] simulations to look at the movement of small molecules under conditions that are near physiologic conditions, can provide new insights into the pathogenesis of cardiovascular disease. And I won't take up any more time. I just wanted to talk about some of the people who've contributed to some of this work on the MIT side, and the names you can see. Thank you.
GIFFORD: So I had a question for you. You know, some people now are believing that they should be publishing their genomes, and that that actually is a normative thing to do. And I'm wondering whether or not you think, in the future, as these devices get smaller and smaller, people are going to be motivated to publish a lot of their biometric data? Because of course, right now, we're primarily gathering those data from people that are at risk.
STULTZ: Yeah, so I think the same issues that surround publishing the genome surround publishing biometric data as well. If your insurance company gets some information that you are at increased risk of, let's say, cancer based on your genetic profile, or based on these biometric data that you're at increased risk of having a heart attack, that may have ramifications for employment, as well as for insurance. Now, things, I think, are changing with this new health care law that's 2,000 pages long. I'm not quite sure I understand. But nevertheless, those are my concerns.
GIFFORD: Actually we're out of time, so let's thank Collin again.
STULTZ: Thank you.