Eric Lander, "The Human Genome Project and Beyond”
SHARP: Welcome. I'm Philip Sharp. And it is a real pleasure to be able to introduce my colleague Eric Lander to present a talk on the historic event of the sequencing of the human genome.
When historians look back on these decades of biological science that bracket the change of the century and the millennium, the most noted advance will be the sequencing of the human genome. This sequence will shape all biological and medical science that is done postgenomic. And, in fact, it is the definition of that word, "postgenomic." The consequences of having the sequence will not be fully realized until decades have passed.
There are many stories for MIT in this event. First there's the story of Eric Lander, educated at Princeton as an undergraduate, PhD in Mathematics at Oxford, then a faculty member at Harvard, in the Business School. His first relationship was with MIT as a fellow in the Whitehead Institute, studying human genetics. Shortly thereafter, he became a professor in the Department of Biology and a member of the Whitehead Institute. He began the Human Genome Project at MIT and is director of the Human Genome Center, which is affiliated with the Whitehead Institute.
The second story is the relationship between the Whitehead Institute and MIT. This relationship, which began in a highly debated experiment some 15 years ago, has been an unqualified success. The Whitehead Institute has produced great science, some of which we celebrate today, has produced great leaders in science, and has played a full role in education at MIT, as typified by Eric's outstanding lecturing in the Core Biology course here in 10.250, which he does in the fall.
Finally, there is the story of the human genome sequence, what it tells us about our past and future, how the sequence was determined in an international effort, and what it will mean for humanity. This story will be told by Eric, the scientist who led the team that contributed the largest portion of the sequence to the public effort and who has been its best spokesman. MIT can be proud that one of its faculty has accomplished an historical first-- the sequencing of the human genome. Eric.
[APPLAUSE]
LANDER: Let me start by-- oh-- I have no voice, by the way. One of the casualties of this week, which has been a rather intense week, is that my voice seems to have been lost somewhere between here and Washington. So those of you who are used to my lecturing in 701 may detect a somewhat more painful version of it today.
But, in any case, let me correct one thing Phil said right at the end, because I think it's very important. One of its faculty members has completed the sequencing of the genome. Of course-- and I know Phil means it-- what happened is, about 250 people here at MIT and the Whitehead Institute have completed this remarkable project of sequencing the human genome.
Because I think one of the really most important aspects of this project is that nobody can do this alone. This was the first example of a truly, truly collaborative effort of enormous proportions in biology. And I don't think it will be the last of those efforts. I think we've learned that there are projects that we need to get done to provide the infrastructure for our field, to provide the tools that the single, smart graduate student and postdoc can use for her or his experiments. And that we need to work together, to make those tools.
And I think, before I go anywhere, I want to say how tremendously, tremendously honored and pleased and proud I am to have gotten to work with so many people here, together, in this community, to get this job done. And any one of us had a small part, but what I can say definitely is, if we here at MIT had not pulled together to do this it wouldn't have gotten done. And that's a pretty good thing. If not any one of us was essential to us, this community was absolutely essential to it. So this is very much a shared talked, but I'm very honored to talk about it on behalf of all the people here who did it.
I'm looking out upon the community. I see members of the genome center here, which is great. I see 701 students here, some of my 701 students, because I lecture here in the fall and I'm glad to see you back again. And many of my colleagues in the Biology department. And then there are probably are a whole lot of people who are not biologists at all who just came because they were to find out what in the world was going on with the Human Genome Project. And I'm going to try to be able to speak appropriately to everybody and just talk a little bit about what went on, what we learned from it, and where it's all going. Let me just dive right in.
The Human Genome Project and Beyond. The Human Genome Project was an absolutely ludicrous notion, proposed in about 1985-- maybe '84, '85-- that we could determine the entire sequence of the human genome and that this would provide incredibly useful information for all of biology. Both of those propositions were open to tremendous challenge-- the first, that we could do this.
At the time, DNA sequencing was less than a decade old. DNA sequencing was practiced with radioactive slab gels held up against film, to detect sequences. It was stunningly inefficient. It was not that long before that point that people got whole graduate theses for sequencing of a couple of thousand letters in some plasmid. And here the field was talking about a million times more sequencing than that.
In addition, it was hotly debated whether this was worth anything at all. Everyone-- not "every"-- many people said, well, what would you do with that information, anyway? Who needs it.
And the truth was, with regard to both of those propositions, the critics were right. We had no clue how in the world to do this. If anybody had gotten up in front of Congress and said, we know how to sequence the human genome, they would have been guilty of perjury. Thank god nobody had to say that under oath to get this project started. And if anyone said, we know exactly what we're going to do with the information, they would have been lying to you, too.
But what happened was a marvelous process of scientific progress, in the latter part of the '80s, when a number of different committees and a number of different countries and a number of different discussions got together and refined this project from what it was first, some megaproject, into a number of steps that would produce intermediate results that would immediately have feedback to science and would involve dozens and dozens of groups around the world. This is just a timeline of some of the events-- with a mammalian-centric point of view, because we here at MIT and Whitehead have taken a mammalian-centric point of view-- of building the genetic maps that let you trace inheritance. That was the first thing-- just landmarks. Not so many of them, a few thousand that would let you trace inheritance and families, making those maps better and better through three different generations, shown here in colors.
Then, once you found by tracing inheritance that there's a gene in a particular region, you would get the piece of DNA by a physical map from that region. And you would study it, to try to find the disease gene. But then, of course, what you'd really like to do is sequence all those pieces of DNA, so that, instead of having to do lots of laboratory work, you would just double-click with your computer and up would pop what you wanted. That, of course, being the dream of biology experiments.
In 1985-- sorry, 1996-- began a pilot project for the sequencing effort, when a number of groups around the world got together and said we'd try to work out the methodologies for doing large-scale sequencing, being able to do the process at high quality, produce finished sequence at the end, to automate it. And this was designated as about a three-year pilot project. And that is indeed what happened.
In the spring of 1999, there was a dramatic scale-up, because we entered the production phase. And, over the course of the next year or so, the vast majority of the sequence got covered. It involved a tremendous change in technology.
When we began, in borrowed space in the cancer center, we were using single-channel pipetters. The great advance following that was eight-channel pipetters.
[LAUGHTER]
Some favored the 12-channel pipetter. But, over the course of the decade or so, we advanced and advanced to the point where now the genome center over at the Whitehead Institute is filled with robots of this sort. There are picking robots that go around with cameras, picking bacterial clones that had their picks 96 colonies, dunks it into grow, it picks 96. And, in the course of a day, we pick about 120,000 bacterial colonies.
We then have to grow them up overnight, and we have to purify them. Many of you in the audience have done plasmid preps before, and then set up sequencing reactions. It's a tedious operation.
This is how we do them now. This is the production floor for doing plasma preps and sequencing reactions. Onto those conveyor belts-- which use a new kind of biochemistry, a solid-phase reversible mobilization that was developed at the genome center. Onto these conveyor belts go Microtiter plates of bacterial growths. They get cracked open, the DNA gets precipitated, it gets washed and cleaned in all different ways, reactions get set up, and it chugs along at a pulse rate of one Microtiter plate, about 96 clones per minute, through the system, processing 120,000 clones in the course of a day.
You see a tremendous number of machines in this picture. You see only a handful of people. And that is correct. The remarkable development team over at the genome center that built this robotics, in conjunction with the company, in fact is significantly larger than the number of people it takes to run the robotics, which is really quite delightful.
Once all the sequence sequencing reactions come off, we have to detect them. And they go back to a room of commercial detectors, back in the back room. We have a farm of about 145 capillary-based sequencers, with a throughput of about 65 million letters of DNA per day. That's a lot of sequence per day. And then all of that goes upstairs to the informatics department, where folks process this and assemble this and extract something from all the sequence.
It was a very interesting year, from May of 1999 to May of 2000, when the amount of the genome that had been covered in draft form-- not perfect form, but draft form-- had grown, grew from about 10% of the genome to about 90% of the genome, in the course of one quite remarkable year. This is roughly what the year looked like, here.
It was month by month, the genome. And I could just look at this again and again and again. It just--
[LAUGHTER]
Really, I mean, let's look at that again. That's really--
[LAUGHTER]
That's really fun. There you go! That's all there is to sequencing a genome. No, this involved a tremendous amount of coordination of 20 different groups around the world, distributing clones, groups working in all different sizes. Our group at the Whitehead Institute was indeed the largest single contributor to the effort, but in fact there were major, major contributions from the Sanger Centre in England, from Washington University in St. Louis, from the Department of Energy, from Baylor in Texas.
And then a number of small groups, all of whom played, I think, incredibly important roles in keeping this an international project. In fact, the project was so open that, halfway through its scale-up phase, the group in Beijing, China, said they wanted to join and we said, sure. And Beijing brought in 1 and 1/2% of the genome, done there in China.
The groups were all very different in how they approached this, but they all shared one absolute commitment, which was that the data, however it was produced, would be freely released on the web every 24 hours, without restrictions of any kind. That was the only condition for signing up to join the International Human Genome Project.
It was a pleasure to work with all these people, and it was a pleasure to do all this work. And I've got to say, there's just absolutely nothing like just being able to be left alone and focus on your science. And (EMPHASIS) this was absolutely nothing like being able to be left alone and focus on your science!
[LAUGHTER]
We spent the entire year, while we're trying to sequence the genome, waking up to newspaper headlines all the time. Because apparently, I understand, some company decided that we were being slow about it.
[LAUGHTER]
And decided to create a business plan, that they were going to sequence the human genome so much faster than anybody else. And they would therefore have a proprietary database that they could do with as they chose. To my mind, there was a little miscalculation, there, in that I think they mistook what was the designated pilot phase, where we were attempting to figure out how to do this thing carefully and right, for an inability to scale up when we were ready.
And so, in fact-- I think somewhat to their surprise-- they found that we were zooming along just fine, doing this, and that the business plan which involved having three-year window where the sequence was exclusively available through this company really seemed to, um, well, it didn't look so good, after a while. We'll come back to what actually happened, because, scientifically, there was a lot of very interesting stuff there. There were some claims that they were going to do this by a much more powerful method, the much-vaunted whole-genome-shotgun method, that, instead of dividing the genome up into pieces, they were going to toss the whole genome into your blender, press, you know, Puree, and be able to sequence lots of pieces and put it together. And this was so much faster than the silly old, you know, know where you are and get the whole thing organized kind of way.
[LAUGHTER]
We're going to come back to that point, because, for the first time, we know the answer of what actually happened. But, given all the controversy of what was going on, we decided that the best thing to do-- I believe Senator Aiken really gets credit for this strategy-- was simply to declare victory. Right? This was the old Vietnam strategy, right?
[LAUGHTER]
In his case, it was declare victory and withdraw. We just declared victory and went back to our labs to get to work, you know, finishing the genome. Back on June 26, there was a big announcement at the White House that the human genome had been sequenced. Um-- at least, if you read closely, that a draft version of the human genome had been sequenced. And, if you read really closely, that the groups claim that a draft sequence of the--
[LAUGHTER]
Obviously, in our case, you could check the web yourself and see that there was data there, or that you would have to analyze it yourself. The case of the company, you know, go figure-- how would you know, anyway? But it didn't stop the announcements, and it was basically right. Because there had been an amazing milestone achieved, which was that it was no longer the case that the lack of sequence was a rate-limiting step for biology, that, for the vast majority regions of the genome, you could get your hands on the sequence.
But what we hadn't done is analyze that sequence, put it together, neatly seen what's in it, really read through the book. And, to my mind-- although I realize it seems sort of pokey and old-fashioned-- instead of science by press release, to me the real milestones come with peer-reviewed papers. And that's what in fact happened this week.
That's what was exciting, this week, was that a paper appeared in the journal Nature. In fact, a remarkable collection of about 20 papers appeared in the journal Nature, one of which [? cub-reports ?] the sequence of the human genome in an initial analysis. Some others report other related topics I'll talk about. It appeared officially, covered it yesterday. We held a press conference on Monday, to announce this.
And this picture here, by the way, is one we designed here at the Whitehead Institute. You may not be able to see it closely, but this is a picture of DNA that has been decomposed through the photo-mosaic technique into 1,500 human faces from around the world. And so, when you get your issue of Nature, look up close.
We did this here, and-- in fact, we had this idea of doing this here, and then we searched the web to find out where in the world was that company that made these photo mosaics. And it was absolutely delightful to discover that this company that I'd heard of but never really paid enough attention to was the product of an MIT graduate and was, in fact, two blocks down, just by the Necco factory.
[APPLAUSE]
And so MIT pops up all over the place. You know, it was great. Anyway, for fun, go try to find the picture of Crick and Watson in this picture.
[LAUGHTER]
There is a Where's Waldo to this. So--
[LAUGHTER]
It's there. All right, so, a little bit about assembling the sequence of the human genome. I realize many of you aren't aficionados for this stuff. I think I'm scheduled to give a proper seminar in April. This, as you can tell, is not a proper seminar. This is kind of fun to celebrate this week.
Assembling the sequence of the human genome. Very briefly, we had to produce-- and we did produce-- this is largely the work of Bob Waterston's group in St. Louis-- overlapping clone maps, across the genome, of large inserts clones, each about 200,000 letters in length, and then to select from that a reasonable tiling path of these large, we called them BACs-- Bacterial Artificial Chromosomes-- these large, 200,000-base-pair clones, break up each of these things into small, shotgun fragments, sequence them, and reassemble them into a draft form that still had some gaps but a modest number of gaps-- perhaps eight, nine gaps, something like that. And then, with computer programs, merge them all together into overlapping sequences, based on where we knew these clones were.
And there's a very nice program, written by a graduate student at UC Santa Cruz, called Jim Kent, called a "gig assembler" that put all these fragments together in some nice way. And the short version of this is that there's a fair amount of continuity. You start initially with about 22 kilobase contigs of sequence. Merging them, you get them about 82 kilobase contigs. Connecting them by links, you're up to about a quarter of a million bases.
They live within larger clone contigs that we have localized all these fragments into, that extend for more than 2 million bases and connect it to about 5 million bases and then are joined by fingerprints into about 8 million bases. And so, in fact, there's a hierarchical arrangement of each of these fragments connected in. There are still gaps throughout this map, but we know where the vast majority of those gaps are. In fact, almost all of those gaps, 99% of the sequence, is well localized to the human chromosomes by this process.
It ain't perfect. It ain't finished. But, for every letter in the sequence, we've attached an error rate. 91% of the letters in the sequence are accurate to 99.99%. 95% are accurate to 99.9%.
The order? Well, we've checked it against all the previous maps available. It agrees pretty darn well, although there are the occasional differences in the order. Where we've checked those differences, so far they appear to be problems with the previous maps. In fact, the sequence seems to do a pretty good job of cleaning up the previous maps, rather than vice versa.
But by no means will we assert that any of this is perfect. There's still probably about a year and a half's more work to go to bring this to what we call "finished state." But, already, there's an awful lot of genome there.
The connected pieces of sequence, the sequences are connected into scaffolds. And together they span about 2.892 billion bases, which appears to be about the entire euchromatic portion of the genome. The genome's probably about 3.1 billion bases, the rest of it being heterochromatin around centromeres and other regions.
Now, of the 2.92, we seem to have about 2.69 in our assembly. That's about 92% of the genome seems to be represented in our assembled bases. There are gaps. There's about 149,000 gaps in the sequence. But all but 2,000 of them we have immediately spanned by pieces, because they live well within clones, and they are all going back through a finishing process and getting closed up.
Because of the wonderful mapping approach that was taken, all these pieces happen to live within a mere 2,000 components that have to be put onto the map. And the number of physical gaps, where we don't actually own a piece of DNA covering it right now, is under 1,000. Those are being worked through.
So that's a rough description. In any case, this is all just to say it's not done. But it's pretty far along. It's pretty far along.
Then there were our friends at the company.
[LAUGHTER]
So I keep reading, in the New York Times, how they beat us at this thing. It was really, really fun this week. We got to read the paper.
So they really did try their whole genome-shotgun thing. They generated a whole lot of data in house, about fivefold coverage. They report no assembly of it whatsoever. I assume, from the fact that nowhere in the paper have they attempted to assemble their data, that it didn't work at all.
Instead what they did, if you read the paper-- and I invite my colleagues now to read the paper closely-- was they took their coverage, and they took the whole public database, based on seven-and-a-halffold coverage. They didn't even break it up randomly or anything. They got out the screwdrivers, and they carefully unscrewed it into a perfectly overlapping, tiling path, with exactly equal, big overlaps, tossed it into their computer, and then added their data and attempted to assemble a genome.
60% of the data that went into the computer was ours. It was not experimental data, it was simulated data-- simulated, not even as you do in a laboratory, but with perfect overlaps. Wish I knew how to generate data like that in a laboratory.
Nonetheless, nonetheless, the whole genome shotgun still failed. Despite all that, it covers about 75% of the genome, in contigs that you can really localize. If you ask, how much does it cover, total, here's what it looks like.
They call it-- the word "faux" is their word, in the paper. Their faux whole genome assembly actually has about 88% of the sequence in it, although it is in 119,000 separate pieces that have to be localized, far beyond which you can localize. If you take the biggest ones, you can get about 75% of the genome realistically localized. And about 25% of the genome is in, I guess the technical term is, "tossed genome salad."
[LAUGHTER]
So the assembly that actually got reported by our friends-- although this makes no difference to the New York Times-- was not that whole genome assembly. What they did instead, when they realized they had trouble with this, was they simply said, all right, we'll take all our data, we'll lay it on top of the public clones, in a clone-based assembly, and we'll just assemble the local bits. So, instead of a whole genome assembly, they did assemblies on the order of about half a bacterial log.
That, however, did work. Of course, what it produced was about 2.65 billion bases of DNA. The input to that, of course, was our 2.69 billion bases of DNA.
[LAUGHTER]
And it produced about the same coverage of the genome. So I mean, I don't mean to make a big deal out of this difference. This is basically the same. So I think in fact what they did was they rearranged our sentences within our paragraphs, or something like that. I must say, after putting up with three years of science by press conference, it is a pleasure to read a scientific paper and find out what really happened.
And, I got to say, there were some pretty dark times in the course of the Human Genome Project, when we felt, should we just throw in the towel? Was this really possible to put up with a steady drumbeat of press telling us that we're being silly for being slow and wasteful and all these things? In the end, there would be no Celera paper but for the Human Genome Project. And you know what? We're very proud to have put our data on the web and let them have our data and all that. The only thing I ask is that they probably ought to do a better job of acknowledging, you know, why they have an assembly.
But they're welcome to the assembly! The data's there for everybody, even for those who would like to say that they're competing with us or beating us. That is the one great thing about making your data freely available, is because you have no chance of winning, because you give everything to your competitors, you absolutely are guaranteed to have no chance of losing. And that's important, because we have to win by giving this stuff away.
All right. Inter--
[APPLAUSE]
It's a good lesson. Interpreting the sequence of the genome. So I want to put aside all that stuff and say, well, what's in your genome, anyway? Well, a lot of people around the world combined their efforts and put on all the different annotations we know. There is a browser, at the University of California, Santa Cruz, where we display all the different RNAs that are known, all the ESTs that are known, all the similarities to the pufferfish, tetraodon, and mouses going on in the next month or two, all the repeat sequences, and anything else you want to know about the genome. Either the folks at Santa Cruz--
Here's the Santa Cruz browser, if you visit it. You can zoom in or zoom out, according to your preference. You can scroll along the chromosome. You can set it for which things you want to see, which things you don't want to see. And, if you have any suggestions for what else should be there, let us know and it'll get added.
In addition, because we believe in diversity, the European Bioinformatics Institute has its own version of this browser. Same sequence is used, here, but they put it up differently. They have different approaches to how they're annotating it. The National Center for Biotechnology Information also is bringing up a browser. And there'll be a lot of diverse ways to look at this sequence.
There is also, when you get your copy of Nature, this way to look at the sequence. We have a foldout-- in fact, two roll-folds-- that come out of the journal, with the human genome on it, showing the key features like the density of Gs and Cs and SINE elements and LINE elements and all sorts of things. It is, I must say, not at a terribly usable scale. It's at a scale of 3.3 million bases per centimeter, which is--
[LAUGHTER] --beyond
My visual acuity. But, in any case-- but it looks pretty to put up on your wall.
But you notice certain things from it. In fact, going back to it, you can see, if your eye is really good, that the GC content, which is this black line, goes up and down a fair amount. The genes, there is some dense urban neighborhoods of genes, some rural neighborhoods, here, of genes.
If you plot it, the genome is really lumpy. We're very lumpy, compared to other species. There are regions where your GC content's 41%. There are regions where it takes an excursion out by 10% or 15%, for periods of 10 million bases, before it comes back down again. And, you know, even a little bit of statistics will tell you this can't possibly happen by chance. In fact, there are very distinct neighborhoods.
The GC-rich regions appear to be gene-rich. In fact, we have now great data about how gene-rich they are. They also, by the way, correlate to many other things such as chromosome banding pattern. There's a very strong correlation, where the GC-rich regions correspond to the light bands in chromosomes. And, in one of the companion papers to come out in yesterday's issue of Nature, one of the consortium groups put about 8,000 of the BAC clones that had been sequenced onto the cytogenetic map and drew these correspondences, which is useful both for the general statement but also for, say, cancer genetics, where one can take cytogenetic rearrangements and correlate them now to sequence, because you have 8,000 tide points.
Now, you may know that most of your genome is not genes. In fact, only about 1% to 1 and 1/2% of your genome is genes. Most of your genome are repeat elements. They're repeat sequences. They are transposable elements that have inhabited our genome and hopped around, over the course of hundreds of millions of years. From their point of view, that's the purpose of our genome, is to carry around them, because they certainly are the majority occupant. They consider the genes a necessary detail in order to propagate the transposable elements and all that.
But, when you go back to 15 years ago, people said, ah, you know, thee repeats, they're boring. I might have actually thought that, too, even five years ago, but, I've got to say, they are unbelievably interesting, these repeats.
Very briefly, almost all repeats in your genome fall into four flavors. One are called LINE elements, one are called SINE elements, one are called LTR retrovirus-like elements, one are called DNA transposons. For example, the LINE elements are one of the great inventions in the history of eukaryotes.
They are a transposable element. They make an RNA that goes off to the cytoplasm and [INAUDIBLE] reading [? frames ?] it gets translated into protein. The protein grabs the RNA that made it, brings it back to the nucleus, cuts the chromosome at a site, and reverse-transcribes it back into the chromosome. It's a completely self-contained package for making an RNA, getting the machinery translated, coming back to the nucleus, and copying yourself. You'd think, the ultimate parasite.
SINE elements, interestingly, are parasites upon the parasites. They encode nothing. All they do is they make an RNA that doesn't get translated. There's no open reading frame. They go off to the cytoplasm. The only thing this RNA's supposedly really good at is grabbing the LINE machinery away from a LINE RNA and getting it to stick it in the chromosome instead.
[LAUGHTER]
It's just-- it's just-- they have a lot of millions of years to work this stuff out. And it's very clever.
Retrovirus-like elements, these things also move by RNA intermediates by a different mechanism. They involve a gag and a pol but not an env gene. But otherwise they look an awful lot like the mechanism of retroviruses. And, indeed, it is thought that these were the precursors to retroviruses, that they learned first how to do their jumping jacks within a cell. And then, after having worked out how to do all this, get reverse-transcribed, whatever, pick up a cellular envelope gene, and depart.
Then there's DNA transposons. They're very different to the other three, because they don't make RNAs. What they do is they make a message that goes-- sorry-- they don't transpose through RNA intermediates. They make a message, it goes off, makes a transposase, the transposase comes back to the nucleus, cuts the chromosome and moves the DNA element elsewhere.
These guys have a very different life than the other guys, because this is actually a pretty crummy lifestyle. If you're a transposase, well, the problem is that your transposase gets made out there in the cytoplasm, comes back to the nucleus, and tries to find who made it. It would like to find the element that made it and move that element preferentially.
Remember, the LINE is smart enough that the proteins grab the RNA that made it. That way, you ensure that you're moving an active element. The bloody transposase comes back, doesn't know who to get! It grabs anybody and moves it.
[LAUGHTER]
And therefore DNA transposons are self-limiting infections. Because, as they begin to build up extra defective copies, they become less and less and less efficient, and their lives are short. OK? So that's what you've got.
Now, a few comments. We can put them all into a family tree. Because every element, when it hops, is a copy of something that was an active element, initially. Its siblings who hopped about the same time all started with the same sequence, and then it began to degenerate.
But we can, by evolutionary, by phylogenetic reconstruction, figure out that these 52 SINE elements are siblings and they arose by virtue of the number of changes that are in them, about, oh, I don't know, 5 million years or 10 million years. And every single element in the genome can be put into families, and those families can be dated.
So, in fact, we've done that for all 3 million elements of the genome. And those dates are really fascinating. In particular, you can find out that most of our repeats are very, very, very old, much older than you see in flies or worms or mustard weeds. This tells, by measure, by percentage of interspersed repeats. Another way to put it, though, is for the human genome the vast majority of repeats in your genome hopped before mammals, hopped before the mammalian radiation. There are fossils pre mammal.
And so they're quite old things. But here's a shocker. Despite the fact that they've been hopping for a billion years, despite the fact that this is, you know, the thing that's caused your genome to form, at the size it is, the rate of transposition is plummeting dramatically. In fact, in olden days-- is this working? Yeah.
In old days, here-- this is old, moving to now-- the rate of transmission was here. It went up. But lately, the last 40 million years, it has been in a plummet. In fact, DNA transposons are extinct, completely extinct. These retrovirus-like guys are almost extinct, if not extinct. There is only a single family left that has hopped since our divergence from chimpanzees.
And we can only detect three possible active elements that have open reading frames that are full-length. If those guys have [? missensed ?] mutations that inactivate them, that's it. They're dead. That's the end of the line for LTR retrotransposition.
So there is a serious ecological crisis, with regard to the, you know, endangered species of transposons in our genome right now.
[LAUGHTER]
And we don't know why. But we do know that this is not happening in the mouse. Interestingly, the mouse does not show the same pattern at all. Curiously, its rate of transposition is not-- and there are 11 happy families of LTR retrotransposons, hopping around the mouse genome, every day. What's gone on?
We don't know. We have some interesting speculations having to do with the fact that it may be that our genome, that our population, the [? prominent ?] lineage, has a very small population size, and we're much more subject to the effects of genetic drift that could let you lose the small number of active elements. So that's my own guess, but we have no evidence of this. Rodents tend to have much larger population sizes are much less subject to genetic drift. But we're going to have to look at a bunch more species to know what's gone on.
Here's another weird fact, just to show you what you can learn by staring at a genome. You think genome sequences are boring? Wow. Here's something really weird.
If you were a transposon, where would you go? Would you hop into gene-rich, GC-rich DNA? Or would you hop into gene-poor, AT-rich DNA?
Well, if you were a smart transposon, trying to get along with your organism, it'd be sensible to hop into gene-poor, AT-rich DNA, where you'd do the least damage. And, in fact, that's exactly what the LINE element does. LINE's endonuclease cuts the chromosome at AAAAT. And, in fact, that accounts for the tremendous enrichment of LINE elements in AT-rich, gene-poor regions. Seems a very sensible sort of detente between the organism and the genes, the organism and the transposons.
SINE elements, however, show exactly the opposite pattern. They are highly enriched near genes. Now, A, that doesn't make a lot of sense. Well, why should they do that? And B, far more importantly, from the point of view of biochemical mechanism, how do they accomplish that? Because I thought I told you they use the LINE endonuclease to move. Why don't they get stuck in at AAAAT?
Two hypotheses. Number 1, somehow that RNA is so smart that it could reprogram the endonuclease to put it elsewhere. I don't know how to do that. That's be a fancy bit of protein machinery, protein engineering, but maybe.
Alternative hypothesis-- they actually go in in AT-rich regions, but evolution favors the retention of those that are near genes in GC-rich regions. But the latter hypothesis that I've just stated would imply a function. Well, we now know the answer, because we can look at the genome and the genome will tell us the answer.
How do we know? We can take all those SINE transposons and sort them into the newborn ones, the adolescent ones, the middle-aged ones, and the old ones. The newborn ones have a distribution that look just like the LINEs in AT-rich DNA. As they get older and older and older, there's a 13-fold enrichment for those near genes.
While there are some alternative possible interpretations for that, we strongly favor the interpretation that there's actually selection going on for this. And there was a hypothesis proposed two years ago, by [? Schmid, ?] that's really fascinating that proposes the mechanism. Turns out that these SINE RNAs bind in a sequence-specific fashion to a specific protein kinase involved in the regulation of protein translation. Different organisms have different SINEs, but they all blind to the cognate protein kinase. Mutate them slightly, they don't.
If you wanted to regulate protein translation, how would you do it? Would you use a protein-- oh, sorry, one more fact. SINEs, unlike other transposable elements, are expressed in somatic cells under stress; others, just in the germline. Turns out, if you wanted to regulate protein translation under stress, would it be smart to use a protein? Nah, because you'd have to translate it. Right? And that'd be kind of dumb.
You'd use an RNA. And you'd want to have a lot of it, if you wanted to work quickly under stress. And you'd want it to be located somewhere in open chromatin, where you could express it easily.
In other words-- of course, this is backward reasoning, but it might be quite reasonable to use a high-copy-repeat RNA, located near genes, if you wanted to regulate something as a stress response. In any case, this was [? Schmid's ?] hypothesis. He had made it two years ago. And the evidence we have now, that there is a rapid, within 25 million years, reshaping of the distribution of SINEs I think provides a very interesting bit of potentially corroborating evidence for that.
Here's another interesting fact. You can take transposons that were born at the same time and use them for epidemiology. You can use them as if they were a birth cohort and say, if a transposon landed on the Y chromosome versus the X chromosome, how does it do? Is that right? Yeah, that's right. How does it do?
Well, if you look at it, transposons born at the same time that land on the Y have lots more mutations than land on the X. And from this you can infer-- and, in fact, back out-- that the rate of mutation in sperm-- because Y chromosomes always go through male meiosis-- is twice as high as in eggs. Because X chromosomes spent two thirds of their time in female meiosis.
So you can, in fact, work out the differential mutation rates, by comparing things on the X and the Y chromosome. In fact, in this particular case, we have the pleasure of confirming work of David Page, about six months ago. There had been a bunch of previous studies, trying to do this, that got lots of wrong answers. And David made a really nice estimate of this, about six months ago, where he said, we really think it is a twofold, difference based on one particular region, about four million years ago. And we can confirm this now, based on about 100 times as many data, spread out over 40 million years, that David's suggested rate of male-to-female mutation is spot-on.
There are zillions of more things I could tell you about transposons. I won't. You know, I'd tell you about the fact that we get genes from transposons. At least 50 genes in the genome, we can recognize as gifts from transposons. We can also recognize the most recent ones to hop, the ones that are still variable in our population and could be used to trace the history of our population.
But let me switch to genes, because our genome does contain genes, and they are of some interest.
[LAUGHTER]
We have been attempting, by looking at the sequence, to identify the gene content of the human genome. Golly. That turns out to be a lot harder than bacteria or yeast or worms or flies, because there's a signal-to-noise problem. Our genes are only 1 and 1/2% of our genome, whereas in flies and worms and yeast they're a big proportion of a [? genetic ?] locus is the coding regions. 50%, typically. Whereas in us it's a trivial fraction.
And so our ability to detect these is nowhere near as good, and the gene set currently generated and described in the paper is nowhere near as good. But, nonetheless, we have a pretty good handle on how many genes there are, and we certainly do have good chunks of most of them.
And the answer is, there's really only about 30,000, 35,000 genes. I'll say 35,000 genes, as a rough guess. We've run it through all sorts of estimation techniques, statistical estimation techniques. We've also done direct counting programs, including one written by Chris Burge here called Genome Scam, which is a successor to his famous GENSCAN program, for finding genes within genomes, as well as a bunch of others.
And all the evidence we get points to the same conclusion-- about 35,000 genes. So I owe an apology, and I wish to make that apology right here and now, to about seven years' worth of students in my 701 course, because I have been reporting the official number, that there are supposed to be 100,000 genes in the genome. And so I don't know that you're entitled to your money back, but, nonetheless--
[LAUGHTER]
Check with the registrar on that. But, nonetheless, I now wish to correct that misimpression. Actually, why did we ever think there were 100,000 genes? That turns out to be very interesting. We had to track back to what was the origin of this round, nice 100,000-gene number.
The origin turns out, very clearly, to be down the road. It's Wally Gilbert. Wally did a beautifully simple calculation, in the early 1980s, that a typical human gene looked like it was about 30,000 letters. The genome was about 3 billion letters. Divide one into the other, you get 100,000.
And it's so round, it made it into the textbooks and just refused to go. And, of course, it was only meant to be a back-of-the-envelope estimate. If you tell this to Wally, the reaction is, hey, you know, we were basically right. Because-- Wally's trained as a physicist, and, to a physicist, a back-of-the-envelope estimate that's within a half an order of magnitude is just fine. The problem is--
[LAUGHTER]
--that biologists think that a factor of three means something. Right?
[LAUGHTER]
And it really doesn't, not by these estimation techniques, anyway. But there are a lot fewer genes. We've taken the genes. We've sorted them into functional categories, and things like that. We can say a number of things about them.
The exons tend to be smaller, the introns-- the intervening spaces-- bigger. There seems to be more alternative splicing, about twofold more alternative splicing, in human genes, it appears, than flies or worms. So that, in fact, although we don't--
The depressing fact is that the total number of genes we have, at 35,000, is only about twice as many as a nematode worm, C. elegans, or only about twice as many as the fruit fly Drosophila melanogaster. And, on the whole, we like to think that we're a lot more complicated than that. And so this has provoked tremendous concern, particularly in the past week, about people worrying about how we can be so complex with such an economy of genes.
One answer is, well, we make more proteins out of them. We do more alternative splicing that clearly does seem to be right. Oh, another fact to mention is that they are, indeed-- because now we can tell-- a huge, huge bias to the GC-rich regions, in case you needed any confirmation about this. About a tenfold enrichment, there.
But, when we look at our genes, we ask, how are they new? How are they different? What explains what makes vertebrate genes?
Well, first off, we are not that innovative. Most of the domains in our proteins, they're old. They're in invertebrates. Only about 7% of all the domains that we can find in vertebrate proteins appear to be new to the vertebrate lineages. That is a surprise.
There's only one enzyme that vertebrates seem to have invented. That's pancreatic ribonuclease. But what we did do, to get more genes, was we created many more architectures. We put together the old pieces in many new ways. And we have considerably more architectures, any given domain appears with many more genes in a human than it does typically in fly or worm.
There's a particular emphasis of this in extracellular molecules. And I expect this has something to do with the greater complexity of our circuits, is that we connect the parts in many different ways.
In addition, another bit of our rather derivative innovation is that, when we like something, we just expand it dramatically. Olfactory receptors. The human genome shows about 1,000 olfactory-receptor loci-- 1,000. We actually count 906, at the moment, that we've got well in hand. And then there are parts missing, and all that, but it's about right.
Shows that vertebrates show tremendous interest in smell. It's one of the crowning achievements of vertebrates, is the spectacular elaboration of smell receptors. However, I'm sorry to report that, having done this, the hominid lineage decided that it didn't really care about smell receptors. And two thirds of our smell receptors are pseudogenes, we can clearly see from the sequence, whereas mouse keeps its smell receptors in pretty fine working order, by comparison. We seem, after all this investment in the nose, to have decided to settle on the eyes, instead, as a more interesting sense. And you can see it in what we tidy up in our genome.
Immunoglobulin domains, obviously a big part of the genome. We discuss some of that. Intermediate filaments. There are 127 intermediate filament proteins, including 111 keratins. If you want to describe a vertebrate, you'd say it's something that really likes epithelial surfaces, all sorts of specialized epithelia, for these intermediate filaments.
Growth factors. We're into growth factors. Big expansions [INAUDIBLE] growth-factor families, like TGF beta. So, vertebrates are things that like to smell, have immune systems, specialized epithelia, and communicate and develop with lots of growth factors. That [INAUDIBLE] be us.
Also, we appear not to be able to take credit for that much-- for all of our genes. Actually, about 223, is our count, genes, that we find in the human, there is no homologue, outside of vertebrates, in the rest of the eukaryotic world, but there are homologues in bacteria. That is, we have a close relative in bacteria but nothing outside of vertebrata.
The likely interpretation of this is that these represent horizontal gene transfers that occurred early in the vertebrate lineage. Bacteria can, in fact, mate with nonbacterial cells and transfer DNA to them. It's usually unproductive. But, in fact, possibly back when we were a fish, some bacteria got with some gamete and passed some DNA. And this isn't a frequent occurrence, but we count about 223 of these.
It's, of course, conceivable that some of these are hiding somewhere in invertebrates we haven't seen yet. And we may have to rethink it. But, right now, the most likely interpretation is that these are indeed horizontal transfers. They're all enzymes, so we didn't invent them. They're not characteristic of what gets invented in vertebrates. One of them, for example, is monoamine oxidase, important in psychiatric disease.
Now, to extract all of the information out of the genome, we're going to need to line up with lots and lots more species. And so already the international consortium is at work on sequencing mouse, rat, zebrafish, pufferfish, and various other species. In a partnership between Whitehead, Wash. U, and Sanger, we've been generating random coverage of the mouse genome, to be able to pick out exons by cross-species homology. The goal is to get about threefold coverage. We're already at twofold coverage, so I think the goal of threefold coverage by the spring will be met, and we'll be able to line up and, I think, identify many of the exons from the mouse rather directly.
But I think we're going to be able to get a lot more than just the exons. Because, in fact, I think we'll be able to get much of the regulation this way. I'll give you just one neat example that emerges from looking at the genome sequence.
We have a very important set of genes involved in development, called the HOX clusters. These HOX genes occur in four big clusters-- HOX A, B, C, and D-- that all are homologues to things that arose in insects. We've sequenced them now, not just in the human, but we've done them to completion in the mouse, the rat, and the baboon.
When you line them up-- and this is work of [? Ken Doer ?] at the genome center-- they line up spectacularly. I don't just mean "good," which is what you'd expect. But, spectacularly. That is, over 100 million years of evolution, they fall perfectly on a diagonal. There's been no insertion or deletion, to speak of, across these loci.
And, when you look closely why, when you look closely, you see why. Remember, I told you half of your genome, 50% of your genome's, repetitive? At the HOX cluster, it's 1%. The HOX cluster has essentially no repeats. It admits no mutation to repeat, either in the human or the rodent lineages.
There's no target for changing. That means that spacing and content are very tightly conserved. And that suggests that this thing is chockablock full of regulation.
And, sure enough, when we line up the different species, we can pick out more than 100 apparently conserved regulatory features there. I don't think this is particularly special to HOX. I think HOX is very special in the way it's all tightly conserved, and spacing, and all that. But I think, in fact, if we can get the correct alignments throughout the rest of the genomes, we will be able to pick up lots and lots of regulation that occurs.
Finally, I'll mention very briefly that there are a lot of other applications of this that are happening and that also are represented in the papers that appeared yesterday. To apply the sequence to humans in medicine, we really need to understand not just all human sequence but all its variation. We need to understand what's the nature of variation in the human species, and what's the nature of those variants that cause disease.
Well, in fact, over the past couple of years, particularly at the genome center, we've been characterizing that. We've had a project to not just sequence the genome but begin to collect variants. We did a project, maybe three years ago, which at that time we called, foolishly, "Large-scale survey of variance in the human genome," where we reported about 4,000 single nucleotide differences and could report that the frequency was about 1 in 1,000-- actually, [? about ?] 1,300-- that typical genes had only two or three or four common variants in the population, and that, in general, we looked-- despite being 6 million people-- like we were really a population of about 10,000.
And that's right, because we are a population of 10,000. We all trace back to a small found population in Africa, a mere 5,000 generations ago. And we show it from our DNA variation.
This is, of course, very important, because, if we have a very little amount of variation, we stand a chance of characterizing every variant in the genome, maybe three or four variants for each of 30,000 genes, and testing each one for correlation with disease. [? Finding ?] doing human genetics may become a simple matrix problem. Write down all the variants along the top, all the diseases along the side, and fill in the matrix, and just see. That would be a lot of fun.
Well, if so, in order to do that, we need various techniques. I won't fuss over these. There are techniques called "association studies" where you directly test variants. Linkage disequilibrium studies, where you indirectly test them by ancestral haplotypes.
But what you need for all of that is to have zillions and zillions of variants. The goal is to have such a dense map of single nucleotide polymorphisms-- we call them SNPs-- that you can comprehensively test for disease association anywhere in the genome, by looking for the signatures of ancestral haplotypes enriched in people with disease. But you need a lot of them! 4,000, that we published three years ago, is not much.
So, happily, a great example of how industry and academia can work together, the SNP Consortium was founded. A bunch of companies and a bunch of academic centers got together and said they would pool their resources and find 300,000 variants across the genome and put them all freely in the public domain without cost or any kind of patent protection. This got going about two years ago, and it has been wildly successful. It's a great example of how industry and academia can work together to produce things that are helpful to all concerned.
In addition, a bunch of variants come from just taking the sequence and taking our clone-based map and looking at the overlaps. So the number of SNPs, the number of variants in the genome, has grown from when we reported, right over here, 4,000, and we were very impressed with ourselves, to basically the work of three centers-- Wash U, Whitehead, and the Sanger Centre-- vvvoomp! And appearing yesterday in Nature was the companion paper reporting 1.42 million variants in the human genome.
We now have a variant, on average, every 2 kilobases in the genome. We will push forward, and we'd like to get about 4 billion variants or so. But already we are just up to our ears in variants across the genome. And we stand a very good chance of finding all of the ancestral segments that occurred in the genome and using them to trace disease.
I won't tell you about-- I'll skip over a couple slides, here-- that we've also measured how big those segments are, and that varies with populations. But in northern Europe, they're quite big-- on average, about 100 kilobases in total, maybe 50 or 60 in either direction, from a point. In certain African tribes, they're much smaller because the ancestry is different in terms of the age of the family of that population. But we're going to be able to understand those and use these for disease studies. I'll skip over those factoids.
In addition, and I'm not going to talk anything about it, this work on the genome is having a huge effect on cancer studies, as well. It's driving us in cancer studies, and many other people, to look at all the genes simultaneously in the cell. And I think the reporting now of the full list of genes means that we're going to be able to make really good microarrays, to study the pictures of gene expression in cells, finally, with, for the first time, comprehensive arrays.
All right. Well, the goal has been to try to create for biology what the chemists have had for the whole last century-- a periodic table. To have the parts list. [INAUDIBLE] folks call this stuff the "holy grail" and stuff. I'm not convinced by any of that.
This is just a parts list. But a parts list is a very useful thing. We are now closer than I had thought we might ever get to such a parts list.
We don't have fully all the nucleotides, but we will. Give it another year or so. We don't have fully all the genes, but we have the vast majority covered, to, the vast majority, their lengths. And give it a year or two, and that will be there. And we also have, by the way, the isotopes, the variants on the elements-- namely, those SNPs.
I think it's probably fair to say that you could already start planning your experiments, counting on the fact, now, safely, that all of that will be there, that the periodic table will be up on your wall in the next year or two. Won't be perfect yet. It'll take--
Actually, the periodic table wasn't perfect, when it first went up on the wall. They spent 20 years filling in all the holes. Right? It didn't cost them quite as much money, back then, but they spent a lot of time, filling in all the holes, right?
[LAUGHTER]
And so none of these things are ever perfect, but the conceptual framework there is there. I think it'll have a big effect on us.
Before I close, though, I have to reemphasize what I said at the beginning. This was different than any other project in biology. When you stand up usually, and you give a talk in biology, you talk about your work or your work and a couple of graduate students or some postdocs or something like that.
This is a project where there's nobody who can take personal credit for this. This is something that happened because of the way everyone agreed to pull together and share the work and share the credit. The group that did the most to bring this to fruition is shown here, on the snow. This is the Whitehead Genome Center, over in the park, in Charles Street.
This was the only place we could find we could actually take a picture. There was no indoor location we could actually manage to get the whole genome center and still have a camera stand back far enough to get them. But this is the folks.
And one could not ask for a better set of colleagues and friends to have worked with, over such an extraordinarily exciting, at times stressful, but always exhilarating time. And so I am in tremendous debt to my colleagues, here. I will not name the names of all 250 people associated with this.
I will in particular single out Lauren Linton, who did an amazing job at mounting a scale-up for the center, in the course of this project. And I do particularly want to thank Chad Nusbaum, Bruce Birren, and Mike Zody, who have been the assistant directors of the center, who have managed all different parts of this effort. But there are heroes, throughout this entire picture.
In addition, there are groups around the world that contributed tremendously. This is a picture of the first Bermuda meeting that occurred. It was in Bermuda, really, truly, amongst the different international centers. And I owe great debts to colleagues in different countries, and we all are still working together.
But the only really appropriate credit slide for this talk is that. This is the first talk where you really just have to put up the whole globe and say, we done it together. And it was really, really satisfying, to go to Washington this week, with so many of our friends, and realize that, for something that we had started-- in my case, many cases-- 15 years ago, we actually had done it-- not all quite finished, but we'd done it well enough that we knew there was now no doubt it was all going to get done.
And I confess I tend to be pretty inured to these sorts of things. I'm not, you know, one to get soppy about this stuff. But, by the end of day on Monday, I had realized that this is kind of the surprise, in a way, is that only in retrospect do you realize how history ends up happening. It happens sort of just by doing your work every day, on something worth working on. And, 15 years later, we all who have been working together, here in this community, on something, just working on the right next step and the right next step, realize that, over 15 years, collectively, we all did manage to make quite a difference.
And so it's really fun to come, this week in particular, and just share it with the community. Because it's all of ours. Thanks very much.
[APPLAUSE]
[INTERPOSING VOICES]
SHARP: Do you have many of your colleagues here?
LANDER: There are some, at least, right over here.
[INTERPOSING VOICES]
LANDER: Yeah, I don't know how many [? of them, but ?] yes.
[APPLAUSE]
SHARP: In the tone of Eric's comments, I'd like to everybody who participated in the genome-center work who's in the room to stand up. And let's thank them, too. Come on!
[APPLAUSE]
LANDER: [INAUDIBLE]!
SHARP: Eric would be pleased to take a few questions. This is a big crowd, but we can take a few questions.
LANDER: Not from people who I work with.
[LAUGHTER]
SHARP: No, not from people who--
LANDER: You can ask me tomorrow!
[LAUGHTER]
SHARP: A few-- please.
AUDIENCE: [? How many ?] [? of the genes are ?] involved in intelligence?
LANDER: The fraction of genes involved in intelligence.
[LAUGHTER]
It depends on the definition.
SHARP: [INAUDIBLE]
LANDER: If you mean the fraction of genes which, when knocked out, disrupt your intelligence, it's probably like we've always taught-- half, or more. If you include being dead as being an insult your intelligence--
[LAUGHTER]
--then it's a very large fraction. If you want to know what fraction of the variation in intelligence in the population is attributable to variation in genes, we haven't a clue. And my own guess is-- I mean, this is the usual question of, how much genetic determinism of intelligence there is-- there's nothing in our sequence that tells us much about that.
And what I do think I find most interesting in our sequence, because I think it's a point worth making, is that the very fact that we are so extraordinarily similar, across the species, that we're 99.9% identical, only one difference in 1,000-- this holds across the whole planet-- and that we come from such a small founding group, tells me that the population genetics have not changed in 5,000 generations but we ourselves have changed opportunity [? and ?] society and what's possible and what we think men or women or people from different groups can do, so dramatically. So I must say, realizing how the gene pool has not undergone significant reshaping, in that huge time, and yet society can come out so different-- that is to say, what we do with our brains and what we do with our intelligence has so many degrees of freedom, there, that I'm not too worried about what the genome is ever going to tell us about limits on what we can possibly do, because we have tremendous evidence that, within whatever limits there are, we've been able to have huge variations, [? and ?] [? a lot's ?] within our control.
SHARP: Over here? Oh, here, first.
LANDER: Sorry, uh, yeah.
AUDIENCE: [INAUDIBLE] 30,000, 40,000 genes. What about all those companies that are [INAUDIBLE] marketing, you know, we have a database [INAUDIBLE].
SHARP: Repeat the question.
LANDER: So, given that there appear to be only 30,000, 40,000 genes, the question is that there are, for example, ads in several journals that say, subscribe to the X database, which has 120,000 genes.
[LAUGHTER]
So, let's see-- the markets have closed. I could say, well, one response potentially would be to sell. Right?
[LAUGHTER]
Another response would be that the New York Times actually has begun to probe Insight Pharmaceuticals, now called Insight Genomics, on this question. Because the ad I just quoted is from Insight. And, according to the New York Times, the chairman of Insight-- I think it was the chairman-- the president of Insight said to the New York Times this week that, oh-oh, we actually meant 120,000 distinct transcripts, but they actually come from 40,000 genes. So, in fact, that one, if I take correctly what was in the New York Times, appears to have just been retracted. And it was just the detail, that they meant transcripts not genes.
But Bill Haseltine at Human Genome Sciences is rock-solid sure that he's got 120,000 genes and that he's not giving any ground on that. And so I propose the following really fun challenge. We get to specify 1% of the genome, and anybody who wants to disagree produces the genes they claim they have in the 1% that we picked. We'll check them together and see if they're right. If they have three times as many genes in that 1%, then we'll believe that they've got gold in the other 99%. But if nobody wants to take us up on it, it's tough to deal with the claims. Yes.
AUDIENCE: [INAUDIBLE]
LANDER: Future priorities for other species. Well, what this is telling us is that sequencing is a tremendously powerful way to get information you never imagined before. Like, all these repeats, all this stuff. Clearly, experiment is never going go go away. Experiment is our primary driving force. But it's a great way to form hypotheses.
So my sense is, already mouse is underway, rat is underway, zebrafish is underway, two different pufferfish are underway. I believe that several of the organisms along the lineage leading to vertebrates, like [? Ciona ?] and such, will be sequenced in the next couple of years, in order to figure out what were the key innovations. Chimp is a very interesting debate. I can tell you already, there's going to be 50 million differences with chimp, because we know what the percent difference is. I just don't know which ones are going to be important.
And there's an active debate in the community as to whether we should race to sequence chimp or that we should figure out what in the world we'd do if we had the 50 million differences. And I think what'll happen is there'll probably be light coverage of the chimp, to see if we can interpret those differences. Most of them are probably just noise. Right? And, if we can figure out, is it 10 that mattered, or 10,000 that mattered, how are we going to know before we go overboard?
I have a feeling that, as the cost of sequencing comes down-- already, yeasts! And it's no big deal to do bacteria. Right? And I think it's going soon be no big deal to do fungi. And I can't see why we wouldn't just continue to sequence lots of organisms, as the cost comes down, because it's telling us what is the greatest lab notebook in history. Evolution, it turns out, has been taking notes for 3 and 1/2 billion years, in its lab notebook.
Now, it did not have a good course on research practices, because--
[LAUGHTER]
--because it only takes notes on the successful experiments. It throws away the unsuccessful experiments.
[LAUGHTER]
But we got the notes on the successful experiments. I mean, obviously, we don't want to criticize it, because it got started before these procedures were required of graduate students, et cetera.
[LAUGHTER]
But it's a pretty good lab notebook.
SHARP: Over here, there's--
LANDER: Yeah.
SHARP: --and [? then-- ?] question. We will end in a couple more questions.
AUDIENCE: Yeah, I may have missed a quote, when you explained how you were going to apply, once you have the sequenced genome, how do you use that to-- how would you apply that knowledge? Like, once you have the sequenced genome, how do you use it? I'm not sure I--
LANDER: okay. So what I didn't really say, and I apologize, in many different ways. One way, make an expression array, where you put down a piece of DNA complementary to every gene, take the messengers from a cell, wash them over, and see which genes are on and off. By having a comprehensive list of all genes, you know you have a description.
Here's another way. Produce full-length CDNAs for every gene, and actually express the proteins in appropriate places. If you're a pharmaceutical company, test a small molecule that has an interesting biological action, but you don't know what its target is, for binding against all 30,000 genes. And, if it has a side effect, test it against all 30,000 genes to find out what other things it may bind to.
Look at all the interactions between those things. I mean, in some sense, it's asking, why have a periodic table? After all, it doesn't define the experiment. The genome sequence doesn't define the experiment.
What it does, though, is it converts the search for every answer into a search in a finite list. It means, with the periodic table, I can look it up on the mass spec. If I know what I saw go by, and its mass and a few properties, I know what it was.
Increasingly, in biology, we're going to be in a world where, when we see something happen, we can know what it was that just happened, what particle went by, what protein it just was. Whereas, in the past, almost any phenomenon we were studying, we had to spend months or years figuring out what just did that. And I think what it is is the great boon to the graduate student of the postdoc, because usually the graduate students and postdocs spend most of their time tracking down what the part was. Which is kind of boring, compared to the phenomenon.
Now they can study a phenomenon. And the tools aren't all there, but four or five years from now, maybe, it'll be much easier to describe that phenomenon [? in ?] working parts and get on to real experiments. That's what it's for.
SHARP: [INAUDIBLE]
AUDIENCE: [INAUDIBLE] what you think it's going to happen now, how many--
[INTERPOSING VOICES]
LANDER: We sort of-- one gene, one protein. What happens to that? Well, of course, "one gene, one protein" is the official doctrine in 701. But, even in 701, we did acknowledge that that wasn't always the case and that there was alternative splicing. I mean, viruses have been good at this for a long time, of being able to put together other transcripts and such. And we've known about this for a long time.
What it means is that there will be a more intensive study to figure out, when you make three proteins from the same gene, how different are they, really? They're not radically different proteins. They're variations on a theme, typically, with some extra domain.
And I think some [? serious ?] work will have to go to that question, of what this domain's-- you know, the extra domains do accomplish there. But I think, as with all of biology, I mean, the central dogma isn't strictly right, and "one gene, one protein" isn't strictly right, but, you know, it's not bad. It's not bad.
SHARP: okay, this is the last question.
LANDER: Last question.
SHARP: Last question.
AUDIENCE: How well were you able to sequence over the centromeric and telomeric regions? Were you able to [? define ?] [INAUDIBLE]?
[LAUGHTER]
LANDER: So the BACs that were sequenced were prioritized not to be those in highly repeated regions. We do seem to recover a decent number of clones for those regions. But, because they tend to be strongly repeated, they got put at the bottom of the list.
And so we poke our way, our head, into the telomeric and the centromeric regions, but we haven't dealt with it. And I'm not sure how well we're ever going to do it, really, cleaning up two megabases of centromeric repeat, or whether it matters.
SHARP: Last question [INAUDIBLE].
AUDIENCE: So, many of the genomes that [? can ?] sequenced, above 50% of the genes have no known function. Is this the same in the human genome? And what steps can be taken to identify the functions of those genes.
LANDER: So that's a tremendously important question. In most genomes that have been sequenced, 50% of the genes have no known function. Is that true for the human? And what steps are we taking?
Ah-- depends what you mean by "no known function." If you say, what fraction of the genes have no recognizable domain that we could-- you know, being a protein kinase, I suppose, is a function. But, then again, for what purpose is it a protein kinase?
So, if you're willing to say, I know it's a protein kinase, or it does-- you know, it's a GPCR-- that counts enough-- 60% of the genes we can attach functions to. Some bit of biochemistry associated with the gene. If you're asking, what's their real role in the body, in some meaningful sense, oh, golly, it might be you know, 12,000 or so, might be a third of the genes, that we have pretty good ideas of what they're doing in the body.
And, of course, even there we're often surprised. Right? You know, apolipoprotein E that was carrying around, you know, lipoproteins in our blood system turned out to be in our brain and predispose to Alzheimer's disease. So it's pretty primitive. We don't know what most of these genes really do. That is the work, I think, of the next decade, is to attempt to find as many ways to characterize these genes by where they're localized in the body, under what circumstances their transcripts are turned on, you know, where they are subcellularly what they interact with through partners, what happens when I make a small molecule that binds to each of them, to figure out what function I disrupt.
By no means is this a comprehensive description of life. This is really-- I mean, I'm fond of calling it-- it's the parts list to a Boeing 777. You know all the screws, you know all the seat cushions, you've got the fuel gauge, you've got all the pieces, and they're all spread out on the floor. Right? We don't yet know how to screw it together, and we don't know why it flies, but--
[LAUGHTER]
--it's sure better than if we didn't have the parts list. I shall close, with that.
SHARP: Thank you.
[APPLAUSE]