MIT/Brown Vannevar Bush Symposium - Fifty Years After 'As We May Think' (Part 3/5)
VAN DAM: Our last speaker today is Michael Lesk. And to tell you a bit about why he is here, as you know, memex was a personal information organization and retrieval mechanism. And whereas Michael is probably known to you as one of the original Unix hackers-- for example, Lex and Yacc and UUCP-- the real reason for his being here is that he's a deep scholar equally at home in the humanities and in computer science where he's been specializing in information retrieval.
He's the chief research scientist at Bellcore. He's also visiting professor at the University College in London where he simultaneously holds appointments in computer science and librarianship. And he's had an abiding interest in books and libraries. Michael.
LESK: Thank you very much. Of all the undeserved praise, I will in particular correct that I did not write Yacc. I'm very flattered to be here. And I thank you. The title of this talk is the Seven Ages of Information Retrieval. If we could have the first slide, Shakespeare wrote about the seven ages of man. And information retrieval, we think of it as being born in 1945.
I think it's going to work out very well to view it as a human life. That, i.e. by the time that it would be, say, 70 in 2015, we will be done. We will have achieved what Bush set out to. We will have a library of a million books fitting into your desk, at least virtually. To show you how far along we are on this, everyone in the audience think. The last time you needed to know something that you had to look up, you didn't remember it and the guy in the next office didn't know, did you look it up on a screen or on paper?
How many on a screen? Raise your hand. And how many on paper? Raise your hand. Last time, okay. It was about 75% to 80% on screens. Hands down. That is how far we have come. And I can easily believe that in the next 20 years, we'll get the rest of the way. So I'm going to talk about IR as a life with its seven stages. This works particularly well for me because I was born in 1945.
And so I think of this, right, you know, when IR was 20 years old and should have been in, you know, sort of college student, change-the-universe mode, well, you know, I was working for Gerry Salton. And we were publishing the first paper about SMART and CACM, trying to introduce free-text indexing. So it works well enough.
The other comment I use as a thread through the talk is there's been a tension throughout this life. We've heard a lot about Bush. Now, in addition to Professor Bush, there was another MIT man named Warren Weaver who was active at the same time. Weaver wrote an article in 1949 recommending machine translation. Bush, remember, is alluding in his article to the work of the physicists, the people who built the nuclear bomb and the microwave radio.
Weaver, who was another MIT man, another propeller head going out from Cambridge to rule the world, was thinking in terms of the cryptographers, the people who had broken codes with computers during the war. He thought of, well, we could use that technology on language. He said, quote, "it is very tempting to say that a book written in Chinese is simply a book written in English, which was coded into the Chinese code. If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation?"
So Bush started hypertext and information retrieval. Weaver started machine translation. But there's also a tension here. Bush thought of information organization as, people will make trails analogous to the manual indexers who had been working in libraries for generations. Weaver is talking about statistics. And throughout the life of information retrieval, there has been this tension between, are we going to do intellectual analysis-- whether it be manual or automatic-- or are we going to sort of do word counting?
It's exactly the same tension that exists in chess between the people who say, we'll try all legal moves as far ahead as we can. And those who say no, no, no, there must be a role for intelligent selection of which moves to evaluate. Add in the same way, we've had this tension in IR for its entire history. So, okay, we'll go through this. As I say, in 1945 the field starts. Shakespeare's next stage is the schoolboy, which would have been the '60s, then adulthood, experience.
Shakespeare actually goes on to things like soldier and justice and comic character on the stage. I won't do those, so we've changed in a little bit. Bush's predictions are rather interesting. If we could have the next slide, please. This is a list of some of the things Bush said we were going to have. And it will not be unfamiliar to everybody in the business that most of the technology predictions were achieved at some point.
We have instant photography. In fact, we had it very soon after he wrote. We have motor-driven cameras and we have automatic exposure cameras. We have computers that have card and film control and select their own data. What don't we have? Well, we haven't actually achieved his storage goals. We can't put a million volumes in a desk yet, although Mead Data Center has the equivalent of a million volumes.
They have 2.6 terabytes. But that's a very selected set of stuff. We don't have speech recognition. We don't have really working OCR. And some of the things that were done turned out not to be worth doing. Stereo photography has been achieved, but it's sort of a toy to sell to tourists. And the ultramicrofiche that was held up earlier today, again was achieved.
But if you're a librarian and let's say out of each $100 you budget, you spend $25 on the building and somebody tells you that regular microfiche would reduce that cost to $5 and you don't do it, it's unlikely that telling you that ultra microfiche will reduce the cost to $2 will make enough of a difference. If reducing 125 down to 105 doesn't cut it, down to 102 isn't going to cut it either. So again, we've done most of the hardware stuff. We're still waiting on some of the software stuff.
We haven't got automatic typing from dictation. We haven't got OCR. We also, by the way, haven't got the stuff about plugging your nerves directly into electronic systems. Although, I did see a story on that in the newspaper fairly recently. What I found more interesting was that Bush talked a lot about individualized systems. The way Bush envisioned his interface, each person would have their own personalized information space.
For most of the history of information retrieval, that isn't what's been done. What's been done is to have systems that look the same to everybody who uses them and which therefore provide the best access to the information that is sort of impersonal, the journal articles. But again, that's something we'll come back to later. So let's go on. We still have this infancy stage.
Now, in 1957 the Soviet Union put up the first artificial earth satellite. And that produced a wide array of fears that the United States was falling behind in science and a realization that the United States didn't even know much about what was going on in Russian science. So there was some funding of Russian language studies and machine translation. And there was a lot of funding of information transfer, to say, well, the answer will be that we'll improve our knowledge by building systems to distribute information.
There were widespread urban legends of some company that had spent either $100,000 or $250,000 to reproduce a result that had been readily available in literature, but they had not found. I once eventually saw a paper in which someone claimed that they had run the story into the ground, but it was false. But it was widely believed at the time, so people set to work.
The first thing they did was they built quick indexes. I don't know how many of you remember keyword-in-context indexes, a man named HP Luhn. If you think that's primitive, you should remember what the competition was. How many people here remember something called edge notch punch cards? A reasonable number. You know, I thought about bringing one. And then I said, where am I going to find an edge notch card today? So I made some.
And the idea for those of you who haven't seeing them is you have these cards. And there's a row of holes and you can hold them up with something through the holes. And let's say this is a bibliography, so this card has the citation for Moby Dick. And this card has the citation for Principles of Computer Graphics. And you take each hole and you assign it.
So let's say that we decide that this hole is going to be books written by Andy van Dam. So for this card, we tear out that notch. And let's say the next hole is books written by Al Aho. We haven't got one of those. And let's say the next one is going to be books written by Jim Foley. So again we'll make a notch. And let's say that somewhere we get to Herman Melville and we make another.
And now you see we have this batch of cards. And again, you can put something through the slots. And suppose you want all books, all your references for van Dam you pick his hole. You put something through. Now traditionally, you used a darning needle. But these holes are so big in the demonstration which I'm going to emphasize that this is a digital process by using my finger. And you shake the cards and the right ones fall out.
This is the retrieval technology of the 1950s. So this is what we got replaced. Now as I said, because of things like Sputnik, it was a boom time for retrieval in the '50s and '60s. More people attended SIGIR-- Special Interest Group on Information Retrieval Conferences-- in the early '60s with the predecessor conferences, actually, than attended them a couple of years ago. It's really remarkable.
The first experiments were systems that used indexing. Because things were being keyed in and they were painful to do. What happened in the '60s when we've gone into the experimental schoolboy stage was people started doing experiments on free-text retrieving, most particularly Gerry Salton, who died recently, unfortunately, of cancer a few weeks ago. Gerry was the first person to try to do large scale experiments to say, yes, free-text indexing can be compared and really work.
Gerry worked a lot with a man named Cyril Cleverdon. Cyril Cleverdon was the inventor of the recall and precision measures. For the first time, we had an attempt to make this part of computer science an experimental science in which people would run experiments. This is not typical in computer science. You know, people do not do evaluated experiments. When I wrote UUCP, we didn't say, oh, well, now let's take 20 MIT undergraduates and tell 10 of them to get a message to Brown with UUCP and 10 of them to try Amtrak or Greyhound and see which ones get there first.
We just sort of write it and put it out and say here. Well, Gerry and Cyril had the idea that, no, information retrieval was going to be evaluated experiments. Because there was this history of there is a lot of stuff out there. And we need an effective way to find it. There had been a long tradition of, well, the way to do this is with standardized nomenclature and whatnot. But nobody knew whether they would really work.
And so a lot of experiments were done. The test collections, in fact, that Gerry and Cyril prepared in the 1960s were basically still in use until about 1992 as the standard retrieval techniques. Now, this stuff was all basic keyword retrieval. As this work was going on, some new techniques that were also sort of statistical went on like relevance feedback-- the idea that, well, we'll find one relevant document and we'll use that as a set of search terms to retrieve more relevant documents.
But all of this was, again, statistical. Now, at the same time, I said there's this other thread. The other thread is intellectual analysis. And in the IR context, that was AI. That was artificial intelligence. And what we had was that people like Terry Winograd and Daniel began looking at programs that would do linguistic analysis of queries, and perhaps some documents, and attempt to match them and attempt to retrieve answers on that basis.
Now, these two groups didn't get along very well. The AI people were in the computer science tradition of we're going to build something that will make enough examples to fill the back of a thesis, and that's it. And the IR people were down this trend of, we're going to run the same 1,400-document text-collection over and over again until the experimenters memorized all the articles on aeronautics. And all this was going on, but the tension is still there.
The tension fortunately is only at the research level. Because at this point, we're still in the schoolboy phase and not that much is actually getting done. Now, then we get into the 1970s. In the 1970s, we're now an adult. The field is now, you know, in the 20s. And what's happening? The main thing that's happening is computer type setting and word processing.
You know, I appreciated Ted Nelson's videotape of real cut-and-paste. I remember that. I haven't seen anybody do that for 20 years. Once we had computer typesetting, and I hadn't realized that Bush had worked-- I mean, one of the other things I've learned from Andy's talk was that Bush had worked on typesetting and that he had worked on management theory. So that he was not only the progenitor of information retrieval, but he was also sort of the antecedent of Monotype and Linotype modern typesetting machines and of Dilbert.
But we now had, because of computer typesetting and word processing, we had large volumes of material in free text. We also had online time sharing. All of the experiments Gerry Salton did in the '60s were done in a batch [INAUDIBLE]. You wrote down your queries and sent them into a system, you got it back later. And there was a whole lot of talk about selective dissemination of information. People would write down lists of queries and leave them in the background to be run.
And this all was blown away when time sharing came in. All of a sudden, you could put up these real systems. You also had some early examples of real cooperation. OCLC, for example, this was an organization now called the-- it was originally called the Ohio College Library Center, that being an insufficiently expansive name. It is now the online computer library center.
But OCLC was founded to distribute catalog cards in libraries. And there was a cooperative effort. If a library got a book which hadn't been cataloged in the file, they would catalog it and enter it in the file and other people would then retrieve that record. And if you entered a record, you got the next 10 records free or something as an incentive to do that. These were all very limited search systems. But they were very popular. A lot of promises were made. Why don't we look at the next slide.
People believe that, you know, there was going to be a change in libraries, last chance before everything goes on microfilm. And now if we could look at the top half of the next slide please. There are earlier statements about microfilm that will remind you of some of the statements that have been made over the years by the more extreme proponents of artificial intelligence and hypertext in which people say that microfilm, micro photography is one of the most important developments in the transmission of the printed word since Gutenberg. And so what I have to say from this is that hypertext did not only not invent text as they would have you believe, they didn't even invent hype.
So what happened in the '70s in the research arena? Well, again, we've still got this tension. We still have the people doing statistical processing. And now, in fact, they get a new weapon. Keith van Rijsbergen shows up with probabilistic information retrieval and introduces even more statistics to an area. I should confess that I'm not terribly sympathetic to things like statistics, you know. I was taught as a chemist that statistics are what people do when they can't get their apparatus in good enough adjustment to get convincing answers.
But we went on. And the AI people were getting into trouble now. They had made lots of promises about what they were going to do with machine translation and computational linguistics. And to get away from these, they had started making promises about speech recognition. And the thing they did they got into information retrieval was they invented expert systems. Now, the easiest way to think about expert systems in the late '70s and early '80s is that they occupied the same buzzword niche that intelligent agents do today.
And people wrote, for example, "the 1980s are very probably going to be the era of the expert system. By the end of the decade, it is possible that each of us will telephone an expert system whenever we need to obtain advice and information on a range of technical, legal, or medical topics." In fact, when I went out to get quotes like that for some talk a few years ago, it was shooting fish in a barrel. They were all over the place.
People have wildly different views about the importance of this. You know, my name is Lesk. It's a very rare name, and it's not an English word or a word in any other language that I know. When I do a search on "Lesk" in the database, if I don't find myself, it's because I found my brother or my second cousin. That's it. Stu Card is in the room. I suspect Stu Card does not feel the same way about dumb searching of ordinary strings.
There was a problem, though, that none of the AI systems really seemed to generalize very well. Roger Schank was perhaps the best proponent of AI systems in the '70s, that we're going to introduce higher-level language processing. And their idea was that every document could be mapped into standard frameworks. For example, there's a very large number of medical articles that boil down to, we have a batch of rats. They're all suffering from such and such a disease.
We gave them such and such. And some of them got better and some of them died. This is how many in each category. And Schank's group would try to construct such schemas for many areas. And then they tried to fill them out. They produced a lot of argument about whether this was for real or not. There were other systems like Bill Woods' or LUNAR System or Stan Petrick's Transformational question-answering that were evaluated, but they tended not to do off ordinary text.
So we weren't getting very far on this. But we still had this tension between people who said, no, all you need is statistics. And now intellectual analysis will help. But everyone was going away from manual indexing. They all said, well, we can't afford to do manual indexing. So if we're going to do intellectual analysis, it's going to have to be done by machines. And we've got to get machines that are smart enough to do that.
So now we get on to the 1980s. And a couple of things go on in the 1980s. We have a steady increase in word processing to the extent that it becomes impossible to buy hot-lead typesetting machines. And they only exist as sort of craft devices. And the price of disk space goes down. So everybody starts to think of, ah, these are going to be the new alternatives. There's a little data on the next few slides. Go to the next slide, please. [INAUDIBLE] any data.
This is a Berke Breathed "Outland" cartoon in which one of the characters comes over and says, you know, could I borrow-- Oh, no, he said, we're on the brink of a gleaming, digital upheaval. 500 cable channels, so many sitcoms, so little time. TVs, telephones, computers merged in one, our lives awash with instant visual input. And well, what do you want? Well, I actually wanted to curl up with your copy of Winnie the Pooh. And you can see in the last frame that the guy is here staring at a compact disk. And that's what--
[? NELSON: ?] Hey, Lesk.
[? NELSON: ?] These are the slides I was gonna close with.
LESK: Sorry, well, we'll go onto the next one then, the next slide. And the next slide is why the libraries are starting to chase this so hard, why there's a business here. This is what university library budgets look like. Your typical university spends 3% to 4% on its library. Unless it's the place up the river that you all are afraid of which spends about 6% or 7% on its library. Actually, there was a nasty comment about Harvard made earlier. I'll make a nasty one back.
All right, I'm talking about the seven ages of man in Shakespeare. Everybody who knows which play it comes from, raise your hand? Got you. The answer is as you like it. But it reminds me of the story about the time the bus fare in Cambridge went from $0.50 to $0.60 and some kid gets on and he pays the old fare and walks past without realizing. The bus driver calls out and says, hey, you, are you from MIT and you can't read or from Harvard and you can't count?
Anyway. Most universities-- 3% to 4% on the library. And it's not going up. Where does the library budget go? About a third of it goes for purchases, about half of it goes for salaries, and the rest goes for other. This is not quite fair. Most universities don't monetize space. About another third would have to be added if the library were charged the fair rental value of the buildings it's cluttering up.
Of what's spent on buying a book, where does that go? The right-hand column is the breakdown of the prices of books. Retail markup is like 40%. Distributor gets 15%. Printer gets 15%. Publishing office gets 20%. Author gets 10%.
What that means is, if you say, suppose we blew away this system, suppose, you know, what Ted Nelson wants happened. We were getting stuff directly from the author to the reader. And we didn't have to bother with paying for the printer or paying for the librarian to put it on the shelf. You know, relatively little of that money is needed. So there's a lot of potential economic gain in this system if we can do it.
Can we do it? The next slide is the price of disk space over time. And this is a chart of how much disk space-- And you see it's going down very nicely. And now I'll tell you what the people in the back can't see, which is it's a log scale. Between 1971 and now, we've had 100,000 fold decrease in the price of disk space. So that's why these things are becoming possible.
Next slide is, as a result, the increase in the number of databases. But an interesting thing, the top green is the number of online commercial databases. The red line near the bottom that's going up much faster is CD-ROM. And then the mag tape is the blue line that's actually going down. So what we're seeing is a switch to CD-ROMs, which are becoming increasingly popular. CD-ROM is one of the big inventions of the '80s.
Let's go on-- well, let me just talk for a minute about the '80s. Okay, by the '80s, we're up to the 40 year mark, right? So the world is now getting mature. And what that means is, yes, these systems are commercially available. Lots of people have personal accounts on DIALOG and CompuServe and they're looking things up. And people are actually getting used to OPACs. Ordinary people walk into libraries now and they get met with computer terminals, not with card catalogs.
Some of them write nasty articles for The New Yorker magazine and make trouble for us. But basically, most people are pretty happy. What is annoying people is that the research community has been cranking along all these years, right? And they've invented probabilistic retrieval and relevancy [INAUDIBLE]. And none of it's in use. All these commercial services are doing dumb free-text searching. They don't think they need either intellectual analysis or even good statistical analysis.
The AI community, meanwhile, is down on this expert system and knowledge representation language stuff that imagines translations into unbelievably sophisticated and complicated languages. I mean, Feigenbaum got the Turing Award last year. I think he should have gotten a special award from the Department of Commerce for sending the Japanese down the fifth-generation rat hole. And the enthusiasm, in the early part of the '80s, this stuff was riding high.
In the middle of the '80s, we got what is called AI winter. And we have now gone into a world in which instead of people believing that it was possible-- I should say, the idea that you can take natural language descriptions of subjects and turn them into a single artificial symbolism predates artificial intelligence. There's probably one other person in the room who remembers the name JJ Faraday, put your hand up.
Dick? Where's Dick Marcus? You must know Faraday. He's the only one. Faraday was this British information retrieval guy who came up with this amazing nine operator language with special symbols for each operator, you know, and he believed that if he encoded something in this language, he would get 100% precision in recall. Everything would be perfect.
Well, it's the same thing as translating into [INAUDIBLE] or some language like that. Most people now are sort of almost back. There was a famous linguist named Benjamin Lee Whorf. Whorf had a theory that, in fact, language constrains thought. What you will think depends on the language you use to express it. And many of us are going back to believe things like that. Okay, so now we've made it through the '80s, things are going fine.
Basically, the intellectual analysis people are in full retreat. The statistical people are riding high. All the systems that are running are dumb. And we get into the 1990s. And the world is now, you know, in the late '40s, and it's time for the midlife crisis. What is happening? Well, what happens is the internet. Even one year ago, there were people at Bellcore saying to me, only 15% of the computers in people's homes even have modems. Who cares about the internet?
Nobody says that anymore. I now see charts suggesting that by 2003, the entire population of the world will be on the net. Now, what is remarkable is that everybody is providing it and everybody is classifying it themselves. You know, it's the revenge of the Bush people over the Weaver people. The Weaver people have been winning for 40 years. They've almost won. And now all of a sudden the Bush people make a comeback. And there's all these people, you know, organizing their own stuff.
Now, admittedly, there is a lot of problem with quality on this. You know, I used to be taught as an example of probability, the statement that if you took a million monkeys and sat them in a million typewriters and gave them long enough, they'd write all the works of Shakespeare. The internet has proven that this is not true.
Sorry? There's this vast amount. And the question is, how do you sort through this? I mean, the Lycos people now think there are 10 million pages on the net or more. You know, everybody now thinks, you know, in the '80s, I first met with the expectation that everyone is supposed to have a fax machine. And now you meet with the expectation everyone is supposed to have a home page.
The world's hard disk industry will ship over a terabyte of disks this year-- two megabytes for every person in the world. I can remember not too long ago when the people at Bell Labs, not a stingy institution, were bugging me to keep my disk space below 200 kilobytes. The next thing that we'll talk about is, well, what are we going to do to get some of the quality valued information into the libraries.
And one of the answers is scanning. One of the big things that happened in the '90s, is the rise of scanning. So the next slide shows an example. I figure I have to sooner or later get to my own research. This is a sample of the core information retrieval system which we built at Cornell. This is based on scanned chemical journals. This page is an image of a real page. This is also an image.
You're looking at a picture of the page. Somebody keyed the page in in Columbus, Ohio, took the computer typesetting tape, printed it on paper. I took the paper, fed it back in through a scanning machine to get that. You may say, that's dumb. The next slide is the alternative. It is an ASCII thing, in which in this case, it's a different Bellcore retrieval system.
But in this case, this stuff is regenerated from ASCII. We did some experiments-- or actually "we" is Dennis Egan who's sitting in the back somewhere, people can read both of these about equally quickly, so they both work. And they both work if you have to search for something a lot better than paper. So you now see a lot of things coming out in image form, a lot of libraries doing things.
Some more examples of how you get things in. The next slide shows another alternative of how you might get things into the digital library. Mozart writing a digital version of Symphony number 38. This is not what you do. Let's see, what's the next? Put up the next slide. The next slide is an example. I was helping some people who wanted to digitize Charles Sanders Peirce.
So he wrote 100,000 pages of manuscripts such as that at the bottom. And you could scan it in, and the scholars could read it. I should say the sample here is the page when he was applying for money to the Carnegie Foundation to publish his works in 1900. They turned him down. In 1992 about, the people who wanted to scan all his manuscripts applied to the Carnegie Foundation for money, and they turned down again. So scholars never learn.
Let's skip the next slide, go on to the one after that, which is an example. Once you get into images, you can do lots of other things. This is a British Library 1771 map of New England. The black is the original map. The red is an overlay of where the boundaries really are, assuming that the geological survey today knows what it's doing. And if you look at this, you'll see that the latitude is reasonably good, but the longitude is somewhat mucked up, which is what you'd expect.
The next slide is another scanned map. And this slide is one of the reasons why you're not seeing these, you know, from some-- I'm sorry, rotate it 90 degrees, please. Right. This is Manhattan Island in 1775. And the reason that I'm not really doing this on a SparkBook-- the image from which this comes is 50 megabytes. And it takes a little bit long to put up. And the view graph goes up very quickly.
Simply, as a matter of attractiveness, we'll put up the next slide, which is another map, Lord Jeffery Amherst's map of New York in 1770. The next slide is to move on to still other media. There's lots of talk about other media. I like to listen to the radio. So I've got this system where I've got a radio plugged into my workstation permanently tuned to the public radio station in New York.
And every day it digitizes "All Things Considered" and "Morning Edition." And I can listen to them at my convenience later. And I can actually listen to them faster than normal. I can also clip out interesting things and save them. And I can do things like segment by looking for silences.
The top bit is that I also like to listen to the BBC. Well, you know, they don't really broadcast a strong enough signal. So a colleague in London has a radio plugged into her workstation. And it's tuned to BBC Radio 4. And if you ever wonder why the response on the net isn't even good at 4:30 AM, it's because there are people like me dragging over the BBC from London so they can listen to it when they get to work late.
Let's see, other things. Bush talked about OCR. The next slide is some random attempt to evaluate OCR. And the top is a newspaper story which would be perfectly readable to you if you could see it. It got 80% of the words right. Better printing you get 90%. But if you're counting by words, we still don't have that. We need to get there.
The next slide is another example of why you would like to have images. And the problem, "cat," C-A-T, three bytes. That little frame maker's sketch was 1,000 bytes in frame maker. The picture on the right is 12,000 bytes. So a picture is worth a thousand words, but it costs it. I said the price of disk space had gone down about a factor of 100,000. The problem is the difference between video and Ascii is about 50,000. So nearly all of the progress has been given back.
What does that mean? In the 1960s and early '70s, all sorts of scholars had key punched one thing, they had their copy of Paradise Lost or something that they had keyed in that they were doing research on. And right now it's the same way. I meet people who have their 45 seconds of digital video that they're experimenting with. So, all right, based on past history, another five to 10 years we're going to have really large digital collections which will be video collections, which we will have no idea how to search.
So, all right, what is happening with that? Well, research. Interesting thing with research-- there is suddenly a text retrieval industry. All of a sudden-- well, actually, let me back off. I want to talk a little bit about the libraries again. And they're getting all this stuff. And next slide please. Here's another quote for the future, you know. "Libraries for books will cease to exist." To be true in 1984, predicted in 1964.
No, you know, it didn't happen. Today, it's another story. What are the relative costs? Scanning a book-- well, let's put up the next slide, please. Scanning a book costs about $30, between $30 and $40. And you need another $10 to pay for the disk drive to hold it. To build a space to put it on the shelf, Cornell-- $20 for a book in the newest book stack they've built. Berkeley is building a stack at $30 per book.
The top building there, the British Library in London, will come in at $75 per book. And the French National Library is $100 per space for a book on the shelf. So the libraries are now in a situation where within a few years, it will be cheaper for them to scan and not put up the space to put the book. And since many of them are under strong pressure not to put up any more buildings for other reasons, you're going to see an awful lot more of conversion.
It's already economical if any libraries could get together. But they can't do it right. Let's see, we also have a text retrieval industry. Next slide is the online search industry. We're at a couple of billion dollars a year now among the different vendors in which LexisNexis is biggest and Dialog is next. The next slide is the software sales to support that industry.
This is from 1990 to '96. The rate of growth in selling programs like Personal Librarian and DynaText and things like that. So this is a thriving business now. And by god, all of the research technologies that have been ignored for so long are now in use. Bruce Croft's Center for Intelligent Information Retrieval, which has developed a lot of intelligent text processing algorithms, provided the software for the THOMAS system at the Library of Congress. Gerry Salton's software was used to make a CD-ROM encyclopedia.
The Waze system has relevance feedback in it. So we have an active industry. And by god, it's finally using the research results. Politically, we have a big thing. Vise president Gore, as people probably know, decided that since his father had started the interstate highway bill through Congress, he would do the internet and the national information infrastructure as the analogous thing. And he constantly talks about his school child in his hometown of Carthage, Tennessee being able to turn on her computer and plug into the Library of Congress.
And so the feds have started a major digital library program. Do me a favor, skip the next two slides and put up the one after that. The black spots here are the federally DLI groups at Stanford, Santa Barbara, Berkeley, Michigan. Some other projects are in green. For example, the JSTOR project in Michigan is the Mellon Foundation's attempt to see whether it really is true that libraries can scan as a replacement for shelf space.
And what Mellon is doing is paying to scan 10 major economics and history journals back to the beginning of time, of their time, and seeing whether these libraries can use these instead. We also got a return to evaluation. I said that for 30 years, everybody had been using the same collections for evaluation. And Donna Harman started running something called the TREC Conference in which people sent in hundreds of queries were run against a gigabyte of text. And you suddenly got some realistic numbers.
And what we learned, not surprisingly, was that there is still enormous scatter in the performance of these systems. The reason we can't agree on what a good system is is that it depends very much on the query and the user. You could do enormously better with best hindsight. If you could say for each query, on this query, we'll use that search system, on the other query, we'll use the other search system. We don't know how to do that yet, but there's clearly gains to be made.
So as I said, the midlife crisis is, all of a sudden this is succeeding, but it's succeeding with manual content analysis rather than just the statistical retrieval. They're actually both working together, sort of the Bush disciples have made a comeback. All right, so what happens next? The next decade is the 2000s. IR is going to be 55 to 65. So this is fulfillment, right? We should be [INAUDIBLE] away the money for our retirement.
And our problem is going to be, how are we going to do the image? We can get all these images, video, it's bad enough indexing the text. At least we have free text retrieval. What are we going to do with all the images of the video? We've got to have manual analysis for that at the moment. I don't know how we do it. That's going to be the big issue.
But it's going to have to rely on people doing it, at the moment. And that's going to require us to learn how to manage manual indexing after many years in which librarians were sort of viewed as, you know, well, librarians are sort of the equivalent of Quill pen makers in the typesetting era. Suddenly, we need these people. We need people who know how to organize information and make use of it. Okay, but I think we can get through that.
Finally-- 2010. This is-- I've labeled it retirement on my view graph. Put back the original view graph, please. Shakespeare had it as senility. I prefer to be more optimistic. At this point, I think that, basically, the conversion job has been done, that by 30 years, you know-- well, by this time, we will have had enough of the scanning and enough of the keying available and we will have a market in which people can buy this stuff that we will basically have the material we need online.
We will need to know how to find it. But we will have it there. And we can all go off to study biotech or something. The printed books remain sort of in warehouses somewhere. The library buildings on campuses have been turned over to more bureaucrats and administrators and something else.
What might go wrong with this? I'm very optimistic, but what are the problems going to be? One of them is internationalism. We're doing just fine now on the internet on the assumption that everybody in the world speaks English. At some point, that's going to break down. And it's going to be political as well as technical. There's going to need to be more new kinds of research.
I am a little worried that 10 years from now, there will be library departments giving PhDs in arcane details of probabilistic indexing, carrying out the arguments that Keith van Rijsbergen had with Gerry Salton in the 1970s. You know, we've got to get onto some of these things. There will be a few of us old fogies who even remember that we were promised automatic language parsing and question answering. We're still waiting for it.
More seriously, I'm worried about some of the social effects. CB radio, anybody remember CB radio, right? You know, lots of trash. How do we keep the internet from being overwhelmed with that stuff? How do we get a role for the people who will sort out what is reasonable and what is not and let us find that stuff? That is going to require some form of charging. Bob Kahn already said, we need to go to service charging.
We need some leadership in how to do this right. My problem is that, you know, it's hard to look to economics for leadership. Even the Economist magazine once said that an economist was somebody who was good with numbers but lacks the personal charisma to be an accountant. And there is a problem of where do we get the leadership to build a system, such as the one that Ted Nelson has envisioned, where people actually do get paid and it works.
At the moment, the copyright law is a real problem. You know, somebody made some comment about, well, we needed more legal stuff. I mean, I have been going to computer conferences since 1969. So I remember the days before there were lawyers in computing. I never heard at one of those conferences anybody suggest that we had a problem in computing because there weren't enough lawyers paying attention to it. There is a story that--
Okay, oh, wait, no, that's right. IBM did prepare a CD-ROM to commemorate the 50th anniversary of the Columbus voyage to America. There is a story that they paid over a million dollars to clear the rights for the material used on that CD-ROM. That's okay, but only 10,000 of it was paid to the rights holders. All the rest went into the administrative costs. We have to get somewhere.
There is a real possibility that the world of the future is everything published after about 1990 is available because the publishers have it and they're making it available. Everything before 1920 which is out of copyright is available, and everything in between is falling into a black hole in which every once in a while, someone tries hard to find the people who have control and get the data.
Finally, as I alleged in an earlier question, I'm worried about political problems. There are technologies which looked as if they were going to be successful and have been stopped by political or legal liability issues. As I say, nuclear power, childhood vaccination. We need to try to design our world so that doesn't happen to us. All right, but I don't want to end on an unhappy note. You know, it is possible that we will all, you know, be drowned in information water or something.
But I think that it will actually work. I think given the rates of progress, Bush's dream will be achieved in one lifetime, in the lifetime of this profession. Now Bush actually talked about the information organization. He said there will be a new profession. There will be trailblazers, people who make trials on request. Now, part of the problem is that today people think of librarians as someone who alphabetizes books and puts them on a shelf. The function that is known in a library as "mark and park." You label the book and stuff it on the shelf.
We need to get sort of a higher status for this, whether it's called information trailblazer. I'm not as good as some people at making up new words for things. I'm optimistic about that. Once upon a time, accountants had to be good in arithmatic. Then computers came along and made skill in doing arithmetic totally irrelevant to the real world.
Did that mean that accountants became uninteresting minimum wage people? No, accountants took over all the major corporations. Bean counters now run the world. So if computers go in there and say, all right, alphabetizing is no longer interesting, what happens to librarians?
And I hope that, you know, information becomes more valuable. Somebody earlier talked about information as a sea. All right. The purpose of librarians and trailblazers, whatever, the purpose of those people in the future is now going to be to navigate. It is not going to be to provide the water. That's it. Thank you.
VAN DAM: [INAUDIBLE]
LESK: Thank you.
VAN DAM: Great. Let's have some questions for Michael.
NELSON: How do you see--
NELSON: I can yell so everybody--
LESK: No, no but then you won't be recorded, Ted. Please.
NELSON: How do you see the copyright issue as resolving?
LESK: First of all, the question was by Ted Nelson. I don't-- I would hope, frankly, that some payment scheme is adopted. My personal guess is that it will be a much simpler scheme than yours. When I talk to the journal publishers and say, how do you want to charge for your stuff online, they don't say a hundredth of a cent a byte or hundredth of a cent a minute.
They say $25 a year because that's what they're used to. When I talk to a book publisher and say, what do you want to charge for online access? $50 a person. So my hope is that there will be larger and bigger units charged so that I don't get into the huge overhead of $0.02 here and $0.03 there.
NELSON: You think that's a simpler system.
LESK: Yeah, I do. You know what I would really like? I'd really like the German solution, the tax on blank tape. You know, the Germans do this. They want to deal with over-the-air taping. And they put a tax on blank tape and give the money to the German society of composers. And I wish some mechanism like that would work. I've never understood why in the US political context a tax on blank tape, which is all made in the Far East for the benefit of recording artists who are all Americans doesn't go through Congress. But it never does. Anyway.
AUDIENCE: Michael, you said there are only about a terabyte--
VAN DAM: [INAUDIBLE] please.
AUDIENCE: Raj Reddy-- only a terabyte of sales of disk. I don't know where you got the number.
LESK: No, I'm sorry. 2 to the-- it's the 10 to the 15th. I'm sorry.
LESK: Yeah, I'm wrong. It's a petabyte.
AUDIENCE: Yeah, at least a petabyte. Last year, 50 million pieces were sold. Even if each of them had only 100 megabytes, it would be more than a petabyte. It would be at least five petabytes. It is probably more like 20 or 30.
LESK: The number that I got came from Jim Gray giving a talk at VLDB last fall.
AUDIENCE: It's changing every day.
LESK: I mean that may be the answer. That that was last September.
AUDIENCE: Last year we only bought one gigabyte disks. This year we're buying 9 gigabyte disks for the same price. The second issue I think is more interesting, you said OCR doesn't work.
LESK: That's right.
AUDIENCE: One company, [? Care, ?] has a $50 million a year product business.
AUDIENCE: Obviously a lot of people are buying it and using it. So none of the technologies that you talked about and we are all working on will ever be perfect, including information retrieval. The precision in recall is still pretty lousy. And it will never get that much better. It will never be perfect.
AUDIENCE: The question is, how do you define what works and what does not work?
LESK: When my friends stopped sending things to be key punched in the Philippines because OCR is good enough that they don't need to do that anymore.
AUDIENCE: And lots of people are still doing it, I gather.
LESK: To tell you a really horrible story, admittedly, from a few years ago, a friend of mine runs the--
AUDIENCE: A few years is a long time in this industry.
LESK: OCR isn't changing that much.
AUDIENCE: I'm surprised.
LESK: Well, a friend of mine was involved in having the publications of the AIAA put into one of the online services. She not only talked to the online service about, do you want the pages to OCR, they offered them the typesetting tapes. The service said, nah, we'll send it to Korea. We got lots of people we can hire overseas to keystroke. We can't even find people to decode your typesetting format, let alone to patch up the OCR. So I still see too much stuff being sent out for keying.
AUDIENCE: So the definition is when most people use-- I'm sorry. I agree with your definition, when most people routinely use scanners for use of information rather than sending it out to be keypunched.
LESK: When people use OCR rather than keying. In the same way, speech recognition I will recognize as really practical when the business of typing from dictation disappears.
AUDIENCE: So the issue is, when you type with a word processor, you make mistakes also.
AUDIENCE: And you learn to live with it, you fix it when you make a mistake. So the issue is when you make a mistake with voice input, you'll make mistakes. There's no such thing as-- it'll never be perfect. Pen input and voice input, including keyboard input for word processing, will never be perfect.
LESK: I agree it's not perfect. The librarians have a funny attitude towards this. They read too many books on total quality maintenance long before there were any such books. A friend of mine made the mistake of saying in public that the key stroking of the British Library catalog had introduced 50,000 lost books by errors in the press marks. He got fired for it.
A more interesting story-- many people have thought of the idea of let's scan something, OCR it, and because the OCR isn't perfect, we'll use the OCR only for searching behind the scenes. And we'll display the scanned image. This is the way the Elsevier TULIP project works. I first heard this idea from 20 years ago at the national agricultural library. And I said, well, what happened to that project? Answer-- they started it.
But they had the OCR. It seemed ridiculous to have this and not send it out to the people who were buying the scanned disks. But it wasn't accurate. It was embarrassing to send it out this way. So they started trying to correct it. Then they couldn't afford to put out the product. So the whole thing died. And I said, you couldn't argue this one out and just distribute knowing, inaccurate stuff, but saying for its purpose it's good enough? No, that's not our world view. We want to be proud of what we're doing. Oh, forget it.
AUDIENCE: Adobe is selling a product just like the one you mentioned. You can buy it in the market--
AUDIENCE: --for a couple of hundred dollars right now.
AUDIENCE: Samuel Epstein with sensemedia.net. And one of the things that you mentioned was that I guess by the year 2000 or some date like that, that all the information that we need will be digitized and up on the net ready to be retrieved. One of my concerns, and I'm starting to see it now with a lot of the smaller kids that we do some work with is that the idea that, well, anything that we need to find we can find by going into Infoseek or Lycos and doing it.
Regarding the parochialism of data and the fact that even in a few years or 20 years, that there's going to be a lot of stuff that is not on the net and stuff that even never will be on the net, whether it's coming out of a rainforest or whoever, how do we as designers of these systems impart to our users that this is not the end to all ends of information retrieval and as seductive as it sounds the reality is sometimes you gotta go crawl around a jungle?
LESK: I know. And the unfortunate answer, I think, is if you go to Western European libraries, there are many manuscripts in them, that have never been printed. The average manuscript that survives in an old library has not been transcribed to printing in 500 years. So what's happened to this stuff since no ordinary person ever finds it. A few devoted scholars shrinking with university budgets every year devote their lives to crawling through this.
So if somebody says, does there happen to be any estimate of the cost of having a horse shod in Germany and 1300, one of them pipes up and says, yeah, I saw that 20 years ago. It's in a library in Dresden or something. The same thing, I am afraid, is going to happen to the data that is not easily available in electronic form. And this is unfortunate. And I don't really know what to do about it. Because I'm afraid that the cost of converting the entire past is too high. Enough will be converted to serve most people most of the time. And then we're going to be stuck. And I don't know how to get around that.
AUDIENCE: Just as a follow up, do you see, possibly-- now, we're starting to see advertisement for professional paid net surfers to go out there and search the web. Do you see in the future a potential for a career as a professional analog information surfer?
LESK: Yeah, I think there will be. My problem is that I think that the career for somebody who wants to dig around in the non-electronic stuff will turn out to be a few specialists, as with the people who deal with manuscripts. I mean, there's this whole profession of archivists. And, you know, you cannot get a job in it. So I just don't know. I wish people would be more interested in diversity and getting more kinds of information.
AUDIENCE: John Smith, University of North Carolina. I'd like to ask you two related questions, if I could. First a factual question and then kind of an opinion question. Your projection about this or your vision is kind of based on, in part, on this amortization of cost for storing and providing access to books and the sort of declining curve of cost for disk space. But what you didn't mention is what is the cost trend for moving bits around the internet?
Is that going up, going down? Are we seeing any kind of a logarithmic decrease in that? Because it seems to me that this vision really is predicated on lower and lower cost of that. And the second question is kind of a follow on to that, and that is I think a lot of us think that internet is a kind of God-given right. But it expands in use. I really worry about how it's going to be financed to bring it up to that level of service. And so I wonder if what kind of economic model you see in the future that would make the internet universally accessible.
LESK: It's irresistible. Bob is the next person at the microphone, and he can answer both of these questions better than I can.
AUDIENCE: [INAUDIBLE] answer my question.
LESK: And the reality is, costs of transmission are going way down. I mean, for example, when you were growing up, long distance calls cost a lot more than local calls. Today, most large purchases of long distance service pay irrespective of distance. You know, LL Bean will pay something like $0.06 a minute, flat rate, anywhere in the United States. So these costs are coming down. And all the costs are coming down.
You've given me the option. Ted gave out this paper. And one of the things he says in this 1965 paper is, "the costs are now down considerably. A small computer with mass memory and video-type display now cost $37,000." "Several people could use it around the clock."
All of these costs are crashing. Now, how did we advertise or how did we arrange that people like me who are perhaps using vast quantities of internet service for frivolous purposes get charged more than people who are only using a little? I don't know how that will play out.
My gut feeling is that the bandwidth regulation model we have now, which is that Bellcore pays more because it has a T1 link than people who have a 9,600 baud link, is not entirely unreasonable, some sort of peak bandwidth limitation pricing. But I don't-- the economists are going to have to sort that one out.
KAHN: Bob Kahn. Michael, I'd like to get you to speculate about retirement in 2010. In particular, given that this is a birth-to-retirement process, the presumption I make is that you think that the information retrieval problem will have been solved by then. I'm afraid that that may turn out to be more hype than it is reality in that people's expectations of what the information retrieval capabilities will be will be far greater than the reality. So can you give us some sense of the scope of our information retrieval capabilities at retirement?
LESK: Oh. What I believe will have happened is that there will be vast numbers of people, for reasons I probably won't understand, who have put together enough trials on enough subjects that when I sit down and I say, I want to find a photograph of Bob Kahn, there will be a way to do it. Now, I don't understand all this. There are some subjects on the net, you know, if I want to know the names assigned to the locomotives of any class of British railways, I assure you I can post that question, and 30 people will fall over themselves to show off that they can answer it.
If I were to say, what fraction of current computer programs are written in Pascal? I am also quite confident I would not receive a single competent answer, although I would receive many idiotic opinions. I don't know how to arrange that for every subject, there will be some bibliography equivalent. But I believe that is what it will be.
What I'm saying is, the Bush people are eventually going to win over the Weaver people. We're not going to be able to do the image and sound and video analysis. I know, Raj, you know, I know there's a nice demo at CMU of how we're going to solve all these problems, but I'm not convinced. Whereas, I am amazed by the number of people on the web willing to spend their time collecting every locomotive picture, every picture of a pet rabbit, and making lists of them.
Someone objects to the pet rabbits? Okay.
VAN DAM: Terrific.
VAN DAM: Great. Great way to finish.
Okay, we're going to go for half an hour max. We're going to take some local questions, and then we're going to take some questions via the MBone. All right, who will be first? Okay, general or to anyone particular?
AUDIENCE: Perhaps both. But I think I'll direct the question first to Robert Kahn. My name is Ricky [? Goldman-Siegel, ?] and I deal a lot with video annotation tools. I was concerned about one of the things you talked about, which was this notion of malleable content. And I guess what worries me and what I think about is that what happens when the content that we have, let's say the video images are images of people in our own research, that we've taken in our own research of children and adults doing all kinds of things.
Now, we have ethical approval to use those. We might even have ethical approval to use them on the internet. But what happens when someone else uses that and takes it out of context? It's not just a matter of authorship. Because let's say I'm the author, I'm the research. Okay, I have the authorship. It's not just an issue of property rights. It's an issue of ethics. And no one in the panel today, no one in any of the-- none of the speakers today addressed the issue of the ethics of malleable content. So could you talk about that, please?
KAHN: Well, I mean, malleable content is not exactly a household word in most quarters. And even today in the world of what I would call hard copyright-- I don't mean hardcore copyright, I just mean hard copyright-- there are moral rights that do attach to that material. You might think that after copyright runs out you can do anything you want. But under the Berne Convention, there are still moral rights retained by authors.
So for example, if you were to try and do syncopated whoever and the estate of that musician did not want you to do that because they thought it impinged upon the moral rights of the musical works writer, they could prevent you from doing that legally. Similar kinds of constraints occur in other related areas. So moral rights even pertain when you have hard copyright where you think it's now in the public domain, but it was originally protected.
My sense of what's needed to happen is that people need to be able to state what it is they're willing to have happen to material that they have rights in. And if somebody really does not want somebody else to manipulate material that they have rights in, that is their prerogative. And therefore all the trappings of the law that would normally pertain in whatever part of the world that you're in ought to apply to that.
On the other hand, I can see many people providing material where they say you can make changes. You know, for example, suppose that I had created a film called, I don't know, Gone with the Wind, pick your favorite. But I put it out into the world of cyberspace in a form where the storyline was out there, the emotions of the characters was out there. And the only thing that was not really fixed was the actual choice of characters or the location.
But all the rest of it was sort of pinned down in the storyline. And let's say my guideline was fix the storyline, can't change that, but you can change the party. So you can watch Gone with the Wind starring you as Scarlett, for example. And that might be--
KAHN: You like that? That would be okay with me, okay? But maybe I put out another one where I sort of had the start of a good story and it was really more like interactive fiction and you could change the storyline by virtue of actions that you took. And that was okay with me because you aren't changing the core that is manipulating the changes of the storyline. And that was what I put out there. There may be somebody else who says, well, I'm going to put out a blank slate and you can put anything you want on there and they consider that a work although somebody else might not.
So I see a whole range of possibilities. I just would like to make it possible that we can deal with derivative works out there or additions to works. Here's another case in point. Nobody can stop you today from doing a visual overlay on a wall. So if you are doing a simultaneous showing of A and B, two different audio visual works, then what you see is the composite of those two.
And somebody may object for legal reasons that you're affecting the original, but you're actually not changing it directly, you're just changing the visual experience that somebody has. So I could easily see somebody giving you a template that if somebody else created one that would do a projection that would overlay on it, in fact, that might be okay, too.
NELSON: Can I piggy-back on that? I'd just like to, once again, explain how the transcopyright seems, to some people, to solve this issue. You see, what's been at the end of the trail here has been the problem that all copyright owners have under this Berne Convention of being able to stop you from taking things out of context. And meanwhile, people are quote "downloading" things on the network and inventing for themselves all sorts of schemes and in their minds that make it reasonable and correct and with total ignorance of the law, basically.
So the transcopyright proposal is essentially that anyone can reuse this arbitrarily in a new context pre-permission for that new context with the understanding that all the pieces will be bought from the originator. Now, what about altering the bytes? That's where it gets into trouble. So the only way the transcopyright thing works is if the bytes are always obtained from the originator. So if you want it to be stretched and morphed or something like that, what the map then contains is the directions for how to stretch and morph these things once you download them. And that gives the desired result.
So malleability is possible within that framework on a strict basis. The reason that some people want to talk about the moral issue and others of us want to talk about the legalistic issue is that moralisms can blow away with the wind, whereas if you can set down some guidelines that can be implemented as a workable solution that people can live with that could have a long term effect.
BERNERS-LEE: I could offer-- I can offer an alternative. Hello? I can offer an alternative. The alternative point of view to suggesting that transcopyright should be mandated, basically should be a flag that you turn on or turn off is to say-- I'll make two observations. First of all, to observe that a system which tries to constrain how people behave doesn't fly. So if you, for example, make a documentation which requires everybody to write in a given word processor it doesn't work.
If you try to make a system which changes the way the relationships that people have, which forces them into some mold, even if the CEO of the company of 50,000 people mandates that everybody use it, they will use it under coercion, and it won't really work. So the reason that hypertext is neat is that it's very unconstraining. It allows a lot of flexibility.
Now, the second observation is that when you look at the agreement under which information is passed from one person to another or anything else, for that matter, goods, Corn Flakes. But when you buy Corn Flakes you think at the first level that there is an exchange whereby money goes in one direction and Corn Flakes go in the other direction. But in fact what you're getting is you're getting the Corn Flakes, and on the packet there's a UPC bar code, which, if you send it in, it will get you 100 American Airlines free miles.
You can send the coupon it with $10 and you can get yourself a free camera and 10 free rolls of film. And if you combine it with the UPC from a particular brand of washing powder, you can get a free ticket to the theater. So there are an incredible number of different-- This is a very complex agreement involved. There are an incredible number of different kinds of legal tender and illegal tender which have crept in here.
So that if you're going to allow this sort of behavior to go on, which marketing people seem to need to do, they seem to need to have 16 different types of train ticket and 18 different types of airplane ticket so as to extract the most money from the populace, then if you're going to represent that in a system, if you're going to represent the commerce and the agreement on the network, the network has to be extremely flexible so that when you pass information from one person to another, basically you have to be on a pass an arbitrarily complicated expression of the license terms.
And, yeah, I mean, even when you buy it, think of a video. What is it, really? What do you get the right to do when you take a video home from the video store? It doesn't have a complex license on it. But you get the understanding that you can show it under reasonable terms as long as you're not on a coach or an oil rig to people who are, well, in your family, well, say, in your house. The number of people who can reasonably cluster around the television. And things like that. And it's reasonable to watch it twice as well as long as you get it back by the date.
If you try to write that sort of thing down in Lisp, it gets frightfully complicated. If you don't write it down in a Lisp, and you send the thing over a network and it sits there on a proxy several and it's cached. And then somebody else asks for the thing. This is video. This is serious network traffic. So the proxy server is very interested to know whether it can give you a copy.
If it doesn't understand the license agreement because it can't read it, it can't work out whether it can A, give it to you for free or, B, give it to you and charge you a certain amount and pass the cash back. So in fact, this is a really big, hairy problem. In fact in solving that problem, if you can find a sufficiently powerful solution, you may solve a whole lot of other problems accidentally.
VAN DAM: Given the fact that this is very complicated, I think we should get off the topic and go to another one. Otherwise, I'm afraid we'll be stuck with this very interesting one for the remainder of the period. Roger.
AUDIENCE: I'm Roger Bloomberg. And this is a question for each of you, if you'd like to answer. As you know, in Bush's essay, he cites meant disappearance of Mendel's paper as a catastrophe, that in reforming modes of communication and transmission, one must avoid. So I'd like to ask each of you to speculate about the future of reforms in modes of communication and transmission. What catastrophes should we avoid?
NELSON: I've got the microphone, so I'll say it very briefly. Basically, I see ours as an age of information loss. And as we digitize more and more, the formats get crazier and crazier. I've heard that NASA actually has a job designation called data archeologist. And the great problem of being able to see the same document twice years apart is it's very important to be able to see it the same way. I'll tell this very quickly.
When I was editing creative computing magazine, I saw a great article on Smalltalk that I wanted to run from an obscure English journal. And it was so much better than that piece that Alan Kay had written in The Scientific American, which really ticked me off. And I was very irritated with Xerox PARC and their attitudes, anyway, though I liked Alan. So we got permission to reprint this article from this obscure English journal.
And guess what? It was plagiarized. It was Alan's article which they changed every occurrence of the word we to they, thinking that this somehow guarded against a copyright violation. Now, that was the same article I had read several years before. And I was a different person when I wrote the article. The article had not changed except for "we" to "they." And what this shows is the importance of being able to know that you're seeing the same thing twice, even though you have changed. And if you change and the documents change, nothing can be kept track of.
LESK: I agree that saving all data is a problem. I believe that there are several task forces working on this. I think that one can be solved. The problem that worries me much more is the potential for the loss of diversity in information sources. It is not clear whether the new technology with very low cost for making many copies of one thing will drive us in the direction of there is one extremely glitzy multi-media college physics textbook instead of 20 easier to prepare, ordinary, written textbooks.
And I don't want to see the technology change in such a way that we lose diversity of information preparation. And that means we have to work on making multimedia authoring easier. And we have to work on, you know, preserving symmetric access to networks rather than let us say direct broadcast satellite for everything.
NELSON: And ways of freely merging material.
LESK: Well, merging-- not necessarily freely merging, merging with low administrative cost.
NELSON: Unfettered merging.
LESK: Unfettered-- I'll go with unfettered.
AUDIENCE: I would just like to--
VAN DAM: Name?
AUDIENCE: --bring us back to--
VAN DAM: David?
VAN DAM: Your name please?
AUDIENCE: I'm sorry I'm David [INAUDIBLE] from Brown University. I would just like to bring us back to Vannevar Bush. He was, as you know, an extraordinary manager. And one of his concerns was directing the efforts that he was in charge of toward essential and specialized tasks. To what end do you see the worldwide web, the internet, electronic text, electronic libraries, if you care, becoming essential services in our society? And if they are becoming essential services, how do we justify our taxation of them? Or how do we go about funding them, in that case, which is something you're already addressing, of course?
KAHN: Well, having the mic, let me just pick a piece of that to address, rather than deal with it totally generically. Because I think every instance of capability in society can be, in some ways, addressed separately or should be addressed separately. In the case of the internet, the thing that is really the most viable attribute there is the connectivity that it provides in an open architecture framework at the moment. That may change with more functionality, but that, I would say, is the basic elements that's there. I don't see how that is going to disappear.
But the form and shape in which it's provided could change fairly dramatically over time. This is a marketplace right now for services. And every instance where a capability can be provided as a marketplace service, then the market, it seems to me, will deal with those issues as time moves on. Things may evolve. Other parties may come in. Some may drop out. Prices may go up. Prices may go down. Parts of it could be subsidized. Parts of it could be paid for and straightforward or indirect ways.
But it seems to me that at least at that level what you have is a basic service that's out there in the marketplace. Now there are areas where oversight has turned out to be important. This was nurtured from really very little through federal government, US government efforts initially. Today, the US government still plays a significant role in the oversight process.
But more and more, the participation in that process has become broader. It's involved folks from the commercial sector. The universities have been involved from day one. And involves international participation. And I see that continuing. But the participation is at the level of ensuring that the process by which things work is maintained effectively, not in the provision of individual services. People may want to talk about your other points separately as well.
VAN DAM: Why don't we take two more questions, and then we'll switch to MBone. Eric?
AUDIENCE: My name is Eric [? Nelson. ?] And what I'd like to ask is if there is another McCarthy-era-style witch hunt sometime in the future, will it be made more vicious by today's new tools? Specifically, two issues, one is does one leave a trail when one uses the network? And one is if information that is about you can be processed faster by parties unknown, do you necessarily want that?
BERNERS-LEE: Doug, you haven't said anything yet.
ENGELBART: Well, the criterion for my answering this is that I haven't said anything yet.
NELSON: You've said more than enough, but please go on.
ENGELBART: That's the kind of thing that interests me a lot is the long-term impact of how our social processes will change. And some of them can get bad. And so the only thing I look at is saying if you get there first with the good stuff that'll help you solve the bad stuff or not. And so I've really been focusing all the time on how can people that are collectively trying to really cope with a problem in a straight forward way do it.
And it's all been focused on domains in which you assume that people are agreeing to collaborate on it. So the copyright issues aren't there in a salient way. So therefore, I've said it. And I'm not much good at answering your question, I think. He'll always say something.
NELSON: I came out of-- in high school I read Fahrenheit 451 and 1984 and The Space Merchants and all those terrible government conspiracy kinds of things. And so I've always thought the government was a conspiracy. And when Lee Harvey Oswald was shot, I predicted the shooting of Lee Harvey Oswald, Debbie, didn't I, three minutes before it happened? And so for years during the '60s, I was an assassination buff and just sure that the entire structure of American society was controlled by some terrible group.
It was so obvious. I mean, there's so much evidence everywhere. And gradually, this hypothesis has fallen of its own weight because the number of people involved in the conspiracy at a witting level had to be in the hundreds of thousands. But it is certainly the case that there are new tools out there that people are figuring out how to use [? adversarially ?] in new ways every day.
The creation of any new tool creates a new adversarial method, perhaps, just as the creation of any new capability creates a deprived class and a privileged class. People talk about how are we going to make sure everyone has equal access to information. Give me a break. When did anyone ever have equal access to information and how can it ever be possible? Those of us who spend 48 hours a day trying to get as much information as we can are obviously going to be in a different category from couch potatoes. Okay, so anyway here.
BERNERS-LEE: Good answer. Good question. This is really-- I think it concerns a lot of people. It certainly concerns me. This is largely the import of my talk. Civilization is walking a road between the mountains of despotic dictatorship and the swamps of terrorism. And every now and again, somebody feels that we're veering in one direction and drives hard in the other direction.
So how can we ensure that, in fact, we stay in the green pastures in between? Perhaps one of the answers is making a sort of [? fractal ?] society. The interesting thing for me that Ted said, there is when the required number of people to have been in the conspiracy approach, what is it 10,000? 100,000? At that point, you realized that they couldn't do that without you knowing one of them.
BERNERS-LEE: So in other words--
NELSON: Or being one of them and not knowing it, which is worse.
BERNERS-LEE: Right, who knows?
So it's a question of the number of links in a way. If society is structured right, then you will know somebody sufficiently well who knows somebody sufficiently well who knows somebody who knows the president that you trust the guy or you don't. Or if it's too many, if there are too many levels, then you and all the people that you know and all the people that they know can all have the same conspiracy theory, can all be meeting in the woods doing nasty, mean things with explosives and seriously think that they're right.
VAN DAM: One more question.
AUDIENCE: I'm Rosemary [? Simpson. ?] And this question is for Ted. You mentioned the problem of spaghetti webs and the issue of information programming and alluded to Dijkstra in your talk. And I wonder if you could address that issue. What solutions you would propose for enabling us to create better-structured hypertext, given the tools that we have now?
NELSON: Well, that last codicil, that's, again, like what kind of ketchup do you want on your cow patty? And the important thing is to improve the tools. Right now what Tim has given us in the way of addressability of links has been fabulous. We have to go on, make these bi-followable, again, not bi-directional. Because links can be directional and followable from both ends.
And secondly-- transclusion meaning that you can look at the same thing in some of its many contexts. And trans-parallel visualization in browsers. So transparallel visualization could be added to Netscape or any other browser. You say, just, well, keep the thing you're jumping from on the left and add the thing you're jumping to on the right and sort of scroll leftward as you keep jumping through things.
Isn't it amazing that hypertext, which has non-linear structure, is stored by Netscape and Mosaic a sequence? You say back and forward in the structure. What does that mean? You're talking about back and forward in time in your history of the structure. Anyway, so that trans-parallel visualization and transclusion are my answers, and it doesn't matter what the question is.
ENGELBART: This is the kind of thing, I think, that Ted and I have waved at each other across the frontier for many years, that I think there's more to it than that. Eloquently spoken, but there are a lot of tool changes in the things you can do in there to make a different way for moving around, for visualizing what's going on, for constructing views that are interesting and supportive, and for getting conventions in the way you structure and tag things so that they're more useful, et cetera.
So there's a lot to explore. And I think it's not going to happen unless there is a purposeful push just to try to make a better constructive way in which knowledge is dynamically developed by a collective group. And so-- but thank you very much for the ideas.
VAN DAM: Let's switch to MBone Let's lower the lights a little so we can see the question.
CREW: Not yet.
VAN DAM: Not yet? Ho, I was under the impression we had some things cued up. Okay, in that case, somebody else from the audience. Hey, Raj.
AUDIENCE: Raj Reddy, question for Michael. Do you think we'll get to the stage where all the libraries will stop funding the current 3% and go all electronic? And when should they do that?
LESK: You're saying when should lib--
AUDIENCE: --we stop funding the current library systems and go to all electronic digital libraries?
LESK: This obviously is going to be a progressive thing. There are pharmaceutical companies today where 80% of their acquisitions budget goes to purchasing electronic materials. There are publishers today where more than half their-- publishers like the American Chemical Society that have a paper tradition where more than half of their revenue comes from bytes rather than pages. So there is a transition happening.
What I am saying is, I believe that by 2010, most of the information that is needed by ordinary students and ordinary faculty members will be coming electronically. And that will be the major acquisition item in a library budget. They probably won't be buying objects. They'll probably be buying the right to access the object, in some way.
AUDIENCE: The issue is, what is the transition path? Namely, those of us in universities who have a $3 or $5 million budget have to worry about how to do it without impacting a lot of people and without impacting students. And what is the right transition path?
LESK: Actually, my experience of a number of universities is I know more library directors who are anxious to move in this direction being held up by some faculty members demanding that the library keep buying books than I do the reverse, places where the faculty is beating up the library because it keeps buying books. Every university librarian will have a story of how they have had to cutback journal purchases.
Because one of the odd statistics I tried to calculate recently is if you plot the statistic, books purchased by academic research libraries in the United States divided by number of books published in the United States year by year, it's been going down steeply for 20 years. It's now back to where it was in 1920. So every librarian has stories of cutting back journal subscriptions. There are screams from the faculty when they try it. After it's done, nobody notices.
So the answer is, unfortunately, fairly clear. Librarians are going to cut back the sorts of journals that exist because a few people on the faculty have to get tenure and for no other reason, journals that are right only and never read. And the journals will go away. And no one will notice. And it will free up a bit of money for the libraries to do new service. And what they really need is a way to do this without the faculty members who have been seduced into being editorial board members of those journals from complaining.
VAN DAM: Last last question from Stu. And then Paul will tell us where we go next.
AUDIENCE: Stu Card. So as I was sitting back listening, it finally struck me what was odd about all your presentations, in a way, is that you all assume a world in which things go, eventually, some sooner, some later, into a completely-- we go into a completely electronic world and leave the paper world behind, as opposed to a world in which we have a role for paper or augmented realities in which physical realities are interleaved with electronic realities, pieces of paper, Ted's cut and paste, for example, that have digitally-embedded URLs in them so that, in fact, he carries around linkages to things around the world or paper-tronic systems in which the computer data structures are embedded perhaps in the half tone images of the things that are printed out or other forms that are mixtures between the physical and electronic world. So I was wondering if someone might make a comment on that.
KAHN: Holding the microphone, I would say that if you got that out of my talk, you've got a problem with your perceiver. Because I actually hold the view that paper is, in fact, going to continue to be probably an increasingly present component of this system. I remember all of those articles about the paperless office. And I think that's as likely to happen as the paperless bathroom. I mean, just not really in the cards, so to speak.
And I just don't believe that people are going to be without paper. I think there will be augmentation technologies, not of the kind that Doug has worked on, but that give you the equivalent of that. If you want to carry it around, and you don't like the equivalent of a laptop or something like that. And it may even be embedded in the clothing that you wear, if that's the case. But I think paper is going to be with us for a long time.
ENGELBART: How long?
ENGELBART: How long?
KAHN: As long as I'm around to perceive it.
ENGELBART: Gee, goodbye. Well, I think there's some things you look at and you could say, it's inevitable. And you don't have to set the time exactly. But if it's inevitable, you start kind of counting on it in the future. And I think it's inevitable that we won't have paper, that technology will let you have, if it's because it's convenient and small, it'll still let you have all kinds of views you can toss in on it or something like that.
So I think its days-- it's doomed. And, you know, I used to tell people I like books, too. You know, but practically speaking, I can hold in my hand something that has access to all kinds of books, it won't weigh as much maybe as a book does, it will have more information, accessibility to it, and flexible usage of it. So I think it's inevitable.
NELSON: I love books. I have warehouses full of them. And I don't get to read them very often. And I would like to have them all right there in a virtual surround. Because they are there in my heart. And everybody's information environment is essentially their spiritual environment. I mean, your books, the magazines you read, the TV you watch, this is your sense of identity.
But for me, for the last 35 years, it's just been clear as a divine revelation, as Doug has said, that obviously that's what we want to get away from. And I have been designing a paperless world for 35 years. And I'm very comfortable with it. And I promise you that when you actually see it, it'll be great. But the other part of that is-- and as I've said to other people who bring up the bathroom metaphor, confusing an office and a bathroom, I think, is a rather deep question.
KAHN: Don't invite them to your office.
NELSON: Right, I wouldn't want to go to your office. And--
And I just want to say this one thing. I am very, very happy this moment to be between two of the men I admire most in the world.
BERNERS-LEE: Thank you. I love books, too. And when it comes to pieces of paper to scribble on, just looking at the hardware technology, come off it, I mean-- in my office, I have two of the biggest high resolution screens I could get stuck together as close as I could get them so that you can drag the windows across. And I have very little paper. And I must confess, at home, I have stacks and stacks of paper. I feel quite at home in either environment.
But for scribbling on, for doodling, I think that whereas the reference books you're going to have to have in cyberspace, most of the pieces of paper, it's going to be so difficult to find hardware, which you can layer, pin on the wall quite as easily. My feeling is that you're not going to get rid of paper unless you can make virtual paper, in other words, unless you can with your hand gestures, with your virtual reality glasses, you can basically do the equivalent thing.
You can actually, zoom with perhaps accelerated mouse hands so that you can just flick 100 miles to pick up a really interesting piece of paper and flick it back. Scribble on it, then screw it into a wall, and throw it into the metaphorical wastebasket. When you can do that sort of thing, then maybe you will be getting to the point where you can get rid of-- [VOCALIZING] Bye, folks. Bye.
PENFIELD: I think we should give a wonderful hand to the panel here and to Andy.