Computation and the Transformation of Practically Everything: Current Research (Session 3)

Search transcript...

JOHN GUTTAG: You've just heard many deep philosophical comments from what we'll call senior states people in the field, all of whom professed an unwillingness to attend MIT's 200th Anniversary Celebration.

Now we're going to do something completely different. We're going to hear, not from senior states people, but from some young Turks. And I think there's a good probability that at least one of the younger people you will hear from today, or heard from yesterday, will actually be on the Turing Award panel 50 years from now, which I expect to chair.

[AUDIENCE LAUGHTER]

So our first speaker is Polina Golland, from EECS CSAIL. Polina?

POLINA GOLLAND: Good morning everybody. My group's research focuses on computational modeling of biological shape and function. Specifically, we build computational models of how biological shape varies across individuals, how the function is derived from the shape and affected by it, and furthermore, how the two are intertwined.

In application to neuro-imaging, which is what I'm going to talk about today, it leads to lots of interesting questions, such as: how do the diseases affect the brain? How does the normal brain develop and age? Can we actually understand something about functional organization of the brain by studying carefully it's anatomical organization as well as the relationship between anatomy and the function?

We have a wide variety of projects. I want to give you one example today, where building such computational models in the non-conventional way really provides an insight into organization of the function and anatomy in the human brain.

And specifically, in traditional population studies, clinical studies of diseases, typically the approach that's taken is to say, there is an average brain that represents fairly carefully every one of us. And all of us can be thought of as a slightly noisy instantiations of that brain.

And then, for example, if there is a particular disease such as Alzheimer's or schizophrenia that creates an alteration in that brain-- so we can think of every schizophrenic brain-- and now we're going to throw all of our resources in understanding what the differences are between these two average brains.

This approach has been somewhat successful but surprisingly limited, mainly because if we try this approach, we find that variability in anatomy within normal population is so large compared to the subtle differences induced by these severe diseases, that you rarely find anything very specific.

So what we've done is put the problem upside down and say, let's find us the structure in the space of anatomical images. Let's characterize what that space looks like, and then understand how external factors such as whether somebody has a disease or not, what their age is, for example, is related to that structure. And what you find is then, you find the definitive structure in the images that sometimes is correlated with the demographics in the clinical populations. But sometimes it's not.

And so we get a glimpse into both types of variability. So let me show you an example. We applied this type of analysis to about 400 subjects from longitudinal data from Harvard University. And the goal of the study was to collect enough data that we can actually study the development and the aging of the brain. So the ages of subjects varied from 18 to 96 years. And some of them were diagnosed with mild cognitive impairment, which is sort of a precursor for Alzheimer's disease.

I should say that our computational model can be thought of, for those of you who are computationally inclined, can be thought of as a mixture model. So we have several templates in the population that represent anatomically homogeneous sub-populations. And then everybody is assigned to one of those.

In contrast to what happens in standard signal processing, the variability around each template is not a very simple model. So in the signal space, often times, you put something like a Gaussian distribution to indicate your noise. In our case, the noise is geometric warps that deformed my brain to yours, effectively. So we had to develop a set of algorithms that allow us to actually estimate this geometric noise, that are quite substantially different from the standard mixture modeling in the field.

Anyway, so when we apply this to this 400 data set, we find these three templates. Well, let's see. So we found these three templates. What I'm showing you here is a cross-section through the brain, sort of going like this. These black cavities are ventricles. These are fluid-filled cavities inside the brain. The grey matter is on the outside of the brain. And this bright white matter is believed to be all the axonal connections that go between different neurons in the brain.

Now, the algorithm finds these three templates in a completely unsupervised way, in the sense that we didn't tell it anything except for, these are the images that we need to summarise. These templates collectively represent three relatively homogeneous sub-populations in this group of subjects.

Now if you look at the images themselves, you can actually see the differences between them, right? So the ventricles are getting bigger as I go from this template, to this template, to this template. A more subtle difference is, if you look in this area, this area gets darker. This means that there is decrease in gray matter, and increase in fluid. Because in these images, fluid shows up darker than the gray matter.

So overall, this picture is a picture of degeneration. Basically, the brain matter is decreasing, and there is more and more fluid.

Now, I put these labels on, not because the algorithm produced it-- in fact, it's blind to the ages. All it was given is the images themselves. But if you look at the distribution of ages in these three templates-- so, this is the image on the right-- this is what the age distribution is of subjects who looked the most like the leftmost template. And then there are two bumps for the age distributions of the two other templates. So my students decided to call it the young, the middle age, and the old population.

Now for comparison, I show you what happens if you tell the algorithm, find me just two templates. Then very clearly, you get the distinction between the young and the old. For those of you who are starting to draw dramatic conclusions for yourselves, actually, there is quite a bit of a bump around the fairly advanced ages. And the question is, how do these people succeed in keeping their brains looking like the young template, right? So this is the clinical question that our collaborators are working on.

And more interestingly, if we allow more richness and ask for three templates, the model doesn't actually distribute it uniformly. In fact, it focuses on the advanced age. And the two templates here, even though we call them the middle age and the old, really if you look at the average age, it's not that different.

Now, on top of the structure that we discovered in the images without any suggestion of outside information, you can start asking questions. Okay, what about Alzheimer's disease? Well, it turns out that there is twice as much chances of having Alzheimer's if you're in the red cluster, than if you are in the yellow cluster. So suddenly, we're starting to see how the demographics in the clinical populations correlate with the structure in the images. And we could do that by, basically, reducing the variability in every one of the sub-populations.

Now, the young subjects are separated from the old subjects. And by young, I really mean young-looking, rather than by age. Because you can see that there is substantial bump at the advanced age. The young-looking subjects look different. But now, they're removed from the analysis of the clinical variable. And as a result we have a much more clear picture of what's going on with, for example, when Alzheimer's attacks your brain.

So we're pursuing this research in collaboration with our Harvard colleagues. And now they're interested in really looking carefully at the people who look at the boundary between the red and the orange group here, and understanding what makes those two different, and get insight into the nature of the disease.

So to summarise, [INAUDIBLE] showed you how by changing the problem somewhat, and constructing a computational model of variability in the images, we could get much better handle on the structure in the images overall, as well as structure that is correlated with clinical factors of interest. And specifically, what I showed you is a potential for doing clinical studies in a completely different way. But actually, this methodology has implications, also, for doing image analysis for surgical guidance, for understanding how function relates to structure, in many other areas of clinical science. Thank you.

[APPLAUSE]

JOHN GUTTAG: Well, that's a talk that's well-given by someone on the young side of that graph. The next speaker is Dina Katabi, who will talk to us about a rather different subject. Dina?

DINA KATABI: So this talk is about how we can get a mobile video that does not glitch or stall. So mobile video is the next killer application, according to all predictions. We are all very much excited about getting all TV programs to our hand-held devices, watching live concerts and sporting events on our tablets or PDAs.

However, the question is, how can we do this, given that mobile video today suffers from glitches and stalls all the time. So the reason why mobile video today suffers from glitches and stalls is that-- the button is not moving. Okay. But there is more important technical reason, indeed, which is that mobility causes unpredictable variation in channel quality, between the transmitter and the receiver.

So for example, if this is your transmitter and this is your receiver, and you try to look at the channel between them as a function of time, you will see that the channel going up and down all the time. In fact, even if you just jitter your hand, you get crazy variations. And wireless video can not deal with such variations.

In fact, wireless video suffers from a problem that is called the performance cliff. That is, if you plot the video quality as a function of the channel quality, you are going to see a cliff graph. So you're going to see that the video works really well for a particular channel quality. However, as you improve the channel quality beyond that point, the video does not improve with improved channel quality.

And if the channel quality degrades, your video becomes unwatchable. And this is why you get these glitches and stalls. And you might say, oh, maybe we can change the modulation. We can change the error-correcting codes and get a different performance. That's true. You can. And actually, you're going to get different cliff point.

And here are all the possible cliff points for the performance and the coding and modulation available with systems like WiFi and WiMax. Ideally however, you want this performance. That is, you want a video quality that changes smoothly with changes in the wireless channel quality between your transmitter and receiver. And if you can get this, then there will be no glitches and no stalls. Because the degradation will be very smooth, with variations. Not only this, but at every single channel quality, at every instantaneous channel, you're going to get the best possible video performance for your channel.

So before I'm going to tell you how to try to approach such performance, let us see why current video suffers from stalls and glitches-- in fact, why current video suffers from a cliff effect. So current wireless video has two components: the video codec, which compresses the video, and the channel code, which protects the video as it goes over the wireless channel.

And the problem with these two components is that both compression and error protection transform real-number pixel values into sequences of bits. But these bits have no numerical relation whatsoever to the original pixel values.

So if you have two sequences of bits that differ only by one bit flip, say, because of some error on the channel, then these might refer to completely different pixel values. So for example, if you transmitted 1, 1, 1, 0, and you got 1, 1, 1, 1, which is just one bit flip, it might change the luminance from 5 to 149.

Indeed, the problem is way, way worse than this, because all video codecs use something called entropy coding, for which, when you get a bit flip, you can do synchronization between the transmitter and the receiver. And you get arbitrary errors in the video.

So today, you have a mobile video. If you can deliver all the bits perfectly correctly to the receiver, you get perfect video. But if you have one residual error in your video, then you get the arbitrary errors at the receiver. You get cliff effect, and glitches, and stalls.

So how do we solve this problem? Here is our solution. It's called SoftCast. And instead of using two components, one compresses the video, one protects it from errors, we are going to have-- and also transforms the real-number pixel to bits, we're going to have one code that both compresses the video and protects it from errors, without transforming it into bits.

Actually, it operates over real numbers. And the key property of this code is that it is linear. So as a result of this linearity, small perturbation on the channel that affect the transmitted signal, are going to have linear impact on the pixel value, and therefore they are going to produce small perturbations in the pixel luminance, and small perturbations in your video. So no cliff effect. This is, on its own, it's not sufficient. You still need to compress the video, obviously, and protect it from errors.

Okay, so the challenge in compressing the video is that existing compression schemes are not linear. So we need the linear scheme. So here is what we are going to do. We're going to compress the video by dropping three-dimensional frequencies.

So let me tell you what I mean by this. Say that this is your mobile video, and these are the frames. So now, we all know that pixel luminance change gradually, both in space and time in the video. So if you transform your video to the frequency domain, the fast frequencies, the high frequency, both spatial frequency and temporal frequency, are going to be 0.

So this is what we are going to leverage. We are in SoftCast, our scheme, transform into the frequency domain using a transform called 3D DCT. And this is what you get.

Basically all these black regions are 0-value frequencies. And 0-value frequencies have 0 information, so you can compress the video by sending the non-0 frequency and dropping the 0 frequency. In fact, you can compress the video to any level of compression that you want, by dropping all the frequencies above certain threshold.

So this form of compression, the 3D DCT, both compresses within the frame and across frames. And also, it stays linear. So we don't have this cliff effect.

Now, we still need to protect the video from errors. And because of lack of time, I'm just going to tell you. The real challenge in protecting from error is that existing error protection code transforms the real numbers into bits. And that kills linearity. So we can do this.

So at the very high level, the solution that we adopt is to formalize this problem as an optimization problem, and find the optimal linear codes that can minimize the video error at the receiver.

So let me go quickly and show you some results. So we implement this SoftCast, and we compare it empirically with the existing approach, which uses MPEG4, part 10, or what people refer to as H 264, over OFDM channels, which are more than wireless channels.

And I'm going to show you the video quality as a function of the channel quality. And we have seen, already, that the existing approach has all of these different cliff graphs, depending on the modulation and the error-correcting code that you use. And here is SoftCast.

So as you can see, it's pretty close to the idea that we wanted, to start with. And the video quality changes very smoothly with the channel quality.

Now I want to show you, also, a demo that visually tells you the implication of having this smooth change on your mobile video, and the stalls that you might see in your mobile video.

So here is the demo. And in this demo, we are going to have a mobile receiver, which is a cell phone, up there. So this mobile receiver that's going to move away from the transmitter. And as the mobile receiver moves away, obviously, the channel quality is going to degrade. And we're going to see the impact on two videos, SoftCast, there, and to this side, MPEG4, part 10, which is the existing approach. So could you please play it?

So as you can see, and you should probably expect, the existing approach-- MPEG4, part 10, H 264-- has a cliff effect and is going to stall with variation in your channel, whereas the other, new approach will just smoothly changes with the quality of the video. And you can continue watching your game.

So with this, I'll end my talk and leave it to the next speaker.

[APPLAUSE]

JOHN GUTTAG: So I'll remind people that, though we don't have time for questions in the session, if you want to come down, we'll ask the speakers to hang around afterwards-- or maybe at lunch, that you'll be able to ask them some questions.

Last speaker of the session is Rob Miller, again, another one of the young Turks from EECS and CSAIL. Rob?

ROB MILLER: Thanks, John. Can I have my slides? Great. So I want to talk to you about some of the work that my group has been doing lately in crowd computing, which is an intentional analog to the notion of cloud computing that's popular now.

We now live in a world where elastic, highly-available human resources are out there on the web, for us to tap and use in systems, to solve problems that are really too hard for us to know how to do with software alone.

Here's a simple example of one of those problems. This is a handwriting transcription problem. And we just have no way to solve this using automatic methods. And in fact, any one person would have trouble reading this, as well. So what we want to do is, orchestrate a group of people that we're going to get from out on the web in a process where each of them is making only a small contribution to this process, and where we're not necessarily trusting any one person's contribution, in order to solve this problem that we don't know how to do with software-- and maybe even that the end user, a single person who wants this done, doesn't know how to do.

There's lots of places today where you can get these crowds. So Wikipedia has a volunteer crowd of millions of people that are making small contributions to it. Your Facebook friends, so Ron and Corby could now, maybe, log onto their Facebook accounts more often if they could take advantage of the people that are following them.

In my group, for prototyping, for developing and testing these systems, we hire our [INAUDIBLE]. And there's this neat service that Amazon has created called Mechanical Turk, which is another cute pun based on the famous Victorian prank of a mechanical chess player. But Mechanical Turk is a labor marketplace where people are constantly there, ready to do tiny tasks, on the order of minutes, for tiny amounts of money, on the order of cents.

So we're actually going to break these problems up into tiny things that we can do on Mechanical Turk. The challenge is that we need to do this in a smarter way. And here's an illustration of this. So I recently put up this task on Mechanical Turk, asking people to just flip a coin for me, and tell me whether it was heads or tails. And I'd pay them a cent. And these are the results that I got back, which either mean that the coins in circulation are horribly unfair, or there wasn't a lot of actual coin-flipping going on.

Also, I asked 100 people. And you may notice, this doesn't add up to 100. That's because the last coin was, apparently, a binary coin that gave either 0 or 1. The upshot of this is that people are not necessarily going to do what you ask them to do. So you have to organize them into some kind of system that's going to tolerate the noise that you're going to get from this human computation system.

So coming back to this simple example, the process that we experimented with, here, is the notion of iterative improvement, getting people to gradually do this task by making small contributions. So we may have a partial transcription here, and we give that to one person and have them improve it by trying to figure out a few more words, using some of the context, perhaps, that they already have. And then we take the work that they did, and compare it with their input, and pass that to another group of people, and simply ask them which of these is a better transcription, a better partial transcription, of this handwriting.

And we do that repetitively, until we get close to the result. And after nine iterations of this particular task, which involves somewhere on the order of 50 people-- but only for tiny amounts of money, so less than $1-- we end up almost getting the right answer.

Here's another example. This is actually printed text that we've applied a Gaussian blur to. And after eight iterations, so magically, the crowd is able to figure it out. So these are simple experiments. To be honest, these are not so important to me.

This is a problem that is important to me, as an academic. When I read a conference paper, and some arbitrary limit on that paper that I have to squeeze it into, and it's usually half an hour before the deadline that I worry about this. And I am not quite down. I've tried all the tricks, like shrinking the font and all that kind of stuff. And I still have extra left over.

What I want to do is, throw a crowd at this problem-- to look at my text, and find places where it can be shortened-- and then to take all of those suggestions, not just finding places, but finding ways to say them shorter, and bring them back into a user interface that will allow me, as the end user, to basically drag the length of my text down to exactly where I want it, and automatically select among all the suggestions that the crowd has made, in order to get exactly the length that I want. And also give me a good user interface, so I can look at what changes were actually made, and make sure that it doesn't change what I meant to say.

So this whole area of work sort of combines user interfaces for the end user on the desktop or on a mobile device, and user interfaces for the crowd, and algorithms to structure the crowd, and as well, some algorithm that is making this selection from all these choices in a greedy way, in order to get the final answer. So it's really a hybrid interaction between HCI and crowds and traditional computation.

So that's one feature of this Soylent system. You may know Soylent, it's made of people. So is this system, that's shortening. Another thing that Soylent does is straightforward proofreading. So there is already a built-in spelling and grammar checker in Microsoft Word, but it's incomplete. It has false positives. It has false negatives. It doesn't catch everything.

You can take your text and throw it on Mechanical Turk using a process that I'm about to describe to you, and get identification of additional grammar errors and suggestions about how to fix them. And we bring those back and put them in the word interface, using much the same user interface that you'd see for the automatic grammar and spelling check.

And then there's a more general programming notion, where you're asking the crowd to do something for you, do some editing task like changing from past tense to present tense. And that's a challenging thing for us to program, in general, right now.

Now I said that algorithm is a challenge here. Workflow is a challenge. So we don't trust any particular member of this crowd that is doing our proofreading, or doing our shortening. We orchestrate them into a process where they're going to check each other and correct each other.

So the first step of that is, just to find places to shorten, or find grammar errors to fix. And we look for independent agreement between those locations of errors, before we go ahead and decide that's actually something worth changing. And then we pass those locations on to another group of people, who are going to suggest alternatives. And then we take those suggested alternatives, and pass them through a filter that basically throws away the worst ones, the ones that introduce grammar errors of their own, or that change the meaning of the sentence.

So here's some quick statistics about what this does. So roughly every pass through the shortening system, you cut about 15% off of your text. And actually, you can run it through multiple times. And each time it seems to cut off 15%. But we think eventually, that will start to actually strip out the signal that's in your text.

The cost of it is actually comparable to hiring a professional proofreader, a single person that will look at your text and shorten it, or find grammar errors and rephrase, and things like that. But this is much faster, so it can happen within minutes, essentially, whereas the turnaround time for a professional proofreader tends to be much longer than that. And you're not depending on one person, who might flake out on you. You're instead using a crowd.

We wrote a paper about this system. Of course, you can't write a paper about a system like this without actually feeding the paper into the system. When we did that, actually, we discovered a grammatical error that Word didn't catch, that none of the authors of the paper didn't catch-- and there were a lot of authors-- none of the reviewers of the paper caught. It's this one down here, this word introduce, which actually is supposed to be parallel with the word changing, here. So it should be introducing.

And none of those people caught it, I think, because they were all tired by the time they got to this part of the paper. It was on page eight. Whereas the crowd that was reading it, this was the first thing they saw. They were fresh eyeballs. And all that they were looking at in this paragraph was for grammatical errors. So there's really a lot of value here, in drawing on human beings who are different from yourself.

I'm going to give one more example of an application we've done, which also shows the advantage of people other than yourself. There's a system called VizWiz that runs on a smartphone and allows a blind user to take a picture of a question they have-- a visual question they have in the world, like which of these two doors should I go through-- and speak that question, along with the picture, and have it uploaded to Mechanical Turk and get a crowd of people to give answers.

And we get multiple answers back, again, because we don't trust any individual person. The amazing thing about this system-- well first of all, one amazing thing is that blind people actually can take pictures. But they don't always do it perfectly. So here's an example of a question. I want to know what's inside this can. And they haven't taken a picture of the right side of the can.

But human beings can look at this and figure it out. So just the high ceiling of human intelligence and human problem solving-- looking at, for instance, the edge of the label here, and recognizing the kind of food that it is, and reading the ingredients and seeing that it's chickpeas, they can figure it out.

So one of the reasons we're excited about this area is, it sort of takes the notion of Wizard of Oz that has been traditionally used in AI, and let's us think about wiring it directly into a deployed system, a system that we're actually going to use in practice.

I'm going to skip over some of this design space stuff for lack of time, and just leave you with some takeaways, which is that we now live in a world where we can actually have human resources built into a system, and wired into that system, alongside the computation that has been so successful over the last 50 years. Which will allow us to deploy systems that we don't really know how to do, right now, and find out whether they're useful, and how they can be used right now, by having human beings behind them, helping them work. But to do that really requires new algorithms and design patterns, to control for these strange behaviors of the crowd.

So thank you very much, and I especially want to thank the students and collaborators who helped with this work, and to about 10,000 people out on the web who have supported these experiments in these systems.

[APPLAUSE]