Leslie Kaelbling: 6.01 Lecture 08

Search transcript...

PROFESSOR KAELBLING: OK, we might as well start. It's a little sparse today, but I guess it's the holiday weekend phenomenon. Just to kind of explain where we are, and what we're doing at the moment, the design lab of this week is going to be the last of the circuit labs. And it's going to be about connecting the head of the robot that you guys just made up to the robot, and getting it to do some things. So we're going to re-integrate what we've been talking about in circuits with some state machine stuff.

What we're going to start doing today, in some sense the theme for the rest of the class, starting in the software lab this week with Homework Four, and then continuing to the rest of the software and design labs, is going to be a different way of thinking about modeling. I want to think about the trajectory of what we've been using models for in this class.

So when we started out, we were just kind of writing computer program, and we said, oh well, maybe we can write a computer program to get the robot to do something intelligent. Maybe we can write a program to get it to follow the walls, or to stop a certain distance from the wall in front of it, or something like that. And that was OK, and as long as the programs we were writing weren't too complicated, we could just write the program, and try it out, and see if it worked, and tweaked the numbers a little bit, and run it again, and see if it worked, and so on.

But then what we did is we transitioned into a regime where we said, let's look at a particular special case of computer programs to control robots. That is to say, the ones that can be described as Linear Time Invariant Systems. And what we did was we said, we're going to look at a simple class of programs that we're going to be able to describe those problems as LTI systems.

But we're also going to try to describe the way the rest of the world works. The plant, and all this stuff that's external to the controller that we're writing, we're also going to model that as an LTI system. And the advantage of making an LTI system program to control the robot, and an LTI system model of the world outside was that we could then compose those two models, and make a model of how the whole joint system would work, of the robot and the world outside. And then, we could use these mathematical tools of looking at the system function of the combined system, looking at the poles of the system function, and so on, to analyze how well that system would work.

So the whole of our theme in the LTI systems portion of this class was that if we could make a mathematical model of the robot and the world around it, then we can prove some things about how it would work. And we could use that model to give us insight into how to design a good system.

So now, what we're going to do in this last section of the class is take that modeling idea one step further. So in LTI systems, the model was in our hear, or in our notebook, or in our computer program. The model wasn't in the robot. When we were thinking about circuits, so in fact, last week, when I was going around design lab, I saw somebody and they had their circuit schematic. And they had a box with wires connected up to it that was labeled "Motor Model."

But you can't hook resistor up to a model of the motor. You can hook a resistor up to the motor, that's OK, the motor's a physical thing in the world. That's in a network, that's actually in a circuit. But the model is something that exists in the world of mathematics. So there's the circuit, or the actual program or something that exists in the world. The model exists in the world of mathematics. So far that's how things have been going. The model was in our head, or in our notebook. The thing we were modeling was the physical thing.

Now what we're going to do is try to make a class of computer programs that's very rich in the kind of behavior it can generate. And also very good, it turns out, at dealing with the uncertainty of the world, by taking the models from our notebooks, or from our computer program, from our analysis, and putting them inside the head of the robot. So the new maneuver is, we're going to make a model of how the world works, we're going to take that model of how the world works, and put it in the robot's head.

I would argue that you have all sorts of models of how the world works in your head. You have a model of what will happen if you knock over your coffee cup. You have a model of what will happen if you fail to study for the test. You have models of all kinds of consequences of the actions that you take. And you have models of how the sensors that you have that give you information about the world, how they're connected up to the actual state of the world.

So you have, in your head, models of how the world works, and models of how your sensors work. And you can use those to infer things. You can use your models to infer what might really be going on when you only get some particular pieces of information about a process. You can use those models in your head to infer what actions you ought to take.

You might say, if I want to get a good grade in this class, I should go to lectures. So clearly, you guys have made that inference, that's good. Or maybe you come here for the sheer joy, or because the chairs are comfortable, I don't know. But for some reason, you inferred that it would be a good idea to come here.

So that's what we're going to do. We're going to take our robots and we're going to put models in their heads. And we're going to let them use those models to infer something about what's going on in the world around them. And to infer what actions they ought to take order to achieve particular goals. So that's our theme here.

So what we're going to do for the next two weeks is concentrate on probabilistic reasoning, which will let us try to infer something about what's going on in a process that we don't totally understand. For instance, I hear some crinkly, crackly plastic over there, and I think, oh, probably somebody's having their breakfast. I don't know that for sure. There's all sorts of other things that could be going on over there, but I hear this thing, and from that I'm going to infer something about what's going on in the world. So how can we get the robots to do that?

So this what we're up to. We're going to get the robot to make inferences about its world from its noisy and partial sonar measurements. And we're going to get it to plan clever routes through complicated spaces this way.

Probability. So probably, you've all been exposed to probability theory in one form or another. Most of the time, a lot of the time, people who have only seen a little bit of probability theory have only ever heard endlessly about coins and die rolls. And that's a very particular set-up for doing reasoning about probability. And sometimes, you get the coin and die roll model so firmly in your head it's hard to get it back out again. So we'll try not to spend too much time there.

But we'll start there anyway. Probability is about considering a set of possible ways something could be, and assigning weights to those possible ways something could be. The something might be how many heads come up when you flip a coin 10 times. The something could be whether somebody has a disease. The something could be where the robot's located in the world.

So there's all kinds of different universes that you might want to reason about that you have uncertainty about. And probability gives us a way to attach likelihoods, or probabilities, numbers between zero and one, to those different things they could be happening.

So to set up a probabilistic model of the world, of some world, first of all, you have to enumerate what are called the atomic events. These are the most specific things that you could say about what's going on in the world.

Let me say also that we're only going to talk about discreet probability spaces right now. We're going to assume our spaces are discrete and finite. So there's some set of primitive things that could happen, and we can make a list of them. And that list of the most specific things that we can describe, that's going to be our space of atomic events, the primitive things that could happen. And atomic events are mutually exclusive. That means only one of them can ever actually happen. And they're collectively exhaustive, which means at least one of them has to happen. So those two things together means exactly one of these atomic events is going to be true.

So terminology. OK, so here is the Axioms of Probability theory. Probability is cool, because there's a very small number of axioms, of assumptions we have to make, about how it works as a mathematical structure. And then there's all sorts of nice consequences that you can derive from them, and usually the derivations are pretty simple.

So we have the notion of an event. Before, we talked about primitive events, so primitive events, or atomic events, these are some particular things like this. Maybe you've got A, B, C, D, E, and F. So those are atomic events. We'll talk about events all the time. Events are just some sets of atomic events. So an event in general could be a set of atomic events.

So when I write the probability of something, the something always, always, always-- when you write "probability of (blank) equals some number," what goes there is an event. It's a set of possible things that could happen in the world.

So what has to be true for all the probability statements that we can make? The first is that the probability of any event is greater than or equal to zero, so there's no negative probabilities. The probability of "U" is one. U is the universe. U is this set of all the possible atomic events. Sometimes called the sample space, the probably of the universe is one. So one of those things is definitely going to happen.

And then there's only one more thing that you need to know. Let's look at it in this form, because this is something that people are maybe more used to. And I'll write it down here, because it's sort of important to us. The probability of A union B is equal to the probability of A plus the probability of B minus the probability of A intersect B. So standard picture, here's the universe, here's A, here's B.

And if you think about this, what this says is we're interested in the probability that something in the set A or the set B happens. And so that's the probability that something in the set A happens plus the probability that something in the set B happens. But if we've added these things up, we've counted this middle region in the intersection twice. So we have to subtract it back out again.

From just those three formal mathematical statements, you can derive everything in discrete probability theory. That's it. We could just stop now. But it's like saying, oh, circuits. There's KVL and KCL, and you can describe how the components work, and that's all.

But it takes some practice, and thinking, and so on to turn those formal, fundamental things into intuition, and answers, and common cases, and so on. So we're going to do a fair amount of that. But in some sense, that's really all there is to discrete probability theory.

So let's do a little simple check yourself here. So we're going to talk about rolling one die. If we roll a die, there's six possible outcomes, one, two, three, four, five, six. It's a cube. So what's the probability that the result is odd and greater than three. Put up a finger. OK, we got mostly ones, good, right. So odd and greater than three.

The reason that we have this example here is so that you can think about different events. Here's all the possible results of the die, one, two, three, four, five six. And we assume that it's a fair die, fair means that it's going to come out equi-probable. So the probabilities are one out of six.

So we look at all the cases that are odd. There's three of those. We look at the cases that are greater than three. There's also three of those. But there's only one where it's odd and greater than three at the same time. That's five. And so the probability of that is one out of six. This is the sort of stuff that probably everybody's done too much of before.

Our most favorite, most important, most crucial way of talking about events and probabilities is going to be conditional probability. Here's the definition of conditional probability. Is it doesn't necessarily mean anything, but I'll explain it, and then we'll see what it's good for. So it's written like this-- the probability of A given B, so that vertical bar is "given." And mathematically, it's defined to be the probability of A intersect B. So that means that the probability that we're in the set A and in the set B, divided by the probability that we're in set B.

So the way to think about this, the way I think about it is that B is the new U. Normally, when we're talking about the probability of some set, we think about how likely it is that we're going to end up with an atomic event that's in that set, divided by the likelihood that we're going to end up in an atomic event that's in the universe, U. What's the probability of the universe? One. We can always write the probability of B is equal to the probability of B over the probability of U. It's always fair to divide by the probability of the universe.

So in conditional probability, what we're going to do is shrink the universe. It's all about shrinking the universe. We're going to make an assumption. We're going to say, let us just assume that B is true, that B really happened, that B is definitely going to happen, we're definitely going to be in that set B. And now, under the assumption that we're definitely going to be in set B, how likely is it that we're in A?

So that's the intuition for conditional probability. Before, we had this whole big universe. And if we wanted to know, for instance, the probability of A, that would be sort of the amount of probability in the set A over one.

Now, B is new universe. We forget about everything outside of B, and we ask, what's the probability of being in this set-- A intersect B, because those are the only As left-- divided by the probability of being in B. Does that makes sense? Conditioning is a restriction of the universe, the best way I know to think about it.

Here's a conditional probability question. What's the conditional probability of getting a roll greater than three, given that it's odd. We have quite few two fingers in the air. Let's see how that goes. That's good.

Just as before, we can draw out the sample space, so we're sure we understand these events that we're talking about. So here are all the ones that are odd. Here all the ones that are greater than three. Now what we're going to do is restrict our attention to the possible outcomes where the result is odd.

So we just erase all the rest of the ways the world could be. We say, OK, there's really only three possible ways the world could be, because I've assumed that the outcome is odd. Maybe we're playing a sneaky dice game, I roll the die, somebody looks at it, and tells you whether it's odd or even, and then you get to place another bet. You can imagine such a dice game.

So somebody's told you whether it's odd, and now you're trying to decide whether or not it's greater than three. So now, in the new universe-- the new universe only has three items. And now, of those, only one of them is greater than three. We assume that they were equi-probable. And so, if we do the math out, we get one sixth over one half, which is the sum of these, is one third. So most of you guys got to. Good. Conditional probability is our most important idea.

It's important to know that conditioning can actually-- you might start with some probability of A, and then condition it on B. Somebody gives you some information, B. The information can make your probability estimate about A go up or down. So A given A could be greater than the probability of A, or the probability of A given B could be less than the probability of A. It could go either way. The information that B can change your mind.

So in this case, if they barely overlap, I tell you that B is true. You used to think A was pretty likely, and then I say B is true, and now you think maybe A is not that likely after. Or it could be that you thought A was pretty likely, we told you that B is true, and now you think A is really very, very likely to be true. So it could go either way.

All right, so all this talk about enumerated events-- going to give a list of the sample space, and then we're going to talk about particular subsets and so on-- that's good when you only have two coins or one die roll, or something like that. But as the problems that you start to talk about get bigger and more complicated, we need more interesting mathematical structure to the probability spaces that we're going to talk about.

In order to give some underlying internal structure to our probability spaces, we're going to introduce this notion of a random variable. One way to think about it is that our random variable is the probabilistic analog of a variable. So it doesn't actually have a value. What it has associated with it is a distribution over values.

What I really like to say is that a random variable is a function that takes a value and gives you the probability that it's going to take that value on. A simple random variable that we're all used to is the roll of a particular die. We might say X now is random variable. It can take on one of six values. And for each of those values, we can assign a probability that takes that value.

There is a mouse, how interesting. Like an actual mammal. It's right over there. OK, good. It wants your breakfast. Oh no, everybody pick their feet up. So, where was I? Probability. So what's the probability that a mouse is going to like run over somebody's toes? We can ask that question. That would be a question we could ask. Small. I think it's trying to exit.

This is the distribution on a random variable. It's uniform. We'll call this uniform, it means all the probabilities are the same. It doesn't need to be like that, that's just an example. Once we have the notion of random variables, we can talk about events like this. We could say the probability that X equals 3. So that's an event. That's the event that the random variable X has the value three.

Now we'll start talking about worlds where we have lots of different random variables. Maybe we have six random variables all going at the same time. If we say the probability that X is equal to 3, that picks out all the particular assignments where that one variable has the value three.

Let's talk about joint probability distributions, and that will help us really get comfortable with the notion of a random variable. Imagine that I have two random variables, V and W. And now my atomic events are going to be pairs of values, a value for V and a value for W. So if I have two random variables, the atomic events are pairs of values.

And the event is that big V has value little v and big W has value little w. We'll use this notation a lot, with the big variable standing for random variables, and the little one standing for particular values that they could take on. We'll also write things with a comma in them, just for convenience, because we get tired of writing "and". But comma means "and". Probability of little v, little w is shorthand for the probability that the random variable V takes on value little v, and random variable W takes on value little w.

Things have been too abstract. So let's make them concrete. And we'll make them concrete using some actual data from an actual study, just to focus our attention a little bit. And this is a very common case of probabilistic reasoning.

Let's imagine that we're interested in the efficacy of an AIDS test. So somebody went and did a bunch of experiments. Imagine that they came up with this data, and right now, we just have the proportions. So we're going to have two random variables in this problem we're going to look at. One random variable is the random variable AIDS, so that's whether the person indeed has AIDS or not. It could be true or false. I know that this is a very naive model of disease, and nuanced. But let's right now imagine that either you do or you don't have it. And there's the question of whether the test comes out positive or negative.

So we have two random variables, each of which can take on two possible values. The test can be positive or negative, AIDS can be true or false. Random variables don't have to be the binary, but this is a simple case that fits on slide nicely.

Somebody went and gathered a bunch of data, and they found that these were the relative proportions of these cases. So that means that, of all the people that participated in their study, let's say 0.003, so 0.3%, had the disease, and tested positive. So this is a joint distribution.

This describes-- there goes the mouse again. They're picking up their feet. Go out. Shall I try to chase it out? No, it's entertaining. I'll just ignore it. I won't tell you about it anymore. The front row is not happy. This is a joint distribution. It tells us the probability of all the assignments to the random variables. Since each random variable can have two values, then there's four possible assignments. So this says there's probability 0.003 of having a positive test and having the disease. So this is an and. There's probability 0.00005 of having the disease and having a negative test. Probability 0.02 of not having the disease and having a positive test. And probability 0.973 of not having the disease and having a negative test. These four numbers add up to one. Because in a joint distribution, the cells in this distribution are the atomic events. We know exactly one of them is going to happen, and the probability of the universe has to be one. This is a joint probability distribution. So what can we do with a joint probability distribution? We can answer a whole bunch of questions. I'm going to do these on the board. I'm not going to ask you to try to think them through, because we're not totally ready to do that. A question that we can ask is-- the mouse is really distracting. It's over here somewhere. Everyone is picking up their bags. Don't bring it to lab with you. Good.

Let's say we want to ask the question, what's the probability of a positive test, given that somebody has the disease? So this is the sort of thing that's written down in the leaflet. When you go and you buy the test procedure from the biotech company, you get a leaflet. And the leaflet tells you things like, what's the probability that the test is going to be positive, given that somebody has the disease. That that's a statistic you want to know when you go and buy this test.

Let's see if we can figure that out, and I'm going to do it on board. Here's our table. So we have AIDS, and that could be true or false. We have the test, that can be positive or negative. And we had these numbers here. So 0.003648, 0.000052, 0.026563, and 0.973437.

First of all, what I want to do is illustrate some other things that you can do with this joint distribution. And we're going to need these results in order to compute this thing we're going to compute in a minute anyway. Given this table, I might, first of all, want to ask the question, what proportion of the people in this particular study had AIDS? What's the probability that AIDS is true? How would I compute that from this table?

I would add up this column. This is the proportion of people who had AIDS and a positive test. This is the proportion of people who had AIDS and a negative test. If we add the entries in this column up, we get the proportion of people who had AIDS. So this is 0.0037.

And if we add up this column, what had we better get? 1 minus 0.037. Otherwise we did something wrong. 0.9963.

This is known as the marginal distribution. In case you ever wondered why it's called the marginal distribution, it's because it's in the margin. It's the probability, in this case, it's the probability distribution over the variable AIDS. You can write the probability of AIDS is equal to little a. So a could be true or false, and that's the distribution over AIDS.

Over here, there's another marginal distribution we can compute. We can ask, what proportion of the people in the study had a positive test? How do we compute that? Add the row. The proportion of people with a positive test is this plus this. This is 0.026563, and this is 0.973437. This is another marginal distribution, there are two.

STUDENT: The false columns are different from the slides.

PROFESSOR KAELBLING: Oh, really? You mean I just read it wrong up here?

STUDENT: Yeah. Just the positive column.

PROFESSOR KAELBLING: Oh, this one. I see. Thank you Is that good now?

STUDENT: No.

PROFESSOR KAELBLING: No?

STUDENT: False negative.

PROFESSOR KAELBLING: 9733. I got it. I just got the whole column wrong. That didn't add up to one before. Now it does, I hope. So this is another marginal distribution. This is the distribution of the probability that the test has some value. So that's distribution on the probability of test.

That's a start to computing what we need to compute. What we want to compute is the probability that the test is positive, given AIDS is true. And so to do that, we'll use the definition of conditional probability. We'll do the definition of conditional probability over here. So it's probability that the test is positive and AIDS is true, divided by the probability that AIDS is true.

Remember that conditional probability is about restricting the universe. To answer this question, we're only going to consider cases where AIDS is true, and we're going to ask, of those cases what proportion had a positive test? This number, we can read off the table. It's one of the entries in the joint distribution. Test is positive, and AIDS is true is 0.03648. Probability of AIDS this true is 0.0037, and so we get 0.986.

Another way to see what we did was that we really just ignored this column. When we did conditioning, we conditioned on the event that AIDS was true. So that's like saying, in this joint distribution, I'm just going to focus on this column. I'm going to forget totally that that other column exists. I'm just going to focus on this column. I'm going to pretend this is my universe. If I'm going to pretend this is my universe, and I'm interested in knowing what proportion of these guys are positive versus negative, then I just take this number divided by the sum.

Does that make sense to everybody? Any questions about that? We're going to be doing a whole bunch of this.

Let's do, for an example, and then we're going to do this another way later on, let's go in the opposite direction. So let's say I wanted to know a different conditional probability. I want to know the probability AIDS is true, given that the test is equal to positive. That's the answer we just computed. I'm going to let you guys do this one. So what's the probability that AIDS is true, given that the test is positive? At least get that down to a fraction, we'll let pick some fingers.

I see some twos, and some threes, and some fours, and a lot of sleeping. Twos and threes and fours, and a lot of sleeping. Let's do it. We're going to have the same numerator. So this is the probability the test is positive and AIDS is true, but now the universe is different. We have the same numerator, different denominator. The denominator in this case is the probability that the test is positive.

So we're interested in focusing now in the population where the test was positive, and asking what proportion had AIDS. If we do that, we get the same numerator, 0.003648, divided by 0.026563. There's a number. What's interesting about that is that it's pretty small. It's 0.137.

So why is this number so small? It seems disconcerting, right? This says, at least in the population that this study was done in, if you went and had a positive test, the probability that you have AIDS is actually not that high. Why is that? In this population, the prevalence isn't very high. The prevalence of AIDS in the population is not very high. We saw that the probability that somebody has AIDS without information in this population is 0.037.

If you get a positive test, now it goes up to 0.137, but that's still relatively low. So this is an issue that comes up in all kinds of medical testing, which is the false positives and how much trouble they cause. The answer here was less than 50%.

As we go along, I'm going to talk to you about how we manipulate and represent probability distributions in Python. The software lab today, and the homework assignment-- let me just insert a word about the homework assignment. We're handing out homework for today. It's technically due in two weeks. But you can do it in one week, and you should do it in one week, because the next week after that is the exam, and it's better not to have the homework hanging over your head. It's short-ish and it's doable after this lecture, so you should just do it.

We're going to have a Python class that we use to represent distributions. It's called DDist, for discrete distribution. It uses a Python dictionary to store the distributional information. And it has one method that we use all the time, which is prob. If you make a DDist and you ask the probability of an element, it'll give you out the probability that this distribution assigns to that element. If the element isn't even in this dictionary, it returns zero. So let's just see how that goes.

Here's an example. Here's a discrete distribution. It happens to be 50/50 on heads and tails. A dictionary in Python-- probably everybody's done something with a dictionary, if you haven't, you should go read about dictionaries, or do some practice problems. But dictionary the syntax is this. You have a curly brace, any object in the world and you want here, except for a list-- string, head, colon, value. And then another string, tails, colon, value. And so now, if we say we made a distribution. If we say, distribution toss.prob of head, it tells us 0.5, toss.prob of tail, 0.5. You can ask it the probability of any other thing in the world that you want to, and it'll tell you zero.

So that's a pretty straightforward way of representing a probability distribution. Conditional probability distributions are a confusing thing, because they aren't distributions, really. We call them conditional probability distributions, but a conditional probability distribution isn't a distribution.

So let's think about this. I could talk about this object-- the probability of test given AIDS. Now is test an event? What's test? It's a random variable. AIDS is also not an event. It's a random variable. So this is basically a structure, this conditional distribution is a structure that lets you answer specific probability questions. But it is itself not a distribution, nor is it a probability. It's a more complicated thing.

The way we're going to think about it is that it's a function, which, given us a value for the random variable on the right-hand side of the bar-- so, given a value for AIDS-- it gives you a distribution on test. So given a value of the thing that we're assuming, on the right-hand side, I tell you AIDS is true. If AIDS true, then we get this one distribution on the test results. If AIDS is false, then we get a different distribution on the test results.

So that's a conditional probability distribution. It's a function that takes a possible value of the variable on the right-hand side of the conditioning bar, and it gives you a distribution on the variable on the left-hand side. So here's how we would use it. Imagine that we define this test given AIDS.

Then if we apply this procedure to the value true, what type of thing do we get out? A distribution, right? We get a distribution on test. Here's our distribution on test. If we want to know, what's the probability of a negative test given that AIDS is true, we would write this expression here. That make sense to everybody?

It takes a little bit of getting used to, but it's really important to understand that a conditional distribution is not itself a distribution. It's a function from values to distribution.

What we're going to do is now look at different ways of manipulating distributions, and then talk about some patterns of reasoning. What we looked at before was a situation where somehow, we were given the joint distribution, and we were interested in computing the marginals, and some conditionals, and so on.

Sometimes you'll find yourself in the opposite situation, where you are given a marginal distribution, and a conditional distribution that conditions on that variable they have the marginal for. So maybe somebody gives you a distribution on V, and also a conditional distribution on W given V.

Then the question is, can you use this to compute the joint distribution? And the answer is yes. We can do it because we know that the entry in the joint distribution-- the probability that big V has value little v and big W has value little w-- is the probability that V has value little v times the probability W has a little w, given that V has this value that we specified for it.

And this is always true. It's always true. Some people, if you've done too many coin flips and die roles and so on, you've gotten into a habit of thinking that the probability of V having value v and W having value little w is just the product of two independent probabilities. But I said the word independent.

When is it true that the probability that A equals little a and B is equal to little b is equal to the probability that A is equal to little a times the probability that B is equal to little b. When is that true? When the variables are independent. In fact, this is the definition of what it means for random variables to be independent. It means that what happens with this guy, with A, can't influence anything, has no bearing on what's going to happen with B, or vice versa.

So this is a super-special case, only if independence. I put some exclamation marks there, because I'll have this conversation with lots of people over the next couple weeks. This particular thing is only true if they're independent, but the more general thing that's written up there on the slides is always true.

And you can do it in either direction. You can say it's V times W given v, or you can say it it's W times V given w. Either way is OK. They come out the same. They better come out the same. If you do it both ways and they don't come out the same, then you did an arithmetic error.

STUDENT: [INAUDIBLE] probability that [INAUDIBLE] want to find opposites of probability of V given W, it is just one minus?

PROFESSOR KAELBLING: No, excellent. Look at this case. This is probability of T given a, and the probability of A given t. Those are not one minus. They can be arbitrarily, in a complicated way, related to one another.

Here's the one minus thing. Imagine that I have an event, E. It's something that could happen. And now, I want to know the probability that not E-- the probability that E doesn't happen. That's equal to 1 minus the probability of E. That's the only place where that 1 minus thing comes up. It can happen in conditional distributions too.

It's also true-- let me just write this out a little bit more generally. If the probability of not E given F is equal to 1 minus the probability of E given F-- because, again, conditioned on F-- a conditional distribution is the distribution, it has to sum to one. So conditioned on F, the events have to add up in the right way, as usual. So conditioned on anything, that has to be true. But that's where the 1 minus comes from Good question. Other questions?

In our pipeline code, we have a way of creating a joint distribution from a marginal and a conditional. Remember that back here, a couple slides ago, we defined this conditional distribution, probability of test given AIDS.

What's the probability that the test comes out positive or negative, given that you have the disease? And what's the probability it comes out positive or negative, given to you don't have the disease? So that was the conditional distribution.

Now imagine that I write down this marginal distribution, which is just the prevalence of AIDS in this particular population. From those two things, I can go back and recreate the joint distribution. In our Python class, we have something called dist.jdist. It stands for joint distribution. It takes a marginal and a conditional distribution which conditions on that marginal, and it makes you back out a joint distribution.

And notice, the joint distribution, the entries, the atomic events in this joint distribution are pairs of values. The first value is a value for the variable AIDS, and the second value is a value for the variable test. Does that make sense?

I'm going to skip this example, not least because it's actually wrong in the notes. I'm going to do an example again in the AIDS case, to illustrate Bayesian reasoning. We're just going to go free-form here on the board. It's also in a media lecture, and also I'll write it down again somewhere.

There's a pattern of reasoning which we've already done here implicitly. But we use it so often that it's called a theorem. And we'll use a lot, and it's worth getting it cemented in your brain. Bayes' Rule, or Bayes' Theorem, named after Reverend Thomas Bayes from sometime, Bayes' rule goes like this. It says that the probability of B-- if we're just talking about events now-- the probability of B given a is equal to the probability of A given b times the probability of B divided by the probability of A. That's Bayes' Rule.

For the first n years of my doing Bayesian reasoning stuff, I could never remember which way this last bit worked, and I can never remember anything like that. But all you, in fact, have to remember, is the definition of conditional probability. So I'm going to do a proof of this here on the board.

What's the definition of the probability of A given b? A and B over the probability of B. This is the probability of A and B over the probability of B. Let me make intersections. Here we have probability of B over probability of A, and we cancel. And we get the probability of A and B over the probability of A, which is probability of B given A.

So that was a proof. It's not such a hard proof. You could always do in an exam if you need it. My strategy is remembering just the definition of conditional probability, you could figure this out.

So why do we care about this? There's a bazillion little theorems in the theory of probability that we could write down. Why is this one so important? What does it do? It lets us take a conditional probability statement that's written one way, and flip it around to get the conditional probability statement that's written the other way. That's what you wanted to do. It's an operation we often want to do.

And the setting where we want to do it is the following. So let's actually do the AIDS example. We'll do it with different numbers to illustrate a particular point. Imagine that we want to know, again, probability AIDS is true, given test is positive.

First of all, let's talk about these two things and what they represent. Generally speaking, when we're doing a Bayesian reasoning of this kind, there's some kind of hidden state in the world, some kind of a thing that we wished we knew, but we can't directly access it or observe it.

It might be what disease does this patient have? It might be where in the room is the robot located? So these are things that we might wish we knew, but we just don't have direct access. This is some sort of an observation, or a piece of evidence that we can get about the hidden state, which we hope will give us some new insight, give us more information about what's going on.

So this is a very common pattern of things that we might want to know. Something we might want to do with probabilistic reasoning. Why do we want to treat this as flipping the conditional? Let's write the Bayes' rule out and think about what the pieces stand for.

This is the probability that the test is positive given AIDS is true times the probability AIDS is true divided by the probability that the test is positive. Let's look at this-- the probability that the test is positive, given AIDS is true.

This is something that we think is a reasonably constant sort of number. I said that that's something that you would get written on the leaflet when you went to buy the test kit from a pharmaceutical company. It's not going to change depending on what population you apply the test to, it's something that's a reasonably eternal fact about how your observations are related to the hidden state. So this, you think, is going to stay true for a long time. It's a fundamental thing about the test that you have. We're going to call this your observation model.

What's this? The probability that AIDS is equal to true. This is what you used to believe before you got the evidence. This is going to change, depending on what population I'm doing this reasoning about, or what I already know about the patient who walked in the door. I'll start out with a different belief, a different probability about whether that person has AIDS.

So this is something that's not fixed and eternal. It's about them case that I have at hand. And the question I want to ask is, given this case that I have at hand, and this observation I'm making, now how likely is it that I think that patient has AIDS?

So this is what we'll call a prior probability, or just the prior. That's my prior belief that this event is true. We've got observation model, prior, and this is going to be some kind of normalizer, which means we can finesse it with a bit of algebra in a minute. What we call this thing here-- I didn't leave space for my label-- is a posterior. That doesn't mean what you think it means, it means later. Before, I had this belief, and after, I had that belief. That's the posterior.

Let's write down, also, the companion question that goes with this. I can also say, what's the probability that AIDS is false, given the test is positive. And that is equal to the probability that the test is positive, given AIDS is false. The probability AIDS is false, probability the test is positive.

Similarly, this is still the same observation. It's the same observation. This is also the hidden state, it's just that we're considering the two possible different values the hidden state could take on. This is another aspect of the observation model-- what's the probability that I get a positive test given the disease is false? This is the other part of my prior. This part of my prior had better be one minus that, we do know that. And the normalizer is the same. So that's a way to think about what's going on here.

What it, I hope, makes clear, is that if you put in different values for the prior, which is a thing you might really want to do frequently, because you might want to go and apply this reasoning in a bunch of different places, then you'll get out different values for the posterior. And Bayes' rule tells us how to use the observation model and our prior to get a new posterior distribution.

Well, it almost does. There's one piece of this that I sort of swept under the rug. What haven't I talked about enough yet? Is there a number up there you don't really know the value of? How about the normalizer? The probability the test is equal to positive? You know you could compute that if you had to, because you could make the joint distribution and add up the relevant row or column, but it's not obvious by looking at this. So we'll see how to make this work out.

STUDENT: How do you know the value of the prior since it's hidden?

PROFESSOR KAELBLING: Good. How do you know the value of the prior? Beautiful question. I have this little piece of philosophy that I sometimes give in this lecture, but now I only do it when somebody asks me a leading questions, so now I have to do it.

If you learned probability in high school, probability was always about repeated events. You were told that the probability the coin comes up heads is 0.54, that's because, if we flip it a whole bunch of times, infinitely many times, then in the limit 0.54 of those would be heads, and the rest would be tails. So there's a view of probability, which is that it's about long-term frequencies.

We're going to use a different view of probability. The math is the same. We're just going to use it for a somewhat different purpose. And the purpose that we're going to use probability for is describing our own, or the robot's own, uncertainty about the world around it.

So here's this patient. This particular patient isn't going to have the disease or not have the disease 100 million times. We can't run that experiment a whole bunch of times and see what happens on this patient, that doesn't work. But I can say, look, here's this patient.

And for whatever reason, when that patient walks in the door-- maybe because I know something about the prevalence of AIDS in this place that I'm working, or because I know some other symptoms that the patient has, or something-- I will have my own belief about the likelihood that they have AIDS. That's just my belief. Just my belief. It's not a long-term frequency, but I'll have my belief, it's some belief.

And then, given this evidence, this is how I'm going to update my belief. I'm going to have a new belief about how likely it is that they have AIDS. So here, we're doing what's called subjective probability. And it's using probability to model what I believe about the state of the world around me.

Now what's kind of interesting and fun is that the math all works out the same, whether you take the long-term frequency view, or the subjective belief view. But what we're really using it for is to model what we believe is going on in the world around us.

There's some nice bit of math that also shows that, if your beliefs about uncertain events don't satisfy the axioms of probability, then I can set up a set of bets, which you would take from me, but which I can guarantee to win money on. So if you want to be an optimal gambler, your beliefs about lotteries should have to at least agree with the laws of probability. So there's that. That was great question, it let me give my little rant. Any more questions like that, or not?

I'm going to do this reasoning now. I'm going to do it in a stylized graphical form that we're going to use. I think it helps think about stuff. And we're going to do the AIDS problem, but we're going to do with a different base rate.

We're going to do it in pictures. I have a prior distribution. Here's my variable AIDS, and it can be true or false. Let's now imagine we're in a population which has a high incidence. In fact, 10% of the population has the disease. Unfortunately, there are such populations.

So imagine you're in this kind of population, you're working there. This is your belief about a random patient who walks into your office. And now we say, all right, we give this person a test, and it comes out positive. So we're interested in knowing what's the probability of AIDS given a positive test. That's the thing that we're going to try to compute.

This is a whole distribution. So this is the probability of AIDS, all by itself. Now what we're going to do is compute some new numbers. We're going to, here, in my little picture, consider the probability that test is equal to positive given the AIDS variable. So the probability the test is positive, given that AIDS is true, this is going to be the same testing probabilities that we had before. So this is 0.986. And the probability that the test is positive given that AIDS is false was 0.023.

Here's our prior, here's the observation model. Now we can see in the computation over there, what we do is we multiply the prior by the observation probability. So we're going to do that, and we're going to multiply these two things, and get 0.0986. We're going to multiply these two things, and gets 0.0207.

So there are some numbers. What do they mean? What do they stand for? What you can think of them is as part of a joint distribution. They're actually the probability that the test is equal to positive and A equals some little a-- in this case true, and in that case false. That is not yet a distribution. These two numbers represents this product and this product.

What do we know about these guys? There's something important about these two numbers. What do you know about them? They add up to one. Conditioned on the test being positive, either the patient does or does not have the disease, and so these two probabilities have to add up to one. But they're going to be in the same proportions. There's the same divisor over there, the same normalizer.

That means to go from here to a probability distribution, we have to normalize, which means divide by the sum. And the sum of those is actually the probability that the test is positive. If we divide through by the sum, we get 0.826 and 0.174.

Unlike the example that we did the first time with some different numbers, in particular, a different base rate, now if we already have this prior belief, somebody comes in and has a positive test, now we think our posterior-- our belief at the end-- so this is now the probability of AIDS given the test is positive. The probability of AIDS has gone up significantly.

This is a pattern of Bayesian reasoning. We'll do a lot of it. It needn't have only two values. If there are 10 possible values that your random variable could take on, then each row here would have 10 boxes. The numbers in those 10 boxes would sum to one. You could still talk about how does that distribution change as you get information, and so on. We're using a model of how our observations are related to the underlying states of the world to update our beliefs about what's going on in the world around us.

I'm going to introduce an idea. So unfortunately, we don't have a lecture next week, because it's a holiday. So that's fortunate. So I'm going to record a bunch of mini-lectures to cover next week. But I'm going to give you a sneak, quick preview of those ideas in the next 10 minutes. That should be doable. It's OK Because actually the fact is that we have most of what we need to already

Let me just ask you, though, if you have any questions about that Bayesian reasoning thing I did there. Because this is really important, and you should have this cemented in your head. Questions about that?

STUDENT: That normalizer of probability of test equals positive. Can we take it from that table?

PROFESSOR KAELBLING: No, it has to come from here, because that depends on the population. The only thing that I took from this example was these probabilities that characterize the effectiveness of the test. That's the thing that's constant across the two-- I'm a doctor and I work in two different places. What's constant is the efficacy of the test, but all the rest of it is not. Great question. Other questions?

This Bayesian reasoning idea-- so let me ask a question. Here a question. If we did another test. Imagine that we just took another test. This patient, we drew their blood again, and ran the test again. And we applied that. Would we get more information? Yeah. In this model, we would.

The view is that every time, it's like drawing from a distribution or something, that every time we do that test again, we'll still have a certain chance of it being positive, and a chance of it be negative, that depends on the underlying state of the person. So we could do another test. And if it came out positive, or if it came out negative, we could apply this exact same reasoning, starting now with this as our prior.

So now we have this patient and now, starting today, our belief about whether they have AIDS is this. We could do another test, get another result, apply this same reasoning. If the test was positive, we would expect that this number will go up. If the test was negative, we'll expect that it will go back down. That's a pattern that we'll have to get used to. But we can gather evidence over time, and it will combine to give us, in the limit, more and more information about what's going on with this underlying hidden state.

That would be true, I could keep updating my belief in this way if I didn't think the underlying state of the patient was changing. But now, you can imagine a situation where I get five negative tests-- one per year-- and then I get a positive test. It might be that my reasoning should not just be the same as if I did five negative tests in a positive test all in the same day. Because you think that their disease state probably hasn't changed in the course of a day, but you might think it's changed over the course of five years.

So the next step that we're going to take in probabilistic reasoning, the step for next week, it's thinking about how gathering information about some kind of a hidden in state works when that hidden state can change over time. So that's tricky business. Because you're getting some evidence, and the thing is changing. You're getting more evidence, and the thing is changing.

My favorite analogy is like when you're lost in Boston and don't have a GPS. So you have some idea about where you are, and you look out the window, and you see a sign. That's some evidence. Maybe you can update your distribution over where you might be. But then, you drive forward two intersections, but you're not really sure whether that was north or east, because you weren't really sure where you were pointed.

So now you have a bigger uncertainty about where you are. And then you make another observation, you see another street sign, and then maybe your uncertainty shrinks. And then you go again for awhile, and you get more confused again. And then you see something, and your uncertainty goes down.

So there's this pattern that's similar to this. It's a Bayesian reasoning pattern also, but where, due the passage of time, you may actually lose information. Of course, due to the passage of time, you might also gain information. What if I have a big funnel, and I drop a marble in it? At first, I'm not really sure where it is. But over time, I'm pretty sure where it's going to be.

Some dynamical systems, as they move through space, move through their state space, actually get more concentrated. They give you more information as time passes. And other ones get more confusing as time passes. We're going to think about the combination of how this hidden state changes as time passes, along with how the observations that we get give us information about the state.

And so the combination of those two things is something that people model with a structure called a Hidden Markov model, which is sort of a fancy name, but it's not that big a deal. The idea is that it's something that goes in discrete time steps. We're used to that. In fact, it's like a probabilistic finite state machine.

It has some hidden states, and it has observations. At each time, it's in some state. We're not sure what state it's in. And at each time, we get to make an observation which is related to the state, just like the test result was related to the disease state.

And what we're going to be interested in knowing is at some time, like at time five, if we've had this sequence of observations, what state do we think the system is in? That's going to be the kind of question. So if I've been driving through Boston, and I've seen this sequence of observations, where do I think I am?

So in order to describe one of these models, we have to say, where do we think we were when we started? So we have to give a probability distribution over the state at the very beginning. We have to say, this transition model-- it's got a lot of subscripts, but it's not so complicated. It says, if-- it's conditional probability distribution-- if I was right in front of MIT, or on 77 Mass Ave, and I went forward a block, what's the distribution on where I would be after that? So if the world was in this particular state, what's the distribution on where I would be on the next step?

For our finite state machines, this distribution is always deterministic. It says, if I was in this state before-- I'm ignoring the input for a minute-- if I was in this state before, then I'm sure to be in this state next. Our previous state machines were like that. This is a probabilistic state machine-- if I was in this state before, what's the distribution over the states I'll be in next.

Similarly, we have an observation model, which is really just like the observation model we said there. If the world is in this state, what's the distribution over what I might see? If the person has the disease, what's the distribution of the test results? If I'm standing in front of 77 Mass Ave, what's the distribution over images I might see, or sonar readings that my car might make, or something?

And so given this description of how model works, what we're going to be interested in doing is figuring out now, given some actual sequence of observations-- I saw these things in a row-- what state do I think that my system is in now.

So that's going to be the problem that we try to solve, and we're going to use it in the robot for things like letting the robot try to infer where it is based on noisy sonar readings. And let the robot try to infer the map that it's living in-- where are the obstacles in its environment, given some noisy sequence of readings that it's gotten from its sensors.

I got three minutes, we'll do the leftovers problem, and then we'll basically have done all the pieces. So imagine-- I don't know if you've ever been in this situation, but you open the refrigerator and there's a Tupperware container in the back. And you're hungry. And so now, you have to think about whether it's edible. And there's some cases where it's so bad you don't even want to open the container. There's some that I've just taken straight from the fridge and put in the trash. But other times I'll open it up and eat it for dinner.

How can we model the reasoning that we do about the leftovers? Well I might say, all right, I have to define a state space. My state space, the possible states of my leftovers, I want to discretize it. I'm going to say they could be tasty, smelly, or furry. That could be the state of my leftovers. I'm pretty sure initially, at time zero, when I put them in there, they were tasty. So probability of tasty at the first time step is one. Smelly and furry, no. So that's what I think about my leftovers when I put them in the fridge.

Now I have to describe a transition model. The transition model is a conditional probability distribution-- so you remember that a conditional probability distribution is not a distribution, it's a function for values to distributions. So this says, if the value before was tasty at time t, I'm assuming if they were tasty at time t, then at time t+1, here is the distribution over what they might be like.

0.8, they're still tasty, 0.2, they have gone to smelly. But they can't go straight from tasty to furry. If it's blue cheese, it could actually go from smelly to tasty, maybe they could actually exist at the same time. So if they were smelly before, there's some chance that they're tasty now. Maybe they stay smelly, maybe they get furry. And if they were furry before, that's it. If they were ever furry, they stay furry. Maybe there's final state, like mummified. That's not here.

So now the question is, if I don't make any observations-- this process is happening in the hidden Tupperware-- I make no observations, how likely is it they're tasty on Step One, after one step in the fridge? 0.8. Good. Because I say, well, they were tasty before, what's the probability they're tasty next. I look it up in this row. I get 0.8.

Another way to think about that-- oh, you're losing patience, aren't you. Never mind. I'll do this in a mini-lecture. It's very cool. You can see in the slides, too. But we can basically do cascaded reasoning, and ask the question what's their tasty after 10 steps? OK, good. Have fun, do the software lab. We'll see you in lecture in two.