Algorithms for Big Data. This is May 11th. We had stopped last time. We were talking about this, the paging problem. If you remember, you have a cache which can only store a few items. And then, but you get one by one, you get requests for different items. And at some point, you have to decide who to kick. And so the paging problem was how to decide who to kick out. We saw a bunch of heuristics. We had the least recently used, the least frequently used, first in first out. And I told you that if you know what the sequence is going to be, then the best you can do is this farthest in future. Right? Which is whenever you have to decide who to kick out from your cache. You look at the sequence. You look in the future. And you look at the element in your cache that you will need farthest in the future. So that's the optimal algorithm that knows the future. But when you talk about online algorithms, we don't know the whole sequence, right? We only know what we have done so far. And for that, we have algorithms like least recently used or first in first out, right? So least recently used, as it says here, you evict the item that was used least recently. Now, what do we know about these two algorithms? We know least recently used and first in first out are k-competitive. So what does k-competitive mean in simple English? What does this result mean? That an algorithm is k-competitive? What does that mean here? Like if I say an algorithm for the paging problem is k-competitive or is too competitive, what is that? The number of mistakes is at most k times that of the optimal algorithm. Right. And when you say number of mistakes, you mean number of cache misses, right? Yes. Right. So it's the number of times there is a request for an item that is not in cache. That's a cache miss. And that's what we are trying to minimize. So that's what the guarantee means here. And obviously, k is the size of the cache. And so a factor k in front of a guarantee is a terrible guarantee, right? I mean, if k is whatever, two gigabytes, then this is saying that the number of times we have a cache miss is at most, I don't know, 2,000 times the optimum. That's pretty bad guarantee. But it turns out you can't really improve on that. But you can improve if you cheat a little bit. So that is what I was calling this resource augmentation. And think of it as, well, clearly, optimum is given too much power here because it is allowed to see the future. So our hands are tied in the sense, we are trying to match optimum with the same amount of memory. And that's why we cannot, we can only get lame results like k competitive. However, given that optimum is allowed to look in the future. Can we somehow catch up to this fact by increasing our memory? So now what we will do is we will allow opt to have a smaller cache than us. So we have a cache of size k. But optimum, we will only allow it as cache of size k prime, which is smaller than k. So in other words, we have extra resources. And the question is, now can we somehow come close to opt? And the result that was proven is that least recent used and first in first out are k over basically k minus k prime competitive. In other words, let's say k prime is roughly half of k. Then what is k over k minus k prime? Well, if k prime is half of k, then k minus k prime is also half of k. And k over half of k is 2. So now you get that these algorithms are too competitive. If you allow the algorithm to have roughly double the cache as the optimum. So another way to state this result is that you can almost match optima's ability to see into the future by making your cache twice the size of optima. Does this result make sense? Questions? All right. So I will not prove this result. And one other result I want to state is... Actually, let's... I mean, it's... You see, compared to this one, where we had proved that LRU and first in first out are k competitive. Is the algorithm LRU randomized or is it not randomized? Look at the algorithm LRU. Does it use randomness? Does it use randomness? Or first in first out? Do people understand what the algorithm is? Wait. You guys can hear me, right? Yes, we can hear you. Okay, okay. Just making sure I don't like lose connection or something at my hotel. So... So look at LRU. Does it toss any coins or no? No. No, there's no randomness. No, right? Because you just... Whenever you have to evict someone, you look at whoever is in your cache. You look at the last time they were used. You know the last time they were used because you have all the sequence so far. And based on that you decide. So there's no randomness in least recently used or first in first out, right? First in first out is also pretty obvious, right? Whoever came first in, that's who you kick out. So now you can ask, well, this only gave us k-competitive. If we allow randomness, what can we do? Can we do better? It turns out actually you can do much better. So if randomness is allowed, then we have a randomized algorithm. That is roughly log-k-competitive. So you see, instead of a k-competitive, you get a much better log-k-competitive factor. So many times there are problems where, you know, if you use deterministic, if you don't allow randomization, you are doomed to a bad competitive factor. But if you allow randomization, then you can get what we call exponentially better, right? Because from a k, it drops to a log-k. So that's a big improvement in the competitive factor. And there's many such problems that with randomization, we can show that actually it's... it drops down the competitive ratio. Do these results make sense? The statement of the results, at least? What does G2 stand for? Like... I was just trying to say there were bad results. So I think B1 and B2 were bad results. Like bad news. And G1 and G2 was good news. So G is for good. Okay. Thank you. So bad news was that these algorithms are k-competitive. And that no algorithm can be better than k-competitive. If you look at this B1 and B2. But the good news is, if you allow us to have double ops cache, then we can be too competitive. Or if you allow us to have randomization, then we can be too log-k-competitive. If both of these are much smaller than k-competitive. Sorry, I have a question. Sure, sure. So you said that, like, we could basically cheat using the fact that we have on the top, right? By setting, like, k-prime as small as we want. So what's preventing us from doing that? Well, I mean, it's... At some point, you're just... I mean, the result stops to make sense also, right? So what this result says is that... Because opt is allowed to look in the future, how much resources do you need in order to sort of be, you know, comparable to it? Now, obviously, if you allow yourself an infinite cache, then there is no problem to be solved, right? Then you'll never have a cache miss. Right? You can just keep everything in memory all the time. So what this result is saying is, how much do you really need to expand your cache in order to match up to a reasonable extent? I see. Okay. Thank you. All right. So... So that roughly brings to an end of whatever I wanted to tell you about online algorithms. And there is this other topic about splay trees, which I will not have time to get into, because I want to tell you guys about something that is perhaps more useful in today's world. Professor, is there some people waiting to be admitted? Oops. Let's see. Yes. You were right. There was one person. So by the way, I cannot monitor chat. So... Don't... Say anything in the chat, because I only have one machine. So... If someone sees anything in the chat, just speak up. Thanks. Okay, let's see. The next topic I wanted to mention... So the two topics I want to cover today are dimension reduction and nearest neighbor search. And these will hopefully be the... These will be the last topics that we will cover in this course. And... The point is they are very useful in machine learning, in high dimensional data statistics. So... Let's first talk about dimension reduction. Right? I mean, the name should... Should somehow make you... Who doesn't want to do dimension reduction? Right? If you have very high dimensional data, and I can somehow reduce the dimension, and still keep most aspects of the data, that will make my algorithms faster. That's sort of the... The general idea behind... Dimension reduction. So first, before I go into dimension reduction, a quick... Reminder about... Maybe some high school geometry. In two dimensions, which I will write as R2. If I give you two points, P1 and P2. And so a point in R2 is... You know, I give you both its coordinates. The x-coordinate and the y-coordinate for P1. So I'm calling that x1, y1. And the x and y-coordinates for P2, I'm calling them x2, y2. Then, people know the distance between P1 and P2 is given by this formula? Yes? Everyone knows this. This is called the L2 distance. Okay? This is... Because you're squaring, adding, and then taking the square root. It's called the L2 distance. You could do anything weird. You could pick any P greater than 2. You could subtract and raise them to the Pth powers, and then take the 1 over P. For example, you could do Q sum, and then take the cube root. That's called the Lp distance. For now, let's just look at the L2 distance. Hopefully, everyone is comfortable with this L2 distance. Okay? And now, let me just make a move. So now, suppose the input is... So, imagine some high dimensional machine learning problem. So, you have some data set of n points in Rd. Okay. Maybe I should talk about Rd. So, these are n points. Let me call them P1, P2, up to Pn. So, what does a point P in Rd look like? How do I express a point P in Rd? By the way, Rd is D-dimensional Euclidean space. Just like... Would it be a couple with D elements? Correct. So, it would be a couple with D elements, yes. So, it will be like... This is the first coordinate, that's the second coordinate, and that's the Dth coordinate. That's a point in Rd. If I give you two points in Rd, P and Q, I specify their D coordinates. Then, what is the distance between P and Q? Just the previous formula that you saw. How people have... Did you guys ever compute this? The square root of the summation of P i minus A i? Squared. Oh, Q i, yeah. So, P i minus Q i squared sum over i from 1 to D. Right? Okay, good. Now, suppose... This is your Rd. And I give you this point set, right? This P1... And by the way, this is... So, when I use subscript, I'm using it for the different points. And when I do superscript, I'm using it for the different coordinates. Right? So, that's P1, P2, Pn. Is the difference between the subscript and the superscript clear? I'm sorry, can you repeat it? I have subscripts here. And superscripts here. So, I'm giving you N points in D dimensions, where every point has D coordinates. Right? That's all. That's all I've done. So, the different points are subscripted. And for one point, its D coordinates are superscripted. Right? That's what I want to say. So, now, imagine you have these N points in D dimensions. And this is some, you know, high dimensional data set. Now, the problem is, every time you work with this data set. So, for example, there is something called the curse of dimensionality. Which kind of says that all the algorithms... Okay. So, for this input data set, what are the two parameters? That describe this data set. Like, when we say we want a fast algorithm, it should be fast in terms of which two variables? N and D? N and D. Correct. However, most algorithms that we will have, right? They will maybe be polynomial in N, but they will be exponential in D. So, they'll be like, I don't know, N squared times 2 to the D. Which is pretty good if the dimension is small, right? In 2D, who cares about 2 to the 2, right? Or in 3D, who cares about 2 to the D? So, in small enough dimension, when D is small, this is fine. And that is why, until the advent of big data, computer scientists were okay with such algorithms, right? Because the data was typically in small dimensions, and N wasn't too big either. However, now, like, you know, D can be really, really large. Like, what's an example where D is large? Is it 50? 10 or 50? No, no, like, what's a real world example? When D would be large. Yes, 10 or 50 would already be too large for this. But you guys know about image data, right? So, every image, if you have a 16 by 16 pixel image, people would represent it as a vector of length 256. Right? The intensity of every pixel, I can put it in a row. If I have a 16 by 16 pixel image, which is like nothing, right? That's not even a high resolution image. I can represent it as a vector of length 256. So, just some image data can be represented as points in 256 dimensions. Does this make sense? 256 is 16 times 16, if I'm not mistaken. Right? I mean, instead of, all I'm saying is, instead of representing an image as a matrix, you just represent it as a long row. So, dimension easily goes into the hundreds, even if you're talking about bad quality image data. Does this make sense? I mean, a 16 by 16 image, has like these intensities, right? And I'm just converting it into a 256 dimensional vector. And so, I have a used data set where n is large, in, you know, very high dimensions. And now this algorithm is going to be, is not going to finish in my lifetime. So, what do we do? Do people understand the sort of the importance of this, the difficulty here? Do people understand why d can be in the hundreds very easily? Okay. Questions? Okay. So, in comes a technique called dimension reduction. And what does dimension reduction do? It basically reduces your dimension from d to something which is much smaller than d. And I mean, what will be the property of this? So, think of the dimension reduction as a map f. And what will be the property of this map f? That I can apply f to every point. Right? So, I can... So, first of all, these points will live in a much smaller dimension. They will live in some dimension d prime, which is much smaller than d. And when I apply this function f to my original data set, p1, p2, pn, I will get n points in this lower dimension. f of p1, f of p2, and so on until f of pn. And what do I really want from this transformed point set? I want it to preserve distances roughly equally. So, the theorem, which I don't know if I have here. Yes. So, here is the theorem. And it's called the Johnson-Linden-Strauss lemma. But, I mean, it's become much more than a lemma by now. It's used everywhere. So, let's read this theorem. So, for any epsilon between 0 and 1 and any n greater than 1, let d prime be such that d prime is at least log n over epsilon squared. Okay. So, this d prime that I was telling you the low dimension, it will roughly be log n divided by epsilon squared, where you choose epsilon. But, you see, log n is a much smaller number, right? Log n is typically very small. So, if you choose this dimension, then for any set S of n points in d dimensions, there is a function f, which takes points in the higher d dimensional space, and spits out points in the lower d dimensional space, such that if you look at any two points in your data set, x and y, then if you look at how much the images, the distance between the images, divided by the distance between the input points, or in other words, the distortion, it's between 1 minus epsilon and epsilon, and 1 plus epsilon. In other words, maybe this last line will make more sense. So, look at the last line. What is the thing in the middle of the last line? What does this mean? I mean, if you want, you can think of it as the distance between f, the L2 distance between fx and fy. So, does the very last line make sense? I'm sorry, can you repeat it again, please? Does the very last line that I have written here make sense? So, stare at the very last line in red, this one. f is a function that maps high dimensional data set to a low dimensional data set in such a way, so that for any two points x and y in your high dimensional data set, if you look at the distance between the mappings under f, it's sandwiched between 1 minus epsilon times the distance between the original points, and 1 plus epsilon times the distance between the original points. So, what have I done? I have reduced the dimensionality of my data set, while still roughly preserving all distances between my input points in the data set. Sorry, Professor, I have a kind of unrelated question, if that's okay. Yes, yes. Does this lemma also apply to different norms? Because I know here we're using the two norm. Does it also apply to different... No. So, yeah. So, unfortunately, it turns out that for other norms, there is no analog of the Johnson-Linistros lemma. In fact, you can show that there is no such math. So, people have studied this question for norms that are not the L2 norm. And basically, this is kind of the only norm for which you can do dimension reduction. I see. Thank you. But is the statement of the theorem clear to people? Yes, that makes sense. Okay. So, to everyone, if the original dimension is D of my data set, what dimension does this function F map my data set into? D prime. What is D prime? And how is D prime defined in terms of the original point set? So, I have given you n points in D dimensions. How would you apply the Johnson-Linistros lemma? What is the lower dimension now? Log n over epsilon square. Right. And does log n over epsilon square depend on D? No. No. So, what has happened is, no matter how large a dimension your input points are in, you have mapped them into a dimension that does not depend on the original dimension D. It depends on n, the number of points. But on n, it depends very gently, in that you are taking a log of it. When you take log of a number, it makes that number very small. And it depends on epsilon, but you get to choose the epsilon. And the way you choose the epsilon will decide this guarantee in the last line. So, you map the input points from very high D dimensions to something that doesn't depend on D, like log n over epsilon square dimensions. So, that now your transformed data set... So, you originally had n points, so you will still have n points. It's not that your number of points has reduced. Is that the dimension of every point has reduced? But the dimension hasn't just reduced like in a stupid... I mean, there is one way to reduce the dimension, right? Map every point to zero. Right? Then, great. You have reduced the dimension to nothing, but you have lost all the information about your original point set. With this map, you are still keeping almost all distances up to a 1 plus minus epsilon factor. That's what the last line is saying. That if you look at the, now the distances... So, the middle term is the distance between fx and fy, right? So, that's the... The new image of x and the new image of y in the smaller dimension. So, in the small dimension, when I calculate the distance between two images, they are pretty close to the original distance between those two points in the data set. Does the statement of the lemma become clear to people? Any questions about the statement of the lemma? Or what it means? Or like why it could be useful at all? Can I ask about dimensions, not lemma? Yes. Why can we get up to 100 dimensions? Say that again? Why can't we? You said we can get up to 100 dimensions. Why? I said we can or we cannot. We can. Yes. The question is, why we can? Because... When you take a photo from your smartphone, how many pixels is it? Like the... Currently, I don't even know. The cameras are what? How many megapixel are the cameras? I don't know. 8, 9? Does that number make sense? How many megapixel is your smartphone camera? I don't know what mine is. 50 megapixel. 50 megapixel. And I'm guessing a megapixel sounds like a thousand pixels. Yeah? I mean, maybe even if it's 100 pixels. So, I mean, your phone is full of images, each of which is an image of, I don't know, 50,000 pixels here and 50,000 pixels here. And each pixel has some intensity value. That's what an image is. It's a bunch of pixels. Each pixel has an intensity value. Is this making sense or no? Yes. And now the way to represent one image is just write down these things row by row in one long row. So that will be 50,000 times 50,000. Right? The size of this matrix is 50,000 squared. Right? So to represent this matrix, I can represent this matrix in a long vector of length 50,000 square. So now, my phone gallery is a bunch of vectors, each living in this dimension. Does that answer your question or no? No. So we call this... Where are these vectors to live? We call this dimension. Repeat your question. I didn't hear you. I think I just misunderstood what dimension is. Thank you. Dimension is how many coordinates you need to represent your data. So if your data is a bunch of vectors, then the length of those vectors is the dimension of your data. And the number of the vectors is n. Okay. I got it. Thank you. Right? So think of n as the number of images and d as 50,000 times 50,000. So now does the statement of the Johnson-Linistros lemma make sense? You are reducing the dimension by a lot while still approximately preserving distance. And the smaller the epsilon you choose, the larger d prime will be, but the better your guarantee will be. So it depends on what approximation epsilon you're willing to live with. I mean, if you're okay with epsilon equal to one, meaning you're okay with distances being at most doubled. Or, you know, at least halved. But then you can get away with log n over log n dimension. Does this make sense? If I'm okay with having my distances at most doubled, what value of epsilon am I okay with? Can you repeat the question? I'm sorry. If I'm okay with my distances being doubled, then what does that mean in terms of epsilon? What value of epsilon am I okay with? Half? No, right? Look at the right-hand side. One? Yes. Why? Because one plus epsilon becomes one plus one. But one minus epsilon will be zero. One minus epsilon will be zero, yes. So all I will guarantee is there's no lower bound. But I will guarantee that my distance is never more than double. They could shrink arbitrarily. But they don't more than double. So yes, the shrinking arbitrarily maybe is a bad thing. So then maybe a good thing to use is epsilon equal to half, let's see. So what does epsilon equal to half mean? So epsilon equal to half would mean that my distances between the new points is at least half of the original distance and at most three halves of the original distance. right? Just like a 50% error. So if I'm okay with epsilon equal to half, then my dimension goes down from D to basically 4 log N. Which could be much smaller than D, right? Again, because log is an... log just brings down the number exponentially. So log of N is the number of digits in N. Right? So epsilon, it's always between 0 and 1. It's never gonna be... Sometime it can be 1 or 0. No, so I guess epsilon is always between 0 and 1. Strictly between 0 and 1. Epsilon equal to 1 you can get easily because then you don't care about distances, you can just map everything to 0. So epsilon equal to 0 and epsilon equal to 1 are trivial maps. Thank you. But does everyone see how you get this D prime? Right? So in the exam, if you are told how many dimensions you can afford, then you can be asked to recalculate what's the best distortion that you can get. Right? So I can say I have some input that's in D dimensions where D is, I don't know, a million and N is, I don't know, 5 billion. And I have an algorithm that only works in 200 dimensions. What is the best distortion possible for this dimension reduction? I'm sorry, can you repeat your question? I'm sorry. I will give you N, D, D prime, calculate epsilon from that. Calculate the best epsilon that you can get from values of N, D and D prime. But again, if you understand the theorem, then there is nothing really mysterious in the question. So again, just stare at this theorem and tell me if there's any questions about what this theorem is doing, what it is saying. Would you give us the formula on the test or no? Which formula? The one in purple. No, because the formula is... No. Because the formula, remembering the formula means that you're already doing the wrong thing. The formula, the last line is the... is the... is the guarantee that you should... that you need to know. But the guarantee almost follows from the purpose of this theorem. So the formula is meaningless, actually. Remembering the formula is meaningless. So I think you guys are getting lost in the formula. And not realizing what this theorem is doing, perhaps. It's a way to reduce the dimension of a bunch of points in... in very high dimensions. It's a way to reduce the dimension while still preserving distances roughly equally. And once you know you're preserving distances roughly equally, the last formula is. .. it just trivially follows from that. Professor, in the last line, in the middle, should it be d prime? Or am I getting it wrong? No, no, no, no. d is not the dimension here. Or maybe that's your... So when... when I say this d, I mean the distance. Distance in... in u... when we already put it through the function, yeah? In the middle, yes. On the sides is the original distance. Did that answer your question? Yes, thank you. So in the exam, do we choose a good value for epsilon? No, no, no, no. I think... I think you guys are too far away from thinking about the exam about this. First, I need you to understand the theorem. Forget about the exam. First, tell me what you guys understand about the theorem. So who can explain what this theorem is saying? In simple English to the rest of the class. Anyone want to give it a shot? The goal is to take high-dimensional data and make it smaller while still keeping the important distance. Correct, yes. That's the rough goal, yes. Now let's go into one more level of detail. What do you mean by that? So that was the rough goal, yes. Now let's go into one more level of detail. How many... What is the data? What is the original data? And what is the transformed data? And what is the guaranteed? So let's say, what is the original data that we have in this problem? Like, what is the input to... What is the input to the Johnson-Lynn-Strauss lemma? To this function f? M Is it just the number M? Is it just the number M? Yes. It's N point from the high dimension? Yes, exactly. Yes. So in other words, N vectors in D dimensions, right? And so you give it these N vectors in D dimensions. And what will it return to you? This map f? When you apply this map f, what will you get? You will get... N points in lower dimension... You will get what? N points in lower dimension D prime. Correct. So again, N points meaning N vectors, right? In lower dimension D prime. Where D prime will be what? Roughly? Log N over epsilon square. Correct. So everyone sees that the dimension has been reduced because log N over epsilon square does not depend on D. The original dimension D. So the original dimension D could have been 50 million. But log N over epsilon square does not depend on D. It does depend on N, but there's a log. So hopefully much smaller. And now, what do I know about these N points in D prime dimensions, in the smaller dimension? So I've transformed my data set into a much smaller dimension, right? And what is the guarantee now? What is the guarantee this map f provides? The distance between the two points is between 1 plus epsilon times the distance and 1 minus epsilon times the distance. So whose distance? The original distance. The distance between x and y. Yes. So original distances, when mapped into the smaller dimensional space, don't get distorted by more than a 1 plus epsilon or 1 minus epsilon factor. And you get to choose the epsilon and that shows up in the lower dimension. Right? So everyone understands this now? Sorry, I have one more question. So the theorem says that there exists a function. Do we know what that function is? Very good. Yes, yes. So the next thing is, sure, this theorem could be all good that there is a function, but I need to give you an algorithm, right, for that function. Otherwise, who the hell cares? I mean, mathematicians would be happy with just showing existence sometimes. But for computer scientists, you need to know, right? You need to compute what the function is in order to apply it. Okay, thank you. But the function is actually pretty simple. So let me tell you what the function is. So no more questions about the theorem, hopefully. Okay. So here is the function. And the function basically says... Before I go into the function, have people heard of the word hyperplane? So in three dimensions, this is a hyperplane, the XY plane. In two dimensions, a hyperplane is a line. Is this making sense at all or no? It's one dimension lower than whatever you live in. So on the right, this is two dimensional plane, right? So if I take a line, that's a one dimensional thing. So in two dimensions, in R2, a line is a hyperplane. If I'm in three dimensions, what is one dimension lower than three dimensions? It's two dimension. So for example, in three dimensions, if I look at this XY plane, then that's a hyperplane. Is the definition of a hyperplane clear or no? Yes. Yes. Yes. Okay. So a hyperplane, the way I've defined it as one dimensional lower, but you can define hyperplanes of any dimension between one and three, for example. So the XY plane is what we also call a two hyperplane, because its dimension is two. And in three dimensions, what is a one hyperplane? Well, that would be a line. Right? Because that's a one dimensional thing. So is it clear what a one hyperplane is, a two hyperplane is in the three dimensions? Hopefully it is some somewhat clear, right? You choose a dimension and now you just live within that dimensions. So now what is this function F? So not only does F exist, but it can actually be found in randomized polynomial time, meaning there is a polynomial time algorithm that with a good probability will return you the F. And actually the runtime is also pretty... It's this. So that's the time taken to find the map of all the points. And F is nothing but the following. So you are in D dimensions, right? Let me first make a picture. So we are living in D dimensions. It's hard. I cannot draw D dimensions on this. So let me just take the example of three dimensions for now. Basically what you will do is you will choose a random hyperplane going through the origin. And you will take all your point set and we will just project it onto the hyperplane. That is the map F. So in other words, we have the data set in D dimensions. We choose a random D prime hyperplane. hyperplane. And we project. So we call this D pi hyperplane H. We project the end points onto H. And this is the map F. This projection from the D dimensions to this hyperplane H. This is the map F. So in two dimensions, what would I do? So if my D is two and someone says they want D prime to be one, then I will choose a random line going through the origin in three dimensions. Right? I will stand at the origin in three dimensions, look around me and shoot a ball in a random direction. That gives me a line. And now I will project all of my points basically onto this line. And it always has to go through the origin. Yes. Yes. The hyperplane always has to pass through the origin. Yes. So... You can ask, how do you find a random line? People have heard of the Gaussian random variable, the normal zero one random variable. In the probability class, hopefully everyone saw the normal or the Gaussian random variable. So what do you do? You take a vector, Ri, which is just a D dimensional vector. And each coordinate is a normal zero one random variable. Okay? And then you, you normalize it to be norm one. That's a random direction in D dimensions. So take a D dimensional vector, each of whose coordinates are normal zero one, and normalize it. Meaning, divided by its norm so that now the total norm is one. That's a direction in D dimension. And now what you do is you just pick any D prime of such guys. Q1, Q2, QD prime. And you look at the vector space that is spanned by these vectors. So when I say things like vector space spanned by vectors, does that make sense to people or these are... this is Greek. Do people understand what I mean by a vector space spanned by a bunch of vectors? Was this taught in a linear algebra course ever or no? Yes? No? I don't know what you guys went through. So tell me. Can you explain a little bit more? Okay. So... If I give you two... three vectors. These are three vectors in four dimensions. Right? What is the vector space spanned by these vectors? It's the set of all vectors that... So if I call these vectors V1, V2, V3, I can take all linear combinations of these vectors, and I'll get another vector. Right? It doesn't even have to be positive. Anything. You can take three vectors and you can look at all possible linear combinations of those three vectors. That will give you a whole set of vectors. This is called the vector space spanned by V1, V2, V3. Any vector that you can get by multiplying V1 by a constant, V2 by a constant, V3 by a constant, and then adding them. Okay? So for example, is the blue vector in the vector space spanned by the three red vectors? Yes. Yes. Because it is just the sum. Right? And you can take other... Is this blue vector in the vector space spanned by the three vectors? Okay? So we'll leave it at the next... No? It is, I think. If I choose lambda 1 to be 1, lambda 2 to be 1, and lambda 3 to be 2, I think I get this. At least that's how I tried to make it. Maybe I added or subtracted incorrectly. But that's the vector space spanned by these vectors. Right? So now, what is our map? Let me... Imagine this is your high dimensions Rd. Right? What you do is in this high dimensions, imagine the unit sphere. Right? So this is the ball. This is the unit ball living in high dimensions. On the surface of this ball, you pick D' random points. Does this picture make sense? I'm living in D dimensions. I take the unit ball in D dimensions. Unit ball meaning its radius is 1. In D dimensions. And on the surface of this ball, I pick D' random points. Q1, Q2, Q3 up to QD'. Is it clear what these D' points are? They're just random points on the ball in D dimensions. And now, if I give you... Okay, so... In two dimensions, how many points define the line? Two. Two. Great. In three dimensions, how many points define a plane? Three. Yes. So now, I am in D dimensions. So, how many points will define a D' hyperplane? Basically, D' or you take one. So now, in this high dimensional thing, you take a plane, that contains all of the Q's and the origin. That's the vector space spanned by these guys. Q D'. So I choose D' random points in the high dimensions. And I just take the hyperplane that contains them. And what is my map F? It is the projection onto this hyperplane. So now, if I had a point P1 here, or a point P2 here, what is my map F? I just project it. Projection means, find the closest point on the plane to the point outside the plane. So if P1 is outside this hyperplane, I find the closest point on the hyperplane to P1. That's F of P1. If some point was already on the hyperplane, then its projection is the same. It is its projection. But if some point is outside the hyperplane, then I find the closest point in the hyperplane to that point. And that's F of P2. Does this make sense now? What this map F is in the Johnson-Lindenstorce Lemma? It's projection onto a random hyperplane. The dimension of the hyperplane, we know what it's going to be. It's going to be D prime. And what is a random D dimensional hyperplane? It's nothing but you take the unit ball and you pick D prime many random vectors on the unit ball. And you take the vector space spanned by them. In other words, the hyperplane containing those points. And basically Johnson-Lindenstorce Lemma says that if you pick a random such hyperplane, then with a good probability that theorem will be true. And if it is not true, you repeat it. And as you have been seeing in this course, there will be a probability of failure and therefore a probability of success. So if you repeat it enough times, you will succeed. So maybe the first random hyperplane you take, you will not have that distance preserving guarantee. Maybe some distances will be changed by more than a one plus epsilon or one minus epsilon factor. But if you repeat this, basically you don't need to repeat this more than like order n times. But you can do it much faster. But is the, is the map clear now? What this function f is? Roughly, yes. Questions? Okay, maybe in 3D it's better to show. If I'm in three dimensions, I take the unit ball, this is actually a ball, like a, like a soccer ball. I'll pick two points. Let's say q1 and q2. And then I know my hyperplane has to pass through the origin. So now I have three points in the picture. 0, q1, q2. These are in 3D. But these three points define a triangle, right? And there's a unique hyperplane that contains this triangle. The unique two dimensional thing that contains this thing. That is what I will project my 3D points onto. In order to lower my dimension from three to two. Is someone still in the waiting room? Whoops. Alright, so questions about this projection business? Wait, so with the ball, you want to find a hyperplane where 0, q1 and q2 make a triangle, right? That's pretty much it? In 3D, yes. If you want to project points in 3D to a random 2D thing, that's how you do. You take the ball in 3D, you choose two random points on the surface of the ball, q1 and q2, and you take the two dimensional plane that passes through q1, q2 and 0. That's a random two dimensional plane. And how does that help in finding a f? Well, f is nothing but projection onto the plane. You have the points living in the full dimension, right? So you have points living in 3D. How are we reducing the dimension of the point set? By projecting them onto a lower dimensional thing. Sort of flattening them. You're squishing the points, or every point is getting squished onto the hyperplane. So my original point set may have high dimensions, but I'll choose a much lower dimensional hyperplane and squish all the points onto this hyperplane. That's my map f to reduce the dimension. And we do it multiple times. Until you succeed. If you get lucky, as soon as you succeed, you stop. And you know when you succeed, because you can compute the distances, right? In the original setting and after the squished or the distances, you know, when you've squished them. So you know when you succeed. And if you don't, you just choose a different random hyperplane. And there is a polynomially small chance that you will succeed. So in polynomial time, you will find an f. That will work. Okay, if this makes sense to you, what is the point set that is the worst for a given hyperplane? Let's say I told you guys that I was going to project my point set onto this hyperplane. This one. What is the original point set that will be the worst for this hyperplane? What is a bad point set for a hyperplane? Bad meaning for which the guarantee will definitely not be true. Would it be they all get mapped to the same point or something like that? Right. So if you took a direction perpendicular to the hyperplane, and all your points were somehow here, then the original distances between these points are not zero, right? They're different points. But they will all get mapped to this point. And then there is no distance, right? Does it make sense? But that can't happen. Or that was very unlikely because we choose it randomly, right? Exactly. Yeah, yeah. So the chance that... So first of all, we don't get data sets like this also. But even if it was, if you choose a random direction, then, you know, this is very unlikely. Right. So that was the Johnson-Lindenstorce lemma. And I wanted to show you an application, a quick application of it. So it's not so much as an application, but let me introduce the problem. And then I think in next class, we'll be done with this in like 10 minutes or so. So this problem is nearest neighbor search. So, so far you have seen membership, right? We started this course with the membership problem. I give you a data set, reprocess it and store it. So that when I give you a query, you tell me if the query is in the data set or no. Right. That was the membership problem. Is Q in S. But one of the more useful versions is, is the query similar to some key in the data set. Right. Maybe the exact query is not in the data set, but maybe it is pretty close to someone in the data set. So in other words, the input here will be end points X1, X2, X3, Xn in D dimensions. Store your data set. So that when I give you a query, I want you to return to me the closest point in your data set to the query. So if this is my query Q and the closest point is this X5, X5, then I want your algorithm to return X5. This is nearest neighbor search. Given a data set, which is a point set in high dimensions, preprocess it so that when a query comes, a query is also a point in high dimensions. You can quickly return the nearest neighbor of this query. Does this make sense? Why would such a thing be useful in machine learning, for example? Can someone see why nearest neighbor search would be useful in today's world? So sometimes when you guys try to log in, I don't know what it's called capture when, when they show you a bunch of images and they say, Oh, click all the ones that contain a motorcycle or that contain a bridge. People have seen that. Yes. Yes. Why do you think they're making you do that? Is it for pattern recognition? Yes. So what they're actually doing is they're training a model. They're using you to train a model. So, so let's say this are, remember every image in your phone I said is a very long vector, right? So you have some data set in high dimensions. And the right now the computer doesn't know what the image contains. So it asks you users to label the images. So for now, let's say some of the images were cat images and some of the images were dog images. So you start labeling cat images. And some images you label as dog images. This is what it's making you do when it's asking you to do stuff. It's using you as a, as a label maker. And now what will the machine learning model do? It will pre-process this data structure for a nearest neighbor search that I haven't told you yet. So that now when a new image appears, the computer will quickly find who is the nearest neighbor to this new image. And if the nearest neighbor is a dog labeled image, then the computer will guess that this new image contains a dog. And if the new image was very close to a cat image, then the computer will guess that this new image contains a cat. This is called nearest neighbor classifier. And it's widely used in machine learning. Does it make sense why the nearest neighbor problem would be useful now? Now, each point set is an image, because it's a vector in very high dimensions. And, you know, given a query, it may not be exactly one of the images, right? I mean, if I take the same cat, and I don't know, take a picture of it from another angle or something, it's not the same image. But in this mapping, it will be pretty close to the original image of the cat. And so if you have a nearest neighbor algorithm, you immediately find out that the nearest neighbor in the data set to this image is a cat image. And so you can guess that this image contains a cat. So you're classifying unknown images based on known images. That's what the training model is, right? You give it data to train. So that now on unseen data, it can make some educated guess. Does the problem make sense? The nearest neighbor search problem? Yes? No? Well, we're out of time. So maybe you can come next class with any questions about the nearest neighbor search. Stop the recording. And we already have scribes for this lecture, right? The scribes for this lecture, are they here? We had decided in the last class? Yes. Yeah, Professor, I believe you chose me and Alicia. Yeah, yes. Okay. So I'll send you guys the recording after the lecture. Just send me an email, so that I have your email. And then we meet on Wednesday, same time. If there's any questions, I'm here for a couple of minutes. Otherwise, class is over. Professor, sorry. I was a little bit late. Just to make sure, for the exam, for the final, it's everything after the midterm and probability that was before the midterm also. Exactly. Yes. Okay. Thank you. Did you say everything after the midterm? Sorry? Everything after the midterm. Yes. After the midterm. Okay. But I mean, I say everything after the midterm, but I have to put the disclaimer that probability was taught before the midterm. So that's obviously included, right? You can't expect... I mean, don't complain if there's... if probability shows up in the final, because everything after the midterm is randomized, so probability will show up. Will you upload an example of alt final? Yes. As soon as I'm back, we should be tomorrow night. So if by tomorrow, 9pm, you guys don't see it, then just send me an email. My flight lands at 5 and then I can go to my desktop. Is there going to be a review session or no? I don't see much of the point of a review session, but I think next class I'll be done after the first 15 minutes. And then we can spend the remainder of next class as a review. But a review would basically be me listing the topics that we have covered so far and giving you a quick snapshot of that.