You guys see the PDF? Yes. Yes. So first question is about Countman Sketch. Second question is membership, Bloomfilter stuff. Third question is approximate median. So this is from the streaming algorithms part. Fourth is again about Bloomfilter. Oh, I had two questions about Bloomfilter. Oh, yeah. This is about Bloomfilter probabilities. Question five is about online algorithms. And question six is frequency moment. So streaming. So you said you're going to add an... like the exam is going to be more than 100. You'll add another question for us for a chance to approve or choose. So look here. So this exam was 120 points. Right? So you can, for example, skip question two. And you can still score 102. Because question two. Yes? Membership and Bloomfilter both were before midterm for us. Yes. Yes. I believe the Bloomfilter was before midterm for you guys. Yes. So I think the Bloomfilter will not then be part of the final. Same thing for the membership problem, right? Number two. No, the membership problem is too basic to me, for me to like, say it's not... The membership problem is... I mean, so I won't ask you about hashing with chaining or stuff like that. Right? The algorithms. But what the membership problem is, hopefully you will never forget. So that's something you... So something like, what is the membership problem? That is... I mean, it's like asking, if I was teaching an algorithm scores and I said, in the final, you'll have after midterm, but then if somewhere the word graph came in. So that's... So, yeah, the algorithms for the membership problem, no, and Bloomfilter, no. But sure, if you... Yeah, go ahead. Do you have any examples for questions for topics we covered after the midterm? Because in this final, I see all the questions that we will not see. No, number three, number five, number six. These are all after midterm, right? So I don't know if you saw this final, but three, five, six are after midterm. Okay. It is the only final that we will see, like the examples. Yes, yes. I tried to find, I couldn't find any others. Because I last taught this, I think 2000... No, so the one I have, I think I taught it at the Graduate Center. But that was a PhD level course. So those questions are harder. So I didn't want to give you guys those questions. Could you give us some examples for topics that we do not see in this example? So I'm not sure what... I cannot give you a question for every topic that we have seen. But at least for some. So which topic would you like to see a question for? Multiplicative weight updates. Multiplicative weight updates. So I have not made up a question yet for that. So making questions requires effort. And so... I think for multiplicative weight updates, so I would say, understand the algorithm. I won't ask you details about the analysis. Remind me, did I prove the... I proved the multiplicative weight updates method, right? Yes. I proved it. But again, the first thing would be just understanding what the algorithm is. What the problem is and what the algorithm is. As long as you understand that, the question, you'll figure it out. Just keep the basic. So I think this is the wrong way of studying. How you guys are preparing is, in my mind, the wrong way, which is question in exam focused, rather than basics focused. So if the basics are solid, then the exam is more of an application. So if you understand what the multiplicative weight updates setting is, and what the algorithm is, that's more than what you need to know. You don't need to understand the whole proof. That is not what you would need to do. But if you understand what the problem is, and what the algorithm is, that's enough. And I cannot give you a question because I don't have a question. You're not going to ask us to prove something in the front. Ask you to prove something? So, I mean, here... For example, here, this is... You have to give an algorithm. And you have to show that the probability of something being something is this. And something being too large is this. So that's... That is asking to prove a guarantee of your algorithm. So in some sense, it is... Proving something about the algorithm. Or... I mean, here there is... I don't know why this... I think it's a... It's an unfortunate consequence of the way we teach math here that proof is somehow a very scary thing. So... So... Whatever I ask you to prove, I mean, so for example, this is an application of Chernoff and you were given Chernoff here. So you really have to just put things in the right place and then apply something. So if I ask you to prove something also, I'll give you hints throughout the proof. I won't just say here is a statement, go ahead and prove it. So I'll give you... Enough hints to... To continue the proof. The... The most important thing is whether you understand the question or not. So do not look for keywords and then write whatever you would like to write. If you understand the question, you are guaranteed to get at least a quarter of the points in the question. Because at least what your attempt will be meaningful somehow. And I give partial credit. But if you don't understand the question and you write something orthogonal, then I cannot even give you partial credit. Right? So just make sure you at least understand the question. For example, in this question for 30 points, the first part is given algorithm for this problem. And then the other parts approve the guarantees of your algorithm. So at least if your algorithm is correct, or it makes some sense, then you can get, you know, at least 10 out of the 30 points. But if your algorithm is not even in the streaming model, if you're doing some weird stuff, then I cannot even give you partial credit for that. Right? So yes, so the topics that you don't see here, the best way to study for them is just try to understand what we've covered in class. Because again, these are topics that are not really in a textbook. So there's no quite like this. I think I made up this question. I made up these, these questions are all made up. They're not like from a textbook or something. Okay, so just understand what we've covered in class and you'll be fine. Anything else? Okay. Then let us quickly talk about nearest neighbor search. All right. Let us record to the cloud. All right, welcome everyone. This is the last class for Algorithms for Big Data, May 13th. And today we will cover nearest neighbor search, which is a very important problem in databases and now in machine learning, especially in high dimensions. So what is nearest neighbor search? It's the membership problem, the original membership problem. We were given N keys. And a query. And we were asked is the query Q in the set of the N keys. Nearest neighbor search is not about exact membership. It's about similarity. So here it's asking is Q similar to some key in your set S. So in other words, the nearest neighbor search query is given a query Q. So think of Q as a point. So think of the data set as points in some high dimensional space. And when I say points in high dimensional space, really, you should think of them as vectors. Right? With some number of coordinates that is large. And now a query will come. That is also a point in high dimensional space, a vector. And you have to return the nearest neighbor of this query in your database. In other words, you will look at the distance. So from now on. Okay. So, so two things to mention in this lecture. When I have D and in the parentheses, I have two points. This is the distance between two points. Right? So this is the distance between the query and the Xi, which is the ith element in your database. So that's the distance from the query to the ith element. I'm taking the minimum over all elements in the database. So this, when I have D with parentheses, this is the distance. But I will also use D for the dimension, but it will be clear which D is the dimension. And if I have a parenthesis, then that means it's the distance. Is that clear? This thing about the notation. So right here, D is not the dimension. Because I have D and then, you know, in the parenthesis, I have two points. So this is the distance between the two points. So the input points will be in some D dimensional Euclidean space. Query is also a point. So for example, if this was your data set, this is a query, then you need to return X5. You need to say that X5 is the closest point to the query. Is a problem statement clear? Given a data set in high dimensions, preprocess it and make a data structure so that given a query, you can quickly answer the nearest neighbor of that query. So X, I, and Q have to be in the same, have to be in the same R, R, D. Yes, yes. Each of these points lives in the same space. Yep. So D could be one, D could be two, but typically we think of D as high. Okay. So if you understand the question, then, without any fancy methods, what worst case query time can you guarantee? What's the stupid way to solve? Huh? Go ahead. O of N. N, but, so you're saying O of N by saying... Just check every single of the points given and see, see which one. Good. But how long does this distance computation take? There's another parameter. Gotcha. Yeah. No, no. So, so if I give you two points in D dimensions, you have to subtract their coordinates and square and R and take the square root, right? Yes. So that's, then they ordered D time, right? Because there are D coordinates you have to like, subtract and square and then add them up, right? So what does the running time become? N times D. Financial in D. No, no, no, no. It'll just be N times D, right? Because for each XI, you will compute the distance from the query to XI. And all I'm saying is this distance computation takes D time. Because Q has D coordinates. XI has D coordinates. And so when you subtract and this, you, you spend order D time to compute one distance. And there are N distances to compute. And then you can take the smallest one out of them. So the stupid way to solve this problem takes N D query time. And the problem there is, well, N is pretty large, right? So, I mean, D could be large, but N is the number of points. That's some, that's always much larger than the dimension, typically in a database, right? So, so is it clear what the order N D, so the trivial solution gives order N D query time. Everyone's clear what the N D solution is, the brute force, like linear search. But the D is there because you have to compute the distance, which takes D time between two points. And so now you can ask, well, okay, can I improve the dependence on N, right? Can I maybe make something like, I don't know, square root N times D? Or something but log N times D, right? And so now, one question I have is, this trivial solution, will it give me the exact answer? Will it give me the exact nearest neighbor? Or will it make any error? Exact. It will be exact. Correct. So now, the unfortunate news that was proven, I would say 2011 or 2012, was this paper by Ryan Williams and Josh Allman. Ryan Williams, by the way, is... So there's this area of computer science called complexity theory. And he's one of the leaders in complexity theory. So him and I think it was a student at that point, they proved that if you want an exact algorithm. So if you want any exact algorithm for nearest neighbor search, and let's say you had an exact algorithm that ran in time, even slightly better than N. So the exponent of N, which is currently one, even if you could reduce it to like a 0.99, then you would violate some hypothesis in complexity theory. Meaning this would be a big result in complexity theory. So it's called the strong exponential time hypothesis. It's about SAT formulae. So think of this result as saying that if you want an exact algorithm, then really you can't even improve the stupid one that we just saw. This is pretty much the best you can hope for if you want an exact answer. Does this result make sense? The exponent in N, which is one here, cannot be reduced even a tiny bit, or is very unlikely to be reduced to a tiny bit. If you can do it, you have solved a major problem in complexity theory. So given that bad news, what can we hope for now? So this is not possible now. So what should we do with this bad news? Should we give up? Should we give up on a faster query time and just say, this is the time distribution solution is the best? Yeah, given the topic of this class, I think giving up sounds correct, professor. Okay. So you could give up, right? That's always an option. But that's not the interesting option. So... David, try something randomized? This is good. So, but this result even holds for randomized algorithms. Even a randomized exact algorithm cannot beat this running time. With this T. So what would you... How would you bypass this result? Reduce dimensions. No, I mean, this is saying that... So even if you reduce the dimension, right? You would maybe improve this fact. But as I said, usually N is much larger than the dimension. So what is the keyword here that you don't necessarily need in an application? Exactly. Right. In many applications, you don't really need the exact nearest neighbor. And so what people have studied, whenever people talk about nearest neighbor search, they rarely talk about exact nearest neighbor search because that is hopeless. And so what we do is approximate nearest neighbor search. So what is approximate nearest neighbor search? You'll have an approximation factor C greater than one. And now instead of returning the nearest point to the query, you have to return a point, xj, whose distance to the query is at most C times the distance between the query and its nearest point. Okay, so this was the distance between query... distance from the query to its nearest neighbor. And we're saying, okay, fine. You don't have to give me the nearest neighbor, but give me someone who is not too much too far away. Where C is this approximation factor. Does this statement make sense? So it's saying that even if it's not, it might not be the nearest neighbor, but it'll be nearest with... it'll be within some bound of nearness, closeness. Exactly. So if my nearest neighbor is a distance R away, then it'll be at most C times R away. Right? So C is 1.5. If my nearest neighbor is distance R, then... Okay, fine. Don't give me the nearest one who's within R, but give me someone who's at most 1. 5 R away. And for many applications, this is good enough. Right? Finding someone nearby is good. And it turns out that this is a whole... this became a whole field. So approximating and relaxing your requirement to not having an exact nearest, but someone who is reasonably close, this became a whole field. And there are professors who only do this, who only publish in this. And most of their life's work is on approximate nearest neighbor search. And one of the main techniques... So, by the way, right now, the way I have described the problem, this distance, you guys are thinking subtract the coordinates, square, add, and take square root. But you can take it any LP norm, right? You could subtract the coordinates, cube them, add the cubes, and then take the cube root. Right? That's the LP norm when P is 3. And there are various other measures of distances that you can take. Do people know what the hamming distance is? Hopefully we know what the hamming distance is between two bit vectors. If I give you two bit vectors, A and B, what is their hamming distance? How many bits are different when you line them up? Exactly. So you can ask the same question about, you know, I give you a database of bit vectors, store them so that when a query comes, you can find me the bit vector that is the closest hamming distance to it. Right? It's kind of like Google search when you put the stars in sometimes, right? You can... You just find the closest one. So this is a generalization of search. And it's a more... It's one of the most meaningful generalizations. Your distances can come from various applications. So what can we do with approximate nearest neighbor search? So... So here, approximate nearest neighbor search is a problem that I've also worked in quite a bit. And... So how... What... How can you beat this previous answer, right? This previous... Bad result. The bad news that these people had proven. What happens if we can do approximate? So now it turns out that using this technique, which is a very famous technique that I'll just briefly describe called LSH, which is locality sensitive hashing. So, you know, hashing, and this is locality sensitive hashing. What you can do is you can create a data structure where the preprocessing time, meaning the time taken to create the data structure is n to the one plus row times D. I'll tell you in a moment what row is. And the query time becomes n to the row times D. Where row, for example, if you're in Hamming distance, then your row is one over C, where C was your approximation factor. So, for example, if C is equal to two, what does my query time become? For Hamming distance, if C is equal to two, I'm okay with returning someone who's twice as far away, but not more. It becomes square root of n times D. And you beat this n times D before, right? Or even this previous result that said, if you want to lower an n to an n to the 0.99 with an exact algorithm, it was hopeless. But if you allow me approximation, then I can get you a much faster query time. So, rho is one over the approximation factor if you're talking about the Hamming distance. And it's one over the approximation factor squared for the Euclidean distance. So, this is even... which is better, this one or this one? If C is equal to two, what do you get for Euclidean query time? The fourth root of n. Yes, which is better than the square root of n, right? So, you get fourth root of n times D. So, that's much faster than n times D. Is the main result about locality-sensitive hashing clear? Any questions about this page? Okay. So, now, you have seen hashing before. That's how we started off this course. What could locality-sensitive hashing possibly mean? Look at this picture, right? And I want to reduce things. What does it mean? So, I could arbitrarily hash these n points, right? To some table. So, if you think of the membership problem, the dictionary problem, what were we doing? We were taking these keys, we were hashing them somewhere. And then our query would come along, we would hash it, and we would find if there was someone in the bucket, right? Where the query hashed to. This was all that was happening in the membership problem. Why does that not work here? Because we don't know the distance? We don't know the distance. Or in other words, it could be that this query was never inserted. Right? And so, its bucket will be empty. And the membership problem would just say no, which is correct, because this query is not in the database. But it could be very close to a point in the database, right? And then your ideal answer should have been this very close point. But because your membership was just hashing into buckets, and then it was giving a yes or no, you will miss the fact that this query is very close to someone in the database. So, do you see what's wrong with the hash function approach here? What sort of a hash function would you want for this problem? In my opinion, I think if we hash the original points, and we can make the farther, the far points to be more farther, and the close point to be more close, that way we can, like, get the points clearly in the boundary, because it's closer, it's closer. Exactly. So, if you could somehow hash, so that the close by points end up in the same bucket, right? In the same hash cell. Whereas far away points don't end up in the same hash cell. Then when I get a query, I can just hash it. And I know all the points nearby the query have hopefully hashed to that bucket. And then I can just restrict my search to that bucket. And that's exactly what a locality sensitive hash family hashing is. It's sensitive to locality. So, it's not... it's a hash function that represents... that respects distances. That close by points get hashed to the same bucket with a good probability. And far away points do not hash to the same bucket with a good probability. So, here... Before I tell you the locality sensitive hash, here is another example of a distance. I told you about bit vectors, right? So, for bit vectors, there is the Hamming distance. And sometimes you want to compare two sets. So, if I have two sets A and B, right? A measure of how... a measure of distance between them is called their jacquard similarity, which is simply the size of the intersection divided by the size of the union. So, if my two sets are identical, if A is equal to B, what is their jacquard similarity value? One, one, one. One. And if they're disjoint, if they have nothing in common? Zero, zero. Zero. So, jacquard similarity is a way of measuring similarities between sets. Right? And this is another version of the distance function, right? So, you can apply it to bit vectors, you can apply it to points in high dimensions, and you can also apply it to sets. So, in other words, I give you a database of sets, preprocess them and store them, so that when I give you a query set, you can quickly return the input set in the database, which has small jacquard, which has high jacquard similarity, to the query set. Okay. So, now what is an LSH, the locality sensitive hash family? So, a family H of hash functions. So, H is a family of hash functions. Could have many hash functions. And if you pick any particular HI from this family, right? So, if you have a family of hash functions. So, a locality sensitive hash family really has four parameters. The four parameters are R. C, you have already C. C is the approximation factor. And then you have two probabilities. But really, it's exactly what we spoke about. So, what are... how are these four parameters? So, a family of hash functions is called R, C, R, P1, P2 sensitive. If for any two points, X and Y in your domain, right? So, think of it as two points in my domain that I will apply the hash to. So, if I take two points and I take a random hash function from this hash family. If the distance between those two points is smaller than or equal to R, meaning they're close, then their hashes are the same with probability at least P1. And if the distance between two points is more than C times R, right? Remember, C was the approximation factor. So, these two points are now very far away. Then, their hashes are the same with a very small probability, with probability at most P2. So, then this is a locality sensitive hash family. Okay? If two points are within R, then they should hash to the same bucket with a good probability. And if two points are greater than C R apart, then they should hash to the same bucket with a very small probability. So, obviously, a family is interesting if P2 is less than P1, right? You want P2 to be small, right? You want far away points to hash to the same bucket with a small probability. And you want P1 to be reasonably large. You want close enough points to hash to the same bucket with a large probability. So, if you have a family of hash functions that satisfy this property, then that family is called R, C R, P1, P2 sensitive. Does this definition make sense? So, basically, you're combining the Ys that are close into one bucket. Yes. But, I mean, not... I'm not forcing them to be in the same bucket. It's a probabilistic statement. Right? Because if you force it, then you'll be forced to put everything in the same bucket, right? If I have points on the diagonal, then if you start from one corner, you say, oh, they have to be in the same bucket. But you move a bit, you say, oh, they have to be... If your points have reasonable overlap, then if you force the condition one, you will end up with all the points in one bucket. And that's not bad. That's bad, right? Because then you've lost all information. So, it's a probabilistic statement. But yes, the idea is close by points go to the same bucket with a good probability, and faraway points do not go to the same bucket with a good probability. So, finding a hash family that meets that requirement of being RCR, P1, P2 sensitive, that's like a pre-processing thing that takes... That's... that's exact. So, if your original problem with whatever distance you want to solve nearest neighbor search for admits an LSH hash... such a hash family, then you can get this result that I told you. And if it doesn't admit it, then we don't know. All the known results are for distances that admit it. So, hamming distance admits an LSH family, which we will see in a moment, because of which you get a better query time. So does Euclidean distance. But there are some distances, like L1, which don't admit it. And so for that, we have nothing better than this search. Don't admit means that you can't use this... you can't use the locality sensitive hashing with it? Exactly. There is... we cannot find... no one has been able to find hash functions like this for that measure of distance. Right? This is just the definition meaning if a family of hash functions satisfies these two properties, then we call that family blah blah blah blah sensitive. But it could be that there is no such family, in which case, you know, there is no such family for that distance, there is no LSH family of hash functions for that distance, and then we cannot solve nearest neighbor fast. Did that answer your question? Yes, thank you. Right, so the whole game is given a... given a distance or a similarity measure, can we cook up LSH families for that? Oh, I'm sorry, one more clarifying thing. So... so being able to find an LSH family is completely dependent on how you define distance and it has nothing to do with the points you're given. Yes. For some points, it has to work for other if you define distance the same way. Exactly, because the definition doesn't actually depend on the points in the database. This has to be for any two points in the entire domain, whichever space your points are coming from. This U is like the whole space. It's not just the data set. Okay, thank you. But actually, that brings to a very interesting point. Maybe it'll be a bit of a tangent. But you see these runtimes? Oops. These ones that I showed you, you can ask, well, are these the best you can get for Hamming distance? Or is this the best you can get for Euclidean distance? Right? Valid question. Turns out, if you use a hash family, which satisfies these properties over the whole domain, then you cannot beat the running time. But there is this guy, Alexander Andoni. I think he's like, maybe the chair of the computer science department at Columbia. Anyway, he's a professor at Columbia. And what he showed was, if you allow your hash family, maybe it doesn't satisfy these properties over the whole domain, but it only satisfies these properties over your data set. Then you can actually beat these running times slightly. You can improve on them slightly. Okay. And that is called... So that's a good question. Maybe here. Related. See. Alexander Andoni's. Work. And that's called, as you can guess, data dependent. LSH. So there's still work going on in this area, improving the running times from classical LSH to now having hash family that only really work for your data. But traditionally, we always wanted the hash family to work for the whole space, for the metric, the original metric space. Does this make sense? Data dependent LSH, the name? Yes. Good. And he's a great researcher. I've invited him for talks. He came to Queens College last semester also. He gave a very nice talk. So do look up his work. Okay. So now quickly, what could be a locality sensitive hash family? Let's say for the Hemming distance. Right? So let's look at... So suppose I take a really stupid hash function, hi. And let's talk about Hamming space. So if I'm talking about Hamming space in D dimensions. So what is the Hamming space in D dimensions? It's just bit vectors of length D. Right? So 0, 1, 1 bit vectors of length D. And here's a very stupid hash function. What it takes is it... It just samples the ith bit. Okay? So this is a hash function that goes from the D dimensional space. Just to... 0 or 1. It only has two answers, two outputs. Right? Is this hash function clear? It takes a vector. It just samples. It's one bit and that's it. So let's look at the two conditions. Let's say we have two points. X and Y. Whose distance is at most R. Meaning I have two bit vectors X and Y. Whose Hamming distance is less than R. Okay? What is the probability that their hashes are the same? Is it the length of the vector minus R over the length of the vector? Yes. Right? Because if you sample anything from where they are, they don't differ, then you've done the right job, right? So it's D minus R over D. Length of the vector here is D. That's one minus R over T. So that's kind of our P1, if you remember, right? P1 was the probability that they had to the same bucket if they're closed. Does this calculation make sense to everyone? If I have two bit vectors whose Hamming distance is less than R, that means they differ in at most R positions. That means they agree in at least D minus R positions. So if I sample any one of those positions, their hashes will be... Oh, this should be Y. Their hashes will be the same. And what is the chance of sampling a particular coordinate? It's one over D, right? So if you sample any of these positions, then you'll hash them to the same bucket. Otherwise, there are R positions at most where they differ. There you'll get a... there they will not match. The hashes will not match. Calculation makes sense to everyone? Can you briefly talk about the D minus R again? Okay, so... This distance between the two... X and Y are now vectors like this, right? Bit vectors. And saying that their distance is less than or equal to R... This is the Hamming distance. It means X and Y differ in at most R positions. Yes? Okay. That means they do not differ in at least D minus R positions. In D minus R positions, they have the same. If one has a zero, then the other has a zero. If one has a one, then the other has a one. They match in D minus R positions at least. Does that make sense? Okay. So if you sample from... If the bit that you're sampling is from one of those D minus R positions, Then you will sample the same bit, right? Their hashes will be the same. Either both will be zero or both will be one. Did that answer your question? Yes, thank you. Good. So now, as you can expect, what would this guarantee be? Now, let's say I have two bit vectors that are far away, that are at least CR apart. What is the probability that their hashes are the same now? I'm going to say it's not too large. This should be easy. CR minus D over D. CR minus D over D. Why CR? Let me see. So their hamming distance is at least CR. No, so then they should be D minus CR. No. They differ in at least CR coordinates, right? That means they agree in at most D minus CR coordinates. And so if you sample any one of the points where they, any one of the dimensions where they agree in, you will get the same hash. Does this make sense or no? Their hamming distance is at least CR. That means they differ in at least CR positions. That means they agree in at most D minus CR positions. And so with that probability, you will sample one of those at most D minus CR positions and they will have the same hash. If you sample any of the remaining CR positions, then they will differ there and they not have the same hash. So here you have seen a very simple hash family. So this is a locality sensitive hash family. So this family sampling is a R CR 1 minus R over D 1 minus CR over D sensitive And this is the derivative hash family for the Hamming distance. Does this make sense? Any questions about this one? Okay. So that's a hash family for Hamming distance. What I will not prove, you can ask, well, what about the hash family? The other distance that we are interested in is Euclidean distance, right? In d dimensions, right? And so what do you think would be a good hash family for d... for... Guess a locality-sensitive hash for the Euclidean distance. So what's a hash family that takes points in d-dimensional Euclidean distance and, you know, hashes them into smaller things? What is the analog of sampling in... So this was sampling one bit, right? What is some other geometric analog of sampling? So the hint is you have seen this before. Sample the number from 0 to 1. No, no. So here you are given some point, right? So you have some point p in rd, whose hash you want to compute. So if you sample something from 0 to 1, how would you relate it to this point p? What is some way you have seen where you take high-dimensional points and map them into smaller things? Do you divide... Hyperplane. No, the hyperplane, the Johnson-Linden-Strauss. So instead of selecting a bit at random, which made sense for Hamming for bit vectors, for points in Euclidean space, you select a random projection, you select a random hyperplane, and you project points onto the hyperplane. So the random projection onto a, let's say a D' dimensional hyperplane. And this gives a locality-sensitive hash family, whose parameters will depend on D' obviously. So the moral of the story is, if you're trying to preserve locality while... I mean, it's basically the same story for both, right? In the first part of this class, you saw hash functions that didn't care about locality, right? The original hash functions that we had for the membership problem, I mean, two input keys could be very close to each other, but their hashes could be very far from each other, right? We didn't care about close things going to the same bucket or anything like that. But now if you care about locality, if you want close by things going to the same location, then in the Hamming world, you sort of sample bits and you get this. And the Euclidean world, you project your points onto a random D' dimensional hyperplane, a random D' dimensional hyperplane. And this gives you... so these two hash families give you these two results that I mentioned. Just simply... go ahead. So for Euclidean distance, the bucket is what point on the hyperplane you land on? Exactly. Yes. Isn't that like... aren't there like a lot of points on the hyperplane? Isn't that like a really... I don't... There are a lot of points on the... so... but I mean, once the hyperplane has a small enough dimension... Yeah, yeah. Then... so... I mean, I did tell you the LSH families, but the P1 and the P2 that really you see here... So there's just one step missing in how to use these LSH families to get to the result from the... on the previous page. And... so let's say I find a locality sensitive hash family, but I am not too happy with the value of P1 and P2. Meaning... let's say that... the gap between P1 and P2 is not large enough. How would I get a better hash family with a larger gap between P1 and P2? Okay, so... so let me put it another way. Here, if you took this sampling hash family, this was its value of P1 and this was its value of P2, right? And sure, P2, if you look at it, it is smaller than P1, right? Because P2 is 1 minus C R over D, whereas P1 is 1 minus R over D. But now let's say from this, I want to build another hash family where the gap between P1 and P2 is larger. Meaning, I want now the same property, but with a higher value of P1 and a smaller value of P2. What should I do? What is the trick? So now I want close by points to hash to the same bucket with an even larger probability and far away points to not hash to the same bucket with an even smaller probability. Where have you seen things like boosting... this is somehow boosting the good probability, right? How do we boost good probabilities in this course? Do it multiple times? Do it multiple times. And that's the missing... that's the only thing you... that's the only ingredient. So here I told you one hash function. In practice, what people will build is they'll take a hash function G. But G is nothing but the concatenation of a bunch of these sampling hash functions. So each HI is of, you know, is from the above LSH family. And if you do this, then you can boost your good probabilities and decrease your bad probabilities. And now the question is, how many times do you do it, right? Like this K is how many times I'm concatenating my hash functions in order to boost my probability... the good probabilities and decrease the bad probabilities. And obviously K will depend on P1 and P2, right? How much is this... what is the quality of the starting hash family before this boosting step? And it turns out this K is basically going to be one over C or one over C squared for the things that we are interested. So that's a... so when we see a new distance, we try to find a locality sensitive hash family for it. Once we have found the locality sensitive hash family, we look at the P1 and P2. And then we do this boosting. So to answer your question of what defines the bucket, eventually the bucket of a point, right? So now we will take our input set and we will apply such a hash function G to a point in the input set. And we will not get one bit vector, but each of these coordinates is a bit. So G will actually map my points in D dimensions to points in K dimensions. And that... that... the K dimensional point is the bucket of an original point. So that's the... that's a boosting step. When combined with the locality sensitive hash family, it gives us this... these properties. So I'm not going to prove it, how we get this. But hopefully the power of allowing... allowing approximations is clear. Because with exact, we could do nothing. But with approximate, we can actually get much faster query times. using this LSH hash family. And the last thing... so maybe here is a quick question for you. So here we wanted a C approximate hash family... C approximate nearest neighbor, right? And I have points in D dimensions. And let's say I was allowing you a slightly more than C approximation. C times one plus epsilon approximation. What can you do? Do I still need to work in the full dimension? So think of it this way. Does this... does this problem... And this problem only depends on distances, right? It only cares about distances between points. And whenever I have a problem where I only care about distances between points... What do I do to reduce the dimension? What was the name of the lemma that we saw before? Do people remember this? The Johnson-Linden-Strauss lemma, right? It said if you have points in a high dimensional space... By projecting them into a lower dimensional space... Right? You can basically fudge up the distances, but only by one plus epsilon or one minus epsilon factor. So in particular... For the Euclidean version of nearest neighbor search... I might as well assume that my dimension... Is like... Log N over epsilon squared. I don't need to go... I don't need to build a data structure... For a dimension higher than this... Because if the dimension is higher than this... I'll just apply the Johnson-Linden-Strauss lemma... Fudge up my distances by one plus minus epsilon factor... And then work in this lower dimensional space. Right? So... Whenever you read a paper about... Nearest neighbor search in Euclidean... Spaces... They will just assume that the dimension is at most that. Because you're anyway allowing approximations, right? So what's the point of working in the full high dimensional space... Where I can just lose a small factor... One plus epsilon... And get my dimension down. So in Euclidean spaces... Nearest neighbor search... The highest dimension you will see is... Log N over epsilon squared. Right? So... Let's see... This is... This is all I wanted to say about... Locality-sensitive hashing? Yeah. Oh... Okay. So... So... One side point... We had the membership problem. Right? What was the advantage that a bloom filter had... Over the... Algorithms for the membership problem? Why did we use a bloom filter in the first place? There were so many elements to look for. You just kind of wanted to have a pretty good guess... Of whether or not... You wanted to know if you had to look back... Or know if it wasn't there at all. It might save you some time. Was it save time? Was the bloom filter there to space time? Or there to save space? Time. But time? I mean, all the... Like, Google hashing was constant time query, right? Well, I'm sorry. Maybe I'm... Maybe I'm confusing something really elementary here. But, like, I thought the example was, like, you have books and library. And, like, the library is massive. And, like, you don't want to have to waste so much time checking if a book is there or not. Use a bloom filter. You know for certain it's not there. You don't waste your time looking for it. But you still have the library hash. Like, the library still exists. You don't save... Yeah, yeah. So, it was actually about the space. Because, remember, the keys were from a universe U. And to save the key exactly, you would need as many bits as, you know, proportional to the universe. But the bloom filter had space that did not depend on the universe. It just depends on the number of keys that you have and a log one over epsilon. Does that make sense? You don't need to store the whole book. You just hash it into a small, like, a bit vector. So, the purpose of bloom filter was to save space. Because, query-wise, FKS hashing or cuckoo hashing were constant time. And a bloom filter is constant time. You look at some bits in your bit vector. So, the point of a bloom filter, if you remember back, was to save space from n log U to n log one over epsilon, where there is no dependence on U. And the same question kind of makes sense here, right? So, now, if I have a database, it could be a large database. And now, I could... When a query comes, instead of actually finding the nearest neighbor in a database where I have all the points stored, if I could have a bloom filter, which could quickly answer, yes, there is someone close to the query. Then, I could then go to my database and fetch that person. Otherwise, if my query is in the middle of nowhere, I don't really have to bother fetching anyone. So, can the original bloom filter fetch me, tell me whether there is someone close to the query? If you look at the original bloom filter, and when I come up with the query, is there a way to find if someone is close to the query? No. No. So, then you can ask, is there a distance-sensitive bloom filter? Is there a bloom filter which, when I give it a query, just says yes or no, quickly, with less space, that there is someone close to the query or not? And this question was given to me, this guy who invented cuckoo hashing. He had asked me this question. And then, with him, we built the bloom filter for nearest neighbor search. So, the paper is called distance-sensitive bloom filters. It's a bloom filter, but with these distance guarantees. So, you can build bloom filters for nearest neighbor search too. Good. Let's see. We are done. Let us quickly talk about the syllabus now. So, here's the syllabus. So, this will be basic probability. Let me spell it out. Markov, Chebyshev. Also, I mean, very basic stuff like calculating expectation and variance. If you don't know variance, how are you going to apply Chebyshev? And how are you going to apply Chernoff? So, that's what I mean by basic probability. Then, in the streaming algorithms, I think we covered uniform sampling. We covered counting, including this exercise 4.9 about boosting. And I saw Hanan in her midterm solutions has also given the solution to exercise 4.9. So, you know how to boost the Morse counters. Then, we have approximate median we studied in the streaming algorithms. You have seen a question on the mock final about the approximate median. We studied the count min sketch, which was used to solve the heavy hitters problem by estimating the frequency. We saw frequency moment estimation, right? Some of the squares of the frequencies. How do you estimate that? And we saw the distinct elements problem, right? Counting how many distinct elements there are, where we were keeping track of the minimum hash value seen so far. So, I believe these are all the streaming algorithms that we have seen. Unless I'm missing something. Hopefully not. Then, we had the online algorithms. Part of the course. Where you saw a bunch of toy problems. So, by toy problems, I mean the ski rental problem. The pizza finding problem. Things like that. Then, you had the list update problem. Remember? Where when you search for an element, you're allowed to place it anywhere before its position for free. And what... How do you actually, you know, get something that is competitive with the optimal algorithm? We saw the experts theorem. Also, I gave the multiplicative weight updates theorem. This caching or paging, I only told you about the guarantees. So, this was remembered about the... By the way, this, see? Last thing. So, in previous semesters, I would cover more. So, we did not cover this, this time. So, caching paging, I mostly showed you the guarantees about the LRU and the first in first out. How they... If you ask them to match, to have the same memory as the optimal algorithm, then you can only get a k-competitive result. Where k is the size of your cache, which is a bad result. But then there was the resource augmentation type of result, where if you allow my memory to be more than opt's memory, then even though opt can look in the future, I can somehow reasonably be competitive to opt. Then we saw, in the last couple of classes, we have seen the Johnson-Lynn-Strauss lemma, and the nearest neighbor search. And look, in previous semesters, I would also cover more. So, we didn't cover the external memory, multi-wave mode sort. So, really, for those who are saying it is a lot, it was... It was at least 20-30% more in the previous semesters. So, I guess I was a bit slow. You said it was a graduate level. No, no, no, no. No, this is not graduate level. Do you see the title of my... This is still big data QC. So, this was Queen's College. Graduate level is even more. We cover more. But I believe you said that in the first half of the... For the first... Everything from the midterm, we covered more than you usually do. Yeah. So, I guess bloom filter. Yeah. So, you see in this part, there was bloom filter. So, I covered bloom filter in the... Before the midterm. So, yeah. That's a weird thing. Maybe the semester was paced. I think also this time, I had the midterm like a week later than I normally do. So, that could have been it. Well, it's a weird schedule. Do you know that for Tuesday, Thursday classes, they're having class in the middle of finals weeks? We have our final on the 18th. We have regular class on the 19th. Yeah. Same. What do you mean? So, we have two more classes? No. Because we're a Monday. No. Because we're a Monday, Wednesday class. No. Oh, yes. Yes. Oh, yeah. We have Monday, Wednesday. Oh, so my Tuesday, Thursday class is next week too. Yeah. So, Tuesday, you do have a class for Tuesday. Of course, you wouldn't be told that by the college in like an explicit manner. You just kind of have to derive it from the schedule. Wait. So, I have class next Tuesday, also next Thursday or no? Just Tuesday, I think. Ah, okay. Okay. All right. Yes. So, this is your syllabus. I'll upload it on Brightspace. And then on Friday, we can do office hours. We can do Friday a couple of hours. And also, maybe some people would prefer Sunday. Because if you're working late, whatever. Last night thing. So, maybe I'll keep a couple of hours on Friday and a couple of hours on Sunday. And we'll do it the same way where the link that I had sent you for the midterm office hours. You just put your name in and book a slot. Write the question that you would want to be discussed. So that I don't have to repeat the same answer. And you don't have to wait. So that everyone knows what is being discussed. And then you can show up at the slot time. All right. I'm going to stop the recording here. We have the scribes for today's lecture. Did we have scribes for today? Or not? Me enough, yo. Okay. Yeah. So then... So that's that. All right. Yeah. So, questions? Or I'll just see you on the final then, next week. Professor, can you post the recordings? Post the what? Oh, yeah. Recordings after May 4th. Because I don't see any recordings after May 4th. You don't see any what? Recordings? Recordings after May 4th. Yeah. Recordings after what? I can't hear you. May 4th. May 4th. May 4th. May 4th. Well, you mean then the uni class was Mondays, right? May 4th was last Wednesday. So, do you mean the only recording... May 4th. May 4th was Monday. So, May 4th, you uploaded only May 4th. Oh, I didn't upload May 6th. Yeah. Because I don't see recording for May 6th and May 11th, Monday. All right. So, anyway, yeah. So, in half an hour, when all the recordings will be uploaded now, in the next half an hour or so. Also, I think there's one recording missing, which is lecture 20. I don't know the exact date yet, but this one... So, on that day, we know about our grades. So, on that day, you did not upload the recording for that. I didn't see on the announcement. Oh, so that means... So, that was when I was in Denmark, I remember. So, I can figure out the dates. Okay. I'll see if there's any missing recording. Normally, I do upload the recordings, but okay. So, one recording missing and the last, including today, the last three recordings. Right? So, four in total. Yes. All right. Yep. I'll try to see once this one is set up. So, mock final is on the... Overleaf. And... Yeah. That is that. So, it was... It was very nice. And hopefully, you guys will... Don't get stressed. As I said, the basic fundamentals is what we... Is what are the most important. So, just understand... First thing, understand the problem statement. Right? You've... Like, from the midterm, if there's one thing that you have learned, And I have learned is... There were these bonus... Like, free points. That you guys didn't get stupid questions. Like, what's the membership problem? And if you can't describe what the problem is, Then... Why would anyone ask you about the solution? So, those are free points. Like, what's the problem? What is it used to solve? Or what's the algorithm? The analysis comes later. Right? So, that's like the first BFS type. Right? Do a BFS traversal of these topics. Right? If... If you only had two minutes to explain it to someone, How would you explain it? And then, if you had five minutes to explain it to someone. And then, if you had half an hour to explain it to someone. But do the two minute and five minute things before going into the half an hour thing. Does that make sense? Yes, professor. All right. Then, class is dismissed. See you guys on Monday. I'll be here for a couple of minutes if people have any questions.