You guys see the PDF?

Yes.

Yes.

So first question is about Countman
Sketch.

Second question is membership,
Bloomfilter stuff.

Third question is approximate median.

So this is from the streaming algorithms
part.

Fourth is again about Bloomfilter.

Oh, I had two questions about Bloomfilter.

Oh, yeah.

This is about Bloomfilter probabilities.

Question five is about online algorithms.

And question six is frequency moment.

So streaming.

So you said you're going to add an... like
the exam is going to be more than 100.

You'll add another question for us for a
chance to approve or choose.

So look here.

So this exam was 120 points.

Right?

So you can, for example, skip question
two.

And you can still score 102.

Because question two.

Yes?

Membership and Bloomfilter both were
before midterm for us.

Yes.

Yes.

I believe the Bloomfilter was before
midterm for you guys.

Yes.

So I think the Bloomfilter will not then
be part of the final.

Same thing for the membership problem,
right?

Number two.

No, the membership problem is too basic to
me, for me to like, say it's not...

The membership problem
is... I mean, so I won't ask

you about hashing with
chaining or stuff like that.

Right?

The algorithms.

But what the membership problem is,
hopefully you will never forget.

So that's something you...

So something like, what is the membership
problem?

That is...

I mean, it's like asking, if I was
teaching an algorithm scores and I said,

in the final, you'll
have after midterm, but

then if somewhere
the word graph came in.

So that's...

So, yeah, the algorithms for the membership
problem, no, and Bloomfilter, no.

But sure, if you...
Yeah, go ahead.

Do you have any examples for questions for
topics we covered after the midterm?

Because in this final, I see all the
questions that we will not see.

No, number three, number five,
number six.

These are all after midterm, right?

So I don't know if you saw this final,
but three, five, six are after midterm.

Okay.

It is the only final that we will see,
like the examples.

Yes, yes.

I tried to find, I couldn't find any
others.

Because I last taught this, I think
2000...

No, so the one I have, I think I taught it
at the Graduate Center.

But that was a PhD level course.

So those questions are harder.

So I didn't want to give you guys those
questions.

Could you give us some examples for topics
that we do not see in this example?

So I'm not sure what...

I cannot give you a question for every
topic that we have seen.

But at least for some.

So which topic would you like to see a
question for?

Multiplicative weight updates.

Multiplicative weight updates.

So I have not made up a question yet for
that.

So making questions requires effort.

And so...

I think for multiplicative weight updates,
so I would say, understand the algorithm.

I won't ask you details about the
analysis.

Remind me, did I prove the...

I proved the multiplicative weight updates
method, right?

Yes.

I proved it.

But again, the first thing would be just
understanding what the algorithm is.

What the problem is and what the algorithm
is.

As long as you understand that,
the question, you'll figure it out.

Just keep the basic.

So I think this is the wrong way of
studying.

How you guys are preparing is,
in my mind, the wrong way, which is

question in exam focused, rather than
basics focused.

So if the basics are solid, then the exam
is more of an application.

So if you understand what the
multiplicative weight updates setting is,

and what the algorithm is, that's more
than what you need to know.

You don't need to understand the whole
proof.

That is not what you would need to do.

But if you understand what the problem is,
and what the algorithm is, that's enough.

And I cannot give you a question because I
don't have a question.

You're not going to ask us to prove
something in the front.

Ask you to prove something?

So, I mean, here...

For example, here, this is...
You have to give an algorithm.

And you have to show that the probability
of something being something is this.

And something being too large is this.

So that's...

That is asking to prove a guarantee of
your algorithm.

So in some sense, it is...

Proving something about the algorithm.

Or...

I mean, here there is...
I don't know why this...

I think it's a...

It's an unfortunate
consequence of the way we teach

math here that proof is
somehow a very scary thing.

So...

So...

Whatever I ask you to prove,
I mean, so for example, this is

an application of Chernoff and
you were given Chernoff here.

So you really have to just put things in
the right place and then apply something.

So if I ask you to prove something also,
I'll give you hints throughout the proof.

I won't just say here is a statement,
go ahead and prove it.

So I'll give you...

Enough hints to...
To continue the proof.

The... The most important thing is
whether you understand the question or not.

So do not look for keywords and then write
whatever you would like to write.

If you understand the
question, you are guaranteed to

get at least a quarter of
the points in the question.

Because at least what your attempt will be
meaningful somehow.

And I give partial credit.

But if you don't understand the question
and you write something orthogonal,

then I cannot even give you partial
credit.

Right?

So just make sure you at least understand
the question.

For example, in this
question for 30 points, the

first part is given
algorithm for this problem.

And then the other parts approve the
guarantees of your algorithm.

So at least if your algorithm is correct,
or it makes some sense, then you can get,

you know, at least 10 out of the 30
points.

But if your algorithm is not even in the
streaming model, if you're doing some

weird stuff, then I cannot even give you
partial credit for that.

Right?

So yes, so the topics that you don't see
here, the best way to study for them is

just try to understand what we've covered
in class.

Because again, these are topics that are
not really in a textbook.

So there's no quite like this.

I think I made up this question.

I made up these, these questions are all
made up.

They're not like from a textbook or
something.

Okay, so just understand what we've
covered in class and you'll be fine.

Anything else?

Okay.

Then let us quickly talk about nearest
neighbor search.

All right.

Let us record to the cloud.

All right, welcome everyone.

This is the last class for Algorithms for
Big Data, May 13th.

And today we will cover nearest neighbor
search, which is a very important problem

in databases and now in machine learning,
especially in high dimensions.

So what is nearest neighbor search?

It's the membership problem, the original
membership problem.

We were given N keys.

And a query.

And we were asked is the query Q in the
set of the N keys.

Nearest neighbor search is not about exact
membership.

It's about similarity.

So here it's asking is Q similar to some
key in your set S.

So in other words, the nearest neighbor
search query is given a query Q.

So think of Q as a point.

So think of the data set as points in some
high dimensional space.

And when I say points in
high dimensional space,

really, you should
think of them as vectors.

Right?

With some number of coordinates that is
large.

And now a query will come.

That is also a point in high dimensional
space, a vector.

And you have to return the nearest
neighbor of this query in your database.

In other words, you will look at the
distance.

So from now on.

Okay.

So, so two things to mention in this
lecture.

When I have D and in the parentheses,
I have two points.

This is the distance between two points.

Right?

So this is the distance
between the query and the

Xi, which is the ith
element in your database.

So that's the distance from the query to
the ith element.

I'm taking the minimum over all elements
in the database.

So this, when I have D with parentheses,
this is the distance.

But I will also use D
for the dimension, but

it will be clear which
D is the dimension.

And if I have a parenthesis, then that
means it's the distance.

Is that clear?

This thing about the notation.

So right here, D is not the dimension.

Because I have D and then, you know,
in the parenthesis, I have two points.

So this is the distance between the two
points.

So the input points will be in some D
dimensional Euclidean space.

Query is also a point.

So for example, if
this was your data set,

this is a query, then
you need to return X5.

You need to say that X5 is the closest
point to the query.

Is a problem statement clear?

Given a data set in high dimensions,
preprocess it and make a data structure so

that given a query, you can quickly answer
the nearest neighbor of that query.

So X, I, and Q have to be in the same,
have to be in the same R, R, D.

Yes, yes.

Each of these points lives in the same
space.

Yep.

So D could be one, D could be two,
but typically we think of D as high.

Okay.

So if you understand the question,
then, without any fancy methods,

what worst case query time can you
guarantee?

What's the stupid way to solve?

Huh?

Go ahead.

O of N.

N, but, so you're saying O of N by
saying...

Just check every single of the points
given and see, see which one.

Good.

But how long does this distance
computation take?

There's another parameter.

Gotcha.

Yeah.

No, no.

So, so if I give you two points in D
dimensions, you have to subtract their

coordinates and square and R and take the
square root, right?

Yes.

So that's, then they ordered D time,
right?

Because there are D
coordinates you have to like,

subtract and square and
then add them up, right?

So what does the running time become?

N times D.

Financial in D.

No, no, no, no.

It'll just be N times D, right?

Because for each XI, you will compute the
distance from the query to XI.

And all I'm saying is this distance
computation takes D time.

Because Q has D coordinates.

XI has D coordinates.

And so when you subtract and this, you, you
spend order D time to compute one distance.

And there are N distances to compute.

And then you can take the smallest one out
of them.

So the stupid way to solve this problem
takes N D query time.

And the problem there is, well,
N is pretty large, right?

So, I mean, D could be large, but N is the
number of points.

That's some, that's
always much larger than

the dimension, typically
in a database, right?

So, so is it clear what the order N D,
so the trivial solution

gives order N D query time.

Everyone's clear what the N D solution is,
the brute force, like linear search.

But the D is there because
you have to compute the

distance, which takes D
time between two points.

And so now you can ask, well, okay,
can I improve the dependence on N, right?

Can I maybe make something like,
I don't know, square root N times D?

Or something but log N times D,
right?

And so now, one question
I have is, this trivial

solution, will it give
me the exact answer?

Will it give me the exact nearest
neighbor?

Or will it make any error?

Exact.

It will be exact.

Correct.

So now, the unfortunate news that was
proven, I would say 2011 or 2012,

was this paper by Ryan Williams and Josh
Allman.

Ryan Williams, by the way, is...

So there's this area of computer science
called complexity theory.

And he's one of the leaders in complexity
theory.

So him and I think it was
a student at that point,

they proved that if you
want an exact algorithm.

So if you want any exact algorithm for
nearest neighbor search, and let's say you

had an exact algorithm that ran in time,
even slightly better than N.

So the exponent of N, which is currently
one, even if you could reduce it to like a

0.99, then you would violate some
hypothesis in complexity theory.

Meaning this would be a big result in
complexity theory.

So it's called the strong exponential time
hypothesis.

It's about SAT formulae.

So think of this result as saying that if
you want an exact algorithm, then really

you can't even improve the stupid one that
we just saw.

This is pretty much the best you can hope
for if you want an exact answer.

Does this result make sense?

The exponent in N, which is one here,
cannot be reduced even a tiny bit,

or is very unlikely to be reduced to a
tiny bit.

If you can do it, you have solved a major
problem in complexity theory.

So given that bad news, what can we hope
for now?

So this is not possible now.

So what should we do with this bad news?

Should we give up?

Should we give up on a
faster query time and just say,

this is the time distribution
solution is the best?

Yeah, given the topic of this class, I
think giving up sounds correct, professor.

Okay.

So you could give up, right?

That's always an option.

But that's not the interesting option.

So... David, try
something randomized?

This is good.

So, but this result even holds for
randomized algorithms.

Even a randomized exact algorithm cannot
beat this running time.

With this T.

So what would you... How
would you bypass this result?

Reduce dimensions.

No, I mean, this is saying that...

So even if you reduce the dimension,
right?

You would maybe improve this fact.

But as I said, usually N is much larger
than the dimension.

So what is the keyword here that you don't
necessarily need in an application?

Exactly.

Right.

In many applications, you don't really
need the exact nearest neighbor.

And so what people have studied,
whenever people talk about nearest

neighbor search, they
rarely talk about exact

nearest neighbor search
because that is hopeless.

And so what we do is approximate nearest
neighbor search.

So what is approximate nearest neighbor
search?

You'll have an approximation factor C
greater than one.

And now instead of returning the nearest
point to the query, you have to return a

point, xj, whose distance to
the query is at most C times

the distance between the
query and its nearest point.

Okay, so this was the distance between
query...

distance from the query to its nearest
neighbor.

And we're saying, okay, fine.

You don't have to give me
the nearest neighbor, but

give me someone who is
not too much too far away.

Where C is this approximation factor.

Does this statement make sense?

So it's saying that even if it's not,
it might not be the nearest neighbor,

but it'll be nearest with... it'll be
within some bound of nearness, closeness.

Exactly.

So if my nearest neighbor is a distance R
away, then it'll be at most C times R away.

Right?

So C is 1.5.

If my nearest neighbor is
distance R, then... Okay, fine.

Don't give me the nearest one who's within
R, but give me someone who's at most 1.

5 R away.

And for many applications, this is good
enough.

Right?

Finding someone nearby is good.

And it turns out that this is a
whole... this became a whole field.

So approximating and relaxing your
requirement to not having an exact

nearest, but someone who is reasonably
close, this became a whole field.

And there are professors who only do this,
who only publish in this.

And most of their life's work is on
approximate nearest neighbor search.

And one of the main techniques...

So, by the way, right now, the way I have
described the problem, this distance,

you guys are thinking subtract the
coordinates, square, add, and take square root.

But you can take it any LP norm,
right?

You could subtract the
coordinates, cube them,

add the cubes, and
then take the cube root.

Right?

That's the LP norm when P is 3.

And there are various other measures of
distances that you can take.

Do people know what the hamming distance
is?

Hopefully we know what the hamming
distance is between two bit vectors.

If I give you two bit vectors,
A and B, what is their hamming distance?

How many bits are different when you line
them up?

Exactly.

So you can ask the same question about,
you know, I give you a database of bit

vectors, store them so that
when a query comes, you can find

me the bit vector that is the
closest hamming distance to it.

Right?

It's kind of like Google search when you
put the stars in sometimes, right?

You can... You just
find the closest one.

So this is a generalization of search.

And it's a more... It's one of the
most meaningful generalizations.

Your distances can come from various
applications.

So what can we do with approximate nearest
neighbor search?

So... So here, approximate
nearest neighbor search

is a problem that I've
also worked in quite a bit.

And... So how... What... How can
you beat this previous answer, right?

This previous... Bad result.

The bad news that these people had proven.

What happens if we can do approximate?

So now it turns out that using this
technique, which is a very famous

technique that I'll
just briefly describe

called LSH, which is
locality sensitive hashing.

So, you know, hashing, and this is
locality sensitive hashing.

What you can do is you can create a data
structure where the preprocessing time,

meaning the time taken to create the data
structure is n to the one plus row times D.

I'll tell you in a moment what row is.

And the query time becomes n to the row
times D.

Where row, for example, if you're in
Hamming distance, then your row is one

over C, where C was your approximation
factor.

So, for example, if C is equal to two,
what does my query time become?

For Hamming distance, if C
is equal to two, I'm okay with

returning someone who's
twice as far away, but not more.

It becomes square root of n times D.

And you beat this n times D before,
right?

Or even this previous result that said,
if you want to lower an n to an n to the

0.99 with an exact algorithm, it was
hopeless.

But if you allow me approximation, then
I can get you a much faster query time.

So, rho is one over
the approximation factor

if you're talking about
the Hamming distance.

And it's one over the approximation factor
squared for the Euclidean distance.

So, this is even... which is better,
this one or this one?

If C is equal to two, what do you get for
Euclidean query time?

The fourth root of n.

Yes, which is better than the square root
of n, right?

So, you get fourth root of n times D.

So, that's much faster than n times D.

Is the main result about
locality-sensitive hashing clear?

Any questions about this page?

Okay.

So, now, you have seen hashing before.

That's how we started off this course.

What could locality-sensitive hashing
possibly mean?

Look at this picture, right?

And I want to reduce things.

What does it mean?

So, I could arbitrarily hash these n
points, right?

To some table.

So, if you think of the membership
problem, the dictionary problem,

what were we doing?

We were taking these keys, we were hashing
them somewhere.

And then our query would come along,
we would hash it, and we would find if

there was someone in the bucket,
right?

Where the query hashed to.

This was all that was happening in the
membership problem.

Why does that not work here?

Because we don't know the distance?

We don't know the distance.

Or in other words, it could be that this
query was never inserted.

Right?

And so, its bucket will be empty.

And the membership problem
would just say no, which

is correct, because this
query is not in the database.

But it could be very close to a point in
the database, right?

And then your ideal answer should have
been this very close point.

But because your membership was just
hashing into buckets, and then it was

giving a yes or no, you
will miss the fact that

this query is very close to
someone in the database.

So, do you see what's wrong with the hash
function approach here?

What sort of a hash function would you
want for this problem?

In my opinion, I think if we hash the
original points, and we can make the

farther, the far points to be more
farther, and the close point to be more

close, that way we can, like, get the
points clearly in the boundary,

because it's closer, it's closer.

Exactly.

So, if you could somehow
hash, so that the close

by points end up in
the same bucket, right?

In the same hash cell.

Whereas far away points don't end up in
the same hash cell.

Then when I get a query, I can just hash
it.

And I know all the points nearby the query
have hopefully hashed to that bucket.

And then I can just restrict my search to
that bucket.

And that's exactly what a locality
sensitive hash family hashing is.

It's sensitive to locality.

So, it's not... it's a hash function that
represents... that respects distances.

That close by points get hashed to the
same bucket with a good probability.

And far away points do not hash to the
same bucket with a good probability.

So, here...

Before I tell you the locality sensitive
hash, here is another example of a distance.

I told you about bit vectors, right?

So, for bit vectors, there is the Hamming
distance.

And sometimes you want to compare two
sets.

So, if I have two sets A and B,
right?

A measure of how... a measure of distance
between them is called their jacquard

similarity, which is
simply the size of the

intersection divided
by the size of the union.

So, if my two sets are
identical, if A is equal

to B, what is their
jacquard similarity value?

One, one, one.

One.

And if they're disjoint, if they have
nothing in common?

Zero, zero.

Zero.

So, jacquard similarity is a way of
measuring similarities between sets.

Right?

And this is another version of the
distance function, right?

So, you can apply it to bit vectors,
you can apply it to points in high

dimensions, and you can also apply it to
sets.

So, in other words, I give you a database
of sets, preprocess them and store them,

so that when I give you a query set,
you can quickly return the input set in

the database, which has small jacquard,
which has high jacquard similarity,

to the query set.

Okay.

So, now what is an LSH, the locality
sensitive hash family?

So, a family H of hash functions.

So, H is a family of hash functions.

Could have many hash functions.

And if you pick any particular HI from
this family, right?

So, if you have a family of hash
functions.

So, a locality sensitive hash family
really has four parameters.

The four parameters are R.

C, you have already C.

C is the approximation factor.

And then you have two probabilities.

But really, it's exactly what we spoke
about.

So, what are... how are these four
parameters?

So, a family of hash functions is called
R, C, R, P1, P2 sensitive.

If for any two points, X and Y in your
domain, right?

So, think of it as two points in my domain
that I will apply the hash to.

So, if I take two points and I take a
random hash function from this hash family.

If the distance between those two points
is smaller than or equal to R,

meaning they're close, then their hashes
are the same with probability at least P1.

And if the distance between two points is
more than C times R, right?

Remember, C was the approximation factor.

So, these two points are now very far
away.

Then, their hashes
are the same with a very

small probability, with
probability at most P2.

So, then this is a locality sensitive hash
family.

Okay?

If two points are within
R, then they should

hash to the same bucket
with a good probability.

And if two points are greater
than C R apart, then they

should hash to the same bucket
with a very small probability.

So, obviously, a family is interesting if
P2 is less than P1, right?

You want P2 to be small, right?

You want far away points to hash to the
same bucket with a small probability.

And you want P1 to be reasonably large.

You want close enough points to hash to
the same bucket with a large probability.

So, if you have a family of
hash functions that satisfy this

property, then that family is
called R, C R, P1, P2 sensitive.

Does this definition make sense?

So, basically, you're combining the Ys
that are close into one bucket.

Yes.

But, I mean, not...

I'm not forcing them to be in the same
bucket.

It's a probabilistic statement.

Right?

Because if you force
it, then you'll be forced

to put everything in
the same bucket, right?

If I have points on the diagonal,
then if you start from one corner,

you say, oh, they have to be in the same
bucket.

But you move a bit, you say, oh,
they have to be...

If your points have reasonable overlap,
then if you force the condition one,

you will end up with all the points in one
bucket.

And that's not bad.

That's bad, right?

Because then you've lost all information.

So, it's a probabilistic statement.

But yes, the idea is close by points go to
the same bucket with a good probability,

and faraway points do not go to the same
bucket with a good probability.

So, finding a hash family that meets that
requirement of being RCR, P1, P2

sensitive, that's like a pre-processing
thing that takes... That's... that's exact.

So, if your original problem with whatever
distance you want to solve nearest

neighbor search for admits
an LSH hash... such a hash

family, then you can get
this result that I told you.

And if it doesn't admit it, then we don't
know.

All the known results are for distances
that admit it.

So, hamming distance admits an LSH family,
which we will see in a moment,

because of which you get a better query
time.

So does Euclidean distance.

But there are some distances, like L1,
which don't admit it.

And so for that, we have nothing better
than this search.

Don't admit means that
you can't use this... you

can't use the locality
sensitive hashing with it?

Exactly.

There is... we cannot find...
no one has been able to find

hash functions like this
for that measure of distance.

Right?

This is just the definition meaning if a
family of hash functions satisfies these

two properties, then we call that family
blah blah blah blah sensitive.

But it could be that there is no such
family, in which case, you know,

there is no such family for that distance,
there is no LSH family of hash functions

for that distance, and then we cannot
solve nearest neighbor fast.

Did that answer your question?

Yes, thank you.

Right, so the whole game is given a...

given a distance or a similarity measure,
can we cook up LSH families for that?

Oh, I'm sorry, one more clarifying thing.

So... so being able to find an LSH family
is completely dependent on how you define

distance and it has nothing to do with the
points you're given.

Yes.

For some points, it has to work for other
if you define distance the same way.

Exactly, because
the definition doesn't

actually depend on the
points in the database.

This has to be for any
two points in the entire

domain, whichever space
your points are coming from.

This U is like the whole space.

It's not just the data set.

Okay, thank you.

But actually, that brings to a very
interesting point.

Maybe it'll be a bit of a tangent.

But you see these runtimes?

Oops.

These ones that I showed
you, you can ask, well,

are these the best you can
get for Hamming distance?

Or is this the best you can get for
Euclidean distance?

Right?

Valid question.

Turns out, if you use a hash family,
which satisfies these properties over the

whole domain, then you cannot beat the
running time.

But there is this guy, Alexander Andoni.

I think he's like, maybe the chair of the
computer science department at Columbia.

Anyway, he's a professor at Columbia.

And what he showed was, if you allow your
hash family, maybe it doesn't satisfy

these properties over
the whole domain, but it

only satisfies these
properties over your data set.

Then you can actually beat these running
times slightly.

You can improve on them slightly.

Okay.

And that is called...

So that's a good question.

Maybe here.

Related.

See.

Alexander Andoni's.

Work.

And that's called, as you can guess,
data dependent.

LSH.

So there's still work going on in this
area, improving the running times from

classical LSH to now having hash family
that only really work for your data.

But traditionally, we always wanted the
hash family to work for the whole space,

for the metric, the original metric space.

Does this make sense?

Data dependent LSH, the name?

Yes.

Good.

And he's a great researcher.

I've invited him for talks.

He came to Queens College last semester
also.

He gave a very nice talk.

So do look up his work.

Okay.

So now quickly, what could be a locality
sensitive hash family?

Let's say for the Hemming distance.

Right?

So let's look at...

So suppose I take a really stupid hash
function, hi.

And let's talk about Hamming space.

So if I'm talking about Hamming space in D
dimensions.

So what is the Hamming space in D
dimensions?

It's just bit vectors of length D.

Right?

So 0, 1, 1 bit vectors of length D.

And here's a very stupid hash function.

What it takes is it... It
just samples the ith bit.

Okay?

So this is a hash function that goes from
the D dimensional space.

Just to...

0 or 1.

It only has two answers, two outputs.

Right?

Is this hash function clear?

It takes a vector.

It just samples.

It's one bit and that's it.

So let's look at the two conditions.

Let's say we have two points.

X and Y.

Whose distance is at most R.

Meaning I have two bit vectors X and Y.

Whose Hamming distance is less than R.

Okay?

What is the probability that their hashes
are the same?

Is it the length of the vector minus R
over the length of the vector?

Yes.

Right?

Because if you sample
anything from where they are, they

don't differ, then you've
done the right job, right?

So it's D minus R over D.

Length of the vector here is D.

That's one minus R over T.

So that's kind of our P1, if you remember,
right?

P1 was the probability that they had to
the same bucket if they're closed.

Does this calculation make sense to
everyone?

If I have two bit vectors
whose Hamming distance is less

than R, that means they
differ in at most R positions.

That means they agree in at least D minus
R positions.

So if I sample any one of those positions,
their hashes will be...

Oh, this should be Y.

Their hashes will be the same.

And what is the chance of sampling a
particular coordinate?

It's one over D, right?

So if you sample any of these positions,
then you'll hash them to the same bucket.

Otherwise, there are R positions at most
where they differ.

There you'll get a... there they will not
match.

The hashes will not match.

Calculation makes sense to everyone?

Can you briefly talk about the D minus R
again?

Okay, so... This distance
between the two...

X and Y are now vectors like this,
right?

Bit vectors.

And saying that their distance is less
than or equal to R...

This is the Hamming distance.

It means X and Y differ in at most R
positions.

Yes?

Okay.

That means they do not differ in at least
D minus R positions.

In D minus R positions, they have the
same.

If one has a zero, then the other has a
zero.

If one has a one, then the other has a
one.

They match in D minus R positions at
least.

Does that make sense?

Okay.

So if you sample from...

If the bit that you're sampling is from
one of those D minus R positions,

Then you will sample the same bit,
right?

Their hashes will be the same.

Either both will be zero or both will be
one.

Did that answer your question?

Yes, thank you.

Good.

So now, as you can expect, what would this
guarantee be?

Now, let's say I have two bit vectors that
are far away, that are at least CR apart.

What is the probability that their hashes
are the same now?

I'm going to say it's not too large.

This should be easy.

CR minus D over D.

CR minus D over D.

Why CR?

Let me see.

So their hamming distance is at least CR.

No, so then they should be D minus CR.

No.

They differ in at least CR coordinates,
right?

That means they agree in at most D minus
CR coordinates.

And so if you sample any one of the points
where they, any one of the dimensions

where they agree in, you will get the same
hash.

Does this make sense or no?

Their hamming distance is at least CR.

That means they differ in at least CR
positions.

That means they agree in at most D minus
CR positions.

And so with that probability, you will
sample one of those at most D minus CR

positions and they will have the same
hash.

If you sample any of the
remaining CR positions, then

they will differ there and
they not have the same hash.

So here you have seen a very simple hash
family.

So this is a locality sensitive hash
family.

So this family sampling is a R CR 1 minus
R over D 1 minus CR over D sensitive And

this is the derivative hash family for the
Hamming distance.

Does this make sense?

Any questions about this one?

Okay.

So that's a hash family for Hamming
distance.

What I will not prove, you can ask,
well, what about the hash family?

The other distance that we are interested
in is Euclidean distance, right?

In d dimensions, right?

And so what do you think would
be a good hash family for d... for...

Guess a locality-sensitive
hash for the Euclidean distance.

So what's a hash family that
takes points in d-dimensional

Euclidean distance and, you know,
hashes them into smaller things?

What is the analog of sampling in... So
this was sampling one bit, right?

What is some other geometric analog of
sampling?

So the hint is you have seen this before.

Sample the number from 0 to 1.

No, no.

So here you are given some point,
right?

So you have some point p in rd,
whose hash you want to compute.

So if you sample something from 0 to 1,
how would you relate it to this point p?

What is some way you
have seen where you take

high-dimensional points and
map them into smaller things?

Do you divide... Hyperplane.

No, the hyperplane, the
Johnson-Linden-Strauss.

So instead of selecting a bit at random,
which made sense for Hamming for bit

vectors, for points in Euclidean space,
you select a random projection,

you select a random hyperplane, and
you project points onto the hyperplane.

So the random projection

onto a, let's say a D' dimensional
hyperplane.

And this gives a locality-sensitive hash
family, whose parameters will depend on D'

obviously.

So the moral of the story is, if you're
trying to preserve locality while...

I mean, it's basically the same story for
both, right?

In the first part of this
class, you saw hash

functions that didn't
care about locality, right?

The original hash functions that we had
for the membership problem, I mean,

two input keys could be
very close to each other, but

their hashes could be very
far from each other, right?

We didn't care about close things going to
the same bucket or anything like that.

But now if you care about locality,
if you want close by things going to the

same location, then in the Hamming world,
you sort of sample bits and you get this.

And the Euclidean world, you project your
points onto a random D' dimensional

hyperplane, a random D' dimensional
hyperplane.

And this gives you... so
these two hash families

give you these two
results that I mentioned.

Just simply... go ahead.

So for Euclidean distance, the bucket is
what point on the hyperplane you land on?

Exactly.

Yes.

Isn't that like... aren't there like a lot
of points on the hyperplane?

Isn't that like a really... I don't...

There are a lot of points
on the... so... but I mean,

once the hyperplane has a
small enough dimension...

Yeah, yeah.

Then... so... I mean, I did
tell you the LSH families,

but the P1 and the P2
that really you see here...

So there's just one step
missing in how to use these LSH

families to get to the result
from the... on the previous page.

And... so let's say I find
a locality sensitive hash

family, but I am not too happy
with the value of P1 and P2.

Meaning... let's say that... the gap
between P1 and P2 is not large enough.

How would I get a better hash family with
a larger gap between P1 and P2?

Okay, so... so let me put it another way.

Here, if you took this
sampling hash family, this was

its value of P1 and this
was its value of P2, right?

And sure, P2, if you look at it,
it is smaller than P1, right?

Because P2 is 1 minus C R over D,
whereas P1 is 1 minus R over D.

But now let's say from
this, I want to build another

hash family where the gap
between P1 and P2 is larger.

Meaning, I want now the
same property, but with

a higher value of P1
and a smaller value of P2.

What should I do?

What is the trick?

So now I want close by points to hash to
the same bucket with an even larger

probability and far away
points to not hash to

the same bucket with an
even smaller probability.

Where have you seen
things like boosting... this

is somehow boosting
the good probability, right?

How do we boost good probabilities in this
course?

Do it multiple times?

Do it multiple times.

And that's the missing... that's the only
thing you... that's the only ingredient.

So here I told you one hash function.

In practice, what people will build is
they'll take a hash function G.

But G is nothing but the concatenation of
a bunch of these sampling hash functions.

So each HI is of, you know, is from the
above LSH family.

And if you do this, then
you can boost your good

probabilities and decrease
your bad probabilities.

And now the question is, how many times do
you do it, right?

Like this K is how many times I'm
concatenating my hash functions in order

to boost my probability... the good
probabilities and decrease the bad probabilities.

And obviously K will depend on P1 and P2,
right?

How much is this...
what is the quality of the

starting hash family
before this boosting step?

And it turns out this K is
basically going to be one over C

or one over C squared for the
things that we are interested.

So that's a... so when
we see a new distance, we

try to find a locality
sensitive hash family for it.

Once we have found the locality sensitive
hash family, we look at the P1 and P2.

And then we do this boosting.

So to answer your
question of what defines the

bucket, eventually the
bucket of a point, right?

So now we will take our
input set and we will apply

such a hash function G
to a point in the input set.

And we will not get one bit vector,
but each of these coordinates is a bit.

So G will actually map my points in D
dimensions to points in K dimensions.

And that... that... the K dimensional
point is the bucket of an original point.

So that's the... that's a boosting step.

When combined with the locality
sensitive hash family, it gives us this...

these properties.

So I'm not going to prove it, how we get
this.

But hopefully the power of allowing...
allowing approximations is clear.

Because with exact, we could do nothing.

But with approximate, we can actually get
much faster query times.

using this LSH hash family.

And the last thing... so maybe here is a
quick question for you.

So here we wanted
a C approximate hash

family... C approximate
nearest neighbor, right?

And I have points in D dimensions.

And let's say I was allowing you a
slightly more than C approximation.

C times one plus epsilon approximation.

What can you do?

Do I still need to work in the full
dimension?

So think of it this way.

Does this... does this problem... And this
problem only depends on distances, right?

It only cares about distances between
points.

And whenever I have a problem where I only
care about distances between points...

What do I do to reduce the dimension?

What was the name of the lemma that we saw
before?

Do people remember this?

The Johnson-Linden-Strauss lemma,
right?

It said if you have points in a high
dimensional space...

By projecting them into a lower
dimensional space...

Right?

You can basically fudge
up the distances, but only

by one plus epsilon or
one minus epsilon factor.

So in particular... For the Euclidean
version of nearest neighbor search...

I might as well assume that my
dimension...

Is like... Log N over
epsilon squared.

I don't need to go... I don't
need to build a data structure...

For a dimension higher than this...

Because if the dimension is higher than
this...

I'll just apply the Johnson-Linden-Strauss
lemma...

Fudge up my distances by one plus minus
epsilon factor...

And then work in this lower dimensional
space.

Right?

So... Whenever you read a paper about...
Nearest neighbor search in Euclidean...

Spaces... They will just assume
that the dimension is at most that.

Because you're anyway allowing
approximations, right?

So what's the point of working in the full
high dimensional space...

Where I can just lose a small
factor... One plus epsilon...

And get my dimension down.

So in Euclidean spaces...

Nearest neighbor search... The
highest dimension you will see is...

Log N over epsilon squared.

Right?

So... Let's see... This is...

This is all I wanted to say
about... Locality-sensitive hashing?

Yeah.

Oh... Okay.

So... So... One side point...
We had the membership problem.

Right?

What was the advantage that
a bloom filter had... Over the...

Algorithms for the membership problem?

Why did we use a bloom filter in the first
place?

There were so many elements to look for.

You just kind of wanted to have a
pretty good guess... Of whether or not...

You wanted to know if you had to look
back...

Or know if it wasn't there at all.

It might save you some time.

Was it save time?

Was the bloom filter there to space time?

Or there to save space?

Time.

But time?

I mean, all the... Like, Google
hashing was constant time query, right?

Well, I'm sorry.

Maybe I'm... Maybe I'm confusing
something really elementary here.

But, like, I thought the example was,
like, you have books and library.

And, like, the library is massive.

And, like, you don't
want to have to waste so

much time checking if
a book is there or not.

Use a bloom filter.

You know for certain it's not there.

You don't waste your time looking for it.

But you still have the library hash.

Like, the library still exists.

You don't save... Yeah, yeah.

So, it was actually about the space.

Because, remember, the keys were from a
universe U.

And to save the key exactly,
you would need as many

bits as, you know,
proportional to the universe.

But the bloom filter had space that did
not depend on the universe.

It just depends on the number of keys that
you have and a log one over epsilon.

Does that make sense?

You don't need to store the whole book.

You just hash it into a small,
like, a bit vector.

So, the purpose of bloom filter was to
save space.

Because, query-wise, FKS hashing or cuckoo
hashing were constant time.

And a bloom filter is constant time.

You look at some bits in your bit vector.

So, the point of a bloom filter,
if you remember back, was to save space

from n log U to n log one over epsilon,
where there is no dependence on U.

And the same question kind of makes sense
here, right?

So, now, if I have a database,
it could be a large database.

And now, I could...

When a query comes, instead of actually
finding the nearest neighbor in a database

where I have all the points stored,
if I could have a bloom filter,

which could quickly answer, yes,
there is someone close to the query.

Then, I could then go to my database and
fetch that person.

Otherwise, if my query is
in the middle of nowhere,

I don't really have to
bother fetching anyone.

So, can the original
bloom filter fetch me, tell

me whether there is
someone close to the query?

If you look at the original bloom filter,
and when I come up with the query,

is there a way to find if someone is close
to the query?

No.

No.

So, then you can ask, is there a
distance-sensitive bloom filter?

Is there a bloom filter which,
when I give it a query, just says yes or

no, quickly, with less space, that there
is someone close to the query or not?

And this question was given to me,
this guy who invented cuckoo hashing.

He had asked me this question.

And then, with him, we built the bloom
filter for nearest neighbor search.

So, the paper is called distance-sensitive
bloom filters.

It's a bloom filter, but with these
distance guarantees.

So, you can build bloom filters for
nearest neighbor search too.

Good.

Let's see.

We are done.

Let us quickly talk about the syllabus
now.

So, here's the syllabus.

So, this will be basic probability.

Let me spell it out.

Markov, Chebyshev.

Also, I mean, very basic stuff like
calculating expectation and variance.

If you don't know variance, how are you
going to apply Chebyshev?

And how are you going to apply Chernoff?

So, that's what I mean by basic
probability.

Then, in the streaming algorithms,
I think we covered uniform sampling.

We covered counting, including this
exercise 4.9 about boosting.

And I saw Hanan in her midterm solutions
has also given the solution to exercise 4.9.

So, you know how to boost the Morse
counters.

Then, we have approximate median we
studied in the streaming algorithms.

You have seen a question on the mock final
about the approximate median.

We studied the count min
sketch, which was used to solve

the heavy hitters problem
by estimating the frequency.

We saw frequency moment estimation,
right?

Some of the squares of the frequencies.

How do you estimate that?

And we saw the distinct elements problem,
right?

Counting how many distinct
elements there are, where we

were keeping track of the
minimum hash value seen so far.

So, I believe these are all the streaming
algorithms that we have seen.

Unless I'm missing something.

Hopefully not.

Then, we had the online algorithms.

Part of the course.

Where you saw a bunch of toy problems.

So, by toy problems, I mean the ski rental
problem.

The pizza finding problem.

Things like that.

Then, you had the list update problem.

Remember?

Where when you search
for an element, you're allowed

to place it anywhere
before its position for free.

And what... How do you
actually, you know, get something

that is competitive with
the optimal algorithm?

We saw the experts theorem.

Also, I gave the multiplicative weight
updates theorem.

This caching or paging, I only told you
about the guarantees.

So, this was remembered
about the... By the way, this, see?

Last thing.

So, in previous semesters, I would cover
more.

So, we did not cover this, this time.

So, caching paging, I
mostly showed you the

guarantees about the
LRU and the first in first out.

How they...

If you ask them to match, to have the same
memory as the optimal algorithm,

then you can only get a k-competitive
result.

Where k is the size of your cache,
which is a bad result.

But then there was the resource
augmentation type of result, where if you

allow my memory to be more than opt's
memory, then even though opt can look in

the future, I can somehow reasonably be
competitive to opt.

Then we saw, in the last couple of
classes, we have seen the

Johnson-Lynn-Strauss lemma, and the
nearest neighbor search.

And look, in previous semesters,
I would also cover more.

So, we didn't cover the external memory,
multi-wave mode sort.

So, really, for those who are saying it is
a lot, it was...

It was at least 20-30% more in the
previous semesters.

So, I guess I was a bit slow.

You said it was a graduate level.

No, no, no, no.

No, this is not graduate level.

Do you see the title of
my... This is still big data QC.

So, this was Queen's College.

Graduate level is even more.

We cover more.

But I believe you said that in
the first half of the... For the first...

Everything from the midterm, we covered
more than you usually do.

Yeah.

So, I guess bloom filter.

Yeah.

So, you see in this part, there was bloom
filter.

So, I covered bloom filter
in the... Before the midterm.

So, yeah.

That's a weird thing.

Maybe the semester was paced.

I think also this time, I had the midterm
like a week later than I normally do.

So, that could have been it.

Well, it's a weird schedule.

Do you know that for
Tuesday, Thursday classes,

they're having class in
the middle of finals weeks?

We have our final on the 18th.

We have regular class on the 19th.

Yeah.

Same.

What do you mean?

So, we have two more classes?

No.

Because we're a Monday.

No.

Because we're a Monday, Wednesday class.

No.

Oh, yes.

Yes.

Oh, yeah.

We have Monday, Wednesday.

Oh, so my Tuesday, Thursday class is next
week too.

Yeah.

So, Tuesday, you do have a class for
Tuesday.

Of course, you wouldn't be told that by
the college in like an explicit manner.

You just kind of have to derive it from
the schedule.

Wait.

So, I have class next Tuesday,
also next Thursday or no?

Just Tuesday, I think.

Ah, okay.

Okay.

All right.

Yes.

So, this is your syllabus.

I'll upload it on Brightspace.

And then on Friday, we can do office
hours.

We can do Friday a couple of hours.

And also, maybe some people would prefer
Sunday.

Because if you're working late,
whatever.

Last night thing.

So, maybe I'll keep a couple of hours on
Friday and a couple of hours on Sunday.

And we'll do it the same
way where the link that

I had sent you for the
midterm office hours.

You just put your name in and book a slot.

Write the question that you would want to
be discussed.

So that I don't have to repeat the same
answer.

And you don't have to wait.

So that everyone knows what is being
discussed.

And then you can show up at the slot time.

All right.

I'm going to stop the recording here.

We have the scribes for today's lecture.

Did we have scribes for today?

Or not?

Me enough, yo.

Okay.

Yeah.

So then...

So that's that.

All right.

Yeah.

So, questions?

Or I'll just see you on the final then,
next week.

Professor, can you post the recordings?

Post the what?

Oh, yeah.

Recordings after May 4th.

Because I don't see any recordings after
May 4th.

You don't see any what?

Recordings?

Recordings after May 4th.

Yeah.

Recordings after what?

I can't hear you.

May 4th.

May 4th.

May 4th.

May 4th.

Well, you mean then the uni class was
Mondays, right?

May 4th was last Wednesday.

So, do you mean the
only recording... May 4th.

May 4th was Monday.

So, May 4th, you uploaded only May 4th.

Oh, I didn't upload May 6th.

Yeah.

Because I don't see recording for May 6th
and May 11th, Monday.

All right.

So, anyway, yeah.

So, in half an hour, when all the
recordings will be uploaded now,

in the next half an hour or so.

Also, I think there's one recording
missing, which is lecture 20.

I don't know the exact date yet,
but this one...

So, on that day, we know about our grades.

So, on that day, you did not upload the
recording for that.

I didn't see on the announcement.

Oh, so that means... So, that was
when I was in Denmark, I remember.

So, I can figure out the dates.

Okay.

I'll see if there's any missing recording.

Normally, I do upload the recordings,
but okay.

So, one recording missing and the last,
including today, the last three recordings.

Right?

So, four in total.

Yes.

All right.

Yep.

I'll try to see once this one is set up.

So, mock final is on the...

Overleaf.

And... Yeah.

That is that.

So, it was...

It was very nice.

And hopefully, you guys will...

Don't get stressed.

As I said, the basic fundamentals is
what we... Is what are the most important.

So, just understand...

First thing, understand the problem
statement.

Right?

You've...

Like, from the midterm, if there's one
thing that you have learned, And I have

learned is... There
were these bonus...

Like, free points.

That you guys didn't get stupid questions.

Like, what's the membership problem?

And if you can't describe what the problem
is, Then...

Why would anyone ask you about the
solution?

So, those are free points.

Like, what's the problem?

What is it used to solve?

Or what's the algorithm?

The analysis comes later.

Right?

So, that's like the first BFS type.

Right?

Do a BFS traversal of these topics.

Right?

If... If you only had two minutes to explain
it to someone, How would you explain it?

And then, if you had five minutes to
explain it to someone.

And then, if you had half an hour to
explain it to someone.

But do the two minute
and five minute things

before going into
the half an hour thing.

Does that make sense?

Yes, professor.

All right.

Then, class is dismissed.

See you guys on Monday.

I'll be here for a couple of minutes if
people have any questions.