Algorithms for Big Data.

This is May 11th.

We had stopped last time.

We were talking about this, the paging
problem.

If you remember, you have a cache which
can only store a few items.

And then, but you get one by one,
you get requests for different items.

And at some point, you have to decide who
to kick.

And so the paging problem was how to
decide who to kick out.

We saw a bunch of heuristics.

We had the least recently used, the
least frequently used, first in first out.

And I told you that if you know what the
sequence is going to be, then the best you

can do is this farthest in future.

Right?

Which is whenever you have to decide who
to kick out from your cache.

You look at the sequence.

You look in the future.

And you look at the element in your cache
that you will need farthest in the future.

So that's the optimal algorithm that knows
the future.

But when you talk about online algorithms,
we don't know the whole sequence, right?

We only know what we have done so far.

And for that, we have algorithms like
least recently used or first in first out,

right?

So least recently used,
as it says here, you

evict the item that
was used least recently.

Now, what do we know about these two
algorithms?

We know least recently used and first in
first out are k-competitive.

So what does k-competitive mean in simple
English?

What does this result mean?

That an algorithm is k-competitive?

What does that mean here?

Like if I say an algorithm
for the paging problem is

k-competitive or is too
competitive, what is that?

The number of mistakes is at most k times
that of the optimal algorithm.

Right.

And when you say number of mistakes,
you mean number of cache misses, right?

Yes.

Right.

So it's the number of times there is a
request for an item that is not in cache.

That's a cache miss.

And that's what we are trying to minimize.

So that's what the guarantee means here.

And obviously, k is the size of the cache.

And so a factor k in front of a guarantee
is a terrible guarantee, right?

I mean, if k is whatever, two gigabytes,
then this is saying that the number of

times we have a cache miss is at most,
I don't know, 2,000 times the optimum.

That's pretty bad guarantee.

But it turns out you can't really improve
on that.

But you can improve if you cheat a little
bit.

So that is what I was calling this
resource augmentation.

And think of it as, well,
clearly, optimum is given too

much power here because
it is allowed to see the future.

So our hands are tied in
the sense, we are trying

to match optimum with the
same amount of memory.

And that's why we cannot, we can only get
lame results like k competitive.

However, given that optimum is allowed to
look in the future.

Can we somehow catch up to this fact by
increasing our memory?

So now what we will do is we will allow
opt to have a smaller cache than us.

So we have a cache of size k.

But optimum, we will only allow it as cache
of size k prime, which is smaller than k.

So in other words, we have extra
resources.

And the question is, now can we somehow
come close to opt?

And the result that was proven is that
least recent used and first in first out

are k over basically k minus k prime
competitive.

In other words, let's say k prime is
roughly half of k.

Then what is k over k minus k prime?

Well, if k prime is half of k,
then k minus k prime is also half of k.

And k over half of k is 2.

So now you get that these algorithms are
too competitive.

If you allow the algorithm to have roughly
double the cache as the optimum.

So another way to state this result is
that you can almost match optima's ability

to see into the future by making your
cache twice the size of optima.

Does this result make sense?

Questions?

All right.

So I will not prove this result.

And one other result I want to state is...

Actually, let's...
I mean, it's...

You see, compared to this
one, where we had proved

that LRU and first in
first out are k competitive.

Is the algorithm LRU randomized or is it
not randomized?

Look at the algorithm LRU.

Does it use randomness?

Does it use randomness?

Or first in first out?

Do people understand what the algorithm
is?

Wait.

You guys can hear me, right?

Yes, we can hear you.

Okay, okay.

Just making sure I don't like lose
connection or something at my hotel.

So...

So look at LRU.

Does it toss any coins or no?

No.

No, there's no randomness.

No, right?

Because you just...

Whenever you have to evict someone,
you look at whoever is in your cache.

You look at the last time they were used.

You know the last time they were used
because you have all the sequence so far.

And based on that you decide.

So there's no randomness in least recently
used or first in first out, right?

First in first out is also pretty obvious,
right?

Whoever came first in, that's who you kick
out.

So now you can ask, well, this only gave
us k-competitive.

If we allow randomness, what can we do?

Can we do better?

It turns out actually you can do much
better.

So if randomness is allowed, then we have
a randomized algorithm.

That is roughly log-k-competitive.

So you see, instead of a k-competitive, you
get a much better log-k-competitive factor.

So many times there are problems where,
you know, if you use deterministic,

if you don't allow randomization, you
are doomed to a bad competitive factor.

But if you allow
randomization, then you can

get what we call
exponentially better, right?

Because from a k, it drops to a log-k.

So that's a big improvement in the
competitive factor.

And there's many
such problems that with

randomization, we can
show that actually it's...

it drops down the competitive ratio.

Do these results make sense?

The statement of the results, at least?

What does G2 stand for?

Like...

I was just trying to say there were bad
results.

So I think B1 and B2 were bad results.

Like bad news.

And G1 and G2 was good news.

So G is for good.

Okay.

Thank you.

So bad news was that these algorithms are
k-competitive.

And that no algorithm can be better than
k-competitive.

If you look at this B1 and B2.

But the good news is,
if you allow us to have

double ops cache, then
we can be too competitive.

Or if you allow us to have randomization,
then we can be too log-k-competitive.

If both of these are much smaller than
k-competitive.

Sorry, I have a question.

Sure, sure.

So you said that, like,
we could basically cheat

using the fact that we
have on the top, right?

By setting, like, k-prime as small as we
want.

So what's preventing us from doing that?

Well, I mean, it's...

At some point, you're just... I mean,
the result stops to make sense also, right?

So what this result says is that...

Because opt is allowed to look in the
future, how much resources do you need in

order to sort of be, you know,
comparable to it?

Now, obviously, if you
allow yourself an infinite

cache, then there is no
problem to be solved, right?

Then you'll never have a cache miss.

Right?

You can just keep everything in memory all
the time.

So what this result is saying is,
how much do you really need to expand your

cache in order to match up to a reasonable
extent?

I see.

Okay.

Thank you.

All right.

So...

So that roughly brings
to an end of whatever

I wanted to tell you
about online algorithms.

And there is this other topic about splay
trees, which I will not have time to get

into, because I want to tell
you guys about something

that is perhaps more
useful in today's world.

Professor, is there some people waiting to
be admitted?

Oops.

Let's see.

Yes.

You were right.

There was one person.

So by the way, I cannot monitor chat.

So...

Don't... Say anything in the chat,
because I only have one machine.

So...

If someone sees anything in the chat,
just speak up.

Thanks.

Okay, let's see.

The next topic I wanted to mention...

So the two topics I
want to cover today are

dimension reduction and
nearest neighbor search.

And these will hopefully be the...

These will be the last topics that we will
cover in this course.

And...

The point is they are
very useful in machine

learning, in high
dimensional data statistics.

So...

Let's first talk about dimension
reduction.

Right?

I mean, the name should...

Should somehow make you...

Who doesn't want to do dimension
reduction?

Right?

If you have very high dimensional data,
and I can somehow reduce the dimension,

and still keep most aspects of the data,
that will make my algorithms faster.

That's sort of the...

The general idea behind...
Dimension reduction.

So first, before I go into dimension
reduction, a quick... Reminder about...

Maybe some high school geometry.

In two dimensions, which I will write as
R2.

If I give you two points, P1 and P2.

And so a point in R2 is...

You know, I give you both its coordinates.

The x-coordinate and the y-coordinate for
P1.

So I'm calling that x1, y1.

And the x and y-coordinates for P2,
I'm calling them x2, y2.

Then, people know the distance between P1
and P2 is given by this formula?

Yes?

Everyone knows this.

This is called the L2 distance.

Okay?

This is...

Because you're squaring, adding,
and then taking the square root.

It's called the L2 distance.

You could do anything weird.

You could pick any P greater than 2.

You could subtract and raise them to the
Pth powers, and then take the 1 over P.

For example, you could do Q sum,
and then take the cube root.

That's called the Lp distance.

For now, let's just look at the L2
distance.

Hopefully, everyone is comfortable with
this L2 distance.

Okay?

And now, let me just make a move.

So now, suppose the input is...

So, imagine some high dimensional machine
learning problem.

So, you have some data set of n points in
Rd.

Okay.

Maybe I should talk about Rd.

So, these are n points.

Let me call them P1, P2, up to Pn.

So, what does a point P in Rd look like?

How do I express a point P in Rd?

By the way, Rd is D-dimensional Euclidean
space.

Just like... Would it be a
couple with D elements?

Correct.

So, it would be a couple with D elements,
yes.

So, it will be like...

This is the first coordinate,
that's the second

coordinate, and that's
the Dth coordinate.

That's a point in Rd.

If I give you two points in Rd,
P and Q, I specify their D coordinates.

Then, what is the distance between P and
Q?

Just the previous formula that you saw.

How people have...

Did you guys ever compute this?

The square root of the summation of P i
minus A i?

Squared.

Oh, Q i, yeah.

So, P i minus Q i squared sum over i from
1 to D.

Right?

Okay, good.

Now, suppose...

This is your Rd.

And I give you this point set,
right?

This P1...

And by the way, this is...

So, when I use subscript, I'm using it for
the different points.

And when I do superscript, I'm using it
for the different coordinates.

Right?

So, that's P1, P2, Pn.

Is the difference between the subscript
and the superscript clear?

I'm sorry, can you repeat it?

I have subscripts here.

And superscripts here.

So, I'm giving you N points in D dimensions,
where every point has D coordinates.

Right?

That's all.

That's all I've done.

So, the different points are subscripted.

And for one point, its D coordinates are
superscripted.

Right?

That's what I want to say.

So, now, imagine you have these N points
in D dimensions.

And this is some, you know, high
dimensional data set.

Now, the problem is, every time you work
with this data set.

So, for example, there is something called
the curse of dimensionality.

Which kind of says that all the
algorithms...

Okay.

So, for this input data set, what are the
two parameters?

That describe this data set.

Like, when we say we
want a fast algorithm, it

should be fast in terms
of which two variables?

N and D?

N and D.

Correct.

However, most algorithms that we will
have, right?

They will maybe be polynomial in N,
but they will be exponential in D.

So, they'll be like, I don't know,
N squared times 2 to the D.

Which is pretty good if the dimension is
small, right?

In 2D, who cares about 2 to the 2,
right?

Or in 3D, who cares about 2 to the D?

So, in small enough dimension,
when D is small, this is fine.

And that is why, until the advent of big
data, computer scientists were okay with

such algorithms, right?

Because the data was typically in small
dimensions, and N wasn't too big either.

However, now, like, you know, D can be
really, really large.

Like, what's an example where D is large?

Is it 50?

10 or 50?

No, no, like, what's a real world example?

When D would be large.

Yes, 10 or 50 would already be too large
for this.

But you guys know about image data,
right?

So, every image, if you
have a 16 by 16 pixel image,

people would represent it
as a vector of length 256.

Right?

The intensity of every pixel, I can put it
in a row.

If I have a 16 by 16 pixel image,
which is like nothing, right?

That's not even a high resolution image.

I can represent it as a vector of length
256.

So, just some image data can be
represented as points in 256 dimensions.

Does this make sense?

256 is 16 times 16, if I'm not mistaken.

Right?

I mean, instead of, all I'm saying is,
instead of representing an image as a

matrix, you just represent it as a long
row.

So, dimension easily goes into the
hundreds, even if you're talking about bad

quality image data.

Does this make sense?

I mean, a 16 by 16 image, has like these
intensities, right?

And I'm just converting it into a 256
dimensional vector.

And so, I have a used data set where n is
large, in, you know, very high dimensions.

And now this algorithm is going to be,
is not going to finish in my lifetime.

So, what do we do?

Do people understand the sort of the
importance of this, the difficulty here?

Do people understand why d can be in the
hundreds very easily?

Okay.

Questions?

Okay.

So, in comes a technique called dimension
reduction.

And what does dimension reduction do?

It basically reduces your dimension from d
to something which is much smaller than d.

And I mean, what will be the property of
this?

So, think of the dimension reduction as a
map f.

And what will be the property of this map
f?

That I can apply f to every point.

Right?

So, I can...

So, first of all, these points will live
in a much smaller dimension.

They will live in some dimension d prime,
which is much smaller than d.

And when I apply this function f to my
original data set, p1, p2, pn,

I will get n points in this lower
dimension.

f of p1, f of p2, and so on until f of pn.

And what do I really want from this
transformed point set?

I want it to preserve distances roughly
equally.

So, the theorem, which I don't know if I
have here.

Yes.

So, here is the theorem.

And it's called the Johnson-Linden-Strauss
lemma.

But, I mean, it's become much more than a
lemma by now.

It's used everywhere.

So, let's read this theorem.

So, for any epsilon between 0 and 1 and
any n greater than 1, let d prime be such

that d prime is at least log n over
epsilon squared.

Okay.

So, this d prime that I was telling you
the low dimension, it will roughly be log

n divided by epsilon squared, where you
choose epsilon.

But, you see, log n is a much smaller
number, right?

Log n is typically very small.

So, if you choose this dimension,
then for any set S of n points in d

dimensions, there is a function f,
which takes points in the higher d

dimensional space, and spits out points in
the lower d dimensional space,

such that if you look at any two points in
your data set, x and y, then if you look

at how much the images, the distance
between the images, divided by the

distance between the input points,
or in other words, the distortion,

it's between 1 minus epsilon and epsilon,
and 1 plus epsilon.

In other words, maybe this last line will
make more sense.

So, look at the last line.

What is the thing in the middle of the
last line?

What does this mean?

I mean, if you want, you can think of it
as the distance between f, the L2 distance

between fx and fy.

So, does the very last line make sense?

I'm sorry, can you repeat it again,
please?

Does the very last line that I have
written here make sense?

So, stare at the very last line in red,
this one.

f is a function that maps high dimensional
data set to a low dimensional data set in

such a way, so that for any two points x
and y in your high dimensional data set,

if you look at the distance between the
mappings under f, it's sandwiched between

1 minus epsilon times the distance between
the original points, and 1 plus epsilon

times the distance between the original
points.

So, what have I done?

I have reduced the dimensionality of my
data set, while still roughly preserving

all distances between my input points in
the data set.

Sorry, Professor, I have a kind of
unrelated question, if that's okay.

Yes, yes.

Does this lemma also apply to different
norms?

Because I know here we're using the two
norm.

Does it also apply
to different... No.

So, yeah.

So, unfortunately, it turns
out that for other norms,

there is no analog of the
Johnson-Linistros lemma.

In fact, you can show that there is no
such math.

So, people have studied this question for
norms that are not the L2 norm.

And basically, this is
kind of the only norm

for which you can do
dimension reduction.

I see.

Thank you.

But is the statement of the theorem clear
to people?

Yes, that makes sense.

Okay.

So, to everyone, if the original dimension
is D of my data set, what dimension does

this function F map my data set into?

D prime.

What is D prime?

And how is D prime defined in terms of the
original point set?

So, I have given you n points in D
dimensions.

How would you apply the Johnson-Linistros
lemma?

What is the lower dimension now?

Log n over epsilon square.

Right.

And does log n over epsilon square depend
on D?

No.

No.

So, what has happened is, no matter how
large a dimension your input points are

in, you have mapped
them into a dimension that

does not depend on
the original dimension D.

It depends on n, the number of points.

But on n, it depends very gently,
in that you are taking a log of it.

When you take log of a number,
it makes that number very small.

And it depends on epsilon, but you get to
choose the epsilon.

And the way you choose the epsilon will
decide this guarantee in the last line.

So, you map the input points from very
high D dimensions to something that

doesn't depend on D, like log n over
epsilon square dimensions.

So, that now your transformed data set...

So, you originally had n points,
so you will still have n points.

It's not that your number of points has
reduced.

Is that the dimension of every point has
reduced?

But the dimension hasn't just reduced like
in a stupid...

I mean, there is one way to reduce the
dimension, right?

Map every point to zero.

Right?

Then, great.

You have reduced the
dimension to nothing, but you have

lost all the information
about your original point set.

With this map, you are
still keeping almost all

distances up to a 1
plus minus epsilon factor.

That's what the last line is saying.

That if you look at the, now the
distances...

So, the middle term is the distance
between fx and fy, right?

So, that's the...

The new image of x and the new image of y
in the smaller dimension.

So, in the small dimension, when I
calculate the distance between two images,

they are pretty close
to the original distance

between those two
points in the data set.

Does the statement of the lemma become
clear to people?

Any questions about the statement of the
lemma?

Or what it means?

Or like why it could be useful at all?

Can I ask about dimensions, not lemma?

Yes.

Why can we get up to 100 dimensions?

Say that again?

Why can't we?

You said we can get up to 100 dimensions.

Why?

I said we can or we cannot.

We can.

Yes.

The question is, why we can?

Because...

When you take a photo from your
smartphone, how many pixels is it?

Like the...

Currently, I don't even know.

The cameras are what?

How many megapixel are the cameras?

I don't know.

8, 9?

Does that number make sense?

How many megapixel is your smartphone
camera?

I don't know what mine is.

50 megapixel.

50 megapixel.

And I'm guessing a megapixel sounds like a
thousand pixels.

Yeah?

I mean, maybe even if it's 100 pixels.

So, I mean, your phone is full of images,
each of which is an image of, I don't

know, 50,000 pixels here and 50,000 pixels
here.

And each pixel has some intensity value.

That's what an image is.

It's a bunch of pixels.

Each pixel has an intensity value.

Is this making sense or no?

Yes.

And now the way to
represent one image is just

write down these things
row by row in one long row.

So that will be 50,000 times 50,000.

Right?

The size of this matrix is 50,000 squared.

Right?

So to represent this matrix, I can
represent this matrix in a long vector of

length 50,000 square.

So now, my phone gallery is a bunch of
vectors, each living in this dimension.

Does that answer your question or

no?

No.

So we call this... Where
are these vectors to live?

We call this dimension.

Repeat your question.

I didn't hear you.

I think I just misunderstood what
dimension is.

Thank you.

Dimension is how many coordinates you need
to represent your data.

So if your data is a
bunch of vectors, then the

length of those vectors is
the dimension of your data.

And the number of the vectors is n.

Okay.

I got it.

Thank you.

Right?

So think of n as the number of images and
d as 50,000 times 50,000.

So now does the statement of the
Johnson-Linistros lemma make sense?

You are reducing the
dimension by a lot while

still approximately
preserving distance.

And the smaller the epsilon
you choose, the larger d

prime will be, but the
better your guarantee will be.

So it depends on what approximation
epsilon you're willing to live with.

I mean, if you're okay with epsilon equal
to one, meaning you're okay with distances

being at most doubled.

Or, you know, at least halved.

But then you can get away with log n over
log n dimension.

Does this make sense?

If I'm okay with having
my distances at most

doubled, what value of
epsilon am I okay with?

Can you repeat the question?

I'm sorry.

If I'm okay with my
distances being doubled,

then what does that
mean in terms of epsilon?

What value of epsilon am I okay with?

Half?

No, right?

Look at the right-hand side.

One?

Yes.

Why?

Because one plus epsilon becomes one plus
one.

But one minus epsilon will be zero.

One minus epsilon will be zero,
yes.

So all I will guarantee is there's no
lower bound.

But I will guarantee that my distance is
never more than double.

They could shrink arbitrarily.

But they don't more than double.

So yes, the shrinking arbitrarily maybe is
a bad thing.

So then maybe a good thing to use is
epsilon equal to half, let's see.

So what does epsilon equal to half mean?

So epsilon equal to half would mean that
my distances

between the new points is at least half of
the original distance and at most three

halves of the original distance.

right?

Just like a 50% error.

So if I'm okay with epsilon
equal to half, then my

dimension goes down
from D to basically 4 log N.

Which could be much smaller than D,
right?

Again, because log is an...

log just brings down the number
exponentially.

So log of N is the number of digits in N.

Right?

So epsilon, it's always between 0 and 1.

It's never gonna be...

Sometime it can be 1 or 0.

No, so I guess epsilon is always between 0
and 1.

Strictly between 0 and 1.

Epsilon equal to 1 you can get easily
because then you don't care about

distances, you can just map everything to
0.

So epsilon equal to 0 and epsilon equal to
1 are trivial maps.

Thank you.

But does everyone see how you get this D
prime?

Right?

So in the exam, if you are told how many
dimensions you can afford, then you can be

asked to recalculate what's the best
distortion that you can get.

Right?

So I can say I have some input that's in D
dimensions where D is, I don't know,

a million and N is, I don't know,
5 billion.

And I have an algorithm that only works in
200 dimensions.

What is the best distortion possible for
this dimension reduction?

I'm sorry, can you repeat your question?

I'm sorry.

I will give you N, D, D prime,
calculate epsilon from that.

Calculate the best epsilon that you can
get from values of N, D and D prime.

But again, if you
understand the theorem, then

there is nothing really
mysterious in the question.

So again, just stare at this theorem and
tell me if there's any questions about

what this theorem is doing, what it is
saying.

Would you give us the formula on the test
or no?

Which formula?

The one in purple.

No, because the formula is...

No.

Because the formula,
remembering the formula

means that you're already
doing the wrong thing.

The formula, the last line is the...

is the... is the guarantee that you
should... that you need to know.

But the guarantee almost follows from the
purpose of this theorem.

So the formula is meaningless,
actually.

Remembering the formula is meaningless.

So I think you guys are getting lost in
the formula.

And not realizing what this theorem is
doing, perhaps.

It's a way to reduce the
dimension of a bunch of points in...

in very high dimensions.

It's a way to reduce the dimension while
still preserving distances roughly equally.

And once you know you're preserving distances
roughly equally, the last formula is.

..

it just trivially follows from that.

Professor, in the last line, in the
middle, should it be d prime?

Or am I getting it wrong?

No, no, no, no.

d is not the dimension here.

Or maybe that's your...

So when...

when I say this d, I mean the distance.

Distance in... in u...

when we already put it through the
function, yeah?

In the middle, yes.

On the sides is the original distance.

Did that answer your question?

Yes, thank you.

So in the exam, do we choose a good value
for epsilon?

No, no, no, no.

I think... I think you guys are too far away
from thinking about the exam about this.

First, I need you to understand the
theorem.

Forget about the exam.

First, tell me what you guys understand
about the theorem.

So who can explain what this theorem is
saying?

In simple English to the rest of the
class.

Anyone want to give it a shot?

The goal is to take high-dimensional data
and make it smaller while still keeping

the important distance.

Correct, yes.

That's the rough goal, yes.

Now let's go into one more level of
detail.

What do you mean by that?

So that was the rough goal, yes.

Now let's go into one more level of
detail.

How many... What is the data?

What is the original data?

And what is the transformed data?

And what is the guaranteed?

So let's say, what is the original data
that we have in this problem?

Like, what is the input to...

What is the input to the
Johnson-Lynn-Strauss lemma?

To this function f?

M Is it just the number M?

Is it just the number M?

Yes.

It's N point from the high dimension?

Yes, exactly.

Yes.

So in other words, N vectors in D
dimensions, right?

And so you give it these N vectors in D
dimensions.

And what will it return to you?

This map f?

When you apply this map f, what will you
get?

You will get...

N points in lower dimension...

You will get what?

N points in lower dimension D prime.

Correct.

So again, N points meaning N vectors,
right?

In lower dimension D prime.

Where D prime will be what?

Roughly?

Log N over epsilon square.

Correct.

So everyone sees that the dimension has
been reduced because log N over epsilon

square does not depend on D.

The original dimension D.

So the original dimension D could have
been 50 million.

But log N over epsilon square does not
depend on D.

It does depend on N, but there's a log.

So hopefully much smaller.

And now, what do I know
about these N points in

D prime dimensions, in
the smaller dimension?

So I've transformed my data set into a
much smaller dimension, right?

And what is the guarantee now?

What is the guarantee this map f provides?

The distance between the two
points is between 1 plus epsilon

times the distance and 1
minus epsilon times the distance.

So whose distance?

The original distance.

The distance between x and y.

Yes.

So original distances, when mapped into
the smaller dimensional space,

don't get distorted by more than a 1 plus
epsilon or 1 minus epsilon factor.

And you get to choose the epsilon and that
shows up in the lower dimension.

Right?

So everyone understands this now?

Sorry, I have one more question.

So the theorem says that there exists a
function.

Do we know what that function is?

Very good.

Yes, yes.

So the next thing is, sure, this theorem
could be all good that there is a

function, but I need to give you an
algorithm, right, for that function.

Otherwise, who the hell cares?

I mean, mathematicians would be happy with
just showing existence sometimes.

But for computer scientists, you need to
know, right?

You need to compute what the function is
in order to apply it.

Okay, thank you.

But the function is actually pretty
simple.

So let me tell you what the function is.

So no more questions about the theorem,
hopefully.

Okay.

So here is the function.

And the function basically says...

Before I go into the function,
have people heard of the word hyperplane?

So in three dimensions, this is a
hyperplane, the XY plane.

In two dimensions, a hyperplane is a line.

Is this making sense at all or no?

It's one dimension lower than whatever you
live in.

So on the right, this is two dimensional
plane, right?

So if I take a line, that's a one
dimensional thing.

So in two dimensions, in R2, a line is a
hyperplane.

If I'm in three dimensions, what is one
dimension lower than three dimensions?

It's two dimension.

So for example, in
three dimensions, if I look

at this XY plane, then
that's a hyperplane.

Is the definition of a hyperplane clear or
no?

Yes.

Yes.

Yes.

Okay.

So a hyperplane, the way I've defined it
as one dimensional lower, but you can

define hyperplanes of any dimension
between one and three, for example.

So the XY plane is what we also call a two
hyperplane, because its dimension is two.

And in three dimensions, what is a one
hyperplane?

Well, that would be a line.

Right?

Because that's a one dimensional thing.

So is it clear what a one hyperplane is, a
two hyperplane is in the three dimensions?

Hopefully it is some somewhat clear,
right?

You choose a dimension and now you just
live within that dimensions.

So now what is this function F?

So not only does F exist, but it can
actually be found in randomized polynomial

time, meaning there is a
polynomial time algorithm

that with a good probability
will return you the F.

And actually the runtime is also pretty...

It's this.

So that's the time taken to find the map
of all the points.

And F is nothing but the following.

So you are in D dimensions, right?

Let me first make a picture.

So we are living in D dimensions.

It's hard.

I cannot draw D dimensions on this.

So let me just take the example of three
dimensions for now.

Basically what you
will do is you will choose

a random hyperplane
going through the origin.

And you will take all your point set

and we will just project it onto the
hyperplane.

That is the map F.

So in other words, we have the data set in
D dimensions.

We choose a random D prime hyperplane.

hyperplane.

And we project.

So we call this D pi hyperplane H.

We project the end points onto H.

And this is the map F.

This projection

from the D dimensions to this hyperplane
H.

This is the map F.

So in two dimensions, what would I do?

So if my D is two and someone says they
want D prime to be one, then I will choose

a random line going through the origin in
three dimensions.

Right?

I will stand at the origin in three
dimensions, look around me and shoot a

ball in a random direction.

That gives me a line.

And now I will project all of my points
basically onto this line.

And it always has to go through the
origin.

Yes.

Yes.

The hyperplane always has to pass through
the origin.

Yes.

So...

You can ask, how do you find a random
line?

People have heard of
the Gaussian random

variable, the normal
zero one random variable.

In the probability class,
hopefully everyone

saw the normal or the
Gaussian random variable.

So what do you do?

You take a vector, Ri, which is just a D
dimensional vector.

And each coordinate is a normal zero one
random variable.

Okay?

And then you, you normalize it to be norm
one.

That's a random direction in D dimensions.

So take a D dimensional
vector, each of whose

coordinates are normal
zero one, and normalize it.

Meaning, divided by its norm so that now
the total norm is one.

That's a direction in D dimension.

And now what you do is you just pick any D
prime of such guys.

Q1, Q2, QD prime.

And you look at the vector space that is
spanned by these vectors.

So when I say things like vector space
spanned by vectors, does that make sense

to people or these
are... this is Greek.

Do people understand what I mean by a
vector space spanned by a bunch of vectors?

Was this taught in a linear algebra course
ever or no?

Yes?

No?

I don't know what you guys went through.

So tell me.

Can you explain a little bit more?

Okay.

So... If I give you
two... three vectors.

These are three vectors in four
dimensions.

Right?

What is the vector space spanned by these
vectors?

It's the set of all vectors that...

So if I call these vectors V1,
V2, V3, I can take all linear combinations

of these vectors, and I'll get another
vector.

Right?

It doesn't even have to be positive.

Anything.

You can take three vectors
and you can look at all

possible linear combinations
of those three vectors.

That will give you a whole set of vectors.

This is called the vector space spanned by
V1, V2, V3.

Any vector that you can get by multiplying
V1 by a constant, V2 by a constant,

V3 by a constant, and then adding them.

Okay?

So for example,

is the blue vector in the vector space
spanned by the three red vectors?

Yes.

Yes.

Because it is just the sum.

Right?

And you can take other...

Is this blue vector in the vector space
spanned by the three vectors?

Okay?

So we'll leave it
at the next... No?

It is, I think.

If I choose lambda 1
to be 1, lambda 2 to be

1, and lambda 3 to
be 2, I think I get this.

At least that's how I tried to make it.

Maybe I added or subtracted incorrectly.

But that's the vector space spanned by
these vectors.

Right?

So now, what is our map?

Let me... Imagine this is
your high dimensions Rd.

Right?

What you do is in this high dimensions,
imagine the unit sphere.

Right?

So this is the ball.

This is the unit ball living in high
dimensions.

On the surface of this ball, you pick D'
random points.

Does this picture make sense?

I'm living in D dimensions.

I take the unit ball in D dimensions.

Unit ball meaning its radius is 1.

In D dimensions.

And on the surface of this ball,
I pick D' random points.

Q1, Q2, Q3 up to QD'.

Is it clear what these D' points are?

They're just random points on the ball in
D dimensions.

And now, if I give
you... Okay, so...

In two dimensions, how many points define
the line?

Two.

Two.

Great.

In three dimensions, how many points
define a plane?

Three.

Yes.

So now, I am in D dimensions.

So, how many points will define a D'
hyperplane?

Basically, D' or you take one.

So now, in this high
dimensional thing, you take a

plane, that contains all
of the Q's and the origin.

That's the vector space spanned by these
guys.

Q D'.

So I choose D' random points in the high
dimensions.

And I just take the hyperplane that
contains them.

And what is my map F?

It is the projection onto this hyperplane.

So now, if I had a point P1 here,
or a point P2 here, what is my map F?

I just project it.

Projection means, find the closest point on
the plane to the point outside the plane.

So if P1 is outside this hyperplane, I find
the closest point on the hyperplane to P1.

That's F of P1.

If some point was already on the
hyperplane, then its projection is the same.

It is its projection.

But if some point is outside
the hyperplane, then I

find the closest point in
the hyperplane to that point.

And that's F of P2.

Does this make sense now?

What this map F is in the
Johnson-Lindenstorce Lemma?

It's projection onto a random hyperplane.

The dimension of the hyperplane,
we know what it's going to be.

It's going to be D prime.

And what is a random D dimensional
hyperplane?

It's nothing but you
take the unit ball and you

pick D prime many random
vectors on the unit ball.

And you take the vector space spanned by
them.

In other words, the hyperplane containing
those points.

And basically Johnson-Lindenstorce Lemma
says that if you pick a random such

hyperplane, then with a good probability
that theorem will be true.

And if it is not true, you repeat it.

And as you have been seeing in this
course, there will be a probability of

failure and therefore a probability of
success.

So if you repeat it enough times,
you will succeed.

So maybe the first random
hyperplane you take, you

will not have that distance
preserving guarantee.

Maybe some distances
will be changed by more than

a one plus epsilon or
one minus epsilon factor.

But if you repeat this,
basically you don't need

to repeat this more
than like order n times.

But you can do it much faster.

But is the, is the map clear now?

What this function f is?

Roughly, yes.

Questions?

Okay, maybe in 3D it's better to show.

If I'm in three dimensions, I take the
unit ball, this is actually a ball,

like a, like a soccer ball.

I'll pick two points.

Let's say q1 and q2.

And then I know my hyperplane has to pass
through the origin.

So now I have three points in the picture.

0, q1, q2.

These are in 3D.

But these three points define a triangle,
right?

And there's a unique hyperplane that
contains this triangle.

The unique two dimensional thing that
contains this thing.

That is what I will project my 3D points
onto.

In order to lower my dimension from three
to two.

Is someone still in the waiting room?

Whoops.

Alright, so questions about this
projection business?

Wait, so with the
ball, you want to find a

hyperplane where 0, q1
and q2 make a triangle, right?

That's pretty much it?

In 3D, yes.

If you want to project points in 3D to a
random 2D thing, that's how you do.

You take the ball in 3D, you choose two
random points on the surface of the ball,

q1 and q2, and you take the two dimensional
plane that passes through q1, q2 and 0.

That's a random two dimensional plane.

And how does that help in finding a f?

Well, f is nothing but projection onto the
plane.

You have the points living in the full
dimension, right?

So you have points living in 3D.

How are we reducing the dimension of the
point set?

By projecting them onto a lower
dimensional thing.

Sort of flattening them.

You're squishing the points, or every point
is getting squished onto the hyperplane.

So my original point set may have high
dimensions, but I'll choose a much lower

dimensional hyperplane and squish all the
points onto this hyperplane.

That's my map f to reduce the dimension.

And we do it multiple times.

Until you succeed.

If you get lucky, as soon as you succeed,
you stop.

And you know when you succeed, because
you can compute the distances, right?

In the original setting
and after the squished or

the distances, you know,
when you've squished them.

So you know when you succeed.

And if you don't, you just choose a
different random hyperplane.

And there is a polynomially small chance
that you will succeed.

So in polynomial time, you will find an f.

That will work.

Okay, if this makes
sense to you, what is the

point set that is the worst
for a given hyperplane?

Let's say I told you
guys that I was going

to project my point set
onto this hyperplane.

This one.

What is the original point set that will
be the worst for this hyperplane?

What is a bad point set for a hyperplane?

Bad meaning for which the guarantee will
definitely not be true.

Would it be they all get mapped to the
same point or something like that?

Right.

So if you took a direction
perpendicular to the

hyperplane, and all your
points were somehow here,

then the original distances between these
points are not zero, right?

They're different points.

But they will all get mapped to this
point.

And then there is no distance,
right?

Does it make sense?

But that can't happen.

Or that was very unlikely because we
choose it randomly, right?

Exactly.

Yeah, yeah.

So the chance that...

So first of all, we don't get data sets
like this also.

But even if it was, if
you choose a random

direction, then, you
know, this is very unlikely.

Right.

So that was the Johnson-Lindenstorce
lemma.

And I wanted to show you an application,
a quick application of it.

So it's not so much as an application,
but let me introduce the problem.

And then I think in next class, we'll be
done with this in like 10 minutes or so.

So this problem is nearest neighbor
search.

So, so far you have seen membership,
right?

We started this course with the membership
problem.

I give you a data set, reprocess it and
store it.

So that when I give you a query, you tell
me if the query is in the data set or no.

Right.

That was the membership problem.

Is Q in S.

But one of the more
useful versions is, is

the query similar to
some key in the data set.

Right.

Maybe the exact query is not in the data
set, but maybe it is pretty close to

someone in the data set.

So in other words, the input here will be
end points X1, X2, X3, Xn in D dimensions.

Store your data set.

So that when I give you a query,
I want you to return to me the closest

point in your data set to the query.

So if this is my query Q and the closest
point is this X5, X5, then I want your

algorithm to return X5.

This is nearest neighbor search.

Given a data set, which is a point set in
high dimensions, preprocess it so that

when a query comes, a query is also a
point in high dimensions.

You can quickly return the nearest
neighbor of this query.

Does this make sense?

Why would such a thing be useful in
machine learning, for example?

Can someone see why nearest neighbor
search would be useful in today's world?

So sometimes when you guys try to log in,
I don't know what it's called capture

when, when they show you a bunch of images
and they say, Oh, click all the ones that

contain a motorcycle or that contain a
bridge.

People have seen that.

Yes.

Yes.

Why do you think they're making you do
that?

Is it for pattern recognition?

Yes.

So what they're actually doing is they're
training a model.

They're using you to train a model.

So,

so let's say this are,
remember every image in

your phone I said is a
very long vector, right?

So you have some data set in high
dimensions.

And the right now the computer doesn't
know what the image contains.

So it asks you users to label the images.

So for now, let's say some of the images
were cat images and some of the images

were dog images.

So you start labeling cat images.

And some images you label as dog images.

This is what it's making you do when it's
asking you to do stuff.

It's using you as a, as a label maker.

And now what will the machine learning
model do?

It will pre-process
this data structure for a

nearest neighbor search
that I haven't told you yet.

So that now when a new image appears,

the computer will quickly find who is the
nearest neighbor to this new image.

And if the nearest neighbor
is a dog labeled image, then

the computer will guess that
this new image contains a dog.

And if the new image was
very close to a cat image, then

the computer will guess that
this new image contains a cat.

This is called nearest neighbor
classifier.

And it's widely used in machine learning.

Does it make sense why the nearest
neighbor problem would be useful now?

Now, each point set is an image, because
it's a vector in very high dimensions.

And, you know, given a query, it may not
be exactly one of the images, right?

I mean, if I take the same cat,
and I don't know, take a picture of it

from another angle or something,
it's not the same image.

But in this mapping, it will be pretty
close to the original image of the cat.

And so if you have a nearest neighbor
algorithm, you immediately find out that

the nearest neighbor in the data set to
this image is a cat image.

And so you can guess that this image
contains a cat.

So you're classifying unknown images based
on known images.

That's what the training model is,
right?

You give it data to train.

So that now on unseen data, it can make
some educated guess.

Does the problem make sense?

The nearest neighbor search problem?

Yes?

No?

Well, we're out of time.

So maybe you can
come next class with any

questions about the
nearest neighbor search.

Stop the recording.

And we already have scribes for this
lecture, right?

The scribes for this lecture, are they
here?

We had decided in the last class?

Yes.

Yeah, Professor, I believe you chose me
and Alicia.

Yeah, yes.

Okay.

So I'll send you guys the recording after
the lecture.

Just send me an email, so that I have your
email.

And then we meet on Wednesday,
same time.

If there's any questions, I'm here for a
couple of minutes.

Otherwise, class is over.

Professor, sorry.

I was a little bit late.

Just to make sure, for the exam,
for the final, it's everything after the

midterm and probability that was before
the midterm also.

Exactly.

Yes.

Okay.

Thank you.

Did you say everything after the midterm?

Sorry?

Everything after the midterm.

Yes.

After the midterm.

Okay.

But I mean, I say everything after the
midterm, but I have to put the disclaimer

that probability was taught before the
midterm.

So that's obviously included, right?

You can't expect... I mean, don't complain
if there's... if probability shows up in

the final, because
everything after the midterm

is randomized, so
probability will show up.

Will you upload an example of alt final?

Yes.

As soon as I'm back, we should be tomorrow
night.

So if by tomorrow, 9pm, you guys don't see
it, then just send me an email.

My flight lands at 5 and then I can go to
my desktop.

Is there going to be a review session or
no?

I don't see much of the
point of a review session, but I

think next class I'll be done
after the first 15 minutes.

And then we can spend the remainder of
next class as a review.

But a review would basically
be me listing the topics that we

have covered so far and giving
you a quick snapshot of that.