---
Video Title: Regular Expressions
Description: Theory of Computation
https://uvatoc.github.io/week6

13.2: Regular Expressions

Nathan Brunelle and David Evans
University of Virginia
---

So we've shown that we can do any finite
function with finite state automata now.

We've shown that we can
do some infinite functions, but

we know that we're not going
to be able to do everything.

One of the first things we proved is that
no matter what our model of computing is,

there's going to be things that we can't
compute with it.

So we know we're not going to be able to
do everything.

We're better than we were
with NAND circuits, but we

know we're not going to be
able to compute everything.

We might have this natural question of
what are those things we can't compute.

To continue on this trend of how to
discern which things we can and cannot

compute, we're going to do something kind
of like what we saw with NAND circuits,

where we're going to have sort of a
machine-looking model of what we're doing,

so like a hardware-looking model for what
we're doing.

We're going to have
like a language or a

software-looking model
for what we're doing as well.

For NAND circuits, we had straight-line
programs as that software-looking model.

The things you can do with finite state
automata, it turns out those are

equivalent So finite state
automata are a way for us to

determine which things are
and are not in a language.

Regular expressions are
essentially this programming

language that allows us to
describe a bunch of strings.

So it allows us to describe a set of
strings.

Regular expressions allow us to describe a
set of strings.

And it turns out any set of strings I can
describe with regular expressions.

I can also compute using a finite state
automaton and vice versa.

So that's what we're going to be showing
next.

Regular expressions are
essentially going to be some

way of describing a pattern
that your string might have.

When we look at a string,
we're basically going to be asking,

does the string that I'm
looking at match this pattern?

So if we're trying to think
of this as a function, we

could say that I'm going
to take a string as an input.

And then if that string does match the
pattern, that function will return 1.

And if it doesn't match the pattern,
we'll return 0.

Or we could think about the
language of that regular expression

as being a set of all strings
that match the pattern.

A regular expression is
just denoting a set of strings,

which is all of the strings
that match the pattern.

So we can use regular
expressions to sort of talk

about decision problems as
computing functions as well.

These are two examples of regular
expressions.

There are a few main components with these
regular expressions we're going to see.

Literal characters.

So here, A, B, and C, those are literal
characters.

We're going to see that we can concatenate
literal characters together, and then we

can apply them together with this vertical
bar thing, which we can think of

intuitively as an or, or with this star,
which is the same as the Kleene star that

we saw before, that says zero or more
copies of.

So let's look at all of the pieces we
might see of a regular expression.

So a regular expression is basically
anything that I can build using some

combination of all of the following
pieces.

So the first thing that
we might want to use as a

piece of our regular
expression is the empty string.

If I wanted to talk about
specifically the empty

string, we can do that
with regular expressions.

There's two symbols that are the most common
symbols to represent the empty string.

One of them is double-double quotes,
so empty quotes.

So this is what you might
see in Java or Python

to represent the string
with no characters.

You might also see this epsilon symbol
here, a backwards three.

If you wanted to do this in LaTeX,
that is backslash var epsilon.

If you do just regular
epsilon, by the way, in.

LaTeX, then it looks more
like the contained symbol.

That's regular epsilon, var epsilon is the
backwards three.

So if you want to talk about the empty
string in a regular expression,

that's how you would talk about the empty
string.

You can also talk about any character in
your alphabet.

You can write down a literal for any
character in your alphabet.

So for instance, if I had the regular
expression that was just A, that is the

literal character A if my alphabet,
say, had the letter A in it.

And that matches one string.

That string is the string containing just
A.

So, the language of
the regular expression

for A is just the string
A, that one string.

This is how we start to build these things
up.

We're gonna start with super, super simple
things, simple components, string

literals, and we're gonna be combining
those with the things that you're gonna

see after that in order to
be able to describe more

and more strings that are
increasingly complicated.

The next thing that we're going to see is
something that we typically are We're

going to call alternation or union,
or you can think about this as OR as well.

What this does is you can take any two
regular expressions that you might have

and combine them together with this
alternation or union or OR, and it's going

to match any string that matched either of
those two other things.

It's going to match any string that
matched either of those two other things.

So, for instance, if I had the string
literal A, and I had the string literal B,

so if I had those two regular expressions,
this one on the left matches just the

string A, the one on the right matches
just the string B.

If I combine those together
with this vertical bar for

alternation or union or OR,
then that'll now match two strings.

It'll match the string that's just an A,
and it'll match the string that's just a B.

So the language of that new regular
expression has two things in it.

That set has two strings in it.

Question.

It would not also match the string AB.

So it has to match either
the regular expression

on the left or the regular
expression on the right.

Or potentially both.

There could be overlap between.

So it would not match AB.

It matches A or it matches B.

The reason that this is called alternation
is it's basically saying here are two

alternatives that you could use for what
to match against.

The reason it's called union is because
the strings that this regular expression

matches is the union
of the sets of strings that

the original two regular
expressions matched.

By saying you can match this one or you
can match that one, this new regular

expression, it matches the strings that
belong to the union of those two.

And then or, because
you're basically saying

you can match the first
one or the second one.

This is the only one that's got like
several names.

The next one we can do is we can do
concatenation.

So I can take two regular expressions and
I can concatenate them together,

which basically says that I want to now
match any string where I can divide it

into two pieces, where the first piece
matches that first regular expression,

the second piece matches that second
regular expression.

So for instance, I've taken that regular
expression that we had before,

a or b, and then I've concatenated that
with the string literal c.

This regular expression is going to match
any regular expression where the first

chunk of it is going to match the regular
expression a or b, and the second chunk of

it is going to match the regular
expression c.

So in this case, there
are, again, exactly two

strings that match
that regular expression.

The first one being AC.

Because the A matches A or B.

And the C is the only thing that matches
C.

The B matches A or B.

And the C is, again, the only thing that
matches C.

Yeah.

If I had the string BCAC.

The way that we use regular expressions
here is gonna be slightly different from

how you might have used them in, say, Python
or your programming language of choice.

So with Python or your
programming language of choice, the

way you use them is you
said, here's a big chunk of text.

Find all of the substrings that match that
regular expression.

What goes on underneath
the hood is your programming

language says, Does this chunk
match the regular expression?

Does this chunk match?

And it just marches through your whole
file doing that.

So really, a regular expression matches a
string or it doesn't.

So when you're using this regular
expression in a programming language,

you'll say, find all of the substrings
which do match it.

Its use in practice is a little bit
different, but the way that we're

describing it here is actually how the
implementation works When people are

actually implementing regular expression
matching in Python.

You look at a regular expression,
does the entire string match or not?

Is the question that we're asking.

How many strings does this regular
expression match?

Discuss that with your neighbors.

How many strings does that match?

How many strings?

How many strings?

I'm seeing four.

So there are four strings.

Who can name all four strings?

I know it's so many.

A, B, D.

A, B, E, F.

C, D.

C, E, F.

So for this one, the way that we know that
these four strings are able to match it is

we can say, well, A, B, D, we can split
that into two chunks.

So here this is saying that we need to be
able to split our string into two chunks.

The first chunk matching the left half,
the second chunk matching the right half.

So in this case, we could sort of split it
between the B and the D.

And now A, B matches the first regular
expression here because it said we could

match either the string A, B or else the
string C.

So A, B, we can choose to match there.

And then D will match this half of the
alternation.

We can match either the string E,
F or the string D.

So D is going to match the string D there.

The final thing that
we're going to look at is

this clean star or this
clean star or clean star.

If you say any of those, everyone will
know what you're talking about.

But some people are going to claim that
one is the actual correct pronunciation

and shame you for using the one that they
don't prefer.

If I do the star of some regular
expression, that is going to match any

string that is zero or more copies of
things that match that regular expression.

And it doesn't have to be the same thing
over and over again.

As long as all of the ways
you've broken it up, each of

them matches that regular
expression that you mentioned.

In this case, we could say, for instance,
A or B, C star.

This will match the
string A by saying, I can

match either A or B
and then match C star.

So it matches A because this says,
I chose zero copies of the string C.

The thing after the A matches C star because
the empty string is zero copies of C.

So this is A concatenated with the empty
string.

We can also do AC because we chose A and
then one copy of C.

Or we could do ACC or A C C C C C C C C C
C C C C C C C C and so forth.

So we can do either A or B or A or B
followed by any number of Cs.

How many strings is that?

Yeah.

Countably infinite.

There's an infinite number of strings that
match this regular expression.

So if we combine all the
other ones without ever using.

A star, then we could only
get finitely many things.

We could get a lot of strings in that
language, but we could only get finitely many.

The star is what allows us to now start
talking about infinite languages.

Let's play with this star a little bit.

Who can give me one string that matches
this regular expression?

Yes.

The empty string.

Very good.

So the empty string matches this regular
expression.

What else?

A, B, B, B, B, A.

I forgot how many Bs you
said, but... A, B, B, B, B, B, A.

That matches this regular expression.

So A or B star, C star.

Who can give me a string that matches
this?

Yes.

The empty string.

Very good.

So that is one thing that matches this.

Who can give me another thing?

Yes.

A matches this.

Very good.

I heard a B and I heard a C.

And what else?

A, B, B, A, C, C.

Yeah, let's look at this one a little bit.

Let's figure out why this one matches.

So this one matches
because, well, the first

thing I have is I have a
concatenation right here.

So I need to be able to break it up into
two pieces where the first one matches A

or B star and the second piece matches C
star.

So here, this piece here matches A or B
star.

This piece here matches C star.

So are you convinced that this is still
countable looking at it?

So you could think about this like when we
did that example where we said,

can you represent all integer pairs using
binary strings?

Since you could do that, you could
represent it as binary strings.

You knew that there are only going to be
countably many of them.

So you can think about this as being a
pairs sort of thing going on here.

Yeah, very good.

That was going to be my next example.

So thank you for asking.

So let's say that I had A, B star or C,
D star.

Who can give me strings that are going to
match this regular expression?

Yeah, you got the easy one this time,
the empty string.

All right, what else?

Yeah.

A, C, D, A, C, D?

Yeah, so that is a good question.

So first of all, we might
have this question, what is

sort of the order of operations
for regular expressions?

Star happens first,
concatenation happens next,

alternation happens after
that, unless there's parentheses.

So in this case, the star is applying to
only the B.

So if I wanted star to apply
to both of the A and the B,

I would need to put parentheses
around the A and the B.

So does this string match?

Does A, C, D, A, C, D match?

How many say yes?

How many say no?

Okay, so we have a split vote.

So how would we figure this out?

How would we answer this question?

What do we need to do?

Yes.

The first thing we need to do is,
the outermost operation is star.

The outermost operation is star,
so I need to figure out, can I take this

string, and can I break it
up into a bunch of pieces,

so that all of the pieces
match what's inside the star?

That's the first thing I need to consider.

And then we have an alternation.

So for each of those pieces, what do I
need to figure out on that?

Yeah, I have to say, does it match the
left or the right?

So can we split this into a bunch of
pieces, so that every single piece matches

either the left or the right, or at least
one of the left or the right?

Okay, so you're saying that you wanted to
do A is chunk one, and then CD is chunk

two, and then A is chunk three,
and then CD is chunk four.

Okay?

So now let's see, do each of these chunks
match the inside of the star?

So we have AB star.

Does A match AB star?

Yeah, because we have our

we need to be able to break
it into two pieces, the first

piece matching the A, and the
second piece matching B star.

So what we're going to do is we're going
to say the first piece is A, A matches A,

and then the second piece is the empty
string, the empty string matches B star.

And then does CD match?

So CD matches because CD matches CD,
and that was one of our options.

And then we have the same thing for the A
and then the CD again.

If you've used regular expressions inside
some sort of programming language in the

past, You probably
had more things that you

could do with regular
expressions than this.

For instance, maybe you had character
classes.

Maybe you had ranges of counts of things.

These are going to be
the only pieces of regular

expression we're going
to consider in this course.

Any of those other pieces that that
programming language might have included,

those are purely syntactic sugar.

So basically those are
things that sort of help make it

easier for you to write regular
expressions for certain tasks.

But they actually aren't any more
expressive than these regular expressions.

Anything that you could do with one of
those fancy schmancy regular expressions

in your programming language
could build an equivalent

regular expression using
just these components.

You can still do it just with more effort
because that was syntactic sugar.

We're only ever going to be
able to produce a countable

number of strings with
these regular expressions.

Why is that?

How do we know that?

Exactly.

So we can only create strings of finite
length.

So when we do a star, we're saying zero or
more copies of, but it's still only going

to be a finite number of copies of that
thing.

So we're saying zero copies of,
or one copy of, or two copies of,

or three copies of, but we never actually
get to infinite copies of.

The set of strings that are
going to match this regular

expression is still going
to be sets of finite strings.

And if we're only
looking at sets of finite

strings, we can only have
countably many of them.

Because there's countably many finite
strings.

Yeah.

Yeah, in general for the rest of the
class, if we're trying to compute things,

the things in that set have to be of
finite size.

So we're not actually
going to ever be talking

about trying to compute
an uncountable set.

Because that

meh... is problematic.

Why lives there are literally words of
finite strings?

It's not hallucinating imaginario about
this assignment.

Okay.

Took about deepない strings,