--- Video Title: Regular Expressions Description: Theory of Computation https://uvatoc.github.io/week6 13.2: Regular Expressions Nathan Brunelle and David Evans University of Virginia --- So we've shown that we can do any finite function with finite state automata now. We've shown that we can do some infinite functions, but we know that we're not going to be able to do everything. One of the first things we proved is that no matter what our model of computing is, there's going to be things that we can't compute with it. So we know we're not going to be able to do everything. We're better than we were with NAND circuits, but we know we're not going to be able to compute everything. We might have this natural question of what are those things we can't compute. To continue on this trend of how to discern which things we can and cannot compute, we're going to do something kind of like what we saw with NAND circuits, where we're going to have sort of a machine-looking model of what we're doing, so like a hardware-looking model for what we're doing. We're going to have like a language or a software-looking model for what we're doing as well. For NAND circuits, we had straight-line programs as that software-looking model. The things you can do with finite state automata, it turns out those are equivalent So finite state automata are a way for us to determine which things are and are not in a language. Regular expressions are essentially this programming language that allows us to describe a bunch of strings. So it allows us to describe a set of strings. Regular expressions allow us to describe a set of strings. And it turns out any set of strings I can describe with regular expressions. I can also compute using a finite state automaton and vice versa. So that's what we're going to be showing next. Regular expressions are essentially going to be some way of describing a pattern that your string might have. When we look at a string, we're basically going to be asking, does the string that I'm looking at match this pattern? So if we're trying to think of this as a function, we could say that I'm going to take a string as an input. And then if that string does match the pattern, that function will return 1. And if it doesn't match the pattern, we'll return 0. Or we could think about the language of that regular expression as being a set of all strings that match the pattern. A regular expression is just denoting a set of strings, which is all of the strings that match the pattern. So we can use regular expressions to sort of talk about decision problems as computing functions as well. These are two examples of regular expressions. There are a few main components with these regular expressions we're going to see. Literal characters. So here, A, B, and C, those are literal characters. We're going to see that we can concatenate literal characters together, and then we can apply them together with this vertical bar thing, which we can think of intuitively as an or, or with this star, which is the same as the Kleene star that we saw before, that says zero or more copies of. So let's look at all of the pieces we might see of a regular expression. So a regular expression is basically anything that I can build using some combination of all of the following pieces. So the first thing that we might want to use as a piece of our regular expression is the empty string. If I wanted to talk about specifically the empty string, we can do that with regular expressions. There's two symbols that are the most common symbols to represent the empty string. One of them is double-double quotes, so empty quotes. So this is what you might see in Java or Python to represent the string with no characters. You might also see this epsilon symbol here, a backwards three. If you wanted to do this in LaTeX, that is backslash var epsilon. If you do just regular epsilon, by the way, in. LaTeX, then it looks more like the contained symbol. That's regular epsilon, var epsilon is the backwards three. So if you want to talk about the empty string in a regular expression, that's how you would talk about the empty string. You can also talk about any character in your alphabet. You can write down a literal for any character in your alphabet. So for instance, if I had the regular expression that was just A, that is the literal character A if my alphabet, say, had the letter A in it. And that matches one string. That string is the string containing just A. So, the language of the regular expression for A is just the string A, that one string. This is how we start to build these things up. We're gonna start with super, super simple things, simple components, string literals, and we're gonna be combining those with the things that you're gonna see after that in order to be able to describe more and more strings that are increasingly complicated. The next thing that we're going to see is something that we typically are We're going to call alternation or union, or you can think about this as OR as well. What this does is you can take any two regular expressions that you might have and combine them together with this alternation or union or OR, and it's going to match any string that matched either of those two other things. It's going to match any string that matched either of those two other things. So, for instance, if I had the string literal A, and I had the string literal B, so if I had those two regular expressions, this one on the left matches just the string A, the one on the right matches just the string B. If I combine those together with this vertical bar for alternation or union or OR, then that'll now match two strings. It'll match the string that's just an A, and it'll match the string that's just a B. So the language of that new regular expression has two things in it. That set has two strings in it. Question. It would not also match the string AB. So it has to match either the regular expression on the left or the regular expression on the right. Or potentially both. There could be overlap between. So it would not match AB. It matches A or it matches B. The reason that this is called alternation is it's basically saying here are two alternatives that you could use for what to match against. The reason it's called union is because the strings that this regular expression matches is the union of the sets of strings that the original two regular expressions matched. By saying you can match this one or you can match that one, this new regular expression, it matches the strings that belong to the union of those two. And then or, because you're basically saying you can match the first one or the second one. This is the only one that's got like several names. The next one we can do is we can do concatenation. So I can take two regular expressions and I can concatenate them together, which basically says that I want to now match any string where I can divide it into two pieces, where the first piece matches that first regular expression, the second piece matches that second regular expression. So for instance, I've taken that regular expression that we had before, a or b, and then I've concatenated that with the string literal c. This regular expression is going to match any regular expression where the first chunk of it is going to match the regular expression a or b, and the second chunk of it is going to match the regular expression c. So in this case, there are, again, exactly two strings that match that regular expression. The first one being AC. Because the A matches A or B. And the C is the only thing that matches C. The B matches A or B. And the C is, again, the only thing that matches C. Yeah. If I had the string BCAC. The way that we use regular expressions here is gonna be slightly different from how you might have used them in, say, Python or your programming language of choice. So with Python or your programming language of choice, the way you use them is you said, here's a big chunk of text. Find all of the substrings that match that regular expression. What goes on underneath the hood is your programming language says, Does this chunk match the regular expression? Does this chunk match? And it just marches through your whole file doing that. So really, a regular expression matches a string or it doesn't. So when you're using this regular expression in a programming language, you'll say, find all of the substrings which do match it. Its use in practice is a little bit different, but the way that we're describing it here is actually how the implementation works When people are actually implementing regular expression matching in Python. You look at a regular expression, does the entire string match or not? Is the question that we're asking. How many strings does this regular expression match? Discuss that with your neighbors. How many strings does that match? How many strings? How many strings? I'm seeing four. So there are four strings. Who can name all four strings? I know it's so many. A, B, D. A, B, E, F. C, D. C, E, F. So for this one, the way that we know that these four strings are able to match it is we can say, well, A, B, D, we can split that into two chunks. So here this is saying that we need to be able to split our string into two chunks. The first chunk matching the left half, the second chunk matching the right half. So in this case, we could sort of split it between the B and the D. And now A, B matches the first regular expression here because it said we could match either the string A, B or else the string C. So A, B, we can choose to match there. And then D will match this half of the alternation. We can match either the string E, F or the string D. So D is going to match the string D there. The final thing that we're going to look at is this clean star or this clean star or clean star. If you say any of those, everyone will know what you're talking about. But some people are going to claim that one is the actual correct pronunciation and shame you for using the one that they don't prefer. If I do the star of some regular expression, that is going to match any string that is zero or more copies of things that match that regular expression. And it doesn't have to be the same thing over and over again. As long as all of the ways you've broken it up, each of them matches that regular expression that you mentioned. In this case, we could say, for instance, A or B, C star. This will match the string A by saying, I can match either A or B and then match C star. So it matches A because this says, I chose zero copies of the string C. The thing after the A matches C star because the empty string is zero copies of C. So this is A concatenated with the empty string. We can also do AC because we chose A and then one copy of C. Or we could do ACC or A C C C C C C C C C C C C C C C C C and so forth. So we can do either A or B or A or B followed by any number of Cs. How many strings is that? Yeah. Countably infinite. There's an infinite number of strings that match this regular expression. So if we combine all the other ones without ever using. A star, then we could only get finitely many things. We could get a lot of strings in that language, but we could only get finitely many. The star is what allows us to now start talking about infinite languages. Let's play with this star a little bit. Who can give me one string that matches this regular expression? Yes. The empty string. Very good. So the empty string matches this regular expression. What else? A, B, B, B, B, A. I forgot how many Bs you said, but... A, B, B, B, B, B, A. That matches this regular expression. So A or B star, C star. Who can give me a string that matches this? Yes. The empty string. Very good. So that is one thing that matches this. Who can give me another thing? Yes. A matches this. Very good. I heard a B and I heard a C. And what else? A, B, B, A, C, C. Yeah, let's look at this one a little bit. Let's figure out why this one matches. So this one matches because, well, the first thing I have is I have a concatenation right here. So I need to be able to break it up into two pieces where the first one matches A or B star and the second piece matches C star. So here, this piece here matches A or B star. This piece here matches C star. So are you convinced that this is still countable looking at it? So you could think about this like when we did that example where we said, can you represent all integer pairs using binary strings? Since you could do that, you could represent it as binary strings. You knew that there are only going to be countably many of them. So you can think about this as being a pairs sort of thing going on here. Yeah, very good. That was going to be my next example. So thank you for asking. So let's say that I had A, B star or C, D star. Who can give me strings that are going to match this regular expression? Yeah, you got the easy one this time, the empty string. All right, what else? Yeah. A, C, D, A, C, D? Yeah, so that is a good question. So first of all, we might have this question, what is sort of the order of operations for regular expressions? Star happens first, concatenation happens next, alternation happens after that, unless there's parentheses. So in this case, the star is applying to only the B. So if I wanted star to apply to both of the A and the B, I would need to put parentheses around the A and the B. So does this string match? Does A, C, D, A, C, D match? How many say yes? How many say no? Okay, so we have a split vote. So how would we figure this out? How would we answer this question? What do we need to do? Yes. The first thing we need to do is, the outermost operation is star. The outermost operation is star, so I need to figure out, can I take this string, and can I break it up into a bunch of pieces, so that all of the pieces match what's inside the star? That's the first thing I need to consider. And then we have an alternation. So for each of those pieces, what do I need to figure out on that? Yeah, I have to say, does it match the left or the right? So can we split this into a bunch of pieces, so that every single piece matches either the left or the right, or at least one of the left or the right? Okay, so you're saying that you wanted to do A is chunk one, and then CD is chunk two, and then A is chunk three, and then CD is chunk four. Okay? So now let's see, do each of these chunks match the inside of the star? So we have AB star. Does A match AB star? Yeah, because we have our we need to be able to break it into two pieces, the first piece matching the A, and the second piece matching B star. So what we're going to do is we're going to say the first piece is A, A matches A, and then the second piece is the empty string, the empty string matches B star. And then does CD match? So CD matches because CD matches CD, and that was one of our options. And then we have the same thing for the A and then the CD again. If you've used regular expressions inside some sort of programming language in the past, You probably had more things that you could do with regular expressions than this. For instance, maybe you had character classes. Maybe you had ranges of counts of things. These are going to be the only pieces of regular expression we're going to consider in this course. Any of those other pieces that that programming language might have included, those are purely syntactic sugar. So basically those are things that sort of help make it easier for you to write regular expressions for certain tasks. But they actually aren't any more expressive than these regular expressions. Anything that you could do with one of those fancy schmancy regular expressions in your programming language could build an equivalent regular expression using just these components. You can still do it just with more effort because that was syntactic sugar. We're only ever going to be able to produce a countable number of strings with these regular expressions. Why is that? How do we know that? Exactly. So we can only create strings of finite length. So when we do a star, we're saying zero or more copies of, but it's still only going to be a finite number of copies of that thing. So we're saying zero copies of, or one copy of, or two copies of, or three copies of, but we never actually get to infinite copies of. The set of strings that are going to match this regular expression is still going to be sets of finite strings. And if we're only looking at sets of finite strings, we can only have countably many of them. Because there's countably many finite strings. Yeah. Yeah, in general for the rest of the class, if we're trying to compute things, the things in that set have to be of finite size. So we're not actually going to ever be talking about trying to compute an uncountable set. Because that meh... is problematic. Why lives there are literally words of finite strings? It's not hallucinating imaginario about this assignment. Okay. Took about deepない strings,