An In-Depth Look at Regular Expressions Part 2: Regular Languages

by | Jun 21, 2016

So we’ve discussed what a language is from a computer science theory viewpoint in part 1 of this series, and we have 2 ways of representing languages:
Set notation: {λ, a, aa, aaa, ...} or {a^n | 0 <= n} for short,
and grammar production rules: {S -> aS | λ}

As I stated in part 1, a regular expression is a representation of not just any language, but a regular language specifically.

Regular languages are just languages with certain restrictions. To avoid getting too confusing and mathy, we’re going to skip over the formal definition of a regular language and jump straight to regular grammars; recall that a grammar is simply a way of representing a language, so a regular (or linear) grammar is a representation of a regular language (just like a regular expression!). So one example of a regular grammar is one in which every production rule is of the form: A -> aB, A -> a, or A -> λ, where A and B are non-terminals, and a is a terminal (this is called a left linear or left regular grammar).

For example:

P = {
S -> aB
B -> bC
C -> c
}

Each production rule in this grammar fits one of the forms above, so it is a regular grammar, and therefore the language represented by this grammar is a regular language.
Note: There are other grammars that represent this same language, such as:

P1 = {
S -> abc
}

or

P2 = {
S -> Bc
B -> Ab
A -> a
}

(For the record, P2 is an example of a right linear grammar)

Depending on the language or library you’re using, regular expressions have a lot of extra features and shortcuts, but a standard regex has only 3 basic operations:
1) Concatination (ab)
2) Alternation (a|b)
3) Kleene Star (a*)

Any other operation can be made from these operations.
The + operator, for example, can be defined as: a+ is equivalent to a(a*)
Another useful operator is the ? operator, which can be defined as: a? is equivalent to (a|)

Recent Posts