So we’ve discussed what a language is from a computer science theory viewpoint in part 1 of this series, and we have 2 ways of representing languages:

Set notation: `{λ, a, aa, aaa, ...}`

or `{a^n | 0 <= n}`

for short,

and grammar production rules: `{S -> aS | λ}`

As I stated in part 1, a regular expression is a representation of not just any language, but a regular language specifically.

Regular languages are just languages with certain restrictions. To avoid getting too confusing and mathy, we’re going to skip over the formal definition of a regular language and jump straight to regular grammars; recall that a grammar is simply a way of representing a language, so a regular (or linear) grammar is a representation of a regular language (just like a regular expression!). So one example of a regular grammar is one in which every production rule is of the form: A -> aB, A -> a, or A -> λ, where A and B are non-terminals, and a is a terminal (this is called a left linear or left regular grammar).

For example:

`P = {`

S -> aB

B -> bC

C -> c

}

Each production rule in this grammar fits one of the forms above, so it is a regular grammar, and therefore the language represented by this grammar is a regular language.

Note: There are other grammars that represent this same language, such as:

`P1 = {`

S -> abc

}

or

`P2 = {`

S -> Bc

B -> Ab

A -> a

}

(For the record, P2 is an example of a right linear grammar)

Depending on the language or library you’re using, regular expressions have a lot of extra features and shortcuts, but a standard regex has only 3 basic operations:

1) Concatination `(ab)`

2) Alternation `(a|b)`

3) Kleene Star `(a*)`

Any other operation can be made from these operations.

The + operator, for example, can be defined as: a+ is equivalent to `a(a*)`

Another useful operator is the ? operator, which can be defined as: a? is equivalent to `(a|)`