Finite Automata and Regular Expressions

Finite Automata and Regular Expressions#

The course slowly builds up different models of computation of increasing complexity, starting with finite state automata (FSAs). This chapter introduces FSAs and regular expressions, which turn out to be equivalent in terms of expressiveness.

Finite State Automata#

Finite state automata (FSAs) are an abstraction for a kind of machine with finite memory. An FSA reads a sequence of symbol and either accepts or rejects it. We begin by defining finite state automata (FSAs) formally.

Definition 1 (Finite State Automaton; FSA)

A finite state automaton \(M\) is a 5-tuple \((Q, \Sigma, \delta, q_0, F)\) where

\(Q\) is a finite set, whose elements we call the states,
\(\Sigma\) is a finite set, whose elements we call the alphabet symbols,
\(\delta : Q \times \Sigma \to Q\) is a function called the transition function,
\(q_0 \in Q\) is the start state,
\(F \subseteq Q\) is a set of accept states.

Note

By its definition, at each state of the FSA there exists precisely one transition for each symbol in the alphabet. Having two or more transitions with the same symbol is not allowed, and having no transition for a given symbol is also not allowed.

In order to get a better idea for what an FSA does, we can draw it in the form of a state diagram. In a state diagram, we use circles for the states, labeled arrows for the transitions, an arrow to mark the inital state, and double circles to mark the final states. Below is an example of an FSA, with three states.

../../_images/d0157d6d07baffe755ae6b89b1bf6fdb41d731ca366696f94ea5d39ffa8506f9.svg

Now we define strings, which are sequences of symbols, and languages, which are sets of strings.

Definition 2 (Strings and Languages)

A string is a finite sequence of symbols from a finite set \(\Sigma\). A language is a (finite or infinite) set of strings. The empty string \(\epsilon\) is the string of length \(0\). The empty language \(\emptyset\) is the set of no strings.

Now we can formally define what it means for an FSA to accept or reject a string, or to recognise a language.

Definition 3 (FSA accepts a string / recognises a language)

We say that the FSA \(M = (Q, \Sigma, \delta, q_0, F)\) accepts the string \(w = w_1 w_2\dots w_N\), where \(w_n \in \Sigma\) for all \(n = 1, \dots, N\), if there is a sequence of states \(r_0, r_1, \dots, r_n \in Q\) such that

\(r_0 = q_0\),
\(r_n = \delta(r_{n-1}, r_n)\) for \(1 \leq n \leq N\),
\(r_n \in F\).

We say that \(M\) recognises the language \(L\) if \(M\) accepts exactly those strings in \(L\). We write \(L(M)\) to denote the language recognised by \(M\), that is \(L(M)= \{w | M \text{ accepts } w\}\). We say that \(L(M)\) is the language of \(M\).

Note that the definition above means that there exists an FSA which accepts all strings in the language and rejects all other strings which are not in the language.

Example 1

The FSA \(M\) shown above accepts all binary strings which contain \(11\) as a substring.

Now we come to an important definition, that of regular languages. We call a language regular if an FSA recognises it.

Definition 4 (Regular language)

We say that a language is regular if there exists an FSA that recognises it.

Example 2

The set of all finite strings which have an even number of \(1\)s is a regular language. An FSA that recognises this language is shown below.

../../_images/04d8d4403303ba6c960b4d931077afcba6b4100b7bf1ce4e2b489aa99f41011f.svg

Example 3

The set of all finite strings which have an equal number of \(0\)s and \(1\) is not a regular language.

As this result suggests, FSAs are very limited in terms of the languages they can recognise and, more generally, are a limited model of computation.

Regular operations#

Given existing languages, we can construct new ones. In particular, we will define two binary operations, namely union and concatenation, and a unary operation, namely star, which create new languages from existing ones. We give these operations the special name regular operations because, as we will see later, they preserve the closure of regular languages.

Definition 5 (Regular operations)

Let \(A, B\) be languages. We define the union, concatenation and star operations as:

Union: \(A \cup B = \{w | w \in A \text{ or } w \in B\}.\)
Concatenation: \(A \circ B = AB = \{xy | x \in A \text{ and } y \in B\}.\)
Star: \(A^* = \{x_1 \dots x_N | x_n \in A \text{ for } k \geq 0\}.\)

Note that the language obtained via the star operation always contains the empty string.

We can use these regular expressions to form new languages. Regular expressions are a useful bit of notation that facilitates this.

Definition 6 (Regular expressions)

Let \(\Sigma\) be an alphabet set. A regular expression \(R\) on \(\Sigma\) is

\(a\) for some \(a \in \Sigma\),
\(\epsilon\),
\(\emptyset\),
\((R_1 \cup R_2)\), where \(R_1\) and \(R_2\) are regular expressions,
\((R_1 \circ R_2)\), where \(R_1\) and \(R_2\) are regular expressions,
\((R_1^*)\), where \(R_1\) is a regular expression.

Example 4 (Some regular expressions)

Let \(\Sigma = \{0, 1\}\). The following are examples of regular expressions:

\((0 \cup 1)^* = \Sigma^*\) is the set of all strings over \(\Sigma\).
\(\Sigma^*1\) is the set of all strings that end in \(1\).
\(\Sigma^*11\Sigma^*\) is the set of all strings that contain \(11\).

When writing regular expressions, we may use the shorthand \(a\) instead of the singleton set \(\{a\}\), as was done in the examples above.

Closure properties: union#

We continue with proving three closure properties of regular languages, namely closure under union, concatenation and star operations. We will first prove closure under union.

Theorem 1 (Closure under union)

If \(A_1, A_2\) are regular languages over an alphabet \(\Sigma\), so is \(A_1 \cup A_2\).

It is more challenging to prove closure under concatenation and star in this way. Instead, introducing a new concept, nondeterminism, will make these proofs much easier. It will also simplify the proof for closure under union and make it more intuitive. More generally, nondeterminism is a recurring theme across the theory of computation.

Nondeterminism#

We introduce the nondeterministic finite automaton (NFA), which extends the finite automata we have been looking at up to now, which we will refer to as deterministic finite automata (DFA). Below is an example of an NFA which differs from DFAs in two important ways:

From each state in the NFA, there may be none, one or more outgoing transitions for each symbol.
There is an \(\epsilon\) symbol which represents a transition that reads no symbol from the input, and just changes the state of the FSA.

The NFA accepts a string if there is any valid path from the initial state to some finite state.

../../_images/981c7354b67d177c2307e2cf886be16f13d02d8f71b77c22739bc478bd812bd0.svg

We can also look at a few example input strings, and whether this NFA accepts or rejects them.

NFA accepts input ab
NFA rejects input aa
NFA accepts input aba
NFA rejects input abb
NFA accepts input aab

Now we define NFAs formally. The definition is the same as for DFAs except for the transition function \(\delta\).

Definition 7 (Nondeterministic Finite Automaton)

A nondeterministic finite automaton (NFA) \(N\) is a \(5\)-tuple \((Q, \Sigma, \delta, q_0, F)\) where

\(Q\) is a finite set of states,
\(\Sigma\) is a finite set of symbols,
\(\delta : Q \times \Sigma_{\epsilon} \to \mathcal{P}(Q) = \{R | R \subseteq Q\}\) is the transition function, where \(\Sigma_{\epsilon} = \Sigma \cup \{\epsilon\}\),
\(q_0 \in Q\) is the initial state,
\(F \subseteq Q\) is a set of accept states.

Even though NFAs appear to be more powerful than DFAs, they are in fact equivalent to them. Every DFA is an NFA but perhaps more surprisingly every NFA has an equivalent DFA.

Theorem 2 (NFA recognises \(A\) \(\implies\) \(A\) is regular)

If an NFA recognises \(A\), then \(A\) is regular, so there exists a DFA that recognises it.

We can use the procedure in this proof to convert any NFA, to an equivalent DFA. For example, we can convert the NFA from the previous example into the following equivalent DFA.

../../_images/0b7c5fcc5c5a89b04778c6ca42aaea2f45743a5cde561b064695ff0848eeb7da.svg

Closure properties#

We can now use nondeterminism to prove closure under union, intersection and star, much more easily. The proof is based on the intuitive idea that we can make a larger NFA which is literally equivalent to running both machines in parallel.

Theorem 3 (Closure under union)

If \(A_1, A_2\) are regular languages over an alphabet \(\Sigma\), so is \(A_1 \cup A_2\).

Similarly, we can prove closure under concatenation, by joining two DFAs into a single larger NFA.

Theorem 4 (Closure under concatenation)

If \(A_1, A_2\) are regular languages over an alphabet \(\Sigma\), so is \(A_1 \circ A_2\).

Last, we can prove closure under the star operation by adding a new initial state, as well as \(\epsilon\) transitions from the final states of the machine to the old initial state of the machine.

Theorem 5 (Closure under star)

If \(A\) be a regular language. Then so is \(A^*\).

Regular languages \(\equiv\) Regular expressions#

It turns out that NFAs are equivalent to regular expressions, in the sense that all regular expressions are regular languages and all regular languages can be written as regular expressions. We break this result up in the two aforementioned parts. The first part shows that the language written as a regular expression is regular.

Lemma 1 (Regular expressions yield regular languages)

If \(R\) is a regular expression and \(A = L(R)\), then \(A\) is a regular language

The second part shows that if a language is regular, it can be written as a regular expression. Before showing this result we introduce the notion of generalised finite automata, which will make use of in the proof.

Definition 8 (Generalised nondeterministic finite automaton)

A generalised nondeterministic finite automaton (GNFA) is a \(5\)-tuple \((Q, \Sigma, \delta, q_s, q_a)\), where

\(Q\) is a finite set of states,
\(\Sigma\) is a finite set of symbols,
\(\delta : (Q - \{q_a\}) \times (Q - \{q_s\}) \to \mathcal{R}\) is the transition function, where \(\mathcal{R}\) is the set of regular expressions over \(\Sigma\),
\(q_s \in Q\) is the initial state,
\(q_a \in Q\) is the accept state.

We say that the GNFA accepts a string \(w \in \Sigma^*\) if \(w = w_1 \dots w_k\) and there exists a sequence of states \(q_0, q_1, \dots, q_k\) exists, such that: \(q_0 = q_s\) is the start state, \(q_k = q_a\) is the accept state, and for each \(i\) we have \(i \in L(R_i)\), where \(R_i = \delta(q_{i-1}, q_i)\).

In order to show that a regular language can be written as a regular expression, the main idea is to start from a GNFA that is equivalent to a DFA that recognises the language, and incrementally remove states from this GNFA, while introducing transitions, appropriately annotated by regular expressions, to make up for the removed states. This procedure is repeated until a single transition between two states is left in the GNFA. The regular expression of this transition has the same language as the DFA we started with.

Lemma 2 (Regular languages can be written as regular expressions)

If \(A\) is a regular language, then there exists a regular expression \(R\) such that \(A = L(R)\).

Putting the above lemmas, we obtain the result that regular languages are equivalent to regular expressions.

Corollary 1 (Regular laguages \(\equiv\) regular expressions)

A language is regular if and only if it can be written as a regular expression.

Regular pumping lemma#

So far we have a few results that help show that a language is regular. Now, we turn to a criterion that helps us show that a language is not regular, namely the regular pumping lemma. This result effectively shows that the expressive power of DFAs and NFAs is limited, and that if such a machine recognises an infinite language, then the strings of the language have to have repetitions of a particular kind.

Lemma 3 (Regular Pumping Lemma)

For any DFA \(M\), there exists a positive integer \(p\) such that for any string \(s \in L(M)\), the following properties hold

\(|y| \geq 1\),
\(s = xyz\), where \(|xy| \leq p\),
\(s = xy^nz \in L(M)\) for all \(n \geq 1\).

We can use the regular pumping lemma to show that certain languages are not regular, such as the example considered earlier.

Example 5

Let \(A\) be the set of strings over \(\{0, 1\}\) which contain an equal number of zeros and ones. This language is not regular.