Math and science::Theory of Computation

Regular expressions

Regular expressions can be seen as a parallel to arithmetic expressions.

\( (5 + 3) \times 4 \) is an expression whose symbols represent numbers (e.g. natural numbers) and as a whole represents another number, 32.
\( (0 \cup 1)0^\star \) is an expression involving languages (sets of strings) and as a whole represents another language, the set of all strings which start with a 0 or 1 and is followed by any number of 0s.

A regular expression \( R \) is a sequence of symbols that is one of:

a symbol \( a \) for some \( a \) in an alphabet \( \Sigma \)
\( \varepsilon \)
\( \emptyset \)
\( (R_1 \cup R_2) \), where \( R_1 \) and \( R_2 \) are regular expressions
\( (R_1 \circ R_2) \), where \( R_1 \) and \( R_2 \) are regular expressions
\( (R_1^*) \), where \( R_1 \) is a regular expression

A regular expression is said to represent a language according to the rules below. We write \( L(R) \) to mean the language of the regular expression \( R \).

From item 1, the regular expression \( a \) represents the language \( \{a\} \).
From item 2, the regular expression \( \varepsilon \) represents the language \( \{ \varepsilon \} \).
From item 3, the regular expression \( \emptyset \) represents the empty language, \( \{\} \).
From item 4 & 5, the regular expression \( ( R_1 \cup R_2) \) and \( (R_1 \circ R_2) \) represent the languages obtained by taking the union and concatenation of the languages represented by \( R_1 \) and \( R_2 \).
From item 6, the regular expression \( (R_1^\star) \) represents the star of the language represented by \( R_1 \).

We say that two regular expressions \( R_1 \) and \( R_2 \) are equal if they represent the same language, and we write \( R_1 = R_2 \).

This definition is lacking in how it describes 'represents'. Maybe it is a function. I think it's coverage of equality is just sufficient, as it delegates the well defined formulation of set equality.

\( \varepsilon \) vs \( \emptyset \)

The regular expression \( \varepsilon \) represents the language containing a single string, the empty string, \( \{\varepsilon\} \). The regular expression \( \emptyset \) represents the empty language, \( \{\} \).

Precedence order

Star has highest precedence, followed by concatenation, with union last.

\( R^+ \)Shorthand

\( R^+ \) is used as convenient shorthand for \( RR^* \); thus, \( R^* = R^+ \cup \epsilon \).

Two identities

If \( R \) is any regular expression then:

\[ R \cup \emptyset = R \]

\[ R \circ \varepsilon = R \]

[Note: Shouldn't there be an extra rule for creating a regular expression from a set? For example, how is \( \Sigma \) allowed in the regular expressions given as an examples below?]

Example

The regular expressions here represent the languages below.

\( 0^*10^* \)
\( \Sigma^*1\Sigma^* \)
\( \Sigma^*001\Sigma^* \)
\( 1^*(01^+)^*\)
\( (\Sigma \Sigma)^* \)
\( (\Sigma \Sigma \Sigma)^* \)
\( 01 \cup 10 = \{01, 10\} \)
\( 0\Sigma^*0 \cup 1\Sigma^*1 \cup 0 \cup 1 \)
\( (0 \cup \varepsilon)1^* = 01^* \cup 1^* \)
\( (0 \cup \varepsilon)(1 \cup \varepsilon) \)
\( 1^*\emptyset \)
\( \emptyset^* \)

\( \{w | w \text{ contains a single 1} \} \)
\( \{w | w \text{ has at least one 1} \} \)
\( \{w | w \text{ contains the string 001 as a subsequence} \} \)
\( \{w | \text{ every 0 in } w \text{ is followed by at least one 1} \} \)
\( \{w | w \text{ is a string of even length} \} \)
\( \{w | \text{ the length of } w \text{ is a multiple of 3} \} \)
\( \{01, 10 \} \)
\( \{w | w \text{ starts and ends with the same symbol} \} \)
\( 01^* \cup 1^* \)
\( \{\varepsilon, 0, 1, 01 \} \)
\( \emptyset \)
\( \{\varepsilon\} \)

Source

Introduction to the Theory of Computation, Sipser

p64-65