Making NFAs smaller

So, you might be wondering what I’m working on now. Basically, I’m looking at algorithms that are motivated by the idea of reductions of nondeterministic finite automata.

The problem of NFA reduction is basically to take an NFA and make an equivalent NFA with fewer states. The natural question to ask is whether we can find an equivalent NFA with the minimal number of states, which is the NFA minimization problem. There are some good reasons that we focus on the easier problem. The first is that unlike DFAs, there is no unique minimal NFA. The second reason is that NFA minimization is super hard, as in PSPACE-complete (if you know how NP is “hard”, well, PSPACE is “harder” or “really friggin hard”).

Well, okay, but what if we’re not so worried about getting it as small as possible and we just want to get the damned thing smaller. If we can’t guarantee that they’re as small as we can make them, is there some kind of measurement to tell us how far away we are? Or are there special cases of NFAs where we can get the smallest one? And it turns out the answer is probably almost always no. Of course, that won’t stop us because there reducing the number of states in an NFA is pretty useful.

There are a number of ways to do this and it’s not known which way is the best way, otherwise, we could work on approximation or minimization instead. A lot of solutions take the approach of turning the NFA into a DFA, minimizing it, and then turning it back into an NFA and seeing what we get. This is nice, because we know there’s a smallest DFA and it’s not too hard to find. The problem with this is that the growth of the number of states in the NFA to DFA transformation can be exponential in the worst case. We’d like to avoid this.

The idea is to try finding states that are superfluous. What do we mean by superfluous? Well, if I have a word $ababa$ and I can end up in either a state $p$ or a state $q$, then we probably don’t need two different states. Similarly, if we start in a state $p$ or a state $q$ and use up the word $aaaba$ to get to a final state, we might not have needed two different states for that. This is essentially the kind of thing we do in DFA minimization, but in that case it’s a lot easier because the state transitions are deterministic.

What we’d do formally is define a relation that captures this idea. For an NFA $N=(Q,\Sigma,\delta,q_0,F)$, we’d define some relation $R$ that satisfies the following conditions:

  1. $R\cap(F\times(Q-F))=\emptyset$
  2. $\forall p,q\in Q, \forall a\in \Sigma, (pRq \implies \forall q’\in\delta(q,a), \exists p’\in \delta(p,a), q’Rp’)$.

Condition 1 rules out final states being related with non-final states. Condition 2 tells us that if two states $p$ and $q$ are related, then every state that $q$ has a transition to will be related to some state that $p$ transitions to on the same letter. As long as these two conditions are held, then we can define any crazy old relation to work with and gather up our states into a bunch of states.

This is essentially what we do when we want to minimize DFAs. The difference is that it’s a lot easier to define this relation because transitions only have one outcome in the deterministic case. In fact, there’s a specific relation, called the Myhill-Nerode equivalence relation, which has the property that $L$ is a regular language if and only if the Myhill-Nerode relation has a finite number of equivalence classes. And these equivalence classes turn out to correspond to the states in the minimal DFA for $L$.

However, like was mentioned before, we have no such luck with NFAs. We have a lot of relations we can choose from and it’s not clear which one of these is the best or things like which combination or how to iterate them so that we can get the best one. This is where the hard part is, figuring out which of these things gets us the smallest NFA in the most efficient way possible.

State complexity of regular languages

A few months into the start of my graduate studies, my thesis advisor, Dr. Sheng Yu, suddenly and unexpectedly passed away. It was shocking more than anything, since it was only a few days before that I attended one of his classes and we had agreed on a date for me to give the presentation to complete my reading course that I’d been dragging my feet on. The reading course was basically a bunch of papers that formed the basis for what my research was going to be on.

What I’d intended to work on was finding new results in state complexity. Sheng did a ton of work in state complexity and having him as my advisor was the reason I came to Western. The papers that I read for the reading course were a bunch of his papers on state complexity, ranging from his 1994 paper with Zhuang and Salomaa and covering up to one of his latest ones on state complexity approximation with Gao in 2009.

So what is state complexity? Well, it’s a descriptional complexity measure for regular languages. Essentially, it’s defined as the number of states in the minimal deterministic finite automata that accepts that language.

Let’s start from the beginning. In formal language theory, we’re concerned with words and languages. We make words out of an alphabet. An alphabet is just a set of symbols that we can use. So we can have our alphabet be $\{a,b,c\}$ or $\{0,1\}$ or even $\{bla, blar, blargaagh\}$. A word is just any string made up of symbols from your alphabet. So $abcbaca$ is a word from our first alphabet, $010101111$ is a word from our second, and $blablablarblablargaaghbla$ is a word from our third.

Languages are just subsets of the words that we can make out of an alphabet. So if we have our alphabet $\Sigma=\{0,1\}$, maybe we want our language to be the set of all words that have an even number of $1$s. Or maybe we want a language where we have an equal number of $0$s and $1$s or where we always have twice as many $0$s as $1$s. Or maybe we just want our language to be $\{0,1,100,0001011\}$.

So now that we have these languages, we want to know which words are in our language. That’s pretty easy for something like $\{0,1,100,0001011\}$, since we can just check it against every word in our language. But what about something more complicated, like requiring an even number of $1$s?

Here’s where we come up with theoretical machines that do this. These theoretical machines are essentially the theoretical models that eventually led to real computers. You may have heard of Turing machines. Well, these aren’t real machines (not that that stopped this guy from building one), but are just mathematical structures that we build out of sets and functions.

Anyhow, the particular machine we’re concerned with is the deterministic finite automata. The idea behind this machine is that it reads in a word that you give it, one letter at a time. Depending on which letter it sees and which state the machine is in, it’ll go to a different state. It keeps doing this until it’s read the entire word. If the machine is in an accepting state when it’s finished with the word, then the word is in the language that’s recognized by the machine.

These machines are defined mathematically as follows: a deterministic finite automata is a 5-tuple $(Q,\Sigma,\delta,q_0,F)$, where $Q$ is a finite set of states, $\Sigma$ is an alphabet, $\delta$ is a transition function $\delta:Q\times\Sigma\to Q$ that moves us to another state depending on the current state and letter that’s read, $q_0$ is the start state, and $F$ is a subset of states from $Q$ that denote the accepting states.

DFAs aren’t the only kind of machine out there, there are tons of them. But DFAs have a special property, which is that they only accept regular languages. Regular languages are a special class of languages that are generated by regular expressions (of course). I won’t get into those, but the most common usage for regular expressions is for pattern matching. This use is one of the main reasons why we’re still concerned with DFAs even though they’re the simplest and least powerful of our theoretical computation models.

Anyhow, this regular language and DFA correspondence is why we talk about state complexity of regular languages and go on to talk about DFAs. Every regular language has a DFA that’ll accept it and every DFA accepts a regular language. Of course, when we talk about state complexity we’d like to talk about the DFA with the least number of states. That’s not just because we’d like a lower bound on the number of states. It turns out that every regular language has an infinite number of DFAs that can accept it. However, each regular language only has one, unique minimal DFA.

So why does state complexity matter? That’s a pretty good question, because for the first few decades of automata and formal language research, it wasn’t something that concerned computer scientists very much. In fact, it was Sheng (with Zhuang and Salomaa) who kicked off modern state complexity research in 1994 with the paper The state complexities of some basic operations on regular languages. That paper focuses on operational state complexity.

When we talk about the state complexity of an operation, we talk about the state complexity of the language that’s created from the operation. Since languages are just sets, we can do the usual set operations on them and all sorts of other operations. So we express the state complexity of the operation in terms of the state complexity of the languages that we started out with before the operation.

Anyhow, the reason it took a few decades to come up again was basically a lack of motivation in terms of practical applications. The uses for finite automata in the early decades of computer science research were for things like pattern matching and lexical analysis in compilers. Large and complicated finite automata weren’t something that caused a lot of worry and even if they did exist, they couldn’t be used simply because there wasn’t the computing power for it. This also made it hard to prove some state complexity bounds, since some of these operations could cause the number of states to grow exponentially.

All of these problems disappeared with more available computing power. With more computing power, we started to see more applications for finite automata that depended on huge and complex automata in areas like artificial intelligence or computational linguistics. More computing power also led to the development of software tools for manipulating automata. This was a huge improvement over writing and checking things by hand.

Since then, there’s been a ton of state complexity research. Almost any operation you can think of has a state complexity bound proved for it. There’s been a ton of research in operations on restricted classes of regular languages, like finite languages or regular languages over one-letter alphabets. There’s also been research into nondeterministic state complexity and other similar descriptional complexity measures for NFAs.

If you want some (better) summaries of state complexity research, there’s State Complexity: Recent Results and Open Problems and State Compexity Research and Approximation.