Some hon. members: Oh, oh!

A while ago, someone said something on twitter and replied with “Some hon. members: Oh, oh!” For some reason (probably because I am a gigantic nerd, I thought this was hilarious and looked up what other interesting tidbits or convention were transcribed into Hansard.

If interjections give rise to a call for order by the Speaker, they are reported as “Some hon. members: Oh, oh!”
Hansard — Wikipedia

For my computational linguistics project I wanted to play around with Hansard as a corpus. I used the database, which has all the statements made in the House that are available online (so since 1993) and has some convenient metadata. So I proceeded to dump the statements in the database, categorized by stuff like party or whether it was the Speaker speaking. Each statement had some metadata indicating who spoke and has a ton of information about each Member of Parliament that’s scraped from PARLINFO.

While I was doing this, I remembered the “Some hon. members” stuff and wondered whether they had an id so I could dump all of those statements out. It turns out that statements by “Some hon. members” or “An hon. member” aren’t linked to a particular member or politician, even a placeholder one. That’s okay, since it was possible to grab all of that stuff with a query on the name instead of an indexed id.

Now I have all of these statements sitting around without context, so the obvious thing to do is to make a twitter bot.

How it actually works isn’t complicated at all. Everything just sits in a giant text file and a script pulls a line from the file at random once every hour. Since the vast majority of things that Some hon. members say are things like voting (yea or nay or agreed) or interjections (oh, oh or hear, hear), that’s what’ll show up most of the time.

I’ve also included things that An hon. member will say, so occasionally, there will be random heckling showing up. Because, you know, non sequiturs on twitter are hilarious. These are sometimes longer, so I made it randomly pull out a chunk of the statement, which has questionable results.

To be honest, I wanted to do something for Toronto City Council at first, which was why I asked around #TOpoli for something Hansard-like for council. Unfortunately, that doesn’t exist, so unfortunately, all of the amazing possibilities for council bots will go unrealized. On the other hand, there are a few more ideas I have for all of this Hansard stuff. And of course, there’s my actual project to hopefully look forward to as well.

Anyway, I’m glad people are enjoying it.

About that Mighty Middle

Back in fall, I had to do a project for my social network theory course. Not too long before that, I saw a neat post on CalgaryGrit that talked about how to identify similarly aligned groups and councillors on Toronto City Council based on their voting records. With the help of the guy who actually did that analysis and his scripts, I was able to compile my own data from Toronto’s council voting records which were conveniently offered as part of the city’s open data.

And here’s what came out:

This is a graph of Toronto City Council, where edges between two councillors means that they have a high degree of similarity in their voting records. The period considered is from the beginning of the term, December 1, 2010 up until September 30, 2011, which was the last day of voting data that I had available before I had to start work on the project. An edge means that the two councillors voted the same way at least 90%. An edge is coloured blue if that goes up to 92.5% and it’s coloured green if it’s at least 95%. Or you can think of those numbers as 10%, 7.5%, and 5% different. Same thing.

It’s important to note that the edges aren’t weighted. This means that the physical distance of the councillors doesn’t actually mean anything. That Rob Ford is drawn at the top right corner is a coincidence and it doesn’t mean that Denzil Minnan-Wong is more similar to the mayor than Doug Ford. Similarly, it doesn’t mean that Paula Fletcher is necessarily the councillor least similar to the mayor just because she’s the furthest away in the drawing.

Of course, the drawing is oriented such that you can make less fine-grained observations. It’s pretty easy to see which councillors are the right or left wing and which ones reside in the middle. We can also see that the right wing of council votes together a lot more than the left wing. We can even kind of identify which councillors are most likely to break with their respective groups.

In order to get more detailed analysis, we need to look at actual numbers. There are a bunch of graph theoretic metrics that we can use together with social network theory to talk about how certain groups or councillors might act. I get into that sort of stuff in my writeup. Of course, a lot of that is handwaving since I don’t really get hardcore into stats and a lot of it is political analysis, which I’m kind of an amateur at.

Interestingly, I concluded my writeup by saying something like how, based on all of the analysis and numbers, the Mighty Middle didn’t really show any signs of life. And even if it did, it wouldn’t get very far since the number of votes Ford had on his side was pretty stable and was at the point where he’d only need one or two votes.

Well, council sure showed me. One of the first things that happened in 2012 was the Mighty Middle that I said didn’t really exist covertly orchestrated enough votes to reverse a bunch of the mayor’s cuts. So what does this mean?

Well, the most obvious lesson is that you can’t predict what’ll happen in volatile political scenarios based just on data. Obviously, the situation seemed stable, both from the perspective of the data and people watching city hall. I mean, the whole budget thing surprised everyone.

Of course, the mistake there was to assume we were in stable situation when we weren’t. Yeah, the mayor had a lot of safe votes, but he was still missing one or two. Everyone’s been treating this as the mayor having an unfortunate stranglehold on council even though in a parliamentary setting, missing one or two votes is still enough to throw everything into uncertainty (see Queen’s Park). Being a non-partisan legislature, this volatility should’ve been more expected. But that tiny gap managed to fool everyone, even though all it’d take is one or two councillors to ruin everything for the mayor.

Interestingly enough, the Mighty Middle councillors did do what I said they’d be good at. They were able to use their position to broker a compromise that managed to peel off enough votes from the mayor. When they did combine their powers, they were really effective. But also notice that the coalition they forged was still really fragile. Even on the day of the vote, it wasn’t completely slam dunk for the opposition until the vote was done. All it would’ve taken to ruin everything was one or two councillors, which is something I did note in my writeup. Of course, like I said earlier, this caveat and margin of error holds for the mayor’s side as well.

And here’s where numbers aren’t enough to capture everything. Once someone decides to rock the boat, as the Mighty Middle did in late January, it becomes easier and easier. All of a sudden, the mayor’s grip on council doesn’t seem that bad and in February, we have the TTC chair and (former) Ford ally Karen Stintz turn rally council to overturn the mayor’s transit plan.

This is obviously a huge vote. Unfortunately, my model doesn’t weigh votes according to their importance, so this is treated the same as any other banal vote like voting on extensions of speaking time or something. I did throw in some stuff about possible ways of classifying votes, but that’s already hard to do in an objective way. Figuring out how to weigh the importance of votes is probably even harder to do automatically.

None of that changes that I was way off in my conclusion. Of course, I’m thrilled at what’s been happening at city hall, so I guess I should be happy that all of this happened right after I finalized my writeup and got marked on it.

If you’re interested in reading about how wrong I was, here’s the writeup.

State complexity of regular languages

A few months into the start of my graduate studies, my thesis advisor, Dr. Sheng Yu, suddenly and unexpectedly passed away. It was shocking more than anything, since it was only a few days before that I attended one of his classes and we had agreed on a date for me to give the presentation to complete my reading course that I’d been dragging my feet on. The reading course was basically a bunch of papers that formed the basis for what my research was going to be on.

What I’d intended to work on was finding new results in state complexity. Sheng did a ton of work in state complexity and having him as my advisor was the reason I came to Western. The papers that I read for the reading course were a bunch of his papers on state complexity, ranging from his 1994 paper with Zhuang and Salomaa and covering up to one of his latest ones on state complexity approximation with Gao in 2009.

So what is state complexity? Well, it’s a descriptional complexity measure for regular languages. Essentially, it’s defined as the number of states in the minimal deterministic finite automata that accepts that language.

Let’s start from the beginning. In formal language theory, we’re concerned with words and languages. We make words out of an alphabet. An alphabet is just a set of symbols that we can use. So we can have our alphabet be $\{a,b,c\}$ or $\{0,1\}$ or even $\{bla, blar, blargaagh\}$. A word is just any string made up of symbols from your alphabet. So $abcbaca$ is a word from our first alphabet, $010101111$ is a word from our second, and $blablablarblablargaaghbla$ is a word from our third.

Languages are just subsets of the words that we can make out of an alphabet. So if we have our alphabet $\Sigma=\{0,1\}$, maybe we want our language to be the set of all words that have an even number of $1$s. Or maybe we want a language where we have an equal number of $0$s and $1$s or where we always have twice as many $0$s as $1$s. Or maybe we just want our language to be $\{0,1,100,0001011\}$.

So now that we have these languages, we want to know which words are in our language. That’s pretty easy for something like $\{0,1,100,0001011\}$, since we can just check it against every word in our language. But what about something more complicated, like requiring an even number of $1$s?

Here’s where we come up with theoretical machines that do this. These theoretical machines are essentially the theoretical models that eventually led to real computers. You may have heard of Turing machines. Well, these aren’t real machines (not that that stopped this guy from building one), but are just mathematical structures that we build out of sets and functions.

Anyhow, the particular machine we’re concerned with is the deterministic finite automata. The idea behind this machine is that it reads in a word that you give it, one letter at a time. Depending on which letter it sees and which state the machine is in, it’ll go to a different state. It keeps doing this until it’s read the entire word. If the machine is in an accepting state when it’s finished with the word, then the word is in the language that’s recognized by the machine.

These machines are defined mathematically as follows: a deterministic finite automata is a 5-tuple $(Q,\Sigma,\delta,q_0,F)$, where $Q$ is a finite set of states, $\Sigma$ is an alphabet, $\delta$ is a transition function $\delta:Q\times\Sigma\to Q$ that moves us to another state depending on the current state and letter that’s read, $q_0$ is the start state, and $F$ is a subset of states from $Q$ that denote the accepting states.

DFAs aren’t the only kind of machine out there, there are tons of them. But DFAs have a special property, which is that they only accept regular languages. Regular languages are a special class of languages that are generated by regular expressions (of course). I won’t get into those, but the most common usage for regular expressions is for pattern matching. This use is one of the main reasons why we’re still concerned with DFAs even though they’re the simplest and least powerful of our theoretical computation models.

Anyhow, this regular language and DFA correspondence is why we talk about state complexity of regular languages and go on to talk about DFAs. Every regular language has a DFA that’ll accept it and every DFA accepts a regular language. Of course, when we talk about state complexity we’d like to talk about the DFA with the least number of states. That’s not just because we’d like a lower bound on the number of states. It turns out that every regular language has an infinite number of DFAs that can accept it. However, each regular language only has one, unique minimal DFA.

So why does state complexity matter? That’s a pretty good question, because for the first few decades of automata and formal language research, it wasn’t something that concerned computer scientists very much. In fact, it was Sheng (with Zhuang and Salomaa) who kicked off modern state complexity research in 1994 with the paper The state complexities of some basic operations on regular languages. That paper focuses on operational state complexity.

When we talk about the state complexity of an operation, we talk about the state complexity of the language that’s created from the operation. Since languages are just sets, we can do the usual set operations on them and all sorts of other operations. So we express the state complexity of the operation in terms of the state complexity of the languages that we started out with before the operation.

Anyhow, the reason it took a few decades to come up again was basically a lack of motivation in terms of practical applications. The uses for finite automata in the early decades of computer science research were for things like pattern matching and lexical analysis in compilers. Large and complicated finite automata weren’t something that caused a lot of worry and even if they did exist, they couldn’t be used simply because there wasn’t the computing power for it. This also made it hard to prove some state complexity bounds, since some of these operations could cause the number of states to grow exponentially.

All of these problems disappeared with more available computing power. With more computing power, we started to see more applications for finite automata that depended on huge and complex automata in areas like artificial intelligence or computational linguistics. More computing power also led to the development of software tools for manipulating automata. This was a huge improvement over writing and checking things by hand.

Since then, there’s been a ton of state complexity research. Almost any operation you can think of has a state complexity bound proved for it. There’s been a ton of research in operations on restricted classes of regular languages, like finite languages or regular languages over one-letter alphabets. There’s also been research into nondeterministic state complexity and other similar descriptional complexity measures for NFAs.

If you want some (better) summaries of state complexity research, there’s State Complexity: Recent Results and Open Problems and State Compexity Research and Approximation.


One of the things that I’ve always tried to do is maintain a single identity on the internet. Ever since I settled on my long term internet name, I’d been adamant about keeping everything together, even to the point of associating my real name to it. My old blog, contains everything I’ve written ever since I started the whole blogging thing on LiveJournal all the way back in 2004. It’s fascinating to go through all of that crap and see what I wrote when I was a dumb high school kid, moving on to university, and now, in its current iteration where I throw a few hundred words on whatever interests me at the moment.

But over the last year or two, there’ve been things that I’ve wanted to express that I’d hesitated to throw up there. Even though I told myself otherwise, it really felt like the blog had become a single-topic blog. It didn’t really feel right to intrude with more serious stuff anymore. I mean, I did occasionally throw up a post about Canadian politics or something, but that seemed okay because it seemed like people were at least kind of interested in what I had to say.

The problem was that I now had an audience that I was writing for and I had another audience that I also wanted to write for.

It’s kind of the same problem that Google+ tries to solve with its circles. Unfortunately for Google+, I’d already partitioned my online social interactions: Facebook was for people I know in real life and twitter was for people I didn’t. Easy. But the thing is that if real life friends wanted to follow me on twitter, I didn’t stop them. Maybe I should have, but I feel like if they decide to follow me, then they should get the same online experience from me that everyone else does.

And I think this is how I’ll go about with this blogging thing. My old blog has too much cruft in it, but I don’t want to completely throw it away, because as much as some parts might make me want to disavow it, it’s still me. It’s like there’s a very tiny wall. On this side, is this pristine new blog that I’ll get to cultivate. On the other side is everything I’ve gathered over the last eight years. I’ll keep working on that stuff too. What’s important is that you’ll know which side of the wall you’re on. But since the wall’s not that tall, you can hop over if curiosity ever gets the best of you.