Twitter bots, twitter bots, twitter bots, folks

After the incredible success of @SomeHonMembers, I decided to create @HonSpeakerBot, which was not nearly as popular, but whatever.

The lack of any transcripts or any data made a bot for #TOpoli difficult. But now, there are two. You may have seen them around as I was testing them.

Mayor Robot Ford hates streetcars

This idea is based on a popular Japanese twitter bot by the name of @akari_daisuki. What Akari does is takes a random Wikipedia article title or something from her timeline. Let’s call this thing $x$. Every fifteen minutes, she tweets the Japanese equivalent of “Yay $x$, Akari loves $x$”.

The idea behind @MayorRobotFord is similar. He takes a random Wikipedia article title or something from the #TOpoli stream and tweets, in his familiar way, “$x$, $x$, $x$, folks. We’re getting $x$.” or “People want $x$, folks. $x$, $x$.”, depending on how long $x$ is.

How does it work? Well, the Wikipedia portion is easy enough. We just get access to the API to grab a list of ten random pages. The more complicated part is pulling stuff off of #TOpoli.

Since this is a twitter app, it’s not too hard to get a bunch of tweets off of #TOpoli. We just use the API and we’ve got the fifteen latest #TOpoli tweets ready to be used. The difficult part is extracting noun phrases, or NPs, which is where that graduate class on computational linguistics comes in handy.

So how do we do this? Well, first of all we have to identify the parts of speech in a given tweet. So we tokenize it first and split it up into words and punctuation. Then, we use a part-of-speech tagger to go through and figure out what each little piece is. The POS tagger that I used was the default one that came with the Natural Language Toolkit. Normally, you’d need to train a tagger on a corpus. This default one was trained on the Brown corpus, which is a body of text which was hand tagged for training purposes.

So now our tweets are all tagged and we assume that they’re properly tagged. There’s obviously going to be some slight errors here and there, but whatever, we want to make a twitter bot, so it’s not that important. But we only have our parts of speech. We want to be able to relate the different parts of speech into phrases. So we need some kind of parsing or chunking to put these pieces together into larger phrases that make sense.

For this, I used a bigram chunker trained on the conll2000 corpus. Like the Brown corpus for tagging, the conll2000 corpus is manually parsed and chunked for training purposes. What a bigram chunker does is it analyses every consecutive pair of words in a sentence to come up with a statistical model. It uses this to come up with the most likely NPs to arise from the sentence. We can then just pluck out all of the NPs the chunker identifies.

Once we have all of our NPs, we stick them in a list with our Wikipedia titles and randomly select one to use in our tweet. The Wikipedia API has a limit of 10 titles per call and the twitter API grabs 15 tweets per call. Thus, the chance of getting a Wikipedia title is at best somewhere around 2/5 of the time. However, that’s not taking into account removing entries that are too large. That quick calculation also assumes that there’s only one NP per tweet when there could be many, so in reality, the chance of grabbing something from #TOpoli is much more likely, which might be for the best if you want weird accidental metacommentary.

The Core Service Review

One day, I decided to look through the City of Toronto’s open data catalogue and happened upon an interesting entry called the Core Service Review Qualitative data.

Lo and behold, it was exactly that.

After some fiddling around with an Excel module for Python and figuring out how to split tweets that are larger than 140 characters, I let it go.

@TOCoreService will tweet one entry, randomly chosen, from the 12000 submissions, or close to 58000 answers. These range from short answers like “transit” or “taxes” to fairly lengthy responses.

So what’s the point of this bot? Well, the data is up there for anyone to read, which is nice for transparency and engagement. Of course, whether anyone who’s not city staff would want to read 13000 responses is another matter. But here, we pretty decent collection of opinions on what our priorities should be from real citizens. It’d be a shame if the only people who read them were city staff.

Some hon. members: Oh, oh!

A while ago, someone said something on twitter and replied with “Some hon. members: Oh, oh!” For some reason (probably because I am a gigantic nerd, I thought this was hilarious and looked up what other interesting tidbits or convention were transcribed into Hansard.

If interjections give rise to a call for order by the Speaker, they are reported as “Some hon. members: Oh, oh!”
Hansard — Wikipedia

For my computational linguistics project I wanted to play around with Hansard as a corpus. I used the database, which has all the statements made in the House that are available online (so since 1993) and has some convenient metadata. So I proceeded to dump the statements in the database, categorized by stuff like party or whether it was the Speaker speaking. Each statement had some metadata indicating who spoke and has a ton of information about each Member of Parliament that’s scraped from PARLINFO.

While I was doing this, I remembered the “Some hon. members” stuff and wondered whether they had an id so I could dump all of those statements out. It turns out that statements by “Some hon. members” or “An hon. member” aren’t linked to a particular member or politician, even a placeholder one. That’s okay, since it was possible to grab all of that stuff with a query on the name instead of an indexed id.

Now I have all of these statements sitting around without context, so the obvious thing to do is to make a twitter bot.

How it actually works isn’t complicated at all. Everything just sits in a giant text file and a script pulls a line from the file at random once every hour. Since the vast majority of things that Some hon. members say are things like voting (yea or nay or agreed) or interjections (oh, oh or hear, hear), that’s what’ll show up most of the time.

I’ve also included things that An hon. member will say, so occasionally, there will be random heckling showing up. Because, you know, non sequiturs on twitter are hilarious. These are sometimes longer, so I made it randomly pull out a chunk of the statement, which has questionable results.

To be honest, I wanted to do something for Toronto City Council at first, which was why I asked around #TOpoli for something Hansard-like for council. Unfortunately, that doesn’t exist, so unfortunately, all of the amazing possibilities for council bots will go unrealized. On the other hand, there are a few more ideas I have for all of this Hansard stuff. And of course, there’s my actual project to hopefully look forward to as well.

Anyway, I’m glad people are enjoying it.