Twitter bots, twitter bots, twitter bots, folks

After the incredible success of @SomeHonMembers, I decided to create @HonSpeakerBot, which was not nearly as popular, but whatever.

The lack of any transcripts or any data made a bot for #TOpoli difficult. But now, there are two. You may have seen them around as I was testing them.

Mayor Robot Ford hates streetcars

This idea is based on a popular Japanese twitter bot by the name of @akari_daisuki. What Akari does is takes a random Wikipedia article title or something from her timeline. Let’s call this thing $x$. Every fifteen minutes, she tweets the Japanese equivalent of “Yay $x$, Akari loves $x$”.

The idea behind @MayorRobotFord is similar. He takes a random Wikipedia article title or something from the #TOpoli stream and tweets, in his familiar way, “$x$, $x$, $x$, folks. We’re getting $x$.” or “People want $x$, folks. $x$, $x$.”, depending on how long $x$ is.

How does it work? Well, the Wikipedia portion is easy enough. We just get access to the API to grab a list of ten random pages. The more complicated part is pulling stuff off of #TOpoli.

Since this is a twitter app, it’s not too hard to get a bunch of tweets off of #TOpoli. We just use the API and we’ve got the fifteen latest #TOpoli tweets ready to be used. The difficult part is extracting noun phrases, or NPs, which is where that graduate class on computational linguistics comes in handy.

So how do we do this? Well, first of all we have to identify the parts of speech in a given tweet. So we tokenize it first and split it up into words and punctuation. Then, we use a part-of-speech tagger to go through and figure out what each little piece is. The POS tagger that I used was the default one that came with the Natural Language Toolkit. Normally, you’d need to train a tagger on a corpus. This default one was trained on the Brown corpus, which is a body of text which was hand tagged for training purposes.

So now our tweets are all tagged and we assume that they’re properly tagged. There’s obviously going to be some slight errors here and there, but whatever, we want to make a twitter bot, so it’s not that important. But we only have our parts of speech. We want to be able to relate the different parts of speech into phrases. So we need some kind of parsing or chunking to put these pieces together into larger phrases that make sense.

For this, I used a bigram chunker trained on the conll2000 corpus. Like the Brown corpus for tagging, the conll2000 corpus is manually parsed and chunked for training purposes. What a bigram chunker does is it analyses every consecutive pair of words in a sentence to come up with a statistical model. It uses this to come up with the most likely NPs to arise from the sentence. We can then just pluck out all of the NPs the chunker identifies.

Once we have all of our NPs, we stick them in a list with our Wikipedia titles and randomly select one to use in our tweet. The Wikipedia API has a limit of 10 titles per call and the twitter API grabs 15 tweets per call. Thus, the chance of getting a Wikipedia title is at best somewhere around 2/5 of the time. However, that’s not taking into account removing entries that are too large. That quick calculation also assumes that there’s only one NP per tweet when there could be many, so in reality, the chance of grabbing something from #TOpoli is much more likely, which might be for the best if you want weird accidental metacommentary.

The Core Service Review

One day, I decided to look through the City of Toronto’s open data catalogue and happened upon an interesting entry called the Core Service Review Qualitative data.

Lo and behold, it was exactly that.

After some fiddling around with an Excel module for Python and figuring out how to split tweets that are larger than 140 characters, I let it go.

@TOCoreService will tweet one entry, randomly chosen, from the 12000 submissions, or close to 58000 answers. These range from short answers like “transit” or “taxes” to fairly lengthy responses.

So what’s the point of this bot? Well, the data is up there for anyone to read, which is nice for transparency and engagement. Of course, whether anyone who’s not city staff would want to read 13000 responses is another matter. But here, we pretty decent collection of opinions on what our priorities should be from real citizens. It’d be a shame if the only people who read them were city staff.