Patrick Russo

Current Research

Primary Debates: Live Transcripts

I have recently been working on building a speaker classifier to detect the identity of who is speaking at a moment in time during a debate. This prediction is joined with the content prediction from the Google Speech API (so that I can predict what was said, and who said it) to create a live transcript during debates. This is still in progress, at this point I'm just posting it for fun, but it still needs more work.

04/14/16 Democratic Debate can be found here.

03/10/16 Republican Debate can be found here. The classifier worked well on test data (from totally different debates than the training data), but for some reason it didn't like Kasich (it refused to recognize his voice)! It also was prone to classifying Trump as Rubio or Cruz.

03/09/16 Democratic Debate can be found here. In an effort to improve accuracy I added additional training data minutes before the debate. The classifier then decided it would only predict Clinton, so I switched back to the original training set--so please pardon the first 100 or so Clintons!

03/06/16 Democratic Debate can be found here. Note that I changed classifiers half-way through because the initial one was failing. I've scored most of the second half of the debate (about 450 audio clips / lines in the transcript) and got a speaker recognition accuracy between 70% and 80% (79% if we ignore clips of the moderator speaking or where there is background noise (e.g., applause), or 70% if we include all clips¹).

03/03/16 Republican Debate can be found here.

Description:

Classifiers were built using random forests (or AdaBoost, depending on what version I'm running), trained on previous debates. Audio clips from previous debates was manually classified by candidate. Random forests and AdaBoost were run on the mel-frequency cepstral coefficients of 6ms hamming windows of these clips, so predictions were based on an aggregate of these smaller windows. To get a prediction of the speaker for a full clip, we simply take the clip with the highest average predicted probability.

To predict the actual words that a candidate said, I use the Google Speech API, via the speech_recognition Python package. The words are then mapped to twitter messages downloaded via Twitter's API (utilizing the Twitter API Python package) and displayed using JavaScript's tooltips.

Some additional information:

Mel-frequency cepstral coefficients are produced by running a triangular filter on the log of the frequencies of audio—so low frequency coefficients represent a smaller frequency window in the clip than higher frequency coefficients².

Using Hamming Windows prevents spectral leakage when running an FFT on the audio clip (spectral leakage caused by audio being windowed into 6ms clips and starting/ending abruptly).

Decisions I made:

I do not use n-fold cross validation because I am concerned about learning background noise—for the testing sample, I use audio clips from different debate(s) than the training data. (I initially made the mistake of using speeches by each candidate for audio training³ and used n-fold cross validation which yielded great test results—until I realized I was only learning the background noise!)

I used a cross-validation to settle on a window size of 6ms, as well as to set other parameters such as tree depth (and learning rate for AdaBoost). Cross-validation seems to suggest a tree depth of around 26-46 for republicans, but a depth of 1 for democrats—this seems a little concerning and I am still investigating it.

To-do List:

Either train on the moderators, or just output "Other" as the speaker when the probability is below some threshold--the latter is probably the easiest.

Get more training data!

Try other learning algorithms such as boosted trees (xgboost).

Include articles or speeches by the candidates, not just tweets.

That being said, the debates are almost done for the primary season, so this may go on hiatus until the general election.

¹ Since we didn't learn the moderator, any clips where the moderator is the speaker would be by definition incorrect in the set of all clips.

² The log scale is used because it's more similar to the way humans hear--it's easier to distinguish the difference between 70hz and 75hz than the difference between 3000hz and 3005hz!

³ I was lazy and didn't want to manually classify from debates--recording individual speeches would be classified already!