Current Research
Primary Debates: Live Transcripts
I have recently been working on building a speaker classifier to detect the identity of who is speaking at
a moment in time during a debate. This prediction is joined with the content prediction from the
Google Speech API (so that I can predict what was said, and who said it) to create a live transcript during
debates. This is still in progress, at this point I'm just posting it for fun, but it still needs
more work.
04/14/16 Democratic Debate can be found here.
03/10/16 Republican Debate can be found here.
The classifier worked well on test data (from totally different debates than the training data),
but for some reason it didn't like Kasich (it refused to recognize his voice)! It also was prone to
classifying Trump as Rubio or Cruz.
03/09/16 Democratic Debate can be found here.
In an effort to improve accuracy I added additional training data minutes before the debate. The
classifier then decided it would only predict Clinton, so I switched back to the original training set--so
please pardon the first 100 or so Clintons!
03/06/16 Democratic Debate can be found here. Note
that I changed classifiers half-way through because the initial one was failing. I've scored most of
the second half of the debate (about 450 audio clips / lines in the transcript) and got a speaker
recognition accuracy between 70% and 80% (79% if we ignore clips of the moderator speaking or where there is
background noise (e.g., applause), or 70% if we include all clips1).
03/03/16 Republican Debate can be found here.
Description:
Classifiers were built using random forests (or AdaBoost, depending on what version I'm running),
trained on previous debates. Audio clips from previous debates was manually classified by candidate.
Random forests and AdaBoost were run on the mel-frequency cepstral coefficients of 6ms hamming windows
of these clips, so predictions were based on an aggregate of these smaller windows. To get a prediction
of the speaker for a full clip, we simply take the clip with the highest average predicted probability.
To predict the actual words that a candidate said, I use the Google Speech API, via the
speech_recognition Python package. The words are then mapped to twitter messages downloaded via
Twitter's API (utilizing the Twitter API Python package) and displayed using JavaScript's tooltips.
Some additional information:
Mel-frequency cepstral coefficients are produced by running a triangular filter on the log of the
frequencies of audio—so low frequency coefficients represent a smaller frequency window in the clip than
higher frequency coefficients2.
Using Hamming Windows prevents spectral leakage when running an FFT on the audio clip (spectral leakage
caused by audio being windowed into 6ms clips and starting/ending abruptly).
Decisions I made:
I do not use n-fold cross validation because I am concerned about learning background noise—for the
testing sample, I use audio clips from different debate(s) than the training data. (I initially made the
mistake of using speeches by each candidate for audio training3 and used n-fold cross
validation which yielded great test results—until I realized I was only learning the background noise!)
I used a cross-validation to settle on a window size of 6ms, as well as to set other parameters such as
tree depth (and learning rate for AdaBoost). Cross-validation seems to suggest a tree depth of around
26-46 for republicans, but a depth of 1 for democrats—this seems a little concerning and I am still
investigating it.
To-do List:
Either train on the moderators, or just output "Other" as the speaker when the probability is below some
threshold--the latter is probably the easiest.
Get more training data!
Try other learning algorithms such as boosted trees (xgboost).
Include articles or speeches by the candidates, not just tweets.
That being said, the debates are almost done for the primary season, so this may go on hiatus until the
general election.
1 Since we didn't learn the moderator, any clips where the moderator is the speaker would
be by definition incorrect in the set of all clips.
2 The log scale is used because it's more similar to the way humans hear--it's easier to
distinguish the difference between 70hz and 75hz than the difference between 3000hz and 3005hz!
3 I was lazy and didn't want to manually classify from debates--recording individual
speeches would be classified already!