Twittter - k-means clustering

Current Research

Twitter

Description:

This clusters the day's tweets by 60+ news organizations, pundits, and politicians into around 13 clusters (how size is chosen is detailed below). The tweets in each cluster tend to have the same words and are therefore typically about the same topic, event, or theme. The clusters are sorted by the number of tweets in the last two hours, so that the first one is the most "trending" of the clusters. Unfortunately, there tends to be a handful of clusters that are (almost) a catch-all for tweets that don't fit into other categories--I have modified the order that the clusters appear so that very spread clusters appear at the end of the list below, even if they are "trending" by my two hour definition.

The live version can be found here.

Details:

Every half-hour, a day's worth of tweets are downloaded for 60+ Twitter accounts. Some are global, U.S., or business news sources, others are prominent individuals such as Barack Obama, John Boehner, or Anderson Cooper.

The words of the tweets are then stemmed (the endings such as -ing, -ed, -s, etc.) are removed, and all words are put into lowercase and symbols are removed. Common words such as "the," "is," "their," "then," etc. are removed.

Tweets are then vectorized using tf-idf weighting using n-grams (1-4 word n-grams).

k-means clustering is then applied to that day's tweets. I use 13 clusters (in the future I would like to automate the choosing of cluster size with something such as the gap metric--once I have the time) because this tends to yield a low "RSS" or distance from the cluster centers, and also is low enough that it is practical to browse the clusters. Each cluster is chosen so as to minimize the distance from the tweets within it and it's center. This distance is the difference between the tf-idf assigned value of each word and the average tf-idf value in that cluster for that word. (For example, suppose many tweets in a cluster talk about Russia--then the cluster center would tend to have a high occurrence of the word "Russia," and would have a small distance from another tweet that contained the word Russia. Likewise, on some days there may be a cluster with the words "Iran," "deal," "Iran deal," "Kerry," "Rouhani," "Israel," etc. frequently referenced.)

The number of clusters is chosen by using a method based upon the "gap measure" created by Tibshirani, Walther, & Hastie (2001). (More on this to come.)

In between each half-hour's re-estimation, tweets are continuously streamed and classified--so just refresh the browser to see updates!