Out-of-the-box analysis and reporting tools for twitter
While there are some (very neat) R packages focused on twitter (namely twitteR
and stramR
), twitterreport
is centered on providing analysis and reporting tools for twitter data. The package's current version features:
- Access to twitter API
- Extracting mentions/hashtags/urls from text (tweets)
- Gender tagging by matching user names with gender datasets included in the package (es and en)
- Creating (mentions) networks and visualizing them using D3js
- Sentiment analysis (basic, but useful) using lexicons included in the package (again, es and en)
- Creating time series charts of hashtags/users/etc. and visualizing them using D3js
- Create wordclouds (after removing stop words and processing the text)
- Map visualization using the leaflet package
- Topics identification through the Jaccard coeff (words similarity)
You can take a look at a live example at http://www.its.caltech.edu/~gvegayon/twitter/report_example.html, and at the source code of that example at /~https://github.com/gvegayon/twitterreport/blob/master/vignettes/report_example.Rmd
Some of the functions here were firstly developed in the project nodoschile.cl (no longer running). You can visit the project's testimonial website http://nodos.modularity.cl and the website (part of nodoschile) that motivated twitterreports
at http://modularity.cl/presidenciales.
While the package is still in development, you can always use devtools
to install the most recent version.
devtools::install_git('gvegayon/twitterreport')
# Firts, load the package!
library(twitterreport)
# List of twitter accounts
users <- c('MarsRovers', 'senatormenendez', 'sciencemagazine')
# Getting the twitts (first gen the token)
key <- tw_gen_token('myapp','key', 'secret')
tweets <- lapply(users, tw_api_get_statuses_user_timeline, twitter_token=key)
# Processing the data (and taking a look)
tweets <- do.call(rbind, tweets)
head(tweets)
# Loading data
data("senators")
data("senators_profile")
data("senate_tweets")
tweets_components <- tw_extract(senate_tweets$text)
groups <- data.frame(
name = senators_profile$tw_screen_name,
group = factor(senators$party),
real_name = senators$Name,
stringsAsFactors = FALSE)
groups$name <- tolower(groups$name)
senate_network <- tw_network(
tolower(senate_tweets$screen_name),
lapply(tweets_components$mention,unique),only.from = TRUE,
group=groups, min.interact = 3)
plot(senate_network, nodelabel='real_name')
In the following examples we will use data on US senators extracted from twitter using the REST API (you can find it in the package)
The function tw_words
takes a character vector (of tweets for example) and extracts all the stopwords+symbols. And the plot
method for its output creates a wordcloud
data(senate_tweets)
tab <- tw_words(senate_tweets$text)
# What did it do?
senate_tweets$text[1:2];tab[1:2]
## [1] "“I am saddened by the news that four Marines lost their lives today in the service of our country.” #Chattanooga"
## [2] ".@SenAlexander statement on today’s “tragic and senseless” murder of four Marines in #Chattanooga: http://t.co/H9zWdJPbiE"
## [[1]]
## [1] "saddened" "news" "four" "marines" "lost"
## [6] "lives" "today" "service" "country" "chattanooga"
##
## [[2]]
## [1] "senalexander" "statement" "todays" "tragic"
## [5] "senseless" "murder" "four" "marines"
## [9] "chattanooga"
# Plot
set.seed(123) # (so the wordcloud looks the same always)
plot(tab, max.n.words = 40)
Using english and spanish names, the tw_gender
function matches the character argument (which can be a vector) with either a male or female name (or unidentified).
data(senators_profile)
# Getting the names
sen <- tolower(senators_profile$tw_name)
sen <- gsub('\\bsen(ator|\\.)\\s+','',sen)
sen <- gsub('\\s+.+','',sen)
tab <- table(tw_gender(sen))
barplot(tab)
Here we have an example clasifying senate tweets on the #irandeal.
irandeal <- subset(senate_tweets, grepl('irandeal',text, ignore.case = TRUE))
irandeal$sentiment <- tw_sentiment(irandeal$text, normalize = TRUE)
hist(irandeal$sentiment, col = 'lightblue',
xlab ='Valence (strength of sentiment)')
The function tw_leaflet
provides a nice wrapper for the function leaflet
of
the package of the same name. Using D3js, we can visualize the number of tweets grouped up geographically as the following example shows:
tw_leaflet(senate_tweets,~coordinates, nclusters=4,radii = ~sqrt(n)*3e5)
Note that in this case there are 14 tweets with the coordinates
column non-empty, leading to 4 different senators that have such information. Using the nclusters
option, the tw_leaflet
groups the data using the hclust
function of the stats package. So the user doesn't need to worry about aggregating data.
An interesting issue to review is how are words related to each other. Using the Jaccard coefficient we are able to estimate a measure of distance between two words. The jaccard_coef
function implements such algorithm, and it allows us to get a better understanding of topics, as the following example
# Computing the jaccard coefficient
jaccard <- jaccard_coef(senate_tweets$text,max.size = 1000)
# See what words are related with abortion
words_closeness('veterans',jaccard,.025)
## word coef
## 1 veterans 318.00000000
## 2 va 0.08982036
## 3 care 0.08510638
## 4 honor 0.04389313
## 5 access 0.04201681
## 6 deserve 0.04176334
## 7 health 0.04022989
## 8 benefits 0.03827751
## 9 mental 0.03733333
## 10 honored 0.03505155
## 11 home 0.03440860
## 12 service 0.03266788
## 13 july 0.03108808
## 14 combat 0.02964960
## 15 services 0.02857143
## 16 choice 0.02549575
## 17 thank 0.02529960
We can also do this using the output from tw_extract
, this is, by passing a list of character vectors (this is much fasters)
hashtags <- tw_extract(senate_tweets$text, obj = 'hashtag')$hashtag
# Again, but using a list
jaccard <- jaccard_coef(hashtags,max.size = 15000)
jaccard
## Jaccard index Matrix (Sparse) of 3283x3283 elements
## Contains the following words (access via $freq):
## wrd n
## 1 irandeal 202
## 2 iran 179
## 3 scotus 141
## 4 tpa 132
## 5 netde 119
## 6 mepolitics 117
# See what words are related with abortion
words_closeness('veterans',jaccard,.025)
## word coef
## 1 veterans 78.00000000
## 2 honorflight 0.06382979
## 3 va 0.05154639
## 4 miasalutes 0.05000000
## 5 4profit 0.04166667
## 6 choiceact 0.03658537
## 7 40mileissue 0.02564103
## 8 hepc 0.02531646
George G. Vega Yon
g vegayon at caltech