I recently finished my PhD, and my supervisor, Patrick James, always described me as a “data monster” in reference to how much I enjoyed playing with data. He was a massive influence throughout my PhD, so I felt it was only appropriate to get him a data-related gift when I finished. To this effect, I made him a wordcloud of all his tweet history!
This blog post explains how we can interact with Twitter data in R using the rtweet (Kearney 2018) package, and convert this raw data into pretty visualisations using the wordcloud2 (Lang 2018) package. Hopefully it of use to others who may want to replicate the analysis themselves.
Overview
There are three key stages to the process of making the wordcloud:
- Access the data from Twitter: this is done via the rtweet (Kearney 2018) package.
- Clean and extract the word data: removing all additional characters, hyperlinks, etc.
- Format the wordcloud: we need to stylise the appearance of the wordcloud.
The packages used in the analysis are listed as follows:
library(rtweet) # Used for extracting the tweets
library(tm) # Text mining cleaning
library(stringr) # Removing characters
library(qdapRegex) # Removing URLs
library(wordcloud2) # Creating the wordcloud
Extracting Tweets
The Twitter API makes it very easy to download tweet history for a user, a the rtweet (Kearney 2018) package has been developed to provide an interface with this to R. You will need to sign up for a developer account to be able to access the API. From my experience, the process was not overly difficult, but there was almost a three week wait in my application being approved. Once you have an account, you will need to authenticate it with R as explained here.
Having setup the package, the tweet history for a user can be extracted using the get_timelines
function. This extracts up to 3200 recent tweets from a user and provides lots of metdata for each tweet (date, time, text, links, location etc.). This is shown below:
# scrape the tweets
tweets_pab <- get_timelines(c("pab_james"), n = 3200)
Cleaning the Data
Once the tweet history has been extracted, it must be formatted and cleaned for the plot. Firstly, the column text
is collapsed into a single character vector:
# Clean the data
text <- str_c(tweets_pab$text, collapse = "")
We need to clean the text in the string. The str_remove
function is used to remove linebreaks, hyperlinks, any hashtags and mentions. We are also not interested in keeping any basic words such as “a”, “the”, “and” etc., so we can use the removeWords
and stopwords
function from the tm (Feinerer and Hornik 2018) package. In addition, the qdapRegex package (Rinker 2017) is used to strip out the URLs:
# continue cleaning the text
text <-
text %>%
str_remove("\\n") %>% # remove linebreaks
rm_twitter_url() %>% # Remove URLS
rm_url() %>%
str_remove_all("#\\S+") %>% # Remove any hashtags
str_remove_all("@\\S+") %>% # Remove any @ mentions
removeWords(stopwords("english")) %>% # Remove common words (a, the, it etc.)
removeNumbers() %>%
stripWhitespace() %>%
removeWords(c("amp")) # Final cleanup of other small changes
Having cleaned the data, we can format the table. The function ‘TermDocumentMatrix’ is used to construct a frequency table of the words from the text string above. This table is sorted by frequency to make it easier to inspect. A quick summary of the most common words is shown in Table 1.
# Convert the data into a summary table
textCorpus <-
Corpus(VectorSource(text)) %>%
TermDocumentMatrix() %>%
as.matrix()
textCorpus <- sort(rowSums(textCorpus), decreasing=TRUE)
textCorpus <- data.frame(word = names(textCorpus), freq=textCorpus, row.names = NULL)
word | freq |
---|---|
energy | 410 |
today | 224 |
new | 169 |
students | 111 |
now | 108 |
southampton | 107 |
Building the Wordcloud
Finally, we can build the wordcloud. There are two main options which can be used this: either wordcloud or wordcloud2. For the example, I have used the wordcloud2 package (Lang 2018), as it offered a few more functions for customising the output. Below, we use the frequency table developed above to create the wordlcloud, as shown in Figure 1.
# build wordcloud
wordcloud <- wordcloud2(data = textCorpus, minRotation = 0, maxRotation = 0, ellipticity = 0.6)
wordcloud
We can play around with this basic setup, and I would recommend checking out the package documentation to see some of the things that can be done. For example, we can provide our own image as a mask to customise the shape of the wordcloud.
Wrapping It all up
If we want to create Wordclouds for multiple users, we can wrap the above code up into a function. Below is the TweetsToWordcloud
function:
TweetsToWordcloud <- function(username){
tweets <- get_timelines(username, n = 3200)
# Clean the data
text <- str_c(tweets$text, collapse = "") %>%
str_remove("\\n") %>% # remove linebreaks
rm_twitter_url() %>% # Remove URLS
rm_url() %>%
str_remove_all("#\\S+") %>% # Remove any hashtags
str_remove_all("@\\S+") %>% # Remove any @ mentions
removeWords(stopwords("english")) %>% # Remove common words (a, the, it etc.)
removeNumbers() %>%
stripWhitespace() %>%
removeWords(c("amp")) # Final cleanup of other small changes
# Convert the data into a summary table
textCorpus <-
Corpus(VectorSource(text)) %>%
TermDocumentMatrix() %>%
as.matrix()
textCorpus <- sort(rowSums(textCorpus), decreasing=TRUE)
textCorpus <- data.frame(word = names(textCorpus), freq=textCorpus, row.names = NULL)
wordcloud <- wordcloud2(data = textCorpus, minRotation = 0, maxRotation = 0, ellipticity = 0.6)
return(wordcloud)
}
Then using this function on another example of another one of my academic supervisors:
TweetsToWordcloud(username = "dataknut")
Conclusion
This post highlights how we can extract Tweets from Twitter and use this to build data visualisations like wordclouds. I certainly feel like there is a lot more that can be done with this data, so keep an eye out for more posts in the future on this!
References
Feinerer, Ingo, and Kurt Hornik. 2018. Tm: Text Mining Package. https://CRAN.R-project.org/package=tm.
Kearney, Michael W. 2018. Rtweet: Collecting Twitter Data. https://CRAN.R-project.org/package=rtweet.
Lang, Dawei. 2018. Wordcloud2: Create Word Cloud by htmlWidget. https://github.com/lchiffon/wordcloud2.
Rinker, Tyler. 2017. QdapRegex: Regular Expression Removal, Extraction, and Replacement Tools. https://CRAN.R-project.org/package=qdapRegex.