Text Mining

When it comes down to it R does a really good job handling structured data like matrices and data frames. However, its ability to work with unstructured data is still a work in progress. It can and it does handle text mining but the documentation is incomplete and the capabilities still don’t compare to other programs such as MALLET or Mahout.

Though the formal documentation is still lacking. Though this is not an example on real data it does provide the basic tools on text mining and, in particular, latent dirichlet allocation.

There are three R libraries that are useful for text mining: tm, RTextTools, and topicmodels. The tm library is the core of text mining capabilities in R.

Unstructured text files can come in many different formats. I often find that I must get my own data and consequently the data generally originates as plain text (.txt) files. However, those who want to analyze Twitter feeds can user the twitteR library which is useful for analyzing social media topics in real time. This example will incorporate the CNN twitter feed.

In order for R to interpret and analyze these text files they must ultimately be converted into a document term matrix. But first a corpus must be created. A corpus is simply a collection of documents where each document its a topic.

When reading text documents directly from local file the following R code can be used.

Data Preparation using Local Text Files


#These files can be just raw text. For example it could be simply copied and pasted from a Web site.
dir = "C:\\Documents and Settings\\clints\\My Documents\\LDA-S";
filenames = list.files(path=dir,pattern="\\.txt");
setwd(dir);

docs = NULL;
titles = NULL;

for (filename in filenames){
#here I specify a file that contains all the titles of the documents
if(filename=="titles.txt"){
titles = paste(readLines(file(filename)));
} else {
docs = c(docs,list( paste(readLines(file(filename)), collapse="\n") ));
}
}

To pull the text from a Twitter Feed rather than text files then the following lines of code can be used.

Data Preparation using Twitter


library(tm);
library(RTextTools);
library(topicmodels);
library(twitteR);

twitter_feed <- searchTwitter('@cnn', n=150);

### Optional twitter feed retrieval
##twitter_feed <- userTimeline("rdatamining", n=150);
###

df <- do.call("rbind", lapply(twitter_feed, as.data.frame));
myCorpus <- Corpus(VectorSource(df$text));

k = length(docs3);
myCorpus = Corpus(VectorSource(docs));
myCorpus = tm_map(myCorpus, tolower);
myCorpus = tm_map(myCorpus, removePunctuation);
myCorpus = tm_map(myCorpus, removeNumbers);
myStopwords = c(stopwords('english'), "available", "via");
idx = which(myStopwords == "r");
myStopwords = myStopwords&#91;-idx&#93;;
myCorpus = tm_map(myCorpus, removeWords, myStopwords);

dictCorpus = myCorpus;

myCorpus = tm_map(myCorpus, stemDocument);

myCorpus = tm_map(myCorpus, stemCompletion, dictionary=dictCorpus);

myDtm = DocumentTermMatrix(myCorpus, control = list(minWordLength = 3));

findFreqTerms(myDtm, lowfreq=50);
#find the probability a word is associated
findAssocs(myDtm, 'find_a_word', 0.5);

&#91;/sourcecode&#93;

<strong>Word Cloud</strong>



library(wordcloud);
m = as.matrix(myDtm);
v = sort(colSums(m), decreasing=TRUE);
myNames = names(v);
k = which(names(v)=="miners");
myNames[k] = "mining";
d = data.frame(word=myNames, freq=v);
wordcloud(d$word, colors=c(3,4), random.color=FALSE, d$freq, min.freq=20);

Latent Dirichlet Allocation


k = 2;
SEED = 1234;
my_TM =
list(VEM = LDA(myDtm, k = k, control = list(seed = SEED)),
VEM_fixed = LDA(myDtm, k = k,
control = list(estimate.alpha = FALSE, seed = SEED)),
Gibbs = LDA(myDtm, k = k, method = "Gibbs",
control = list(seed = SEED, burnin = 1000,
thin = 100, iter = 1000)),
CTM = CTM(myDtm, k = k,
control = list(seed = SEED,
var = list(tol = 10^-4), em = list(tol = 10^-3))));

Topic = topics(my_TM[["VEM"]], 1);

#top 5 terms for each topic in LDA
Terms = terms(my_TM[["VEM"]], 5);
Terms;

(my_topics =
topics(my_TM[["VEM"]]));

most_frequent = which.max(tabulate(my_topics));

terms(my_TM[["VEM"]], 10)[, most_frequent];

Here, a model is fit setting the number of unobserved latent topics equal to two (k=2). We can then identify the most frequently occurring topics and then identify the top five terms used for the topic. In this example these are the top five terms when setting the number of groups equal to two.

Topic 1 Topic 2
“amp” “cnn”
“cnn” “tweet”
“jobs” “abc”
“romney” “bainport”
“sensata” “cbs”

 

 

Topic Modeling the October 2012 LDS General Conference

Twice a year, once in May, and once in October, there is the obligatory discussion on the theme of the semi-annual LDS general conference.  The importance of a conference of this type is that each person can take away key principles that will help improve their lives and that is the theme for them.

But I wanted to look at it in a more objective way for the theme of the conference.  What was the overarching theme and topics of the general conference.  There are many topics discussed but there is one that  appears more often.  After a little bit of text mining the most likely topic-terms of the October 2012 LDS General Conference is:

Christ, God, Jesus, Lord, Father, Lives

It  is not the overwhelming theme of the conference but it is often referenced.  Other topics that frequently appear include missionaries and children.  I took every word from all talks during the conference (excluding the statistical report, Priesthood, and Relief Society sessions) and analyzed the data.  I took two approaches.  The first was a simple straight forward summary of word frequency.  This ends up becoming a word cloud.  Though, personally, it’s not my favorite graph to present data the word cloud  tends to be fan-favorite and, if anything, it is a fun way to visualize data.

Word Cloud for October 2012 General Conference
Word Cloud for October 2012 General Conference

 

The second approach is solidly statistical based and uses Latent Dirichlet Allocation.  This is effectively a way to cluster the text spoken during the conference into a bunch of bag of words (that is a technical term).  Those bags contain specified words with a probability.  For those familiar with the Dirichlet distribution you will note that it is the conjugate prior of the categorical distribution.  This means that I can group the words spoken into topics (themes) and each of the groups will contain specified terms (words).

This approach shows that there is a wide distribution of topics that spanned the conference indicating that there wasn’t one dominating theme but a lot of separate themes.  However, if I were to group conference into two themes it would have to be

1) Faith in Jesus Christ

2) Time with Children.

I arrived at this conclusion by clustering the text into two groups and then based on on Latent Dirichlet Allocation found that based on that model terms relating to Christ were in one group and terms relating to children were in the other group.   Specifically the following table shows the top five terms for each topic

Topic 1 Topic 2
Christ Children
Faith Time
Jesus Help
God Lives
Love Love

So there it is.  This is my quick analysis of the October 2012 General Conference and the theme of the conference.  However, I think the one thing that people will remember from this conference is missionary work.

*Note:  For those that care all analyses used R.