Text Mining

When it comes down to it R does a really good job handling structured data like matrices and data frames. However, its ability to work with unstructured data is still a work in progress. It can and it does handle text mining but the documentation is incomplete and the capabilities still don’t compare to other programs such as MALLET or Mahout.

Though the formal documentation is still lacking. Though this is not an example on real data it does provide the basic tools on text mining and, in particular, latent dirichlet allocation.

There are three R libraries that are useful for text mining: tm, RTextTools, and topicmodels. The tm library is the core of text mining capabilities in R.

Unstructured text files can come in many different formats. I often find that I must get my own data and consequently the data generally originates as plain text (.txt) files. However, those who want to analyze Twitter feeds can user the twitteR library which is useful for analyzing social media topics in real time. This example will incorporate the CNN twitter feed.

In order for R to interpret and analyze these text files they must ultimately be converted into a document term matrix. But first a corpus must be created. A corpus is simply a collection of documents where each document its a topic.

When reading text documents directly from local file the following R code can be used.

Data Preparation using Local Text Files


#These files can be just raw text. For example it could be simply copied and pasted from a Web site.
dir = "C:\\Documents and Settings\\clints\\My Documents\\LDA-S";
filenames = list.files(path=dir,pattern="\\.txt");
setwd(dir);

docs = NULL;
titles = NULL;

for (filename in filenames){
#here I specify a file that contains all the titles of the documents
if(filename=="titles.txt"){
titles = paste(readLines(file(filename)));
} else {
docs = c(docs,list( paste(readLines(file(filename)), collapse="\n") ));
}
}

To pull the text from a Twitter Feed rather than text files then the following lines of code can be used.

Data Preparation using Twitter


library(tm);
library(RTextTools);
library(topicmodels);
library(twitteR);

twitter_feed <- searchTwitter('@cnn', n=150);

### Optional twitter feed retrieval
##twitter_feed <- userTimeline("rdatamining", n=150);
###

df <- do.call("rbind", lapply(twitter_feed, as.data.frame));
myCorpus <- Corpus(VectorSource(df$text));

k = length(docs3);
myCorpus = Corpus(VectorSource(docs));
myCorpus = tm_map(myCorpus, tolower);
myCorpus = tm_map(myCorpus, removePunctuation);
myCorpus = tm_map(myCorpus, removeNumbers);
myStopwords = c(stopwords('english'), "available", "via");
idx = which(myStopwords == "r");
myStopwords = myStopwords[-idx];
myCorpus = tm_map(myCorpus, removeWords, myStopwords);

dictCorpus = myCorpus;

myCorpus = tm_map(myCorpus, stemDocument);

myCorpus = tm_map(myCorpus, stemCompletion, dictionary=dictCorpus);

myDtm = DocumentTermMatrix(myCorpus, control = list(minWordLength = 3));

findFreqTerms(myDtm, lowfreq=50);
#find the probability a word is associated
findAssocs(myDtm, 'find_a_word', 0.5);

Word Cloud


library(wordcloud);
m = as.matrix(myDtm);
v = sort(colSums(m), decreasing=TRUE);
myNames = names(v);
k = which(names(v)=="miners");
myNames[k] = "mining";
d = data.frame(word=myNames, freq=v);
wordcloud(d$word, colors=c(3,4), random.color=FALSE, d$freq, min.freq=20);

Latent Dirichlet Allocation


k = 2;
SEED = 1234;
my_TM =
list(VEM = LDA(myDtm, k = k, control = list(seed = SEED)),
VEM_fixed = LDA(myDtm, k = k,
control = list(estimate.alpha = FALSE, seed = SEED)),
Gibbs = LDA(myDtm, k = k, method = "Gibbs",
control = list(seed = SEED, burnin = 1000,
thin = 100, iter = 1000)),
CTM = CTM(myDtm, k = k,
control = list(seed = SEED,
var = list(tol = 10^-4), em = list(tol = 10^-3))));

Topic = topics(my_TM[["VEM"]], 1);

#top 5 terms for each topic in LDA
Terms = terms(my_TM[["VEM"]], 5);
Terms;

(my_topics =
topics(my_TM[["VEM"]]));

most_frequent = which.max(tabulate(my_topics));

terms(my_TM[["VEM"]], 10)[, most_frequent];

Here, a model is fit setting the number of unobserved latent topics equal to two (k=2). We can then identify the most frequently occurring topics and then identify the top five terms used for the topic. In this example these are the top five terms when setting the number of groups equal to two.

Topic 1 Topic 2
“amp” “cnn”
“cnn” “tweet”
“jobs” “abc”
“romney” “bainport”
“sensata” “cbs”

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *


8 + = seventeen

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

%d bloggers like this: