## Topic Modeling the October 2012 LDS General Conference

Twice a year, once in May, and once in October, there is the obligatory discussion on the theme of the semi-annual LDS general conference.  The importance of a conference of this type is that each person can take away key principles that will help improve their lives and that is the theme for them.

But I wanted to look at it in a more objective way for the theme of the conference.  What was the overarching theme and topics of the general conference.  There are many topics discussed but there is one that  appears more often.  After a little bit of text mining the most likely topic-terms of the October 2012 LDS General Conference is:

Christ, God, Jesus, Lord, Father, Lives

It  is not the overwhelming theme of the conference but it is often referenced.  Other topics that frequently appear include missionaries and children.  I took every word from all talks during the conference (excluding the statistical report, Priesthood, and Relief Society sessions) and analyzed the data.  I took two approaches.  The first was a simple straight forward summary of word frequency.  This ends up becoming a word cloud.  Though, personally, it’s not my favorite graph to present data the word cloud  tends to be fan-favorite and, if anything, it is a fun way to visualize data.

The second approach is solidly statistical based and uses Latent Dirichlet Allocation.  This is effectively a way to cluster the text spoken during the conference into a bunch of bag of words (that is a technical term).  Those bags contain specified words with a probability.  For those familiar with the Dirichlet distribution you will note that it is the conjugate prior of the categorical distribution.  This means that I can group the words spoken into topics (themes) and each of the groups will contain specified terms (words).

This approach shows that there is a wide distribution of topics that spanned the conference indicating that there wasn’t one dominating theme but a lot of separate themes.  However, if I were to group conference into two themes it would have to be

1) Faith in Jesus Christ

2) Time with Children.

I arrived at this conclusion by clustering the text into two groups and then based on on Latent Dirichlet Allocation found that based on that model terms relating to Christ were in one group and terms relating to children were in the other group.   Specifically the following table shows the top five terms for each topic

 Topic 1 Topic 2 Christ Children Faith Time Jesus Help God Lives Love Love

So there it is.  This is my quick analysis of the October 2012 General Conference and the theme of the conference.  However, I think the one thing that people will remember from this conference is missionary work.

*Note:  For those that care all analyses used R.

## Association Rule Learning and the Apriori Algorithm

Association Rule Learning (also called Association Rule Mining) is a common technique used to find associations between many variables. It is often used by grocery stores, retailers, and anyone with a large transactional databases. It’s the same way that Target knows your pregnant or when you’re buying an item on Amazon.com they know what else you want to buy. The same idea extends to Pandora.com knowing what song you want to listen to next. All of these incorporate, at some level, data mining concepts and association rule algorithms.

Michael Hahsler, et al. has authored and maintains two very useful R packages relating to association rule mining:  the arules package and the arulesViz package.  Furthermore, Hahsler has provided two very good example articles providing details on how to use these packages in Introduction to arules and Visualizing Association Rules.

Often Association Rule Learning is used to analyze the “market-basket” for retailers.  Traditionally, this simply looks at whether a person has purchased an item or not and can be seen as a binary matrix.

Association rules use the R arules library.  The arulesViz add additional features for graphing and plotting the rules.


library("arules");

library("arulesViz");



For testing purposes there is a convenient way to generate random data where patterns can be mined.  The random data is generated in such a way where there is correlation and has correlated items.


patterns = random.patterns(nItems = 1000);

summary(patterns);

trans = random.transactions(nItems = 1000, nTrans = 1000, method = "agrawal",  patterns = patterns);

image(trans);



However, a transaction dataset will usually be available using the approach described in “Data Frames and Transactions“.  The rules can then be created using the apriori function on the transaction dataset.



rules;



Once the rules have been created a researcher can then review and filter the rules down to a manageable subset.  This can be done in a variety of ways using both graphs and by simply inspecting the rules.



plot(rules);



Having 317848 association rules is far too many for a human to deal with.  So we’re going to trim down the rules to the ones that are important.


sel = plot(rules, measure=c("support","lift"), shading="confidence", interactive=TRUE);

subrules = rules[quality(rules)$confidence > 0.8]; subrules  Once again we can now subset the rules to get a visual. In these graphs we can see the two parts to an association rule: the antecedent (IF) and the consequent (THEN). These patterns are found by determining frequent patterns in the data and these are identified by the support and confidence. The support indicates how frequently the items appear in the dataset. The confidence indicates the number of times the IF/THEN statement on the data are true. These IF/THEN statements can be visualized by the following graph: This code will produce many different ways to look at the graphs and can even produce 3-D graphs.  plot(subrules, method="matrix", measure="lift"); plot(subrules, method="matrix", measure="lift", control=list(reorder=TRUE)); plot(subrules, method="matrix3D", measure="lift"); plot(subrules, method="matrix3D", measure="lift", control = list(reorder=TRUE)); plot(subrules, method="matrix", measure=c("lift", "confidence")); plot(subrules, method="matrix", measure=c("lift","confidence"), control = list(reorder=TRUE)); plot(rules, method="grouped"); plot(rules, method="grouped", control=list(k=50)); sel = plot(rules, method="grouped", interactive=TRUE);  We can then subset the rules to the top 30 most important rules and then inspect the smaller set of rules individually to determine where there are meaningful associations.  subrules2 = head(sort(rules, by="lift"), 30); plot(subrules2, method="graph"); plot(subrules2, method="graph", control=list(type="items")); plot(subrules2, method="paracoord"); plot(subrules2, method="paracoord", control=list(reorder=TRUE)); oneRule = sample(rules, 1); inspect(oneRule);  Here we can look at the frequent itemsets and we can use the eclat algorithm rather than the apriori algorithm.  itemFrequencyPlot(Adult, support = 0.1, cex.names=0.8); fsets = eclat(trans, parameter = list(support = 0.05), control = list(verbose=FALSE)); singleItems = fsets[size(items(fsets)) == 1]; singleSupport = quality(singleItems)$support;

names(singleSupport) = unlist(LIST(items(singleItems), decode = FALSE));

itemsetList = LIST(items(fsets), decode = FALSE);

allConfidence = quality(fsets)$support / sapply(itemsetList, function(x) max(singleSupport[as.character(x)])); quality(fsets) = cbind(quality(fsets), allConfidence); summary(fsets);  Using these approaches a researcher can narrow down and determine association rules and determine what leads to frequent items. This is highly useful when working with extremely large datasets. ## Data Frames and Transactions Transactions are a very useful tool when dealing with data mining. It provides a way to mine itemsets or rules on datasets. In R the data must be in transactions form. If the data is only available in a data.frame then to create (or coerce) the data frame to transaction the researcher may use the following code. This example shows the “Adult” dataset available in the arules package. It originates from the “Census Income” database. These data, AdultUCI, can be coerced to transactions using the following commands:  library("arules"); data("AdultUCI"); Adult = as(AdultUCI, "transactions");  The dataframe can be in either a normalized (single) form or a flat file (basket) form. When the file is in basket form it means that each record represents a transaction where the items in the basket are represented by columns. When the dataset is in ‘single’ form it means that each record represents one single item and each item contains a transaction id. The following snippet of code shows the read.transaction() function and how the data is set up.  my_data = paste("1,2","1","2,3", sep="\n"); write(my_data, file = "my_basket"); trans = read.transactions("my_basket", format = "basket", sep=","); inspect(trans);  Once the data has been coerced to transactions the data is ready for mining itemsets or rules. Association Rule Learning uses the transaction data files available in R. A very popular algorithm for association rules is the apriori algorithm. I have discussed approaches on the use of Association Rule Learning and the Apriori Algorithm. ## Chi Square Probability Distribution Code Using PHP This example is a Web application piece of code that I wrote to add a (approximate) p-value to some dynamically generated crosstabs. This will allow a researcher to provide a way to deliver data over the web and will allow the researcher a way to calculate the p-value from a $\chi^{2}$ distribution based on input data. In order to use this function the researcher must calculate the $\chi^{2}$ value first by using the standard $\chi^{2}$ equation: $\sum \frac{ \left( O - E \right)^{2} }{E}$. In addition to the $\chi^{2}$ statistic the researcher needs to provide the degrees of freedom. function getChiSquare($x, $n) { if ( ($n==1) && ($x > 1000) ) { return 0; } if ( ($x>1000) || ($n>1000) ) {$q = getChiSquare(($x-$n)*($x-$n)/(2*$n),1) / 2; if($x > $n) { return$q;
} else {
return 1 - $q; } }$p = exp(-0.5 * $x); if(($n % 2) == 1) {
$p =$p * sqrt(2*$x/pi()); }$k = $n; while($k >= 2) {
$p =$p * ($x/$k);
$k =$k - 2;
}
$t =$p;
$a =$n;
while($t > 0.0000000001 *$p) {
$a =$a + 2;
$t =$t * ($x /$a);
$p =$p + $t; }$retval = 1-$p; return$retval;
}



## N-Way ANOVA

N-Way ANOVA example

Two-way analysis of variance is where the rubber hits the road, so to speak. This extends the concepts of ANOVA with only one factor to two factors. When there are two factors this means that there can be an interaction between the two factors that should be tested. As one might expect this concept can be extended beyond just two factors to an N number of factors.

This example will provide a fairly complete example of a factorial design where there are three factors and a (continuous) response value.  It also provides the tests for ANOVA diagnostics including homogeneity of variance and tests for normality.

Normally, these examples are only one or two pages and are brief. However, analysis of variance can be complex and this example is quite a bit little longer but provides the necessary tools for a sufficient analysis.

## One-Way ANOVA

One-Way ANOVA

Analysis of variance is a tool used for a variety of purposes. Applications range from a common one-way ANOVA, to experimental blocking, to more complex nested designs. This first ANOVA example provides the necessary tools to analyze data using this technique. This example will show a basic one-way ANOVA. I will save the theory and the more detailed model topics behind Analysis of Variance for a separate discussion.

## Using R to connect to a SQL Server and MySQL Database using MS Windows

Connecting to MySQL and Microsoft SQL Server

Connecting to a MySQL database or MS SQL Server from the R environment can be extremely useful.  It allows a researcher direct access to the data without have to first export it from a database and then import it from a csv file or entering it directly into R.

This example shows the process to set up Windows to allow R to connect to MySQL and SQL Server (other databases such as Oracle and PostgreSQL would follow a similar pattern).  These are step-by-step instruction on the setup and provides the R code on how to query the database once connectivity is established.

## Kendall’s Tau

Kendall’s Tau

This is an example of Kendall’s Tau rank correlation.  This is similar to Spearman’s Rho in that it is a non-parametric measure of correlation on ranks.  It is an appropriate measure for ordinal data and is fairly straight forward when there are no ties in the ranks. When ties do exist then variations of Kendall’s Tau can be used. These are tau-a and tau-b.

This example uses the same data from the previous Spearman’s Rho example . This shows a simple example of how one would calculate Kendall’s Tau as well as providing the R commands.

## Spearman’s Rho

Spearman’s Rho Rank Correlation

There are generally three types of correlation that a researcher may encounter: Pearson’s r, Kendall’s Tau, and Spearman’s Rho.  They each have their own uses and applications depending on the data and what you’re trying to achieve.  This example shows how Spearman’s Rho rank correlation is calculated.