Association Rule Learning and the Apriori Algorithm

Association Rule Learning (also called Association Rule Mining) is a common technique used to find associations between many variables. It is often used by grocery stores, retailers, and anyone with a large transactional databases. It’s the same way that Target knows your pregnant or when you’re buying an item on Amazon.com they know what else you want to buy. The same idea extends to Pandora.com knowing what song you want to listen to next. All of these incorporate, at some level, data mining concepts and association rule algorithms.

Michael Hahsler, et al. has authored and maintains two very useful R packages relating to association rule mining:  the arules package and the arulesViz package.  Furthermore, Hahsler has provided two very good example articles providing details on how to use these packages in Introduction to arules and Visualizing Association Rules.

Often Association Rule Learning is used to analyze the “market-basket” for retailers.  Traditionally, this simply looks at whether a person has purchased an item or not and can be seen as a binary matrix.

Association rules use the R arules library.  The arulesViz add additional features for graphing and plotting the rules.


library("arules");

library("arulesViz");

For testing purposes there is a convenient way to generate random data where patterns can be mined.  The random data is generated in such a way where there is correlation and has correlated items.


patterns = random.patterns(nItems = 1000);

summary(patterns);

trans = random.transactions(nItems = 1000, nTrans = 1000, method = "agrawal",  patterns = patterns);

image(trans);

However, a transaction dataset will usually be available using the approach described in “Data Frames and Transactions“.  The rules can then be created using the apriori function on the transaction dataset.


data("AdultUCI");

Adult = as(AdultUCI, "transactions");

rules = apriori(Adult, parameter=list(support=0.01, confidence=0.5));

rules;

Once the rules have been created a researcher can then review and filter the rules down to a manageable subset.  This can be done in a variety of ways using both graphs and by simply inspecting the rules.


inspect(head(sort(rules, by="lift"),3));

plot(rules);

head(quality(rules));

plot(rules, measure=c("support","lift"), shading="confidence");

plot(rules, shading="order", control=list(main ="Two-key plot"));

Plot of Association Rules Shaded by Order

All Association Rules

Having 317848 association rules is far too many for a human to deal with.  So we’re going to trim down the rules to the ones that are important.


sel = plot(rules, measure=c("support","lift"), shading="confidence", interactive=TRUE);

subrules = rules[quality(rules)$confidence > 0.8];

subrules

Once again we can now subset the rules to get a visual.  In these graphs we can see the two parts to an association rule: the antecedent (IF) and the consequent (THEN).  These patterns are found by determining frequent patterns in the data and these are identified by the support and confidence.  The support indicates how frequently the items appear in the dataset.  The confidence indicates the number of times the IF/THEN statement on the data are true.  These IF/THEN statements can be visualized by the following graph:

Association Rules with Consequent and Antecedent.

This code will produce many different ways to look at the graphs and can even produce 3-D graphs.


plot(subrules, method="matrix", measure="lift");

plot(subrules, method="matrix", measure="lift", control=list(reorder=TRUE));

plot(subrules, method="matrix3D", measure="lift");

plot(subrules, method="matrix3D", measure="lift", control = list(reorder=TRUE));

plot(subrules, method="matrix", measure=c("lift", "confidence"));

plot(subrules, method="matrix", measure=c("lift","confidence"), control = list(reorder=TRUE));

plot(rules, method="grouped");

plot(rules, method="grouped", control=list(k=50));

sel = plot(rules, method="grouped", interactive=TRUE);

We can then subset the rules to the top 30 most important rules and then inspect the smaller set of rules individually to determine where there are meaningful associations.


subrules2 = head(sort(rules, by="lift"), 30);

plot(subrules2, method="graph");

plot(subrules2, method="graph", control=list(type="items"));

plot(subrules2, method="paracoord");

plot(subrules2, method="paracoord", control=list(reorder=TRUE));

oneRule = sample(rules, 1);

inspect(oneRule);

Shows the Frequent Itemsets

Here we can look at the frequent itemsets and we can use the eclat algorithm rather than the apriori algorithm.


itemFrequencyPlot(Adult, support = 0.1, cex.names=0.8);

fsets = eclat(trans, parameter = list(support = 0.05), control = list(verbose=FALSE));

singleItems = fsets[size(items(fsets)) == 1];

singleSupport = quality(singleItems)$support;

names(singleSupport) = unlist(LIST(items(singleItems), decode = FALSE));

head(singleSupport, n = 5);

itemsetList = LIST(items(fsets), decode = FALSE);

allConfidence = quality(fsets)$support / sapply(itemsetList, function(x)

max(singleSupport[as.character(x)]));

quality(fsets) = cbind(quality(fsets), allConfidence);

summary(fsets);

Using these approaches a researcher can narrow down and determine association rules and determine what leads to frequent items.  This is highly useful when working with extremely large datasets.

Data Frames and Transactions

Transactions are a very useful tool when dealing with data mining.  It provides a way to mine itemsets or rules on datasets.

In R the data must be in transactions form.  If the data is only available in a data.frame then to create (or coerce) the data frame to transaction the researcher may use the following code.   This example shows the “Adult” dataset available in the arules package.  It originates from the “Census Income” database.  These data, AdultUCI, can be coerced to transactions using the following commands:


library("arules");

data("AdultUCI");

Adult = as(AdultUCI, "transactions");

The dataframe can be in either a normalized (single) form or a flat file (basket) form.  When the file is in basket form it means that each record represents a transaction where the items in the basket are represented by columns.  When the dataset is in ‘single’ form it means that each record represents one single item and each item contains a transaction id.  The following snippet of code shows the read.transaction() function and how the data is set up.


my_data = paste("1,2","1","2,3", sep="\n");

write(my_data, file = "my_basket");

trans = read.transactions("my_basket", format = "basket", sep=",");

inspect(trans);

Once the data has been coerced to transactions the data is ready for mining itemsets or rules.  Association Rule Learning uses the transaction data files available in R.  A very popular algorithm for association rules is the apriori algorithm.  I have discussed approaches on the use of Association Rule Learning and the Apriori Algorithm.

 

Chi Square Probability Distribution Code Using PHP

This example is a Web application piece of code that I wrote to add a (approximate) p-value to some dynamically generated crosstabs.  This will allow a researcher to provide a way to deliver data over the web and will allow the researcher a way to calculate the p-value from a \chi^{2} distribution based on input data.  In order to use this function the researcher must calculate the \chi^{2} value first by using the standard \chi^{2} equation: \sum \frac{ \left( O - E \right)^{2} }{E}.  In addition to the \chi^{2} statistic the researcher needs to provide the degrees of freedom.

function getChiSquare($x, $n) {
if ( ($n==1) && ($x > 1000) ) {
return 0;
}

if ( ($x>1000) || ($n>1000) ) {
$q = getChiSquare(($x-$n)*($x-$n)/(2*$n),1) / 2;
if($x > $n) {
return $q;
} else {
return 1 - $q;
}
}
$p = exp(-0.5 * $x);
if(($n % 2) == 1) {
$p = $p * sqrt(2*$x/pi());
}
$k = $n;
while($k >= 2) {
$p = $p * ($x/$k);
$k = $k - 2;
}
$t = $p;
$a = $n;
while($t > 0.0000000001 * $p) {
$a = $a + 2;
$t = $t * ($x / $a);
$p = $p + $t;
}

$retval = 1-$p;
return $retval;
}

N-Way ANOVA

N-Way ANOVA example

Two-way analysis of variance is where the rubber hits the road, so to speak. This extends the concepts of ANOVA with only one factor to two factors. When there are two factors this means that there can be an interaction between the two factors that should be tested. As one might expect this concept can be extended beyond just two factors to an N number of factors.

This example will provide a fairly complete example of a factorial design where there are three factors and a (continuous) response value.  It also provides the tests for ANOVA diagnostics including homogeneity of variance and tests for normality.

Normally, these examples are only one or two pages and are brief. However, analysis of variance can be complex and this example is quite a bit little longer but provides the necessary tools for a sufficient analysis.

One-Way ANOVA

One-Way ANOVA

Analysis of variance is a tool used for a variety of purposes. Applications range from a common one-way ANOVA, to experimental blocking, to more complex nested designs. This first ANOVA example provides the necessary tools to analyze data using this technique. This example will show a basic one-way ANOVA. I will save the theory and the more detailed model topics behind Analysis of Variance for a separate discussion.

Using R to connect to a SQL Server and MySQL Database using MS Windows

Connecting to MySQL and Microsoft SQL Server

Connecting to a MySQL database or MS SQL Server from the R environment can be extremely useful.  It allows a researcher direct access to the data without have to first export it from a database and then import it from a csv file or entering it directly into R.

This example shows the process to set up Windows to allow R to connect to MySQL and SQL Server (other databases such as Oracle and PostgreSQL would follow a similar pattern).  These are step-by-step instruction on the setup and provides the R code on how to query the database once connectivity is established.

Kendall’s Tau

Kendall’s Tau

This is an example of Kendall’s Tau rank correlation.  This is similar to Spearman’s Rho in that it is a non-parametric measure of correlation on ranks.  It is an appropriate measure for ordinal data and is fairly straight forward when there are no ties in the ranks. When ties do exist then variations of Kendall’s Tau can be used. These are tau-a and tau-b.

This example uses the same data from the previous Spearman’s Rho example . This shows a simple example of how one would calculate Kendall’s Tau as well as providing the R commands.

Generating Random Correlated Data Part 1

Generating Randomly Correlated Data with Cholesky Decomposition

This example shows how to transform randomly generated data continuous data into correlated data with a specified correlation.  This example uses the Cholesky decomposition to make the transformation into correlated data.  The second part to generating random data will be to discuss the process to generated associated random ordinal data.