Association Rule Mining with Apriori Algorithm
Before moving to the association rule mining, we consider what data mining is. According to Wikipedia that defined as “Data mining is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems”. Initially, two main methods are there in data mining “Predicting Methods” and “Description Methods”. Under those two methods, we can see mainly a few of the tasks regarding data mining.
- Classification — Predictive
- Clustering — Descriptive
- Association Rule Mining — Descriptive
- Sequential Pattern Mining — Descriptive
- Regression — Predictive
- Deviation Detection — Predictive
Now, we move into the topic of this article.
Using association rule mining we can find the interesting association and relationships among a large set of data. Found association rules show how frequently and similarly occurred the items in the data set.
Applications of Association Rule Mining
- This technique is very popular in market basket analysis.
- Medical diagnosis
- Census data
- Protein sequences
How can we use association rule mining?
When we talking about association rule mining, we can see there are some special terms and measures.
- Item set
- Support count (σ)
- Support
- Frequent itemset
- Association rule
- Rule evaluation metrics
Let’s consider these terms deeply one by one with the below transactions table.
- Itemset
A collection of one or more items are known as an itemset.
Ex: {printer, tablet, monitor}
Very often used another term is regarding with the itemset is k-itemset, which means about an item set with k number of items.
- Support count(σ)
The frequency of an itemset is called the support count.
Example: σ({printer, monitor, headset}) = 1
- Support
There is a clear relationship between the support and support count. Support is the fraction of transactions that contain an itemset.
Example: s({printer, monitor, headset}) = 1/10
- Frequent itemset
An itemset whose support is greater than or equal to the minsup (minimum support) threshold value.
- Association rules
An implication expression of form X →Y, where X and Y are itemsets.
Examples: {laptop, headset}→{tablet}
- Rule evaluation metrics
These evaluation metrics are used to determine the association rules according to our transaction data. Important metrics are described below.
Support
The fraction of transactions that contain both X and Y itemsets.
Ex:
A.R. — {laptop, headset}→{tablet}
s = σ(laptop,headset,tablet) / |T| = 4/10 = 0.4
Confidence
Measure how often items in Y appear in transactions that contain X.
Ex:
A.R. — {laptop, headset}→{tablet}
c = σ(laptop,headset,tablet) / σ(laptop,headset) = 4/4 = 1
For determining the association rules, there are some approaches. Among them, the Brute Force approach is the simplest way to find association rules. But, in that method, we have to calculate support and confidence for all the transactions and after that, we have to confirm the criteria. To be an association rule, the transaction should be filled with the support ≥ minsup and confidence ≥ minconf criteria. Calculate and checking this for each transaction is a very high computational thing. To overcome this problem, the Apriori Algorithm is used to find association rules.
Apriori Algorithm
Apriori Principle
If an itemset is frequent, then all of its subsets must also be frequent. This principle is shown in mathematical notation as below.
∀ X,Y : (X⊆Y) ⟶ s(X) ≥ s(Y)
Support of an itemset never exceeds the support of its subsets, which is called the anti-monotone property of support.
Pseudo code of Apriori Algorithm
Next, let’s consider an example for Apriori Algorithm.
Minimum Support = 50%
Minimum Confidence = 50%
According to transactions we can define the frequent items.
For rule A -> C:
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
According to the Apriori Algorithm, any subset of frequent itemset must be frequent.