e.g. $D_4$ is neutral and balanced, $D_5$ and $D_6$ are neutral and imbalanced
real example: analysis of DBLP co-author relationships
advisor-advisee relationship is more likely when authors correlation is imbalanced
advisee is more likely to publish papers with advisor, but advisor not with advisee since he may publish papers with others
Example to show Kulczynski and IR
Mining Diverse Patterns
Mining Multi-Level Associations
Items often form hierarchies
How to set min-support thresholds?
Uniform min-support across multiple levels
Level-reduced min-support: items at lower level are expected to have lower support
Efficient mining: shared multi-level mining
use the lowest min-support to pass down the set of candidates
later use the higher min-support to filter when doing further analyses
Multi-level min-support thresholds
Redundancy filtering at mining multi-level associations
A rule is redundant if its support is close to expected value, according to its ancestor rule (rule of higher level), and it has a similar confidence as its ancestor
Such rule should be pruned
Customised min-support for different kinds of items
Items differ in values, therefore differ in frequency
It is necessary to have customised min-support settings for different kinds of items
Mining Multi-Dimensional Associations
Single dimensional rules
e.g. buys(X, “milk”) => buys(X, “bread”)
Multi-dimensional rules
inter-dimension association rules (no repeated predicates)
e.g. age(X, “18-25”) ^ occupation(X, “student”) => buys(X, “coke”)
hybrid-dimension association rules (repeated predicates)
e.g. age(X, “18-25”) ^ buys(X, “popcorn”) => buys(X, “coke”)
Attributes can be categorical or numerical
categorical: e.g. “bread”, “milk”
numerical: e.g. 15, 20
Mining Quantitative Associations
Mining Quantitative Associations
methods
Static discretization based on pre-defined concept hierarchies
discretization: partitioning numerical attributes, e.g. age: 18-20, 20-25, etc.
data-cube-based aggregation
Dynamic discretization based on data distribution
Clustering: Distance-based association
first one-dimensional clustering, then association
Deviation analysis:
e.g. gender == female => wage: 7$/hr, while overall: 9$/hr, this may be interesting…
Mining Extraordinary Phenomena in Quantitative Association Mining
Mining Extraordinary / Interesting phenomena
e.g. gender == female => wage: 7$/hr (while overall: 9$/hr)
LHS: a subset of the population
RHS: an extraordinary behaviour of this subset
The rule is accepted only if a statistical test (e.g. Z-test) confirms the inference with high confidence
Subrule: highlights the extraordinary behaviour of a subset of the population of the super rule
e.g. (gender == female) ^ (south == yes) => wage: 7$/hr (while overall: 9$/hr)
Rule condition may be categorical or numerical
Mining Negative Correlations
Rare Patterns vs. Negative Patterns
Rare patterns
very low support, but interesting (e.g. buying Rolex watches)
need to set customised min-support thresholds for different groups of items
usually small for these rare ones
Negative patterns
negatively correlated – unlikely to happen together
Defining Negative Correlated Patterns
A support-based definition
if A and B are both frequent but rarely occur together, i.e. $sup(A \cup B) \ll sup(A) * sup(B)$
then they are negatively correlated
actually that is: $lift(A, B) = \frac{sup(A \cup B)}{sup(A) * sup(B)} \ll 1$
$lift(A, B) \ll 1$
This definition may be not good when
there are many null-transactions
A good definition should take care of the null-invariance problem
A Kulczynski-measure-based definition
if item A and B are frequent but $\frac{1}{2}(P(A|B) + P(B|A)) < \epsilon $
$\epsilon$ is a negative pattern threshold, then A and B are negatively correlated
Kulczynski measure ranges from 0 to 1, the smaller, the more confidence of negative correlation
Mining Compressed Patterns
Why: Too many scattered patterns but not so meaningful
For the example below:
using closed patterns, there will still be many output (no compression actually)
using max patterns, output will be P3, which has information loss
Comments