Kingfisher - an efficient tool for searching for statistical dependency rules

Download newest version (1.2b)

Contents

  1. What is Kingfisher?
  2. Who can use Kingfisher?
  3. How to install Kingfisher?
  4. How to use Kingfisher?
  5. What if my data is not binary?
  6. How to make your own modifications?
  7. Where to report bugs?

What is Kingfisher?

Kingfisher is a data mining tool, which searches for the best non-redundant dependency rules, which express statistically significant dependencies in data. The special feature of Kingfisher is that it scales up well without any minimum frequency thresholds or other suboptimal heuristics. This means that one can find globally optimal dependencies from data. Another unique feature is that it supports Fisher's p-value as a goodness measure, which produces very reliable results (the discovered dependencies hold in future data which is not always true with other measures).

The discovered rules are of form X->A or X->~A, where X is a set of binary attributes and consequence is any single attribute or its negation. No negations can occur in the antecedent of rules (you can circumvent this restriction with suitable preprocessing). Notice that with many measures (like Fisher's p and chi2), the positive dependence between X and A is the same as positive dependence between ~X and ~A or negative dependence between X and ~A or ~X and A. This means that you can find both positive and negative dependencies with these measures.

The current version evaluates the statistical significance by three measures: Fisher's pF (Fisher's exact test), the chi^2-measure and mutual information (MI). It is also possible to add your own measure functions (see 5).

All discovered rules are required to be non-redundant, which means that they do not contain any redundant attributes, which do not improve the rule. Rule X->A is considered non-redundant, if it is more significant than more general rules with the same consequence. For example, if measure M is increasing by goodness (a high value indicates a good rule) and M(A1 A2 A3 -> A5)<=M(A1 -> A5), then A1 A2 A3 -> A5 is considered redundant in respect of A1 -> A5.

The details are described in

Hämäläinen, W.: Efficient discovery of the top-K optimal dependency rules with Fisher's exact test of significance. Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010), pp. 196-205 IEEE Computer Society 2010. (See here)

and (more detailed)

Hämäläinen, W.: Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowledge and Information Systems: An International Journal (KAIS) 32(2):383-414, 2012. (See here)

Who can use Kingfisher?

Kingfisher has the following copyright, which is also included in the source file:
------------------------------------------------------------------------
 (C) Copyright 2010 Wilhelmiina Hämäläinen.


      The code can be freely used for academic/research purposes. Direct
      or indirect use for commercial advantage is not allowed without
      written permission from the author.

      The code can be modified and redistributed if this note is left
      intact.
-------------------------------------------------------------------------
If you use it for your own publications, use one of the references above.

How to install Kingfisher?

If you are using Linux/Unix with gcc, installing is easy and you shouldn't have any problems. The program may work well in other environments and/or with other compilers, but you have to find it out yourself.

First, load the source code package here. Unpack it in a selected directory. Then simply run command "make". (The package contains a README file with these same instructions.)

Notice that handling really high-dimensional data sets requires relatively much memory. For the classical benchmark data sets (Mushroom, Chess, T10I4D100K, T40I10D100K, Accidents, Pumsb, and Retail) 1 GB memory is sufficient with measure pF, but the chi2-measure may already require more.

Notice also that the default makefile compiles the program by default into 64-bit code. If you want to save memory (which can also speed up the search), you can use the 32-bit mode. Just modify the makefile (add and remove corresponding comment marks). Still, note that some really difficult data sets (especially with the chi2-measure) may require so large address space that 32 bits isn't sufficient. In this case the program may just halt without warning (a well-known malloc bug).

How to use Kingfisher?

The basic syntax is

kingfisher -i<inputfile> -k<maximum attribute number> [-w<measure>] -M<threshold for the measure> [-q<number of best rules to search for> -t<type of the rule>] [extra options]

where