Kingfisher is a data mining tool, which searches for the best non-redundant dependency rules, which express statistically significant dependencies in data. The special feature of Kingfisher is that it scales up well without any minimum frequency thresholds or other suboptimal heuristics. This means that one can find globally optimal dependencies from data. Another unique feature is that it supports Fisher's p-value as a goodness measure, which produces very reliable results (the discovered dependencies hold in future data which is not always true with other measures).
The discovered rules are of form X->A or X->~A, where X is a set of binary attributes and consequence is any single attribute or its negation. No negations can occur in the antecedent of rules (you can circumvent this restriction with suitable preprocessing). Notice that with many measures (like Fisher's p and chi2), the positive dependence between X and A is the same as positive dependence between ~X and ~A or negative dependence between X and ~A or ~X and A. This means that you can find both positive and negative dependencies with these measures.
The current version evaluates the statistical significance by three measures: Fisher's pF (Fisher's exact test), the chi^2-measure and mutual information (MI). It is also possible to add your own measure functions (see 5).
All discovered rules are required to be non-redundant, which means that they do not contain any redundant attributes, which do not improve the rule. Rule X->A is considered non-redundant, if it is more significant than more general rules with the same consequence. For example, if measure M is increasing by goodness (a high value indicates a good rule) and M(A1 A2 A3 -> A5)<=M(A1 -> A5), then A1 A2 A3 -> A5 is considered redundant in respect of A1 -> A5.
The details are described in
Hämäläinen, W.: Efficient discovery of the top-K optimal dependency rules with Fisher's exact test of significance. Proceedings of the 10th IEEE International Conference on Data Mining (ICDM 2010), pp. 196-205 IEEE Computer Society 2010. (See here)
and (more detailed)
Hämäläinen, W.: Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowledge and Information Systems: An International Journal (KAIS) 32(2):383-414, 2012. (See here)
------------------------------------------------------------------------ (C) Copyright 2010 Wilhelmiina Hämäläinen. The code can be freely used for academic/research purposes. Direct or indirect use for commercial advantage is not allowed without written permission from the author. The code can be modified and redistributed if this note is left intact. -------------------------------------------------------------------------If you use it for your own publications, use one of the references above.
If you are using Linux/Unix with gcc, installing is easy and you shouldn't have any problems. The program may work well in other environments and/or with other compilers, but you have to find it out yourself.
First, load the source code package here. Unpack it in a selected directory. Then simply run command "make". (The package contains a README file with these same instructions.)
Notice that handling really high-dimensional data sets requires relatively much memory. For the classical benchmark data sets (Mushroom, Chess, T10I4D100K, T40I10D100K, Accidents, Pumsb, and Retail) 1 GB memory is sufficient with measure pF, but the chi2-measure may already require more.
Notice also that the default makefile compiles the program by default into 64-bit code. If you want to save memory (which can also speed up the search), you can use the 32-bit mode. Just modify the makefile (add and remove corresponding comment marks). Still, note that some really difficult data sets (especially with the chi2-measure) may require so large address space that 32 bits isn't sufficient. In this case the program may just halt without warning (a well-known malloc bug).
The basic syntax is
kingfisher -i<inputfile> -k<maximum attribute number> [-w<measure>] -M<threshold for the measure> [-q<number of best rules to search for> -t<type of the rule>] [extra options]
Since Kingfisher searches for the Q best rules, it updates the threshold automatically, when new good rules are found. However, the first levels are proceeded faster, if you give a sufficiently demanding threshold in the beginning. At each level, the program outputs the currently used threshold value. So, for example, if the program gets stuck at a level, where the updated threshold is x, you can stop the search (ctrl-c) and use x (or a more demanding) value as a threshold. Alternatively, you can search a smaller number of best rules (option -q).
Notice also that if you have created new negation (complement) attributes for all attributes in the preprocessing phase, then you should not search for negative dependency rules - otherwise you would get a lot of trivial rules of the form A->~negA, where neg A is a new attribute for ~A. Note that in this case there is no need for negative dependency rules, because you will get them anyway (positive rule X->negA is the same as X->~A).
kingfisher -i mushroom.dat -k124 -M-1000searches for the 100 best positive dependency rules from mushroom.dat. The measure is by default ln(pF) and the initial threshold is -1000.
kingfisher -imushroom.dat -k124 -M-1000 -t3 -q50searches for the 50 best positive and negative rules from mushroom.dat. Otherwise like the previous one.
kingfisher -i mushroom.dat -k124 -w3 -M8000searches for the 100 best positive rules using chi2-measure (-w3) whose initial threshold is 8000.
Kingfisher like most rule discovery tools was originally designed for occurrence data (like market basket data or species or gene occurrence data). However, it is easy to preprocess different types of data into suitable format. Here, we give some hints for preprocessing. Kingfisher itself supports several kinds of constraints which faciliate different preprocessing schemes.
For any nominal data, you can always create a binary attribute for each nominal value. If the original variable A can get values a, b and c, just create binary attributes B1, B2 and B3 for A=a, A=b and A=c. You may also want to create attributes B4, B5 and B6 for A!=a, A!=b and A!=c. This corresponds to creating attributes for negations ~B1, ~B2 and ~B3. Now it is important to prevent trivial rules which would be just a byproduct of your binarization. Otherwise, you could easily get rules like B1->~B4 (A=a -> ~(A!=a)) or B4 B5 -> B6 (A!=a, A!=b -> A=c). You can define such forbidden combinations in a separate constraint file which you give to Kingfisher with option -e <constraint file>. Each row of the constraint file should list attributes (separated by spaces or commas) any two of which cannot occur together. If you created new attributes for all negations, you may also want to use option -t1 - see above.
Numerical variables require always some discretization. In practice, you can either use some static discretization or simulate a dynamic discretization, which let's Kingfisher to define the discretization on-line. In the static discretization you should predefine mutually exclusive intervals and create new attributes for each of them. Like in the nominal case, it is advisable to add constraints which prevent trivial rules.
For the dynamic discretization, you can try the following technique: Given numeric variable A, decide a set of split points a_1,...,a_m (potential end points for an interval). Then create binary attributes for all inequalities A<=a_i and A>=a_i. Add constraints (with option -e) which forbid attributes for A<=a_i to occur together and similarly for A>=a_is. In addition, you probably want to prevent rules of the form A>=2->A<=5. This can be done with option -b <extra constraint file>. Here extra constraint file lists on each row first a consequence attribute and then its forbidden condition attributes.
Notice that when nominal variables are transformed into binary, no information is lost, and you can still find globally optimal dependency rules. With numerical variables this is no more the case, because discretization can lose information (while it can also clean information on dependencies from noise). If the optimality of results is important to you, try to simulate some dynamic discretization scheme with your preprocessing so that the program can itself set optimal discretization intervals on-line. Depending on the number of numerical variables and their domains, you may be able to find globally optimal rules even when numerical variables are involved. However, if the number of numerical variables is large and/or they have many different values, this is no more feasible, but even then, you can try a coarse-level dynamic discretization.
Adding your own measure functions is easy. The only places where you have to make changes are measures.h and measures.c. Short instructions are given there.
For any goodness measure M it is required that
If measure M is well-behaving, you can easily define the required upper or lower bounds.
If you find any bugs, please report to Wilhelmiina Hämäläinen.