Skip to content

Latest commit

 

History

History
195 lines (135 loc) · 6.01 KB

README.md

File metadata and controls

195 lines (135 loc) · 6.01 KB

CCIT

Classifier Conditional Independence Test: A CI test that uses a binary classifier (XGBoost) for CI testing

This is an implementation of the paper: https://arxiv.org/abs/1709.06138

Please cite the above paper if this package is used in any publication.

Usage for pip install

  1. pip install CCIT==0.4 or sudo -H pip install CCIT==0.4.

2(a). Now in your python script:

from CCIT import CCIT
from CCIT import DataGen

pvalue = CCIT.CCIT(X,Y,Z)    #without bootstrap

pvalue = CCIT.CCIT(X,Y,Z,num_iter = 30, bootstrap = True, nthread = 20)  #with 30 bootstrap iterations and 20 threads in parallel. 

2(b). If you want to test using the included DataGen module:

from CCIT import CCIT
from CCIT import DataGen

data = DataGen.generate_samples_cos(dx=1,dy=1,dz=20,sType='NI')  #non-CI dataset, pvalue should be low

X = data[:,0:1]
Y = data[:,1:2]
Z = data[:,2::]

pvalue = CCIT.CCIT(X,Y,Z)    #without bootstrap

pvalue = CCIT.CCIT(X,Y,Z,num_iter = 30, bootstrap = True, nthread = 20)  #with 30 bootstrap iterations and 20 threads in parallel. 

We suggest normalizing each column of the data either standard normalization or bringing all values in each column in the range [0,1], for the best performance

Note that when Z is None , it produces a pvalue for independence test between X and Y.

It is recommended to recale all columns of the data by standard deviation

Usage for pip install from github repo

  1. clone the repo.

  2. cd CCIT

  3. pip install .

  4. (Optional) from the root directory of the package, run the command

nosetests

This is a comprehensive test and may take some time to run.

  1. Now in your python script:
from CCIT import CCIT

pvalue = CCIT.CCIT(X,Y,Z)

There may be some trouble in installing the xgboost dependency. In that case it is recommended to follow the steps in https://github.com/dmlc/xgboost/blob/master/python-package/build_trouble_shooting.md for installing xgboost first. Then install CCIT from pip.

CI Tester

Functions:

  1. CCIT()
Main function to generate pval of the CI test. If pval is low CI is rejected if its high we fail to reject CI.
        X: Input X table
        Y: Input Y table
        Z: Input Z table. If None then it reverts back to Independence test between X and Y. 
        Optional Arguments:
        max_depths : eg. [6,10,13] list of parameters for depth of tree in xgb for tuning
        n_estimators: eg. [100,200,300] list of parameters for number of estimators for xgboost for tuning
        colsample_bytrees: eg. recommended [0.8] list of parameters for colsample_bytree for xgboost for tuning
        nfold: n-fold cross validation 
        feature_selection : default 0 recommended
        train_samp: -1 recommended. Number of examples out of total to be used for training. 
        threshold: defualt recommended
        num_iter: Number of Bootstrap Iterations. Default 20. Recommended 30. 
        nthread: Number of parallel thread for running XGB. Recommended number of cores in the CPU. Default 8. 
	bootstrap : True or False. If False, then num_iter is set to 1. One deterministic pval is outputted without averaging. If True, results are averaged over num_iter bootstraps and can have randomness. num_iter in this case has to be >= 20.   
        Output: 
        pvalue of the test. 
     

tl;dr version

If the dimensions of X, Y, and Z are 1,1,2 respectively and if the first three i.i.d samples are as follows:

|  X  | Y   |    Z    |
| 1.0 | 1.0 | 1.5 2.5 |
| 0.5 | 1.2 | 0.5 0.6 |
| 0.1 | 4.5 | 1.2 3.6 |

then the input is: 
X = np.array([[1.0],[0.5],[0.1]])
Y = np.array([[1.0],[1.2],[4.5]])
Z = np.array([[1.5,2.5],[0.5,0.6],[1.2,3.6]])
pval = CCIT(X,Y,Z)
  1. CI_sampler_conditional_kNN()
Generate Test and Train set for converting CI testing into Binary Classification
    Arguments:
    	X_in: Samples of r.v. X (np.array)
    	Y_in: Samples of r.v. Y (np.array)
    	Z_in: Samples of r.v. Z (np.array)
    	train_len: length of training set, must be less than number of samples 
    	k: k-nearest neighbor to be used: Always set k = 1. 
    Output:
    	Xtrain: Features for training the classifier
    	Ytrain: Train Labels
    	Xtest: Features for test set
    	Ytest: Test Labels
    	CI_data: Developer Use only

DataGen Module

Functions:

  1. generate_samples_cos()
Generate CI,I or NI post-nonlinear samples:
    
    1. Z is independent Gaussian 
    
    2. X = cos(<a,Z> + b + noise) and Y = cos(<c,Z> + d + noise) in case of CI
    Arguments:    
        size : number of samples
        sType: CI,I, or NI
        dx: Dimension of X 
        dy: Dimension of Y 
        dz: Dimension of Z 
        nstd: noise standard deviation
        freq: Freq of cosine function
    
    Output:
    	allsamples --> complete data-set
    Note that: 	
    [X = first dx coordinates of allsamples each row is an i.i.d samples]
    [Y = [dx:dx + dy] coordinates of allsamples]
    [Z = [dx+dy:dx+dy+dz] coordinates of all samples]
  1. parallel_cos_sample_gen()
Function to create several many data-sets of post-nonlinear cos transform half of which are CI and half of which are NI, along with the correct labels. The data-sets are stored under a given folder path:

	############## The path should exist#####################
	For example create a folder ../data/dim20 first. 


	Arguments:
	nsamples: Number of i.i.d samples in each data-set
	dx, dy, dz : Dimension of X, Y, Z
	nstd: Noise Standard Deviation 
	freq: Freq. of cos function 
	filetype: Path to filenames. if filetype = '../data/dim20/datafile', then the files are stored as '.npy' format in folder './dim20' 
	and the files are named datafile0_20.npy .....datafile50_20.npy
	num_data: number of data files 
	num_proc: number of processes to run in parallel 
	
	Output:
	num_data number of datafiles stored in the given folder. 
	datafile.npy files that constains an array that has the correct label. If the first label is '1' then  'datafile20_0.npy' constains a 'CI' dataset.