Questions about sample code? #57

megancooper · 2019-01-21T17:34:07Z

Hello, I am new to python and machine learning but need to use the library for a project. I read the website and the sample code but am still confused on how I can retrieve the features that have been (selected?) by each of the Relief algorithms.

Apologies if the site goes over this, but I didn't see any information on this. I had a couple questions:

How do we get back the features selected by each algorithm?
The sample code below for the ReliefF algorithm prints a number at the end of running the code, is this number relevant to feature selection?

import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from skrebate import ReliefF
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

genetic_data = pd.read_csv('https://github.com/EpistasisLab/scikit-rebate/raw/master/data/'
                           'GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz',
                           sep='\t', compression='gzip')

features, labels = genetic_data.drop('class', axis=1).values, genetic_data['class'].values

clf = make_pipeline(ReliefF(n_features_to_select=2, n_neighbors=100),
                    RandomForestClassifier(n_estimators=100))

print(np.mean(cross_val_score(clf, features, labels)))
>>> 0.795

Thanks for any help, I've been trying to figure out this code using the internet for a couple weeks now but have not really gotten anywhere

The text was updated successfully, but these errors were encountered:

ryanurbs · 2019-01-21T17:44:35Z

Hi Megan, go to the user guide link on github: https://epistasislab.github.io/scikit-rebate/using/ and scroll down to acquiring feature importance scores. Ryan Using skrebate - scikit-rebate - GitHub Pages<https://epistasislab.github.io/scikit-rebate/using/> epistasislab.github.io We have designed the Relief algorithms to be integrated directly into scikit-learn machine learning workflows. Below, we provide code samples showing how the various Relief algorithms can be used as feature selection methods in scikit-learn pipelines. Ryan J. Urbanowicz, Ph.D. Assistant Professor of Informatics Perelman School of Medicine University of Pennsylvania ---------------------------------------------- 629 Blockley Hall 423 Guardian Drive University of Pennsylvania Philadelphia, Pennsylvania 19104 W. Phone: 215-746-4225 C. Phone: 802-299-9461 Web: www.ryanurbanowicz.com<http://www.ryanurbanowicz.com/> Twitter: www.twitter.com/DocUrbs<http://www.twitter.com/DocUrbs> ----------------------------------------------

________________________________ From: Megan <notifications@github.com> Sent: Monday, January 21, 2019 12:34:07 PM To: EpistasisLab/scikit-rebate Cc: Subscribed Subject: [External] [EpistasisLab/scikit-rebate] Questions about sample code? (#57) Hello, I am new to python and machine learning but need to use the library for a project. I read the website and the sample code but am still confused on how I can retrieve the features that have been (selected?) by each of the Relief algorithms. Apologies if the site goes over this, but I didn't see any information on this. I had a couple questions: 1. How do we get back the features selected by each algorithm? 2. The sample code below for the ReliefF algorithm prints a number at the end of running the code, is this number relevant to feature selection? import pandas as pd import numpy as np from sklearn.pipeline import make_pipeline from skrebate import ReliefF from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score genetic_data = pd.read_csv('https://github.com/EpistasisLab/scikit-rebate/raw/master/data/' 'GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz', sep='\t', compression='gzip') features, labels = genetic_data.drop('class', axis=1).values, genetic_data['class'].values clf = make_pipeline(ReliefF(n_features_to_select=2, n_neighbors=100), RandomForestClassifier(n_estimators=100)) print(np.mean(cross_val_score(clf, features, labels)))

>> 0.795

Thanks for any help, I've been trying to figure out this code using the internet for a couple weeks now but have not really gotten anywhere — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#57>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ANWn0dKHEcw4vBvqVcP7drWMjBSRptGPks5vFfoPgaJpZM4aLQw->.

megancooper · 2019-01-22T14:56:55Z

Hi Ryan,

Thanks for the response, this helps alot!

Just to clarify, if I received the following result I would pick the highest scoring (most closest to 1?) features from my dataset?

>>>N0    -0.0000166666666667
>>>N1    -0.006175
>>>N2    -0.0079
>>>N3    -0.006275
>>>N4    -0.00684166666667
>>>N5    -0.0104416666667
>>>N6    -0.010275
>>>N7    -0.00785
>>>N8    -0.00824166666667
>>>N9    -0.00515
>>>N10   -0.000216666666667
>>>N11   -0.0039
>>>N12   -0.00291666666667
>>>N13   -0.00345833333333
>>>N14   -0.00324166666667
>>>N15   -0.00886666666667
>>>N16   -0.00611666666667
>>>N17   -0.007325
>>>P1    0.108966666667
>>>P2    0.111

megancooper · 2019-01-22T15:06:08Z

Actually my bad, was listed on the link as well:

To sort features by decreasing score along with their names, and simultaneously indicate which features have been assigned a token TuRF feature score (since they were removed from consideration at some point) then add the following...

scored\_features = len(fs.top\_features_)
sorted_names = sorted(scoreDict, key=lambda x: scoreDict[x], reverse=True)
n = 1
for k in sorted\_names:
    if n < scored\_features +1 :
        print(k, '\t', scoreDict[k],'\t',n) 
    else:
        print(k, '\t', scoreDict[k],'\t','*') 
    n += 1

>>>P1    0.20529375      1
>>>P2    0.17374375      2
>>>N0    -0.00103125     3
>>>N10   -0.00118125     4
>>>N13   -0.0086125      5
>>>N1    -0.0107515625   *
>>>N14   -0.0107515625   *
>>>N16   -0.0107515625   *
>>>N8    -0.0107515625   *
>>>N12   -0.0107515625   *
>>>N3    -0.012890625    *
>>>N2    -0.012890625    *
>>>N7    -0.012890625    *
>>>N17   -0.012890625    *
>>>N5    -0.012890625    *
>>>N15   -0.012890625    *
>>>N11   -0.012890625    *
>>>N4    -0.012890625    *
>>>N9    -0.012890625    *
>>>N6    -0.012890625    *

megancooper · 2019-01-23T20:40:31Z

@ryanurbs I had another question about allowable datatypes. Are strings not supported by this library? I noticed most of the sample data in this repo contains numbers for each feature and no strings. I am currently trying to use data that has strings, and I receive the following error:

TypeError: unsupported operand type(s) for /: 'str' and 'int'

My data looks something like this:

feature1	feature2	feature3	feature4
red	on	large	open
blue	off	small	open

My current code:

feature_pairs = pd.DataFrame(feature_value_pairs)

# Separate the features, from the label(s) (bug name(s))
features, labels = feature_pairs.drop('class', axis=1).values, feature_pairs['class'].values

# Make sure to compute the feature importance scores from only your training set
X_train, X_test, y_train, y_test = train_test_split(features, labels)

fs = ReliefF()
fs.fit(X_train, y_train) # This is where the TypeError occurs

ryanurbs · 2019-01-23T20:52:18Z

It was supposed to have been set up to handle strings as well, but I'll have to take a closer look, not sure when I will be able to get to that. In the meantime I'd suggest encoding your variables as integers to avoid the error. Thanks Ryan Get Outlook for Android<https://aka.ms/ghei36>

…

________________________________ From: Megan <notifications@github.com> Sent: Wednesday, January 23, 2019 3:40:31 PM To: EpistasisLab/scikit-rebate Cc: Ryan Urbanowicz; Mention Subject: [External] Re: [EpistasisLab/scikit-rebate] Questions about sample code? (#57) @ryanurbs<https://github.com/ryanurbs> I had another question about allowable datatypes. Are strings not supported by this library? I noticed most of the sample data in this repo contains numbers for each feature and no strings. I am currently trying to use data that has strings, and I receive the following error: TypeError: unsupported operand type(s) for /: 'str' and 'int' My data looks something like this: feature1 feature2 feature3 feature4 red on large open blue off small open My current code: feature_pairs = pd.DataFrame(feature_value_pairs) # Separate the features, from the label(s) (bug name(s)) features, labels = feature_pairs.drop('class', axis=1).values, feature_pairs['class'].values # Make sure to compute the feature importance scores from only your training set X_train, X_test, y_train, y_test = train_test_split(features, labels) fs = ReliefF() fs.fit(X_train, y_train) # This is where the TypeError occurs — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#57 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ANWn0RwCJc4Q170pYo6fpyYFxFR23LzIks5vGMi_gaJpZM4aLQw->.

megancooper · 2019-01-24T00:31:47Z

Apologies for the many questions, but I tried encoding my data and I still have the same error. It is happening everytime on line 140 of relieff.py:

self._labels_std = np.std(self._y, ddof=1)

Here is the full traceback:


Traceback (most recent call last):
line 36, in <module> fs.fit(X_train, y_train)
line 140, in fit self._labels_std = np.std(self._y, ddof=1)
line 3038, in std **kwargs)
line 140, in _std keepdims=keepdims)
line 110, in _var arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
TypeError: unsupported operand type(s) for /: 'str' and 'int'

I checked my labels array and it seems to be formatted correctly. Are there any rules surrounding how these should be formatted? I am directly creating a DataFrame instead of using a tsv file like some of the examples show. Perhaps this is related?

Could it have something to do with the data types for my data frame's columns? I noticed it is a numpy error that is happening when np.std() is called, and that each of my columns are of the type object.

txsing · 2021-03-12T09:28:28Z

Hello, I am new to python and machine learning but need to use the library for a project. I read the website and the sample code but am still confused on how I can retrieve the features that have been (selected?) by each of the Relief algorithms.

Apologies if the site goes over this, but I didn't see any information on this. I had a couple questions:

How do we get back the features selected by each algorithm?

The sample code below for the ReliefF algorithm prints a number at the end of running the code, is this number relevant to feature selection?
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from skrebate import ReliefF
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

genetic_data = pd.read_csv('https://github.com/EpistasisLab/scikit-rebate/raw/master/data/'
                           'GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz',
                           sep='\t', compression='gzip')

features, labels = genetic_data.drop('class', axis=1).values, genetic_data['class'].values

clf = make_pipeline(ReliefF(n_features_to_select=2, n_neighbors=100),
                    RandomForestClassifier(n_estimators=100))

print(np.mean(cross_val_score(clf, features, labels)))
>>> 0.795
Thanks for any help, I've been trying to figure out this code using the internet for a couple weeks now but have not really gotten anywhere

I met the same problem, it seems a little bit difficult to find clear instructions on how to get ReliefF object from the pipeline object and to get to know the final selected features. I kept getting 'AttributeError: 'ReliefF' object has no attribute 'feature_importances_' error prompt by calling print(clf['relieff'].feature_importances_)

It will be great if the developer could give a simpler version of the example code showing the intermediate steps without using pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about sample code? #57

Questions about sample code? #57

megancooper commented Jan 21, 2019

ryanurbs commented Jan 21, 2019 via email

megancooper commented Jan 22, 2019

megancooper commented Jan 22, 2019

megancooper commented Jan 23, 2019

ryanurbs commented Jan 23, 2019 via email

megancooper commented Jan 24, 2019 •

edited

Loading

txsing commented Mar 12, 2021 •

edited

Loading

Questions about sample code? #57

Questions about sample code? #57

Comments

megancooper commented Jan 21, 2019

ryanurbs commented Jan 21, 2019 via email

megancooper commented Jan 22, 2019

megancooper commented Jan 22, 2019

megancooper commented Jan 23, 2019

ryanurbs commented Jan 23, 2019 via email

megancooper commented Jan 24, 2019 • edited Loading

txsing commented Mar 12, 2021 • edited Loading

megancooper commented Jan 24, 2019 •

edited

Loading

txsing commented Mar 12, 2021 •

edited

Loading