Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using my own csv loader #521

Closed
mrgloom opened this issue Jul 4, 2017 · 5 comments
Closed

Using my own csv loader #521

mrgloom opened this issue Jul 4, 2017 · 5 comments
Labels

Comments

@mrgloom
Copy link

mrgloom commented Jul 4, 2017

I'm trying to reproduce MNIST result using my own csv loader. Howewer it eat too much memory(>100Gb)and I stopped process with Optimization Progress: 0%.

Here is the code:

from tpot import TPOTClassifier
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
import time

model_name="tpot_rf_default"

def load_train_data():
	train_data = np.genfromtxt('MNIST/train.csv', delimiter=',', skip_header=1)
	X_train= train_data[:,1:]
	y_train= train_data[:,0]
	
	print ('X_train.shape', X_train.shape)
	print ('y_train.shape', y_train.shape)
	
	return X_train, y_train

def load_test_data():
	test_data = np.genfromtxt('MNIST/test.csv', delimiter=',', skip_header=1)
	X_test= test_data
	
	print ('X_test.shape', X_test.shape)
	
	return X_test

def train_model():
	X, y= load_train_data()
	X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.20, random_state=42)
	
	tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, n_jobs=-1)
	tpot.fit(X_train, y_train)
	print(tpot.score(X_test, y_test))
	tpot.export('tpot_mnist_pipeline.py')
	
train_model()

I'm using default tpot_mnist_pipeline.py

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),
                     tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = KNeighborsClassifier(n_neighbors=6, weights="distance")

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

What can cause this problem?

Running default example seems ok:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, n_jobs=-1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')
Generation 1 - Current best internal CV score: 0.957677927281037                                  
Generation 2 - Current best internal CV score: 0.9605635588143144                                 
Generation 3 - Current best internal CV score: 0.9621501163811281                                 
Generation 4 - Current best internal CV score: 0.9687953565796746                                 
Generation 5 - Current best internal CV score: 0.9687953565796746                                 
                                                                                                  
Best pipeline: GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=0.01, GradientBoostingClassifier__max_depth=10, GradientBoostingClassifier__max_features=0.1, GradientBoostingClassifier__min_samples_leaf=6, GradientBoostingClassifier__min_samples_split=11, GradientBoostingClassifier__n_estimators=100, GradientBoostingClassifier__subsample=0.6)
0.957777777778

real	6m21.765s
user	37m52.342s
sys	1m37.300s
@weixuanfu
Copy link
Contributor

Hmm, I am not sure what is wrong in code about np.genfromtxt. I tried load_train_data with the MNIST download from this link and no error happened.

Maybe your issue is related to #492 . Could you please try pandas module instead of numpy for reading input data? Below is a demo:

import pandas as pd

tpot_data = pd.read_csv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv')
features = tpot_data.drop('class', axis=1).values
training_features, testing_features, training_classes, testing_classes = 
                        train_test_split(features, tpot_data['class'].values, random_state=42)

@rhiever
Copy link
Contributor

rhiever commented Jul 18, 2017

@mrgloom, did that solution work for you?

@mrgloom
Copy link
Author

mrgloom commented Jul 23, 2017

I have tested speed of numpy vs pandas loader:

numpy
real 1m17.487s
user 1m14.974s
sys 0m4.410s

pandas
real 0m9.028s
user 0m8.662s
sys 0m4.043s

Code:

import numpy as np
import pandas as pd

def load_train_data_np():
	train_data = np.genfromtxt('MNIST/train.csv', delimiter=',', skip_header=1)
	X_train= train_data[:,1:]
	y_train= train_data[:,0]
	
	print ('X_train.shape', X_train.shape)
	print ('y_train.shape', y_train.shape)
	
	return X_train, y_train

def load_test_data_np():
	test_data = np.genfromtxt('MNIST/test.csv', delimiter=',', skip_header=1)
	X_test= test_data
	
	print ('X_test.shape', X_test.shape)
	
	
def load_train_data_pd():
	train_data = pd.read_csv('MNIST/train.csv', delimiter=',', skiprows=1).as_matrix()
	X_train= train_data[:,1:]
	y_train= train_data[:,0]
	
	print ('X_train.shape', X_train.shape)
	print ('y_train.shape', y_train.shape)
	
	return X_train, y_train

def load_test_data_pd():
	test_data = pd.read_csv('MNIST/test.csv', delimiter=',', skiprows=1).as_matrix()
	X_test= test_data
	
	print ('X_test.shape', X_test.shape)

def test_np_loader():
	load_train_data_np()
	load_test_data_np()
	
def test_pd_loader():
	load_train_data_pd()
	load_test_data_pd()

@weixuanfu
Copy link
Contributor

Thank you for the feedback. We already add pandas as default csv reader in the PR #519 that was merged to dev branch.

@weixuanfu
Copy link
Contributor

Closing this issue for now. Please feel free to re-open if you have any more questions or comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants