Using my own csv loader #521

mrgloom · 2017-07-04T22:28:48Z

I'm trying to reproduce MNIST result using my own csv loader. Howewer it eat too much memory(>100Gb)and I stopped process with Optimization Progress: 0%.

Here is the code:

from tpot import TPOTClassifier
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
import time

model_name="tpot_rf_default"

def load_train_data():
	train_data = np.genfromtxt('MNIST/train.csv', delimiter=',', skip_header=1)
	X_train= train_data[:,1:]
	y_train= train_data[:,0]
	
	print ('X_train.shape', X_train.shape)
	print ('y_train.shape', y_train.shape)
	
	return X_train, y_train

def load_test_data():
	test_data = np.genfromtxt('MNIST/test.csv', delimiter=',', skip_header=1)
	X_test= test_data
	
	print ('X_test.shape', X_test.shape)
	
	return X_test

def train_model():
	X, y= load_train_data()
	X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.20, random_state=42)
	
	tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, n_jobs=-1)
	tpot.fit(X_train, y_train)
	print(tpot.score(X_test, y_test))
	tpot.export('tpot_mnist_pipeline.py')
	
train_model()

I'm using default tpot_mnist_pipeline.py

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1),
                     tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = KNeighborsClassifier(n_neighbors=6, weights="distance")

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

What can cause this problem?

Running default example seems ok:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, n_jobs=-1)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')

Generation 1 - Current best internal CV score: 0.957677927281037                                  
Generation 2 - Current best internal CV score: 0.9605635588143144                                 
Generation 3 - Current best internal CV score: 0.9621501163811281                                 
Generation 4 - Current best internal CV score: 0.9687953565796746                                 
Generation 5 - Current best internal CV score: 0.9687953565796746                                 
                                                                                                  
Best pipeline: GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=0.01, GradientBoostingClassifier__max_depth=10, GradientBoostingClassifier__max_features=0.1, GradientBoostingClassifier__min_samples_leaf=6, GradientBoostingClassifier__min_samples_split=11, GradientBoostingClassifier__n_estimators=100, GradientBoostingClassifier__subsample=0.6)
0.957777777778

real	6m21.765s
user	37m52.342s
sys	1m37.300s

The text was updated successfully, but these errors were encountered:

weixuanfu · 2017-07-05T13:54:35Z

Hmm, I am not sure what is wrong in code about np.genfromtxt. I tried load_train_data with the MNIST download from this link and no error happened.

Maybe your issue is related to #492 . Could you please try pandas module instead of numpy for reading input data? Below is a demo:

import pandas as pd

tpot_data = pd.read_csv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv')
features = tpot_data.drop('class', axis=1).values
training_features, testing_features, training_classes, testing_classes = 
                        train_test_split(features, tpot_data['class'].values, random_state=42)

rhiever · 2017-07-18T09:25:06Z

@mrgloom, did that solution work for you?

mrgloom · 2017-07-23T09:45:13Z

I have tested speed of numpy vs pandas loader:

numpy
real 1m17.487s
user 1m14.974s
sys 0m4.410s

pandas
real 0m9.028s
user 0m8.662s
sys 0m4.043s

Code:

import numpy as np
import pandas as pd

def load_train_data_np():
	train_data = np.genfromtxt('MNIST/train.csv', delimiter=',', skip_header=1)
	X_train= train_data[:,1:]
	y_train= train_data[:,0]
	
	print ('X_train.shape', X_train.shape)
	print ('y_train.shape', y_train.shape)
	
	return X_train, y_train

def load_test_data_np():
	test_data = np.genfromtxt('MNIST/test.csv', delimiter=',', skip_header=1)
	X_test= test_data
	
	print ('X_test.shape', X_test.shape)
	
	
def load_train_data_pd():
	train_data = pd.read_csv('MNIST/train.csv', delimiter=',', skiprows=1).as_matrix()
	X_train= train_data[:,1:]
	y_train= train_data[:,0]
	
	print ('X_train.shape', X_train.shape)
	print ('y_train.shape', y_train.shape)
	
	return X_train, y_train

def load_test_data_pd():
	test_data = pd.read_csv('MNIST/test.csv', delimiter=',', skiprows=1).as_matrix()
	X_test= test_data
	
	print ('X_test.shape', X_test.shape)

def test_np_loader():
	load_train_data_np()
	load_test_data_np()
	
def test_pd_loader():
	load_train_data_pd()
	load_test_data_pd()

weixuanfu · 2017-07-25T18:55:54Z

Thank you for the feedback. We already add pandas as default csv reader in the PR #519 that was merged to dev branch.

weixuanfu · 2017-08-04T17:23:47Z

Closing this issue for now. Please feel free to re-open if you have any more questions or comments.

rhiever added the question label Jul 7, 2017

weixuanfu closed this as completed Aug 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using my own csv loader #521

Using my own csv loader #521

mrgloom commented Jul 4, 2017

weixuanfu commented Jul 5, 2017

rhiever commented Jul 18, 2017

mrgloom commented Jul 23, 2017

weixuanfu commented Jul 25, 2017

weixuanfu commented Aug 4, 2017

Using my own csv loader #521

Using my own csv loader #521

Comments

mrgloom commented Jul 4, 2017

weixuanfu commented Jul 5, 2017

rhiever commented Jul 18, 2017

mrgloom commented Jul 23, 2017

weixuanfu commented Jul 25, 2017

weixuanfu commented Aug 4, 2017