Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Titanic example -problem with 2nd last cell. #492

Closed
AIAdventures opened this issue Jun 5, 2017 · 14 comments
Closed

Titanic example -problem with 2nd last cell. #492

AIAdventures opened this issue Jun 5, 2017 · 14 comments
Labels

Comments

@AIAdventures
Copy link

Hi all!
Want to enter in the automl comp.
Trying out the titanic example to get some familiarity with the software.
Running into some trouble with the above cell.
using python 3.6 on a linux machine.

screenshot_2017-06-05_10-59-23

@AIAdventures
Copy link
Author

screenshot_2017-06-05_10-59-23

@rhiever
Copy link
Contributor

rhiever commented Jun 5, 2017

Can you please copy and paste the full stack trace from the ValueError? It looks like it's having issues reading the Titanic training data.

@AIAdventures
Copy link
Author

AIAdventures commented Jun 6, 2017


ValueError Traceback (most recent call last)
in ()
6
7 # NOTE: Make sure that the class is labeled 'class' in the data file
----> 8 tpot_data = np.recfromcsv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv', delimiter=',', dtype=np.float64)
9 features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
10 training_features, testing_features, training_classes, testing_classes = train_test_split(features, tpot_data['class'], random_state=42)

/home/andrewcz/miniconda3/lib/python3.5/site-packages/numpy/lib/npyio.py in recfromcsv(fname, **kwargs)
2044 kwargs.setdefault("delimiter", ",")
2045 kwargs.setdefault("dtype", None)
-> 2046 output = genfromtxt(fname, **kwargs)
2047
2048 usemask = kwargs.get("usemask", False)

/home/andrewcz/miniconda3/lib/python3.5/site-packages/numpy/lib/npyio.py in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows)
1826 # Raise an exception ?
1827 if invalid_raise:
-> 1828 raise ValueError(errmsg)
1829 # Issue a warning ?
1830 else:

ValueError: Some errors were detected !
Line 2 (got 13 columns instead of 12)
...
Line 892 (got 13 columns instead of 12)

@AIAdventures
Copy link
Author

AIAdventures commented Jun 6, 2017

Cheers Randy, the above is the full error.
I want to try tpot on the numerai dataset.
Many thanks,
best,
Andrew

@AIAdventures
Copy link
Author

great piece of software :)!
Best,
Andrew

@rhiever
Copy link
Contributor

rhiever commented Jun 6, 2017

It does indeed look like it's an issue reading the dataset. Specifically, numpy's np.recfromcsv function is detecting that there are 12 columns in the Titanic dataset (correct) but thinks there are 13 columns in several of the rows. Are you working on a copy of the Titanic dataset directly from our tutorials directory?

@AIAdventures
Copy link
Author

Yer, i am using the data in the example.
I might be that my numpy is just not up to date?
Will update numpy and re run through the example.

@AIAdventures
Copy link
Author

NOTE: Make sure that the class is labeled 'class' in the data file

tpot_data = np.recfromcsv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv', delimiter=',', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes =
train_test_split(features, tpot_data['class'], random_state=42)

@AIAdventures
Copy link
Author

the tpot data file is correct?

@rhiever
Copy link
Contributor

rhiever commented Jun 7, 2017

I see the problem now. We're using numpy's recfromcsv to read the file in, and telling is that the delimiter is a comma ,. The problem arises when we have strings containing names in the data, as the names have commas in them. Thus, recfromcsv thinks there are 13 columns when there are in fact 12.

pandas.read_csv is smart enough to handle this situation, but apparently recfromcsv isn't. @weixuanfu2016 / @teaearlgraycold, maybe we should go back to using pandas to read the files in again? I don't think that pandas is that heavy of a dependency, and apparently the numpy data file reading functions are pretty inflexible.

@rhiever
Copy link
Contributor

rhiever commented Jun 7, 2017

In the meantime, @AIAdventures, you can change that code to use pandas:

import pandas as pd

tpot_data = pd.read_csv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv')
features = tpot_data.drop('class', axis=1).values
training_features, testing_features, training_classes, testing_classes = 
                        train_test_split(features, tpot_data['class'].values, random_state=42)

@weixuanfu
Copy link
Contributor

weixuanfu commented Jun 7, 2017

@rhiever I think we could go back to using pandas. If we use TFlearn in the future version of TPOT, the tflearn.data_utils.load_csv can be a good alternative.

@AIAdventures
Copy link
Author

yer, from my experience pandas data frames are more reliable than numpy arrays.
more of a focused product.
Thankyou for your help, i am now going to try the tool with the numerai dataset.
Many thanks,
Best,
Andrew

@rhiever
Copy link
Contributor

rhiever commented Jun 9, 2017

Great, please feel free to reopen the issue if you have any other questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants