Titanic example -problem with 2nd last cell. #492

AIAdventures · 2017-06-05T01:00:33Z

Hi all!
Want to enter in the automl comp.
Trying out the titanic example to get some familiarity with the software.
Running into some trouble with the above cell.
using python 3.6 on a linux machine.

AIAdventures · 2017-06-05T01:01:33Z

rhiever · 2017-06-05T15:21:35Z

Can you please copy and paste the full stack trace from the ValueError? It looks like it's having issues reading the Titanic training data.

AIAdventures · 2017-06-06T11:09:19Z

ValueError Traceback (most recent call last)
in ()
6
7 # NOTE: Make sure that the class is labeled 'class' in the data file
----> 8 tpot_data = np.recfromcsv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv', delimiter=',', dtype=np.float64)
9 features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
10 training_features, testing_features, training_classes, testing_classes = train_test_split(features, tpot_data['class'], random_state=42)

/home/andrewcz/miniconda3/lib/python3.5/site-packages/numpy/lib/npyio.py in recfromcsv(fname, **kwargs)
2044 kwargs.setdefault("delimiter", ",")
2045 kwargs.setdefault("dtype", None)
-> 2046 output = genfromtxt(fname, **kwargs)
2047
2048 usemask = kwargs.get("usemask", False)

/home/andrewcz/miniconda3/lib/python3.5/site-packages/numpy/lib/npyio.py in genfromtxt(fname, dtype, comments, delimiter, skip_header, skip_footer, converters, missing_values, filling_values, usecols, names, excludelist, deletechars, replace_space, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_raise, max_rows)
1826 # Raise an exception ?
1827 if invalid_raise:
-> 1828 raise ValueError(errmsg)
1829 # Issue a warning ?
1830 else:

ValueError: Some errors were detected !
Line 2 (got 13 columns instead of 12)
...
Line 892 (got 13 columns instead of 12)

AIAdventures · 2017-06-06T11:10:39Z

Cheers Randy, the above is the full error.
I want to try tpot on the numerai dataset.
Many thanks,
best,
Andrew

AIAdventures · 2017-06-06T11:11:14Z

great piece of software :)!
Best,
Andrew

rhiever · 2017-06-06T19:49:26Z

It does indeed look like it's an issue reading the dataset. Specifically, numpy's np.recfromcsv function is detecting that there are 12 columns in the Titanic dataset (correct) but thinks there are 13 columns in several of the rows. Are you working on a copy of the Titanic dataset directly from our tutorials directory?

AIAdventures · 2017-06-07T00:42:59Z

Yer, i am using the data in the example.
I might be that my numpy is just not up to date?
Will update numpy and re run through the example.

AIAdventures · 2017-06-07T00:51:11Z

NOTE: Make sure that the class is labeled 'class' in the data file

tpot_data = np.recfromcsv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv', delimiter=',', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes =
train_test_split(features, tpot_data['class'], random_state=42)

AIAdventures · 2017-06-07T00:51:35Z

the tpot data file is correct?

rhiever · 2017-06-07T14:09:46Z

I see the problem now. We're using numpy's recfromcsv to read the file in, and telling is that the delimiter is a comma ,. The problem arises when we have strings containing names in the data, as the names have commas in them. Thus, recfromcsv thinks there are 13 columns when there are in fact 12.

pandas.read_csv is smart enough to handle this situation, but apparently recfromcsv isn't. @weixuanfu2016 / @teaearlgraycold, maybe we should go back to using pandas to read the files in again? I don't think that pandas is that heavy of a dependency, and apparently the numpy data file reading functions are pretty inflexible.

rhiever · 2017-06-07T14:11:25Z

In the meantime, @AIAdventures, you can change that code to use pandas:

import pandas as pd

tpot_data = pd.read_csv('/home/andrewcz/tpot/tutorials/data/titanic_train.csv')
features = tpot_data.drop('class', axis=1).values
training_features, testing_features, training_classes, testing_classes = 
                        train_test_split(features, tpot_data['class'].values, random_state=42)

weixuanfu · 2017-06-07T15:21:51Z

@rhiever I think we could go back to using pandas. If we use TFlearn in the future version of TPOT, the tflearn.data_utils.load_csv can be a good alternative.

AIAdventures · 2017-06-08T23:33:28Z

yer, from my experience pandas data frames are more reliable than numpy arrays.
more of a focused product.
Thankyou for your help, i am now going to try the tool with the numerai dataset.
Many thanks,
Best,
Andrew

rhiever · 2017-06-09T16:31:46Z

Great, please feel free to reopen the issue if you have any other questions!

rhiever added the question label Jun 5, 2017

rhiever closed this as completed Jun 9, 2017

This was referenced Jun 29, 2017

Use pandas.read_csv to replace numpy.recfromcsv for reading dataset #519

Merged

Using my own csv loader #521

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Titanic example -problem with 2nd last cell. #492

Titanic example -problem with 2nd last cell. #492

AIAdventures commented Jun 5, 2017

AIAdventures commented Jun 5, 2017

rhiever commented Jun 5, 2017

AIAdventures commented Jun 6, 2017 •

edited by rhiever

Loading

AIAdventures commented Jun 6, 2017 •

edited

Loading

AIAdventures commented Jun 6, 2017

rhiever commented Jun 6, 2017

AIAdventures commented Jun 7, 2017

AIAdventures commented Jun 7, 2017

AIAdventures commented Jun 7, 2017

rhiever commented Jun 7, 2017 •

edited

Loading

rhiever commented Jun 7, 2017 •

edited

Loading

weixuanfu commented Jun 7, 2017 •

edited

Loading

AIAdventures commented Jun 8, 2017

rhiever commented Jun 9, 2017

Titanic example -problem with 2nd last cell. #492

Titanic example -problem with 2nd last cell. #492

Comments

AIAdventures commented Jun 5, 2017

AIAdventures commented Jun 5, 2017

rhiever commented Jun 5, 2017

AIAdventures commented Jun 6, 2017 • edited by rhiever Loading

AIAdventures commented Jun 6, 2017 • edited Loading

AIAdventures commented Jun 6, 2017

rhiever commented Jun 6, 2017

AIAdventures commented Jun 7, 2017

AIAdventures commented Jun 7, 2017

NOTE: Make sure that the class is labeled 'class' in the data file

AIAdventures commented Jun 7, 2017

rhiever commented Jun 7, 2017 • edited Loading

rhiever commented Jun 7, 2017 • edited Loading

weixuanfu commented Jun 7, 2017 • edited Loading

AIAdventures commented Jun 8, 2017

rhiever commented Jun 9, 2017

AIAdventures commented Jun 6, 2017 •

edited by rhiever

Loading

AIAdventures commented Jun 6, 2017 •

edited

Loading

rhiever commented Jun 7, 2017 •

edited

Loading

rhiever commented Jun 7, 2017 •

edited

Loading

weixuanfu commented Jun 7, 2017 •

edited

Loading