Titanic example broken, np.isnan(titanic_new) generates error and training results in CV of 0.5 #201

MichaelMarkieta · 2016-07-28T13:44:10Z

No luck with the Titanic tutorial on the latest installation versions of TPOT and its dependencies.

Python 3.5.1 (default, Dec 26 2015, 18:08:53) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tpot import TPOT
>>> from sklearn.cross_validation import train_test_split
>>> import pandas as pd 
>>> import numpy as np
>>> titanic = pd.read_csv('train.csv')
>>> titanic.head(5)
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
>>> titanic.groupby('Sex').Survived.value_counts()
Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64
>>> titanic.groupby(['Pclass','Sex']).Survived.value_counts()
Pclass  Sex     Survived
1       female  1            91
                0             3
        male    0            77
                1            45
2       female  1            70
                0             6
        male    0            91
                1            17
3       female  0            72
                1            72
        male    0           300
                1            47
Name: Survived, dtype: int64
>>> id = pd.crosstab([titanic.Pclass, titanic.Sex], titanic.Survived.astype(float))
>>> id.div(id.sum(1).astype(float), 0)
Survived            0.0       1.0
Pclass Sex                       
1      female  0.031915  0.968085
       male    0.631148  0.368852
2      female  0.078947  0.921053
       male    0.842593  0.157407
3      female  0.500000  0.500000
       male    0.864553  0.135447
>>> titanic.rename(columns={'Survived': 'class'}, inplace=True)
>>> titanic.dtypes
PassengerId      int64
class            int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
>>> for cat in ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']:
...     print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, titanic[cat].unique().size))
... 
Number of levels in category 'Name': 891.00 
Number of levels in category 'Sex': 2.00 
Number of levels in category 'Ticket': 681.00 
Number of levels in category 'Cabin': 148.00 
Number of levels in category 'Embarked': 4.00 
>>> for cat in ['Sex', 'Embarked']:
...     print("Levels for catgeory '{0}': {1}".format(cat, titanic[cat].unique()))
... 
Levels for catgeory 'Sex': ['male' 'female']
Levels for catgeory 'Embarked': ['S' 'C' 'Q' nan]
>>> titanic = titanic.fillna(-999)
>>> pd.isnull(titanic).any()
PassengerId    False
class          False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> mlb = MultiLabelBinarizer()
>>> CabinTrans = mlb.fit_transform([{str(val)} for val in titanic['Cabin'].values])
>>> CabinTrans
array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ..., 
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])
>>> titanic_new = titanic.drop(['Name','Ticket','Cabin','class'], axis=1)
>>> assert (len(titanic['Cabin'].unique()) == len(mlb.classes_)), "Not Equal"
>>> titanic_new = np.hstack((titanic_new.values,CabinTrans))
>>> np.isnan(titanic_new).any()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
>>> titanic_new[0].size
156
>>> titanic_class = titanic['class'].values
>>> training_indices, validation_indices = training_indices, testing_indices = train_test_split(titanic.index, stratify = titanic_class, train_size=0.75, test_size=0.25)
>>> training_indices.size, validation_indices.size
(668, 223)
>>> tpot = TPOT(generations=5, verbosity=2)
>>> tpot.fit(titanic_new[training_indices], titanic_class[training_indices])
Generation 1 - Current best internal CV score: 0.50000                                                                                                                                                                                      
Generation 2 - Current best internal CV score: 0.50000                                                                                                                                                                                      
Generation 3 - Current best internal CV score: 0.50000                                                                                                                                                                                      
Generation 4 - Current best internal CV score: 0.50000                                                                                                                                                                                      
Generation 5 - Current best internal CV score: 0.50000

Best pipeline: _variance_threshold(input_df, 0.10000000000000001)

rhiever · 2016-07-28T13:51:19Z

Can't reproduce this bug on an Anaconda install w/ Python 3. Can you please report the versions of Python & packages that you're using?

MichaelMarkieta · 2016-08-01T01:30:22Z

I am not using Anaconda. Is that a problem?

Python 3.5.1
deap (1.0.2)
numpy (1.11.1)
pandas (0.18.1)
pip (8.1.2)
python-dateutil (2.5.3)
pytz (2016.6.1)
requests (2.10.0)
scikit-learn (0.17.1)
scipy (0.18.0)
setuptools (18.3.1)
six (1.10.0)
TPOT (0.4.1)
tqdm (4.8.1)
update-checker (0.12)
wheel (0.26.0)

rhiever · 2016-08-01T21:07:21Z

Anaconda shouldn't be necessary, and your package versions seem to check out. This is very strange!

What happens if you run the MNIST example? https://github.com/rhiever/tpot#example

rhiever · 2016-08-13T18:28:54Z

Going to close this issue. Please re-open if the problem persists on other data sets.

snorreralund · 2017-08-17T16:51:48Z

I had the same problem. This was caused by using a sparse matrix as input. np.isnan() does not work on sparse matrices. Is there a way around this other than converting to dense?

weixuanfu · 2017-08-17T17:05:02Z

Please try scipy.sparse.csr_matrix.todense

snorreralund · 2017-08-17T17:06:38Z

yes, that works. I meant if it was possible to use TPOT with a sparse matrix.

weixuanfu · 2017-08-17T17:10:16Z

Oh, sorry for the misunderstanding. We are working on #523 and #462 (One hot encoder can deal with sparse matrix) to add this support in the future version of TPOT.

rhiever added the question label Aug 1, 2016

rhiever closed this as completed Aug 13, 2016

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Titanic example broken, np.isnan(titanic_new) generates error and training results in CV of 0.5 #201

Titanic example broken, np.isnan(titanic_new) generates error and training results in CV of 0.5 #201

MichaelMarkieta commented Jul 28, 2016 •

edited

Loading

rhiever commented Jul 28, 2016

MichaelMarkieta commented Aug 1, 2016 •

edited

Loading

rhiever commented Aug 1, 2016

rhiever commented Aug 13, 2016

snorreralund commented Aug 17, 2017

weixuanfu commented Aug 17, 2017

snorreralund commented Aug 17, 2017

weixuanfu commented Aug 17, 2017

Titanic example broken, np.isnan(titanic_new) generates error and training results in CV of 0.5 #201

Titanic example broken, np.isnan(titanic_new) generates error and training results in CV of 0.5 #201

Comments

MichaelMarkieta commented Jul 28, 2016 • edited Loading

rhiever commented Jul 28, 2016

MichaelMarkieta commented Aug 1, 2016 • edited Loading

rhiever commented Aug 1, 2016

rhiever commented Aug 13, 2016

snorreralund commented Aug 17, 2017

weixuanfu commented Aug 17, 2017

snorreralund commented Aug 17, 2017

weixuanfu commented Aug 17, 2017

MichaelMarkieta commented Jul 28, 2016 •

edited

Loading

MichaelMarkieta commented Aug 1, 2016 •

edited

Loading