Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Titanic example broken, np.isnan(titanic_new) generates error and training results in CV of 0.5 #201

Closed
MichaelMarkieta opened this issue Jul 28, 2016 · 8 comments
Labels

Comments

@MichaelMarkieta
Copy link

MichaelMarkieta commented Jul 28, 2016

No luck with the Titanic tutorial on the latest installation versions of TPOT and its dependencies.

Python 3.5.1 (default, Dec 26 2015, 18:08:53) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from tpot import TPOT
>>> from sklearn.cross_validation import train_test_split
>>> import pandas as pd 
>>> import numpy as np
>>> titanic = pd.read_csv('train.csv')
>>> titanic.head(5)
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
>>> titanic.groupby('Sex').Survived.value_counts()
Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64
>>> titanic.groupby(['Pclass','Sex']).Survived.value_counts()
Pclass  Sex     Survived
1       female  1            91
                0             3
        male    0            77
                1            45
2       female  1            70
                0             6
        male    0            91
                1            17
3       female  0            72
                1            72
        male    0           300
                1            47
Name: Survived, dtype: int64
>>> id = pd.crosstab([titanic.Pclass, titanic.Sex], titanic.Survived.astype(float))
>>> id.div(id.sum(1).astype(float), 0)
Survived            0.0       1.0
Pclass Sex                       
1      female  0.031915  0.968085
       male    0.631148  0.368852
2      female  0.078947  0.921053
       male    0.842593  0.157407
3      female  0.500000  0.500000
       male    0.864553  0.135447
>>> titanic.rename(columns={'Survived': 'class'}, inplace=True)
>>> titanic.dtypes
PassengerId      int64
class            int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
>>> for cat in ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']:
...     print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, titanic[cat].unique().size))
... 
Number of levels in category 'Name': 891.00 
Number of levels in category 'Sex': 2.00 
Number of levels in category 'Ticket': 681.00 
Number of levels in category 'Cabin': 148.00 
Number of levels in category 'Embarked': 4.00 
>>> for cat in ['Sex', 'Embarked']:
...     print("Levels for catgeory '{0}': {1}".format(cat, titanic[cat].unique()))
... 
Levels for catgeory 'Sex': ['male' 'female']
Levels for catgeory 'Embarked': ['S' 'C' 'Q' nan]
>>> titanic = titanic.fillna(-999)
>>> pd.isnull(titanic).any()
PassengerId    False
class          False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> mlb = MultiLabelBinarizer()
>>> CabinTrans = mlb.fit_transform([{str(val)} for val in titanic['Cabin'].values])
>>> CabinTrans
array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ..., 
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])
>>> titanic_new = titanic.drop(['Name','Ticket','Cabin','class'], axis=1)
>>> assert (len(titanic['Cabin'].unique()) == len(mlb.classes_)), "Not Equal"
>>> titanic_new = np.hstack((titanic_new.values,CabinTrans))
>>> np.isnan(titanic_new).any()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
>>> titanic_new[0].size
156
>>> titanic_class = titanic['class'].values
>>> training_indices, validation_indices = training_indices, testing_indices = train_test_split(titanic.index, stratify = titanic_class, train_size=0.75, test_size=0.25)
>>> training_indices.size, validation_indices.size
(668, 223)
>>> tpot = TPOT(generations=5, verbosity=2)
>>> tpot.fit(titanic_new[training_indices], titanic_class[training_indices])
Generation 1 - Current best internal CV score: 0.50000                                                                                                                                                                                      
Generation 2 - Current best internal CV score: 0.50000                                                                                                                                                                                      
Generation 3 - Current best internal CV score: 0.50000                                                                                                                                                                                      
Generation 4 - Current best internal CV score: 0.50000                                                                                                                                                                                      
Generation 5 - Current best internal CV score: 0.50000

Best pipeline: _variance_threshold(input_df, 0.10000000000000001)
@rhiever
Copy link
Contributor

rhiever commented Jul 28, 2016

Can't reproduce this bug on an Anaconda install w/ Python 3. Can you please report the versions of Python & packages that you're using?

@MichaelMarkieta
Copy link
Author

MichaelMarkieta commented Aug 1, 2016

I am not using Anaconda. Is that a problem?

Python 3.5.1
deap (1.0.2)
numpy (1.11.1)
pandas (0.18.1)
pip (8.1.2)
python-dateutil (2.5.3)
pytz (2016.6.1)
requests (2.10.0)
scikit-learn (0.17.1)
scipy (0.18.0)
setuptools (18.3.1)
six (1.10.0)
TPOT (0.4.1)
tqdm (4.8.1)
update-checker (0.12)
wheel (0.26.0)

@rhiever
Copy link
Contributor

rhiever commented Aug 1, 2016

Anaconda shouldn't be necessary, and your package versions seem to check out. This is very strange!

What happens if you run the MNIST example? https://github.com/rhiever/tpot#example

@rhiever
Copy link
Contributor

rhiever commented Aug 13, 2016

Going to close this issue. Please re-open if the problem persists on other data sets.

@snorreralund
Copy link

I had the same problem. This was caused by using a sparse matrix as input. np.isnan() does not work on sparse matrices. Is there a way around this other than converting to dense?

@weixuanfu
Copy link
Contributor

Please try scipy.sparse.csr_matrix.todense

@snorreralund
Copy link

yes, that works. I meant if it was possible to use TPOT with a sparse matrix.

@weixuanfu
Copy link
Contributor

Oh, sorry for the misunderstanding. We are working on #523 and #462 (One hot encoder can deal with sparse matrix) to add this support in the future version of TPOT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants