Fix tests + add titanic database #2

gfournier · 2019-05-02T14:55:05Z

No description provided.

gfournier · 2019-06-06T22:31:12Z

@LionelMassoulard committed Titanic dataset, source is here: https://github.com/gfournier/aikit-datasets/releases/tag/titanic-1.0.0

gfournier · 2019-06-07T16:45:36Z

aikit/tools/helper_functions.py

@@ -357,7 +357,7 @@ def clean_column(s):

    r = s.strip().lower()
    r = re.sub(r"[?\(\)/\[\]\\]", "", r)
-    r = re.sub("[:' \-\.\n]", "_", r)
+    r = re.sub(r"[:' \-\.\n]", "_", r)


Add test for this method

gfournier · 2019-06-07T16:52:07Z

Add longtest marker to test_TargetEncoderClassifierEntropy1

aikit/datasets/datasets.py

LionelMassoulard · 2019-06-07T19:39:51Z

tests/datasets/test_datasets.py


-    res = load_dataset(name)
+    res = load_dataset(name, cache_dir=tempdir)



I think pytest has a fixture to for tmpdir
https://docs.pytest.org/en/latest/tmpdir.html

tests/datasets/test_datasets.py

aikit/datasets/datasets.py

LionelMassoulard · 2019-06-07T20:31:06Z

@LionelMassoulard committed Titanic dataset, source is here: https://github.com/gfournier/aikit-datasets/releases/tag/titanic-1.0.0

there is a weird "._titanic.csv" file in the archive

tests/datasets/test_datasets.py

gfournier · 2019-06-07T21:54:18Z

Green tests :)
With LightGBM, Gensim, NLTK, Graphviz

LionelMassoulard · 2019-06-08T09:00:43Z

Green tests :)
With LightGBM, Gensim, NLTK, Graphviz

It works on my PC as well. I think the last thing is to update the notebooks since the name of the columns has changed + new columns.
(By I'm not sur having a column with a dot is a good idea "home**.**test")

* Fix pytest launch * Add Titanic dataset * Fix failing tests * Add NLTK stopwords to Travis config * Doc + use tmpdir fixture from pytest * Update notebooks

* Fix pytest launch * Add Titanic dataset * Fix failing tests * Add NLTK stopwords to Travis config * Doc + use tmpdir fixture from pytest * Update notebooks * Port for p37 sk21 (#1) * update to work in sk21 * fix n_estimators and seed for RandomForest * conditionnal tests if libraries not installed * fix doc typos * removed from doc while the blender is put in the lib * fix seeds * improve README.md * Fix tests + add titanic database (#2) * Fix pytest launch * Add Titanic dataset * Fix failing tests * Add NLTK stopwords to Travis config * Doc + use tmpdir fixture from pytest * Update notebooks * Update README.md with doc link (#7) * Update README.md with doc link + some formatting fixes * Update README.md fix typos * add a few tests to clean_column * move "function_has_named_arguments" + add test with a functor * create "GroupProbaScorer" : scorer that can access "groups" info * modify cross validation to allow scorer to use a groups.

* add make_pipeline function (works like sklearn) * fix type "_if_fitted" -> "_already_fitted" * * add handling of columns_to_encode == "--object--" in target encoder * corresponding test * add Numerical encoder test for "columns_to_encode == '--object--' " * expose command argument parser outside, to be able to add new arguments. * change WordVectorizer in char mod distributions + fix bug in HyperRangeBetaInt * change default behavior : encode "columns_to_encode == '--object--' " * remove 'bug' (double return) * allow text preprocessors to concat their inputs * add 'RandomTrainTestCv' and 'IndexTrainCv' cv-like object. * same api as a regular cv object ... * ... but only one split * add 'use_for_block_search' attribute + filter models based on that * * add block search iterator * automl config : models_to_keep_block_search * fix typo in test * ignore Warning in test

* Bump version to 0.1.0 * Change output type vectorizer (#1) * change setup * change default output type of countvectorizer to bet in32 * change dtype to numerical encoder as well + tests * add output type test on NumImputer * fix bug NumericalEncoder when new column (#4) * Block Search + other (#2) * add make_pipeline function (works like sklearn) * fix type "_if_fitted" -> "_already_fitted" * * add handling of columns_to_encode == "--object--" in target encoder * corresponding test * add Numerical encoder test for "columns_to_encode == '--object--' " * expose command argument parser outside, to be able to add new arguments. * change WordVectorizer in char mod distributions + fix bug in HyperRangeBetaInt * change default behavior : encode "columns_to_encode == '--object--' " * remove 'bug' (double return) * allow text preprocessors to concat their inputs * add 'RandomTrainTestCv' and 'IndexTrainCv' cv-like object. * same api as a regular cv object ... * ... but only one split * add 'use_for_block_search' attribute + filter models based on that * * add block search iterator * automl config : models_to_keep_block_search * fix typo in test * ignore Warning in test * Graph pipeline subgraph from dev (#3) * fix casting bug + test on filter/map function on dicos * add function to retrieve 2-uple list of edges from generic tuple edges * fix bug on DebugPassThrough * add 'get_subpipeline' methods to create sub GraphPipeline from a given GraphPipeline * add docstring get_subpipeline * Fix numerical encoder max_cum_proba (#6) * Fix bug automl group (#5) * allow reload of groups * * add average_precision default transformation * go back to default transformation if unknown * return dataframe in command * Fix dataset load from SG premises * Fix dummy encoding type in NumericalEncoder

* Block Search + other (#2) * add make_pipeline function (works like sklearn) * fix type "_if_fitted" -> "_already_fitted" * * add handling of columns_to_encode == "--object--" in target encoder * corresponding test * add Numerical encoder test for "columns_to_encode == '--object--' " * expose command argument parser outside, to be able to add new arguments. * change WordVectorizer in char mod distributions + fix bug in HyperRangeBetaInt * change default behavior : encode "columns_to_encode == '--object--' " * remove 'bug' (double return) * allow text preprocessors to concat their inputs * add 'RandomTrainTestCv' and 'IndexTrainCv' cv-like object. * same api as a regular cv object ... * ... but only one split * add 'use_for_block_search' attribute + filter models based on that * * add block search iterator * automl config : models_to_keep_block_search * fix typo in test * ignore Warning in test * move 'function_has_named_argument' from .transformers.model_wrapper to .tools.helper_functions * cleanning * dispatch and split the groups variable to the estimator * add groups to methods + dispatch it to estimators within the pipeline * test on cross validation and pipeline to check the passing of groups * remove useless import * remove useless * fix X -> lastX * debug help * fix after merge * make sur benchmark can be computed * input np.inf as well as np.nan * spaces * don't split and tokenize if not needed * new tests auto-ml, when only numerical values * allow scoring to return multiple values * allow cross_validation to be in Parallel # Conflicts: # aikit/cross_validation.py * add a custom CV for groups * * froze init param * allow additionnal function to be computed * read additionnal results * allow guiding to be done on an "addtionnal metric" * typo * add name of excel print * test if name of columns has change

* * new helpers function (merge node and subbranch search) * fix ordering in graph from edges * * generalize the notion of model graph * change name representation * Block Search + other (#2) * add make_pipeline function (works like sklearn) * fix type "_if_fitted" -> "_already_fitted" * * add handling of columns_to_encode == "--object--" in target encoder * corresponding test * add Numerical encoder test for "columns_to_encode == '--object--' " * expose command argument parser outside, to be able to add new arguments. * change WordVectorizer in char mod distributions + fix bug in HyperRangeBetaInt * change default behavior : encode "columns_to_encode == '--object--' " * remove 'bug' (double return) * allow text preprocessors to concat their inputs * add 'RandomTrainTestCv' and 'IndexTrainCv' cv-like object. * same api as a regular cv object ... * ... but only one split * add 'use_for_block_search' attribute + filter models based on that * * add block search iterator * automl config : models_to_keep_block_search * fix typo in test * ignore Warning in test * fix type : TransformToBlockManager * add number of output utils function * spaces * new tests with impossible graphs * fix merged * fix notebook error * add list test * remove useless import * spaces * fix docstring * merge 2 loops * remove duplicate edge

* Block Search + other (#2) * add make_pipeline function (works like sklearn) * fix type "_if_fitted" -> "_already_fitted" * * add handling of columns_to_encode == "--object--" in target encoder * corresponding test * add Numerical encoder test for "columns_to_encode == '--object--' " * expose command argument parser outside, to be able to add new arguments. * change WordVectorizer in char mod distributions + fix bug in HyperRangeBetaInt * change default behavior : encode "columns_to_encode == '--object--' " * remove 'bug' (double return) * allow text preprocessors to concat their inputs * add 'RandomTrainTestCv' and 'IndexTrainCv' cv-like object. * same api as a regular cv object ... * ... but only one split * add 'use_for_block_search' attribute + filter models based on that * * add block search iterator * automl config : models_to_keep_block_search * fix typo in test * ignore Warning in test * move 'function_has_named_argument' from .transformers.model_wrapper to .tools.helper_functions * cleanning * dispatch and split the groups variable to the estimator * add groups to methods + dispatch it to estimators within the pipeline * test on cross validation and pipeline to check the passing of groups * remove useless import * remove useless * fix X -> lastX * debug help * fix after merge * make sur benchmark can be computed * input np.inf as well as np.nan * spaces * don't split and tokenize if not needed * new tests auto-ml, when only numerical values * allow scoring to return multiple values * allow cross_validation to be in Parallel # Conflicts: # aikit/cross_validation.py * add a custom CV for groups * * froze init param * allow additionnal function to be computed * read additionnal results * allow guiding to be done on an "addtionnal metric" * typo * add name of excel print * test if name of columns has change

* * new helpers function (merge node and subbranch search) * fix ordering in graph from edges * * generalize the notion of model graph * change name representation * Block Search + other (#2) * add make_pipeline function (works like sklearn) * fix type "_if_fitted" -> "_already_fitted" * * add handling of columns_to_encode == "--object--" in target encoder * corresponding test * add Numerical encoder test for "columns_to_encode == '--object--' " * expose command argument parser outside, to be able to add new arguments. * change WordVectorizer in char mod distributions + fix bug in HyperRangeBetaInt * change default behavior : encode "columns_to_encode == '--object--' " * remove 'bug' (double return) * allow text preprocessors to concat their inputs * add 'RandomTrainTestCv' and 'IndexTrainCv' cv-like object. * same api as a regular cv object ... * ... but only one split * add 'use_for_block_search' attribute + filter models based on that * * add block search iterator * automl config : models_to_keep_block_search * fix typo in test * ignore Warning in test * fix type : TransformToBlockManager * add number of output utils function * spaces * new tests with impossible graphs * fix merged * fix notebook error * add list test * remove useless import * spaces * fix docstring * merge 2 loops * remove duplicate edge

* Bump version to 0.1.1 * Fix bug automl block search (#10) * fix bug when no elements to iterator on * remove useless space * Categorical handling (WIP) (#9) * add failing test for categorie * - add a function that can replace categorical columns by object columns - recognize 'category' as a CAT type of variable * ajoute de get ride of categories modifications des transfo numericalencoder et targetencoder ajout d un test de guess_type_of_variables * - add a get_rid_of_categories in the fit_transform of targetencoder - add test of targetencoder with categorical dtype - add test of numericalencoder with categorical dtype * modif de test_guesss_type_of_variable * ajout d'un test permettant de vérifier que le numerical encoder ne transforme pas les colonnes catégorielles ayant des int en colonnes numériques. pour l'instant, le test fail * modification du code pour que le numericalencoder et le targetencoder fonctionnent correctement ajout de tests * modifs prenant en compte les comments de la pull request * remaining changes for the pull request * clean commit * Dispatch groups (#7) * Block Search + other (#2) * add make_pipeline function (works like sklearn) * fix type "_if_fitted" -> "_already_fitted" * * add handling of columns_to_encode == "--object--" in target encoder * corresponding test * add Numerical encoder test for "columns_to_encode == '--object--' " * expose command argument parser outside, to be able to add new arguments. * change WordVectorizer in char mod distributions + fix bug in HyperRangeBetaInt * change default behavior : encode "columns_to_encode == '--object--' " * remove 'bug' (double return) * allow text preprocessors to concat their inputs * add 'RandomTrainTestCv' and 'IndexTrainCv' cv-like object. * same api as a regular cv object ... * ... but only one split * add 'use_for_block_search' attribute + filter models based on that * * add block search iterator * automl config : models_to_keep_block_search * fix typo in test * ignore Warning in test * move 'function_has_named_argument' from .transformers.model_wrapper to .tools.helper_functions * cleanning * dispatch and split the groups variable to the estimator * add groups to methods + dispatch it to estimators within the pipeline * test on cross validation and pipeline to check the passing of groups * remove useless import * remove useless * fix X -> lastX * debug help * fix after merge * make sur benchmark can be computed * input np.inf as well as np.nan * spaces * don't split and tokenize if not needed * new tests auto-ml, when only numerical values * allow scoring to return multiple values * allow cross_validation to be in Parallel # Conflicts: # aikit/cross_validation.py * add a custom CV for groups * * froze init param * allow additionnal function to be computed * read additionnal results * allow guiding to be done on an "addtionnal metric" * typo * add name of excel print * test if name of columns has change * Clean load (#12) * remove config.json * fix loading * remove nltk addtional path * accelerate code using map and dict (#13) * accelerate code using map and dict * accelerate concatenation code * Update categories * * fix test new columns name (#15) * fix seed * new test CdfScaler * Ml graph improve (#8) * * new helpers function (merge node and subbranch search) * fix ordering in graph from edges * * generalize the notion of model graph * change name representation * Block Search + other (#2) * add make_pipeline function (works like sklearn) * fix type "_if_fitted" -> "_already_fitted" * * add handling of columns_to_encode == "--object--" in target encoder * corresponding test * add Numerical encoder test for "columns_to_encode == '--object--' " * expose command argument parser outside, to be able to add new arguments. * change WordVectorizer in char mod distributions + fix bug in HyperRangeBetaInt * change default behavior : encode "columns_to_encode == '--object--' " * remove 'bug' (double return) * allow text preprocessors to concat their inputs * add 'RandomTrainTestCv' and 'IndexTrainCv' cv-like object. * same api as a regular cv object ... * ... but only one split * add 'use_for_block_search' attribute + filter models based on that * * add block search iterator * automl config : models_to_keep_block_search * fix typo in test * ignore Warning in test * fix type : TransformToBlockManager * add number of output utils function * spaces * new tests with impossible graphs * fix merged * fix notebook error * add list test * remove useless import * spaces * fix docstring * merge 2 loops * remove duplicate edge * add a few ploting functions (#14) * add a few ploting functions * add assert * bump version 0.1.2 * DEV bump version * doc typo (#16) * Add matplotlib, seaborn to test requirements * Fixes on dataset load from public URL * Fix dataset path load unit test

gfournier and others added 2 commits June 7, 2019 00:25

Fix pytest launch

1fe9a32

Add Titanic dataset

5833f65

gfournier commented Jun 7, 2019

View reviewed changes

gfournier mentioned this pull request Jun 7, 2019

Issue on sparse dataframe with pandas >= 0.24.0 #4

Closed

gfournier force-pushed the wip_fix_travis_pytest_launch branch from fb2fd8b to 5833f65 Compare June 7, 2019 19:21

LionelMassoulard reviewed Jun 7, 2019

View reviewed changes

aikit/datasets/datasets.py Outdated Show resolved Hide resolved

LionelMassoulard reviewed Jun 7, 2019

View reviewed changes

tests/datasets/test_datasets.py Outdated Show resolved Hide resolved

LionelMassoulard reviewed Jun 7, 2019

View reviewed changes

aikit/datasets/datasets.py Outdated Show resolved Hide resolved

LionelMassoulard reviewed Jun 7, 2019

View reviewed changes

tests/datasets/test_datasets.py Outdated Show resolved Hide resolved

Fix failing tests

2a32dcb

gfournier changed the title ~~(wip) Fix pytest launch~~ Fix test + add titanic database Jun 7, 2019

gfournier changed the title ~~Fix test + add titanic database~~ Fix tests + add titanic database Jun 7, 2019

gfournier added 2 commits June 7, 2019 23:40

Add NLTK stopwords to Travis config

e3c8468

Doc + use tmpdir fixture from pytest

e2729f3

LionelMassoulard closed this Jun 8, 2019

gfournier reopened this Jun 8, 2019

Update notebooks

113b5b5

gfournier merged commit 5044b8e into societe-generale:master Jun 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tests + add titanic database #2

Fix tests + add titanic database #2

gfournier commented May 2, 2019

gfournier commented Jun 6, 2019

gfournier Jun 7, 2019

gfournier commented Jun 7, 2019

LionelMassoulard Jun 7, 2019

gfournier Jun 7, 2019

LionelMassoulard commented Jun 7, 2019

gfournier commented Jun 7, 2019

LionelMassoulard commented Jun 8, 2019


		res = load_dataset(name)
		res = load_dataset(name, cache_dir=tempdir)

Fix tests + add titanic database #2

Fix tests + add titanic database #2

Conversation

gfournier commented May 2, 2019

gfournier commented Jun 6, 2019

gfournier Jun 7, 2019

Choose a reason for hiding this comment

gfournier commented Jun 7, 2019

LionelMassoulard Jun 7, 2019

Choose a reason for hiding this comment

gfournier Jun 7, 2019

Choose a reason for hiding this comment

LionelMassoulard commented Jun 7, 2019

gfournier commented Jun 7, 2019

LionelMassoulard commented Jun 8, 2019