Skip to content

Releases: UrbsLab/STREAMLINE

Beta 0.3.4

28 Sep 21:14
Compare
Choose a tag to compare

Minor Updates

  • Improved PDF report formatting to more clearly display first page, and account for having a larger number of datasets analyzed at once.
  • Fixed edge case bug for running multiple separate replication and replication report phases in legacy mode.

Beta 0.3.3

23 Sep 01:45
Compare
Choose a tag to compare

Major Updates

  • Added a new data cleaning element - removal of invariant features. During C2 cleaning phase of data processing, features with only one value, only Nans or a mix of one value and Nan are removed from the dataset. This has been similarly updated for the replication phase, removing the same features that were removed during the original Phase 1 data cleaning.

Minor Updates

  • Fix to algorithm ordering in figures within Jupyter notebook and Google Colab notebook run modes.
  • Updated replication phase PDF report to simplify the data processing report
  • Fixed handling of (as of yet) unseen values in binary categorical variables during replication phase. Now these are converted to Nans, since we can't introduce a new feature at this point (since it was not included in modeling)
  • Fixed issue with naming of engineered missingness features
  • Fixed issue with running STREAMLINE on cluster in legacy mode without specifying files for categorical or quantitative features.
  • Updated text size on first page of PDF report

Beta 0.3.2

13 Sep 23:48
Compare
Choose a tag to compare

Minor fixes and updates:

  • Fixed command line argument passage for legacy run mode of STREAMLINE
  • Updated legacy run mode of STREAMLINE to submit jobs then end the script instead of waiting for those jobs to complete.
  • Updated STREAMLINE schematic
  • Updated naming of PDF summary file
  • Added description of 'checking job status' in documentation.

Beta 0.3.1

07 Sep 22:27
Compare
Choose a tag to compare

Minor Updates

  • Updated the replication phase to handle a special case where no missing value data imputation was conducted for a feature in the training data, but one or more missing values were present for that feature in the replication dataset. Now, when this occurs, a relevant simple imputation strategy is applied to estimate the replication data missing values. Mean imputation is used for quantitative features, and mode imputation is used for categorical features. Imputation operations are using the pandas mean() and median() function within model_replicate.py.

  • The legends in all the plots including the Composite Feature Importance plots are now ordered alphabetically based on the full name of the models.

Beta 0.3.0

06 Aug 06:15
Compare
Choose a tag to compare

The current version of STREAMLINE is based on our initial STREAMLINE project release Beta 0.2.5, and has since undergone a major refactoring
STREAMLINE's codebase. Many functionalities have been reorganized and extended.

Major Updates ----------------------------------------

  • Extended to be able to run in parallel on 7 different types of HPC clusters using dask_jobqueue
  • Extended Phase 1 (previously EDA), to included numerical data encoding, automated data cleaning, feature engineering, and a second round of EDA:
    • Added numerical encoding for any binary, text-valued features, with a map file Numerical_Encoding_Map.csv output to document this numerical mapping of original text-values
    • Added quantitative_feature_path parameter in addition to categorical_feature_path allowing users to indicate which features to treat as categorical vs. quantitative (or specify one list and all other features will be treated as the other type). New .csv output files are also generated to identify what features were treated as one feature type or the other after data processing.
    • Added automated feature engineering of 'missingness' features to evaluate missingness as being predictive (assuming MNAR) along with featureeng_missingness parameter to control this function. Missingness_Engineered_Features.csv is output to document what features were added to the processed dataset as a result.
    • Added automated cleaning of features with high 'missingness'; with cleaning_missingness parameter added to control this function. Missingness_Feature_Cleaning.csv is output to document what features were removed from the processed dataset as a result.
    • Added automated cleaning of instances with high 'missingness'; with cleaning_missingness parameter added to control this function.
    • Added automated one-hot-encoding of all numerical and text-valued categorical features (with 3 or more values) so that they will be treated as such throughout all STREAMLINE phases.
    • Added automated cleaning of highly correlated features (one feature randomly removed out of a highly correlated feature pair); with correlation_removal_threshold parameter added to control this function. correlation_feature_cleaning.csv is output to document what features were removed in this way.
    • Added DataProcessSummary.csv output file to document changes in feature, feature type, instance, class, and missing value counts during each new cleaning/engineering step.
    • Added a secondary EDA applied to the processed dataset, saved with separate output files to the 'initial' EDA.
  • Adapted the 'replication' phase of STREAMLINE to process the replication data in the same way as the initial 'target dataset' ensuring that the same features are present. This accounts for any new 'as-of-yet' unseen values for categorical features that had previously been one-hot-encoded.
  • Added ability to run the whole pipeline as a single command in the different command line run modes (i.e. from the command line locally or on an HPC). This includes the addition of a variety of new command-line specific run parameters.
  • Added support for running STREAMLINE from the command line using a configuration file (in addition to commandline parameters)
  • Modularize all ML modeling algorithms within classes, which adds the ability for users to (relatively easily) add other scikit-learn-compatible classification modeling algorithms to the STREAMLINE code-base by making a python file in streamine/models/ based on the base model template. This allows code-savy users to easily add other algorithms we have not yet included, including their own.
  • As a demonstration of the ability to add new ML algorithms in this way, we've added Elastic Net (EN) as the 16th ML algorithm included within STREAMLINE.
  • Extended Google Colab Notebook to (1) automatically download the latest version of STREAMLINE, (2) offer separate 'Easy' and 'Manual' run modes for users to apply the notebook to their own data, where 'Easy' mode uses a prompt to gather essential run parameter information including a file navigation window to select the target dataset folder, (3) automatically download the output experiment folder and open the PDF summary reports on their screen (with user permission).

Minor Updates --------------------------------------------

  • Reverted back to using mean (rather than median) to present and sort model feature importances in plots (which was changed in Beta 0.2.4). This is to prevent confusion when running the notebook demos on the demonstration datasets, where using 3-fold CV yields median = 0 for all decision tree model feature importance scores which confuses picking and sorting the top features for plotting, as well as eliminates decision trees from the composite feature importance plots. We have added a hard-coded option to revert back to median ranking within the fi_stats() function within statistics.py.
  • Updated repository folder hierarchy, filenames, and some outputfile names.
  • Updated STREAMLINE phase groupings/numberings.
  • Updated the STREAMLINE schematic figure to reflect all major changes and new phase grouping.
  • Updated the feature correlation heatmap outputs: (1) color scheme used (for clarity), (2) view the non-redundant triangle vs. the full square (3) scale the feature names to avoid overlap, and don't show names at all when there are a large number of features (such that names would be unreadable)
  • Feature correlation results are now also documented within FeatureCorrelations.csv.
  • Reformatted the PDF output summary files to (1) add and re-organize all run parameters on the first page, (2) indicate the STREAMLINE version on the bottom of the page, and (3) include the new data processing/counts summary.
  • Univariate analysis output files now include the test run and test score in addition to p-values.
  • Updated the STREAMLINE Jupyter Notebook and other 'Useful Notebooks' to function with this new code framework.
  • Created a new hcc_data_custom.csv dataset for the demo that adds simulated features and instances to hcc_data.csv to explicitly test (and demonstrate the functionality of) the new automatic data cleaning and engineering steps in STREAMLINE phase 1. Similarly created a replication dataset hcc_data_custom_rep.csv which adds some noise to hcc_data_custom.csv and some other custom additions to demonstrate replication functionality. The code to generate these 'custom' datasets from hcc_data.csv are included in the data folder as the notebook Generate_expanded_HCC_Dataset.

Beta 0.2.5

24 Jun 22:24
Compare
Choose a tag to compare

*Added a minor additional catch to prevent statistical comparison results failure under specific situations. (in StatsJob.py and DataCompareJob.py)
*Cleaned up commented out old code

Beta 0.2.4

15 Jun 20:40
Compare
Choose a tag to compare
  • Fix - Special case when running data with no missing data and imputation was 'True', apply model error when looking for non existent imputation file. Code fixed so that importing imputed file is in try/except loop to prevent fail. Also updated apply model so that both .csv and .txt replication data can be loaded.
  • At recommendation of collaborator, switched from mean to median scores for feature importance figures. Also now outputs median algorithm performance summary, and adds median performance to pdf summary. Also now present median values in statistical significance output since this pairs more appropriately with non-parametric statistics than mean and standard deviation.

Beta 0.2.3

20 May 01:08
Compare
Choose a tag to compare

*Added fixes for (and confirmed functionality of) code to run STREAMLINE serially via the command line (in Linux - does not support Windows command line use).
*This release is considered stable and fully functional based on all tests and user feedback since the alpha release. We will make additional updates as needed for any other reported special case bugs/issues, as well as expand STREAMLINE further in future releases.

Beta 0.2.2

19 May 23:01
Compare
Choose a tag to compare

This latest Beta update addresses key functionality issues for running STREAMLINE serially from the command line, as well as a number of other minor functionality fixes and improvements.

*composite FI no longer fails when one algorithm used
*composite FI plots now support weighting with both balanced accuracy and roc_auc
*fixed major issues preventing running certain phases of STREAMLINE serially from command line
*removed 'None' option for max features in feature selection
*fixed pdf summary page 1 formatting issue
*Updated Optuna optimization for LR to avoid invalid hyperparameter combinations
*Enforced use of Optuna 2.0.0 for generating hyperparameter optimization figure generation, and added try catches to all algorithms so that STREAMLINE does not completely fail when there are lingering issues with Optuna versions in generating these figures.
*Updated notebooks accordingly

Beta 0.2.1

17 May 07:17
Compare
Choose a tag to compare
  • Moved codebase into 'streamline' folder and updated code accordingly
  • Updated default run parameters for Optuna
  • Identified that STREAMLINE does not guarantee complete replicability (due to Optuna) when parallelized.
  • Ensured replicability of cv data following scaling by rounding scaled data to 7 decimal places to avoid float rounding errors beyond the control of random seed fixing.