Releases: pasteur-dev/pasteur
Improved Template Project
This release improves upon the previous one by using the new format preferred by pip-compile
in the template project, and switches from a setup.py
file to pyproject.toml
. The new requirements.in
file no longer suggests installing Pasteur's docs
dependencies due to sphinx 7 compatibility issues.
The transformers in extras are also updated to remove the categorical dtype check deprecation warning.
Pipeline Tweaks, New Commands, and packaging fixes
This pasteur release tweaks pipeline generation to better segment ingestion and synthesis.
It introduces the new commands ingest_dataset
(or id
) and ingest_view
(or iv
) which only perform the dataset and view ingest steps. This makes it easier to iterate on creating new datasets and new views by only re-running their ingest code.
Now by default pipe
won't perform the view ingestion steps, which may be cumbersome for out-of-core datasets, and will begin from filtering onward (pipe --all
will still run the whole pipeline).
A new view option is introduced: fit_global
, which allows for fitting the transformers and encoders in the whole view (at the cost of increased overhead), which fixes issues with rare categorical values not being recognized due to be missing from the work set.
Two bugs were also fixed: TabularDataset
required pandas but it wasn't imported and the mlflow default style was not packaged in the pypi package.
Out-of-core overhaul and new event data support
This new release overhauls and standardizes Pasteur's API to prepare it for multi-modal data synthesis. In addition, it fixes some of its rough parts, by making the process of fitting Encodings, Transformations, and Metrics out-of-core through a map-reduce architecture.
For transforming event data, a new type of Transformer, Seq(uence) Transformer is added. This transformer is multi-table aware and can, for example, encode inter-row references (such as a date of #3 row for patient X having a dependency on #2 row). A built-in implementation of this transformer, named SeqTransformerWrapper (accessed through the name seq
), contains the necessary joining logic to wrap existing reference transformers into supporting this format.
The new mimic_core
view in extras is provided as a proof of concept for this new transformation format, which contains the three core tables of mimic (patients, admissions, and transfers).
0.1.1
Initial Release
Initial release for pasteur. Pasteur can now be installed with pip
and offers a working template for data synthesis.