Releases: gretelai/gretel-synthetics
Validation loss splitting
Aw/core 107 validate (#93) * add validation_split param
Auto-select Tokenizer
Automatically select character-based tokenization over SentencePiece if vocab_size
is set to zero.
Misc updates
- Added a new Record Generator object to DF mode that generates entire records with custom validation
- Added custom
RuntimeError
when not enough training data is ingested - Added ability for custom callbacks to capture epoch training details
Batch DF Updates
Updated routines for generating data to return summary objects that have more detail on their properties.
Smart seeding bugfix
Bugfix to ensure model weights are reset when a list of seed values is provided to the generator
Seeding and DP updates
⚙️ Smart seeding now supports a list of seeds. A list of seeds will yield a 1:1 mapping of seeds to generated lines. This is useful for synthesizing partial data tables
⚙️ When using DataFrame Batch mode, we now will write out the original Training DF header order to the model directory. When a model is loaded from disk, the resulting generated DataFrame will have the columns ordered the way they were in the training data.
🐛 When using DP mode. We (temporarily) will patch TensorFlow 2.4.x to utilize new Keras LSTM codepaths. This will be globally patched for Keras within the running Python Interpreter. This provides a drastic speedup when training a DP model.
📖 Doc updates for new seeding features.
Modular refactor, tokenizers, and differential privacy, oh my!
Major changes:
-
Totally refactored modules and package structure. This will enable future contributions to utilize other underlying underlying ML libraries as the core engine. Configurations are now specific to underlying engine.
LocalConfig
can be replaced withTensorFlowConfig
, although the former is still supported for backwards compatibility. -
With TensorFlow 2.4.x, TensorFlow Privacy can be used to provide differential private training with modified Keras DP optimizers.
-
Added new tokenizer module that can be used independently from the underlying model training. By default, we continue to use SentencePiece as the tokenizer. We have also added a char-by-char tokenizer that can be useful to use when using differential privacy.
-
Misc bug fixes and optimizations
-
Changes in this release are backwards compatible with previous versions.
Please see our updated README and examples directory.
RC0 0.15.0
v0.15.0.rc0 Update README.md
Smart seeding
Enable "Smart Seeding" which allows a prefix to be provided during line generation. The generator will complete the line based on the provided seed. When training on structured data (DataFrames) this enables the first N column values to be pre-provided and then remaining columns will be generated based on the initial values.
v0.14.0: Jm/syn 21 (#58)
-
Introduce Keras Early Stopping and Save Best Model features. Set default number of epochs to 100 which should allow most training sequences to automatically stop without potential over-fitting.
-
Provide better tracking of which epoch's model was used as the best one in the model history table
-
Temporarily disable DP mode