- Modify validation logic to raise
SCHEMA_MISSING_COLUMN
anomaly when observing a feature with no stats (was still broken, now fixed).
- Use joblib to exploit multiprocessing when computing statistics over a pandas dataframe.
- Add support for semantic domain related statistics (natural language, image),
enabled by
StatsOptions.enable_semantic_domain_stats
. - Python 3.5 is supported.
- Expand unit test coverage.
- Modify validation logic to raise
SCHEMA_MISSING_COLUMN
anomaly when observing a feature with no stats. - Add utility functions
write_stats_text
andload_stats_text
to write and load DatasetFeatureStatisticsList protos. - Avoid using multiprocessing by default when generating statistics over a dataframe.
- Depends on
joblib>=0.12,<1
. - Depends on
tensorflow-transform>=0.13,<0.14
. - Depends on
tensorflow-metadata>=0.12.1,<0.14
. - Requires pre-installed
tensorflow>=1.13.1,<2
. - Depends on
apache-beam[gcp]>=2.11,<3
. - Depends on
absl>=0.1.6,<1
.
- Add support for computing statistics over slices of data.
- Performance improvement due to optimizing inner loops.
- Add support for generating statistics from a pandas dataframe.
- Performance improvement due to pre-allocating tf.Example in TFExampleDecoder.
- Performance improvement due to merging common stats generator, numeric stats generator and string stats generator as a single basic stats generator.
- Performance improvement due to merging top-k and uniques generators.
- Add a
validate_instance
function, which checks a single example for anomalies. - Add a utility method
get_statistics_html
, which returns HTML that can be used for Facets visualization outside of a notebook. - Add support for schema inference of semantic domains.
- Performance improvement on statistics computation over a pandas dataframe.
- Use constant 'BYTES_VALUE' in the statistics proto to represent a bytes value which cannot be decoded as a utf-8 string.
- Introduced CombinerFeatureStatsGenerator, a specialized interface for combiners that do not require cross-feature computations.
- Expand unit test coverage.
- Add optional frequency threshold that allows keeping only the most frequent values that are present in a minimum number of examples.
- Add optional desired batch size that allows specification of the number of examples to include in each batch.
- Depends on
numpy>=1.14.5,<2
. - Depends on
protobuf>=3.6.1,<4
. - Depends on
apache-beam[gcp]>=2.10,<3
. - Depends on
tensorflow-metadata>=0.12.1,<0.13
. - Depends on
scikit-learn>=0.18,<1
. - Depends on
IPython>=5.0
. - Requires pre-installed
tensorflow>=1.12,<2
. - Revise example notebook and update it to be able to run in Colab and Jupyter.
- Represent batch as a list of ndarrays instead of ndarrays of ndarrays.
- Modify decoders to return ndarrays of type numpy.float32 for FLOAT features.
- Add option to infer feature types from schema when generating statistics over CSV data.
- Add utility method
set_domain
to set the domain of a feature in the schema. - Add option to compute weighted statistics by providing a weight feature.
- Add a PTransform for decoding TF examples.
- Add utility methods
write_schema_text
andload_schema_text
to write and load the schema protocol buffer. - Add option to compute statistics over a sample.
- Optimize performance of statistics computation (~2x improvement on benchmark datasets).
- Depends on
apache-beam[gcp]>=2.8,<3
. - Depends on
tensorflow-transform>=0.11,<0.12
. - Depends on
tensorflow-metadata>=0.9,<0.10
. - Fix bug in clearing oneof domain_info field in Feature proto.
- Fix overflow error for large integers by casting them to STRING type.
- Added API docs.
- Requires pre-installed
tensorflow>=1.11,<2
. - Make tf.Example decoder to represent a feature with no value list as a missing value (None).
- Make StatsOptions as a class.
- Initial release of TensorFlow Data Validation.