Release 0.14.0

paulgc released this 05 Aug 22:02

· 766 commits to master since this release

adfabea

Major Features and Improvements

Performance improvement due to optimizing inner loops.
Add support for time semantic domain related statistics.
Performance improvement due to batching accumulators before merging.
Add utility method validate_examples_in_tfrecord, which identifies anomalous
examples in TFRecord files containing TFExamples and generates statistics for
those anomalous examples.
Add utility method validate_examples_in_csv, which identifies anomalous
examples in CSV files and generates statistics for those anomalous examples.
Add fast TF example decoder written in C++.
Make BasicStatsGenerator to take arrow table as input. Example batches are
converted to Apache Arrow tables internally and we are able to make use of
vectorized numpy functions. Improved performance of BasicStatsGenerator
by ~40x.
Make TopKUniquesStatsGenerator and TopKUniquesCombinerStatsGenerator to
take arrow table as input.
Add update_schema API which updates the schema to conform to statistics.
Add support for validating changes in the number of examples between the
current and previous spans of data (using the existing validate_statistics
function).
Support building a manylinux2010 compliant wheel in docker.
Add support for cross feature statistics.

Bug Fixes and Other Changes

Expand unit test coverage.
Update natural language stats generator to generate stats if actual ratio
equals match_ratio.
Use __slots__ in accumulators.
Fix overflow warning when generating numeric stats for large integers.
Set max value count in schema when the feature has same valency, thereby
inferring shape for multivalent required features.
Fix divide by zero error in natural language stats generator.
Add load_anomalies_text and write_anomalies_text utility functions.
Define ReasonFeatureNeeded proto.
Add support for Windows OS.
Make semantic domain stats generators to take arrow column as input.
Fix error in number of missing examples and total number of examples
computation.
Make FeaturesNeeded serializable.
Fix memory leak in fast example decoder.
Add semantic_domain_stats_sample_rate option to compute semantic domain
statistics over a sample.
Increment refcount of None in fast example decoder.
Add compression_type option to generate_statistics_from_* methods.
Add link to SysML paper describing some technical details behind TFDV.
Add Python types to the source code.
MakeGenerateStatistics generate a DatasetFeatureStatisticsList containing a
dataset with num_examples == 0 instead of an empty proto if there are no
examples in the input.
Depends on absl-py>=0.7,<1
Depends on apache-beam[gcp]>=2.14,<3
Depends on numpy>=1.16,<2.
Depends on pandas>=0.24,<1.
Depends on pyarrow>=0.14.0,<0.15.0.
Depends on scikit-learn>=0.18,<0.21.
Depends on tensorflow-metadata>=0.14,<0.15.
Depends on tensorflow-transform>=0.14,<0.15.

Breaking Changes

Change examples_threshold to values_threshold and update documentation to
clarify that counts are of values in semantic domain stats generators.
Refactor IdentifyAnomalousExamples to remove sampling and output
(anomaly reason, example) tuples.
Rename anomaly_proto parameter in anomalies utilities to anomalies to
make it more consistent with proto and schema utilities.
FeatureNameStatistics produced by GenerateStatistics is now identified
by its .path field instead of the .name field. For example:
```
feature {
  name: "my_feature"
}
```
becomes:
```
feature {
  path {
    step: "my_feature"
  }
}
```
Change validate_instance API to accept an Arrow table instead of a Dict.
Change GenerateStatistics API to accept Arrow tables as input.

Deprecations

Assets 2