Release 0.14.0
Major Features and Improvements
- Performance improvement due to optimizing inner loops.
- Add support for time semantic domain related statistics.
- Performance improvement due to batching accumulators before merging.
- Add utility method
validate_examples_in_tfrecord
, which identifies anomalous
examples in TFRecord files containing TFExamples and generates statistics for
those anomalous examples. - Add utility method
validate_examples_in_csv
, which identifies anomalous
examples in CSV files and generates statistics for those anomalous examples. - Add fast TF example decoder written in C++.
- Make
BasicStatsGenerator
to take arrow table as input. Example batches are
converted to Apache Arrow tables internally and we are able to make use of
vectorized numpy functions. Improved performance of BasicStatsGenerator
by ~40x. - Make
TopKUniquesStatsGenerator
andTopKUniquesCombinerStatsGenerator
to
take arrow table as input. - Add
update_schema
API which updates the schema to conform to statistics. - Add support for validating changes in the number of examples between the
current and previous spans of data (using the existingvalidate_statistics
function). - Support building a manylinux2010 compliant wheel in docker.
- Add support for cross feature statistics.
Bug Fixes and Other Changes
- Expand unit test coverage.
- Update natural language stats generator to generate stats if actual ratio
equalsmatch_ratio
. - Use
__slots__
in accumulators. - Fix overflow warning when generating numeric stats for large integers.
- Set max value count in schema when the feature has same valency, thereby
inferring shape for multivalent required features. - Fix divide by zero error in natural language stats generator.
- Add
load_anomalies_text
andwrite_anomalies_text
utility functions. - Define ReasonFeatureNeeded proto.
- Add support for Windows OS.
- Make semantic domain stats generators to take arrow column as input.
- Fix error in number of missing examples and total number of examples
computation. - Make FeaturesNeeded serializable.
- Fix memory leak in fast example decoder.
- Add
semantic_domain_stats_sample_rate
option to compute semantic domain
statistics over a sample. - Increment refcount of None in fast example decoder.
- Add
compression_type
option togenerate_statistics_from_*
methods. - Add link to SysML paper describing some technical details behind TFDV.
- Add Python types to the source code.
- Make
GenerateStatistics
generate a DatasetFeatureStatisticsList containing a
dataset with num_examples == 0 instead of an empty proto if there are no
examples in the input. - Depends on
absl-py>=0.7,<1
- Depends on
apache-beam[gcp]>=2.14,<3
- Depends on
numpy>=1.16,<2
. - Depends on
pandas>=0.24,<1
. - Depends on
pyarrow>=0.14.0,<0.15.0
. - Depends on
scikit-learn>=0.18,<0.21
. - Depends on
tensorflow-metadata>=0.14,<0.15
. - Depends on
tensorflow-transform>=0.14,<0.15
.
Breaking Changes
-
Change
examples_threshold
tovalues_threshold
and update documentation to
clarify that counts are of values in semantic domain stats generators. -
Refactor IdentifyAnomalousExamples to remove sampling and output
(anomaly reason, example) tuples. -
Rename
anomaly_proto
parameter in anomalies utilities toanomalies
to
make it more consistent with proto and schema utilities. -
FeatureNameStatistics
produced byGenerateStatistics
is now identified
by its.path
field instead of the.name
field. For example:feature { name: "my_feature" }
becomes:
feature { path { step: "my_feature" } }
-
Change
validate_instance
API to accept an Arrow table instead of a Dict. -
Change
GenerateStatistics
API to accept Arrow tables as input.