Skip to content

Release 0.14.0

Compare
Choose a tag to compare
@paulgc paulgc released this 05 Aug 22:02
· 766 commits to master since this release

Major Features and Improvements

  • Performance improvement due to optimizing inner loops.
  • Add support for time semantic domain related statistics.
  • Performance improvement due to batching accumulators before merging.
  • Add utility method validate_examples_in_tfrecord, which identifies anomalous
    examples in TFRecord files containing TFExamples and generates statistics for
    those anomalous examples.
  • Add utility method validate_examples_in_csv, which identifies anomalous
    examples in CSV files and generates statistics for those anomalous examples.
  • Add fast TF example decoder written in C++.
  • Make BasicStatsGenerator to take arrow table as input. Example batches are
    converted to Apache Arrow tables internally and we are able to make use of
    vectorized numpy functions. Improved performance of BasicStatsGenerator
    by ~40x.
  • Make TopKUniquesStatsGenerator and TopKUniquesCombinerStatsGenerator to
    take arrow table as input.
  • Add update_schema API which updates the schema to conform to statistics.
  • Add support for validating changes in the number of examples between the
    current and previous spans of data (using the existing validate_statistics
    function).
  • Support building a manylinux2010 compliant wheel in docker.
  • Add support for cross feature statistics.

Bug Fixes and Other Changes

  • Expand unit test coverage.
  • Update natural language stats generator to generate stats if actual ratio
    equals match_ratio.
  • Use __slots__ in accumulators.
  • Fix overflow warning when generating numeric stats for large integers.
  • Set max value count in schema when the feature has same valency, thereby
    inferring shape for multivalent required features.
  • Fix divide by zero error in natural language stats generator.
  • Add load_anomalies_text and write_anomalies_text utility functions.
  • Define ReasonFeatureNeeded proto.
  • Add support for Windows OS.
  • Make semantic domain stats generators to take arrow column as input.
  • Fix error in number of missing examples and total number of examples
    computation.
  • Make FeaturesNeeded serializable.
  • Fix memory leak in fast example decoder.
  • Add semantic_domain_stats_sample_rate option to compute semantic domain
    statistics over a sample.
  • Increment refcount of None in fast example decoder.
  • Add compression_type option to generate_statistics_from_* methods.
  • Add link to SysML paper describing some technical details behind TFDV.
  • Add Python types to the source code.
  • MakeGenerateStatistics generate a DatasetFeatureStatisticsList containing a
    dataset with num_examples == 0 instead of an empty proto if there are no
    examples in the input.
  • Depends on absl-py>=0.7,<1
  • Depends on apache-beam[gcp]>=2.14,<3
  • Depends on numpy>=1.16,<2.
  • Depends on pandas>=0.24,<1.
  • Depends on pyarrow>=0.14.0,<0.15.0.
  • Depends on scikit-learn>=0.18,<0.21.
  • Depends on tensorflow-metadata>=0.14,<0.15.
  • Depends on tensorflow-transform>=0.14,<0.15.

Breaking Changes

  • Change examples_threshold to values_threshold and update documentation to
    clarify that counts are of values in semantic domain stats generators.

  • Refactor IdentifyAnomalousExamples to remove sampling and output
    (anomaly reason, example) tuples.

  • Rename anomaly_proto parameter in anomalies utilities to anomalies to
    make it more consistent with proto and schema utilities.

  • FeatureNameStatistics produced by GenerateStatistics is now identified
    by its .path field instead of the .name field. For example:

    feature {
      name: "my_feature"
    }
    

    becomes:

    feature {
      path {
        step: "my_feature"
      }
    }
    
  • Change validate_instance API to accept an Arrow table instead of a Dict.

  • Change GenerateStatistics API to accept Arrow tables as input.

Deprecations