Skip to content

Data Transform Process

brightcoder01 edited this page Jan 13, 2020 · 45 revisions

Data Transform Process

Normalize Table Schema: Wide Table

Transform the table schema to be wide (aka. one table column is one feature) if the original table schema is not. We implement it using a batch processing job such as a MaxCompute job.

Do Statistics Using SQL

Calculate the statistical value for the following transform code_gen.

Generate the Code of Data Transform Stage From SQLFlow

We can use keras layer + feature column to do the data transformation. Please look at the Google Cloud Sample.

Feature Transform Library Based on TensorFlow OP

Build the common transform function set using TensorFlow. It can be fed into tf.keras.layers.Lambda or normalizer_fn of numeric_column.
As the transform function set is built upon TensorFlow op, we can ensure the consistency between training and inference.

Key point: Express the Transform function using COLUMN expression. How to design the syntax in SQLFlow to express our functions elegantly?

Transform Code Structure

We want to settle down the patten of the mode definition. In this way, we can generate the code according to this pattern.

Transform Layers => Feature Columns + DenseFeatures => Neural Network Structure

Transform Work: tf.keras.layers.Lambda
Multiple Column Transform: tf.keras.layers.Lambda
Feature Column: Categorical mapper
Embedding:

  1. Dense embedding -> tf.keras.layers.Embedding
  2. Sparse embedding -> Both embedding_column and tf.keras.layers.Embedding + keras combiner layer are fine. We can use SparseEmbedding if keras provides the native SparseEmbedding layer in the future.
Clone this wiki locally