Concatenate IDs of multiple category features for embedding. #1723

workingloong · 2020-02-11T07:38:14Z

Why do we need to concatenate IDs of multiple category features for embedding?

In #1721, we have introduced that we need to convert categorical feature values to integer IDs if we want to export the model using Tensorflow SavedModel for TF Serving.

age	education	marital-status
34	Master	Divorced
54	Doctor	Never-married
42	Bachelor	Never-married

To

age	education	marital-status
34	0	0
54	1	1
42	2	1

After converting category values to IDs, we generally use those IDs to perform a lookup in the embedding matrix.

The problem: a dataset sometimes has many categorical features. If we make embedding for each categorical features separately, we need to create many embedding table variables. Besides the weights in variables, there are overhead to create a variable. So, the size of the model may be very huge and the performance of embedding lookup may be inefficient.

In order to reduce the number of embedding table variables, we can concatenate the categorical feature IDs tensor to a big tensor and merge the embedding tables. However, the same IDs will return the same embedding vectors by lookup in the merged embedding table. In the following figure, we can see that embedding vectors of "marital-status" are the same as "education".

So, we need to add an offset for IDs of "marital-status" so that the "martial-status" feature can get its embedding vectors by lookup in the merged embedding table.

Solution Proposals to Concatenate the IDs using different Tensorflow API.

#1721 has listed 3 methods to convert category value to IDs. We need to adopt different methods to concatenate IDs using different methods.

1. Concatenate the IDs generated by categorical columns in `tf.feature_column` such as `tf.feature_column.categorical_column_with_hash_bucket`.

The example is the 1st case showed in #1721. If we use categorical columns to convert categories values to IDs, we must use embedding_column to make embedding for those IDs. Because the output of categorial columns is a sparse tensor which cannot be directly used in DenseFeature. So, we need to concatenate the outputs of categorial features before embedding_column to reduce the number of embedding variables.

education_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
	name="education", hash_bucket_size=3
)  # the id is in [0,3)
marital_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
	name="marital-status", hash_bucket_size=5
) # the id is in [0,5)

edu_marital_concat = concat_column([education_hash_column, marital_hash_column])

education_embedded_column = tf.feature_column.embedding_column(
	edu_marital_concat, embedding_dim=2
)

The concat_column will concatenate the outputs of education_hash_column and marital_hash_column and add offset 3 for the IDs of marital_hash_column.

In the case, we need to customize a concat_column showed in PR #1719.

2. Concatenate the IDs generated by `numeric_column` with custom transform_fn。

The examples are 2nd in #1721. The output of numeric_column is a tensor with IDs and we can directly use it in DenseFeatures.

def generate_hash_bucket_column(name, hash_bucket_size):
    def hash_bucket_id(x, hash_bucket_size):
        if x.dtype is not tf.string:
            x = tf.strings.as_string(x)
        return tf.strings.to_hash_bucket_fast(x, hash_bucket_size)

    transform_fn = lambda x, hash_bucket_size=hash_bucket_size : (
        hash_bucket_id(x, hash_bucket_size)
    )
    return tf.feature_column.numeric_column(
        name, dtype=tf.int32, normalizer_fn=transform_fn
    )

input_layers = [
	tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
	tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash = generate_hash_bucket_column(
	name="education", hash_bucket_size=3
) # the id is in [0,3)
marital_hash = generate_hash_bucket_column(
	name="marital-status", hash_bucket_size=5
) # the id is in [0,5)

education_hash_ids = tf.keras.layers.DenseFeature([education_hash])(input_layers)
marital_hash_ids = tf.keras.layers.DenseFeature([marital_hash])(input_layers)

Then, we can add offset for the IDs tensor of "marital-status" and concatenate it with "education" IDs tensor like:

marital_ids_with_offset = marital_hash_ids + 3 #3 is the number of education IDs
edu_marital_concat = tf.keras.layers.Concatenate()([education_hash_ids, marital_ids_with_offset])

In those case, we need to customize a transform_fn for numeric_column and don't need to customize a concat_column.

3. Concatenate the IDs generated by custom transformation layers.

The examples are the 3rd methods in #1721. The output of the custom layer HashBucket is the same as the numeric_column in the 2nd method. So we can use the same way to add offset to "marital-status" IDs and concatenate.

class HashBucket(tf.keras.layers.Layer):
    def __init__(self, hash_bucket_size):
        super(HashBucket, self).__init__()
        self.hash_bucket_size =hash_ bucket_size

    def call(self, inputs):
        if inputs.dtype is not tf.string:
            inputs = tf.strings.as_string(inputs)
        bucket_id = tf.strings.to_hash_bucket_fast(
        	inputs, self.hash_bucket_size
        )
        return tf.cast(bucket_id, tf.int64)

education_input = tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
marital_input = tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)

education_hash_ids = HashBucket(hash_bucket_size=3)(education_input) # the id is in [0,3)
marital_hash_ids = HashBucket(hash_bucket_size=5)(marital_input) # the id is in [0,5)

marital_ids_with_offset = marital_hash_ids + 3 #3 is the number of education IDs

edu_marital_concat = tf.keras.layers.Concatenate()([education_hash_ids, marital_ids_with_offset])

The text was updated successfully, but these errors were encountered:

workingloong added data transform discussion labels Feb 11, 2020

brightcoder01 mentioned this issue Feb 12, 2020

Add the transform function Api design #1725

Merged

brightcoder01 assigned workingloong Feb 18, 2020

workingloong mentioned this issue Mar 17, 2020

Support concatenating the tensor adding offset. #1846

Merged

brightcoder01 closed this as completed in #1846 Mar 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Concatenate IDs of multiple category features for embedding. #1723

Concatenate IDs of multiple category features for embedding. #1723

workingloong commented Feb 11, 2020 •

edited

Loading

Concatenate IDs of multiple category features for embedding. #1723

Concatenate IDs of multiple category features for embedding. #1723

Comments

workingloong commented Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why do we need to concatenate IDs of multiple category features for embedding?

Solution Proposals to Concatenate the IDs using different Tensorflow API.

1. Concatenate the IDs generated by categorical columns in tf.feature_column such as tf.feature_column.categorical_column_with_hash_bucket.

2. Concatenate the IDs generated by numeric_column with custom transform_fn。

3. Concatenate the IDs generated by custom transformation layers.

workingloong commented Feb 11, 2020 •

edited

Loading

1. Concatenate the IDs generated by categorical columns in `tf.feature_column` such as `tf.feature_column.categorical_column_with_hash_bucket`.

2. Concatenate the IDs generated by `numeric_column` with custom transform_fn。