Skip to content

Concatenate IDs of multiple category features for embedding. #1723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
workingloong opened this issue Feb 11, 2020 · 0 comments · Fixed by #1846
Closed

Concatenate IDs of multiple category features for embedding. #1723

workingloong opened this issue Feb 11, 2020 · 0 comments · Fixed by #1846

Comments

@workingloong
Copy link
Collaborator

workingloong commented Feb 11, 2020

Why do we need to concatenate IDs of multiple category features for embedding?

In #1721, we have introduced that we need to convert categorical feature values to integer IDs if we want to export the model using Tensorflow SavedModel for TF Serving.

age education marital-status
34 Master Divorced
54 Doctor Never-married
42 Bachelor Never-married

To

age education marital-status
34 0 0
54 1 1
42 2 1

After converting category values to IDs, we generally use those IDs to perform a lookup in the embedding matrix.

The problem: a dataset sometimes has many categorical features. If we make embedding for each categorical features separately, we need to create many embedding table variables. Besides the weights in variables, there are overhead to create a variable. So, the size of the model may be very huge and the performance of embedding lookup may be inefficient.
image

In order to reduce the number of embedding table variables, we can concatenate the categorical feature IDs tensor to a big tensor and merge the embedding tables. However, the same IDs will return the same embedding vectors by lookup in the merged embedding table. In the following figure, we can see that embedding vectors of "marital-status" are the same as "education".
image

So, we need to add an offset for IDs of "marital-status" so that the "martial-status" feature can get its embedding vectors by lookup in the merged embedding table.
image

Solution Proposals to Concatenate the IDs using different Tensorflow API.

#1721 has listed 3 methods to convert category value to IDs. We need to adopt different methods to concatenate IDs using different methods.

1. Concatenate the IDs generated by categorical columns in tf.feature_column such as tf.feature_column.categorical_column_with_hash_bucket.

The example is the 1st case showed in #1721. If we use categorical columns to convert categories values to IDs, we must use embedding_column to make embedding for those IDs. Because the output of categorial columns is a sparse tensor which cannot be directly used in DenseFeature. So, we need to concatenate the outputs of categorial features before embedding_column to reduce the number of embedding variables.

education_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
	name="education", hash_bucket_size=3
)  # the id is in [0,3)
marital_hash_column = tf.feature_column.categorical_column_with_hash_bucket(
	name="marital-status", hash_bucket_size=5
) # the id is in [0,5)

edu_marital_concat = concat_column([education_hash_column, marital_hash_column])

education_embedded_column = tf.feature_column.embedding_column(
	edu_marital_concat, embedding_dim=2
)

The concat_column will concatenate the outputs of education_hash_column and marital_hash_column and add offset 3 for the IDs of marital_hash_column.

In the case, we need to customize a concat_column showed in PR #1719.

2. Concatenate the IDs generated by numeric_column with custom transform_fn。

The examples are 2nd in #1721. The output of numeric_column is a tensor with IDs and we can directly use it in DenseFeatures.

def generate_hash_bucket_column(name, hash_bucket_size):
    def hash_bucket_id(x, hash_bucket_size):
        if x.dtype is not tf.string:
            x = tf.strings.as_string(x)
        return tf.strings.to_hash_bucket_fast(x, hash_bucket_size)

    transform_fn = lambda x, hash_bucket_size=hash_bucket_size : (
        hash_bucket_id(x, hash_bucket_size)
    )
    return tf.feature_column.numeric_column(
        name, dtype=tf.int32, normalizer_fn=transform_fn
    )

input_layers = [
	tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
	tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash = generate_hash_bucket_column(
	name="education", hash_bucket_size=3
) # the id is in [0,3)
marital_hash = generate_hash_bucket_column(
	name="marital-status", hash_bucket_size=5
) # the id is in [0,5)

education_hash_ids = tf.keras.layers.DenseFeature([education_hash])(input_layers)
marital_hash_ids = tf.keras.layers.DenseFeature([marital_hash])(input_layers)

Then, we can add offset for the IDs tensor of "marital-status" and concatenate it with "education" IDs tensor like:

marital_ids_with_offset = marital_hash_ids + 3 #3 is the number of education IDs
edu_marital_concat = tf.keras.layers.Concatenate()([education_hash_ids, marital_ids_with_offset])

In those case, we need to customize a transform_fn for numeric_column and don't need to customize a concat_column.

3. Concatenate the IDs generated by custom transformation layers.

The examples are the 3rd methods in #1721. The output of the custom layer HashBucket is the same as the numeric_column in the 2nd method. So we can use the same way to add offset to "marital-status" IDs and concatenate.

class HashBucket(tf.keras.layers.Layer):
    def __init__(self, hash_bucket_size):
        super(HashBucket, self).__init__()
        self.hash_bucket_size =hash_ bucket_size

    def call(self, inputs):
        if inputs.dtype is not tf.string:
            inputs = tf.strings.as_string(inputs)
        bucket_id = tf.strings.to_hash_bucket_fast(
        	inputs, self.hash_bucket_size
        )
        return tf.cast(bucket_id, tf.int64)

education_input = tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
marital_input = tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)

education_hash_ids = HashBucket(hash_bucket_size=3)(education_input) # the id is in [0,3)
marital_hash_ids = HashBucket(hash_bucket_size=5)(marital_input) # the id is in [0,5)

marital_ids_with_offset = marital_hash_ids + 3 #3 is the number of education IDs

edu_marital_concat = tf.keras.layers.Concatenate()([education_hash_ids, marital_ids_with_offset])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant