You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why do we need to concatenate IDs of multiple category features for embedding?
In #1721, we have introduced that we need to convert categorical feature values to integer IDs if we want to export the model using Tensorflow SavedModel for TF Serving.
age
education
marital-status
34
Master
Divorced
54
Doctor
Never-married
42
Bachelor
Never-married
To
age
education
marital-status
34
0
0
54
1
1
42
2
1
After converting category values to IDs, we generally use those IDs to perform a lookup in the embedding matrix.
The problem: a dataset sometimes has many categorical features. If we make embedding for each categorical features separately, we need to create many embedding table variables. Besides the weights in variables, there are overhead to create a variable. So, the size of the model may be very huge and the performance of embedding lookup may be inefficient.
In order to reduce the number of embedding table variables, we can concatenate the categorical feature IDs tensor to a big tensor and merge the embedding tables. However, the same IDs will return the same embedding vectors by lookup in the merged embedding table. In the following figure, we can see that embedding vectors of "marital-status" are the same as "education".
So, we need to add an offset for IDs of "marital-status" so that the "martial-status" feature can get its embedding vectors by lookup in the merged embedding table.
Solution Proposals to Concatenate the IDs using different Tensorflow API.
#1721 has listed 3 methods to convert category value to IDs. We need to adopt different methods to concatenate IDs using different methods.
1. Concatenate the IDs generated by categorical columns in tf.feature_column such as tf.feature_column.categorical_column_with_hash_bucket.
The example is the 1st case showed in #1721. If we use categorical columns to convert categories values to IDs, we must use embedding_column to make embedding for those IDs. Because the output of categorial columns is a sparse tensor which cannot be directly used in DenseFeature. So, we need to concatenate the outputs of categorial features before embedding_column to reduce the number of embedding variables.
education_hash_column=tf.feature_column.categorical_column_with_hash_bucket(
name="education", hash_bucket_size=3
) # the id is in [0,3)marital_hash_column=tf.feature_column.categorical_column_with_hash_bucket(
name="marital-status", hash_bucket_size=5
) # the id is in [0,5)edu_marital_concat=concat_column([education_hash_column, marital_hash_column])
education_embedded_column=tf.feature_column.embedding_column(
edu_marital_concat, embedding_dim=2
)
The concat_column will concatenate the outputs of education_hash_column and marital_hash_column and add offset 3 for the IDs of marital_hash_column.
In the case, we need to customize a concat_column showed in PR #1719.
2. Concatenate the IDs generated by numeric_column with custom transform_fn。
The examples are 2nd in #1721. The output of numeric_column is a tensor with IDs and we can directly use it in DenseFeatures.
defgenerate_hash_bucket_column(name, hash_bucket_size):
defhash_bucket_id(x, hash_bucket_size):
ifx.dtypeisnottf.string:
x=tf.strings.as_string(x)
returntf.strings.to_hash_bucket_fast(x, hash_bucket_size)
transform_fn=lambdax, hash_bucket_size=hash_bucket_size : (
hash_bucket_id(x, hash_bucket_size)
)
returntf.feature_column.numeric_column(
name, dtype=tf.int32, normalizer_fn=transform_fn
)
input_layers= [
tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
]
education_hash=generate_hash_bucket_column(
name="education", hash_bucket_size=3
) # the id is in [0,3)marital_hash=generate_hash_bucket_column(
name="marital-status", hash_bucket_size=5
) # the id is in [0,5)education_hash_ids=tf.keras.layers.DenseFeature([education_hash])(input_layers)
marital_hash_ids=tf.keras.layers.DenseFeature([marital_hash])(input_layers)
Then, we can add offset for the IDs tensor of "marital-status" and concatenate it with "education" IDs tensor like:
marital_ids_with_offset=marital_hash_ids+3#3 is the number of education IDsedu_marital_concat=tf.keras.layers.Concatenate()([education_hash_ids, marital_ids_with_offset])
In those case, we need to customize a transform_fn for numeric_column and don't need to customize a concat_column.
3. Concatenate the IDs generated by custom transformation layers.
The examples are the 3rd methods in #1721. The output of the custom layer HashBucket is the same as the numeric_column in the 2nd method. So we can use the same way to add offset to "marital-status" IDs and concatenate.
classHashBucket(tf.keras.layers.Layer):
def__init__(self, hash_bucket_size):
super(HashBucket, self).__init__()
self.hash_bucket_size=hash_bucket_sizedefcall(self, inputs):
ifinputs.dtypeisnottf.string:
inputs=tf.strings.as_string(inputs)
bucket_id=tf.strings.to_hash_bucket_fast(
inputs, self.hash_bucket_size
)
returntf.cast(bucket_id, tf.int64)
education_input=tf.keras.layers.Input(name="education", shape=(1,), dtype=tf.string)
marital_input=tf.keras.layers.Input(name="marital-status", shape=(1,), dtype=tf.string)
education_hash_ids=HashBucket(hash_bucket_size=3)(education_input) # the id is in [0,3)marital_hash_ids=HashBucket(hash_bucket_size=5)(marital_input) # the id is in [0,5)marital_ids_with_offset=marital_hash_ids+3#3 is the number of education IDsedu_marital_concat=tf.keras.layers.Concatenate()([education_hash_ids, marital_ids_with_offset])
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Why do we need to concatenate IDs of multiple category features for embedding?
In #1721, we have introduced that we need to convert categorical feature values to integer IDs if we want to export the model using Tensorflow SavedModel for TF Serving.
To
After converting category values to IDs, we generally use those IDs to perform a lookup in the embedding matrix.
The problem: a dataset sometimes has many categorical features. If we make embedding for each categorical features separately, we need to create many embedding table variables. Besides the weights in variables, there are overhead to create a variable. So, the size of the model may be very huge and the performance of embedding lookup may be inefficient.

In order to reduce the number of embedding table variables, we can concatenate the categorical feature IDs tensor to a big tensor and merge the embedding tables. However, the same IDs will return the same embedding vectors by lookup in the merged embedding table. In the following figure, we can see that embedding vectors of "marital-status" are the same as "education".

So, we need to add an offset for IDs of "marital-status" so that the "martial-status" feature can get its embedding vectors by lookup in the merged embedding table.

Solution Proposals to Concatenate the IDs using different Tensorflow API.
#1721 has listed 3 methods to convert category value to IDs. We need to adopt different methods to concatenate IDs using different methods.
1. Concatenate the IDs generated by categorical columns in
tf.feature_column
such astf.feature_column.categorical_column_with_hash_bucket
.The example is the 1st case showed in #1721. If we use categorical columns to convert categories values to IDs, we must use
embedding_column
to make embedding for those IDs. Because the output of categorial columns is a sparse tensor which cannot be directly used inDenseFeature
. So, we need to concatenate the outputs of categorial features beforeembedding_column
to reduce the number of embedding variables.The
concat_column
will concatenate the outputs ofeducation_hash_column
andmarital_hash_column
and add offset 3 for the IDs ofmarital_hash_column
.In the case, we need to customize a
concat_column
showed in PR #1719.2. Concatenate the IDs generated by
numeric_column
with custom transform_fn。The examples are 2nd in #1721. The output of
numeric_column
is a tensor with IDs and we can directly use it inDenseFeatures
.Then, we can add offset for the IDs tensor of "marital-status" and concatenate it with "education" IDs tensor like:
In those case, we need to customize a transform_fn for
numeric_column
and don't need to customize aconcat_column
.3. Concatenate the IDs generated by custom transformation layers.
The examples are the 3rd methods in #1721. The output of the custom layer
HashBucket
is the same as thenumeric_column
in the 2nd method. So we can use the same way to add offset to "marital-status" IDs and concatenate.The text was updated successfully, but these errors were encountered: