|
78 | 78 | # Embeddings are trained in RecSys through the following process:
|
79 | 79 | #
|
80 | 80 | # * **Input/lookup indices are fed into the model, as unique IDs**. IDs are
|
81 |
| -# hashed to the total size of the embedding table to prevent issues when |
82 |
| -# the ID > number of rows |
| 81 | +# hashed to the total size of the embedding table to prevent issues when |
| 82 | +# the ID > number of rows |
83 | 83 | #
|
84 | 84 | # * Embeddings are then retrieved and **pooled, such as taking the sum or
|
85 | 85 | # mean of the embeddings**. This is required as there can be a variable number of
|
|
220 | 220 | # ------------------------------
|
221 | 221 | #
|
222 | 222 | # This section goes over TorchRec Modules and data types including such
|
223 |
| -# entities as ``EmbeddingCollection``and ``EmbeddingBagCollection``, |
| 223 | +# entities as ``EmbeddingCollection`` and ``EmbeddingBagCollection``, |
224 | 224 | # ``JaggedTensor``, ``KeyedJaggedTensor``, ``KeyedTensor`` and more.
|
225 | 225 | #
|
226 | 226 | # From ``EmbeddingBag`` to ``EmbeddingBagCollection``
|
@@ -918,17 +918,18 @@ def _wait_impl(self) -> torch.Tensor:
|
918 | 918 | # very sensitive to **performance and size of the model**. Running just
|
919 | 919 | # the trained model in a Python environment is incredibly inefficient.
|
920 | 920 | # There are two key differences between inference and training
|
921 |
| -# environments: \* **Quantization**: Inference models are typically |
922 |
| -# quantized, where model parameters lose precision for lower latency in |
923 |
| -# predictions and reduced model size. For example FP32 (4 bytes) in |
924 |
| -# trained model to INT8 (1 byte) for each embedding weight. This is also |
925 |
| -# necessary given the vast scale of embedding tables, as we want to use as |
926 |
| -# few devices as possible for inference to minimize latency. |
| 921 | +# environments: |
| 922 | +# * **Quantization**: Inference models are typically |
| 923 | +# quantized, where model parameters lose precision for lower latency in |
| 924 | +# predictions and reduced model size. For example FP32 (4 bytes) in |
| 925 | +# trained model to INT8 (1 byte) for each embedding weight. This is also |
| 926 | +# necessary given the vast scale of embedding tables, as we want to use as |
| 927 | +# few devices as possible for inference to minimize latency. |
927 | 928 | #
|
928 | 929 | # * **C++ environment**: Inference latency is very important, so in order to ensure
|
929 |
| -# ample performance, the model is typically ran in a C++ environment, |
930 |
| -# along with the situations where we don't have a Python runtime, like on |
931 |
| -# device. |
| 930 | +# ample performance, the model is typically ran in a C++ environment, |
| 931 | +# along with the situations where we don't have a Python runtime, like on |
| 932 | +# device. |
932 | 933 | #
|
933 | 934 | # TorchRec provides primitives for converting a TorchRec model into being
|
934 | 935 | # inference ready with:
|
|
0 commit comments