BinaryOracle
diff --git a/‎src/LLM/图解BERT.md
Lines changed: 42 additions & 39 deletions b/‎src/LLM/图解BERT.md
Lines changed: 42 additions & 39 deletions
diff --git a/‎src/LLM/图解BERT/1.png
56.8 KB b/‎src/LLM/图解BERT/1.png
56.8 KB
diff --git a/‎src/LLM/图解BERT/2.png
98.7 KB b/‎src/LLM/图解BERT/2.png
98.7 KB
diff --git a/‎src/LLM/图解BERT/3.png
150 KB b/‎src/LLM/图解BERT/3.png
150 KB
@@ -92,33 +92,41 @@ Read_Bert_Code/bert_read_step_to_step/chineseGLUEdatasets/tnews
 ```
 > 对于字典中不存在的词 , 用 `[UNK]` 表示, 对应的id为 100
 
-3. 
+3. 过长截断策略
 
-特殊token id:
-- `[CLS]`: 101
-- `[SEP]`: 102
-- `[MASK]`: 103
-- `[UNK]`: 100
-- `[PAD]`: 0
+4. 添加特殊Token标记
+
+![原序列添加特殊Token标记图](图解BERT/1.png)
+
+```json
+[101, 5500, 4873, 704, 4638, 4960, 4788, 2501, 2578, 102]
+```
+
+> BertTokenizer中的特殊token id:
+> - `[CLS]`: 101
+> - `[SEP]`: 102
+> - `[MASK]`: 103
+> - `[UNK]`: 100
+> - `[PAD]`: 0
 
 ```python
-class BertTokenizer(PreTrainedTokenizer):
-    ...
-    
+    # BertTokenizer
     def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
-        """
-        Build model inputs from a sequence or a pair of sequence for sequence classification tasks
-        by concatenating and adding special tokens.
-        A BERT sequence has the following format:
-            single sequence: [CLS] X [SEP]
-            pair of sequences: [CLS] A [SEP] B [SEP]
-        """
         if token_ids_1 is None:
             return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
         cls = [self.cls_token_id]
         sep = [self.sep_token_id]
         return cls + token_ids_0 + sep + token_ids_1 + sep
+```
+5. 创建句子辨识列表，用以区分不同的句子
+
+![token_type_ids作用图解](图解BERT/2.png)
 
+```json
+[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
+```
+```python
+     # BertTokenizer
      def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
         """
         Creates a mask from the two sequences passed to be used in a sequence-pair classification task.
@@ -133,30 +141,25 @@ class BertTokenizer(PreTrainedTokenizer):
         if token_ids_1 is None:
             return len(cls + token_ids_0 + sep) * [0]
         return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1]
+```
+6. 创建用以区分special tokens部分的mask列表
 
-    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
-        """
-        Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer ``prepare_for_model`` or ``encode_plus`` methods.
-
-        Args:
-            token_ids_0: list of ids (must not contain special tokens)
-            token_ids_1: Optional list of ids (must not contain special tokens), necessary when fetching sequence ids
-                for sequence pairs
-            already_has_special_tokens: (default False) Set to True if the token list is already formated with
-                special tokens for the model
-
-        Returns:
-            A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token.
-        """
-
-        if already_has_special_tokens:
-            if token_ids_1 is not None:
-                raise ValueError("You should not supply a second sequence if the provided sequence of "
-                                 "ids is already formated with special tokens for the model.")
-            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+![special_tokens_mask作用图解](图解BERT/3.png)
 
+```python
+    # BertTokenizer
+    def get_special_tokens_mask(self, token_ids_0, token_ids_1=None, already_has_special_tokens=False):
         if token_ids_1 is not None:
             return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
         return [1] + ([0] * len(token_ids_0)) + [1]
-```
+```
+7. 超长截断
+
+```python
+       # PreTrainedTokenizer
+       if max_length and len(encoded_inputs["input_ids"]) > max_length:
+            encoded_inputs["input_ids"] = encoded_inputs["input_ids"][:max_length]
+            encoded_inputs["token_type_ids"] = encoded_inputs["token_type_ids"][:max_length]
+            encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"][:max_length]
+```
+