Skip to content

Commit 87af712

Browse files
committed
updates
1 parent b1be05e commit 87af712

File tree

4 files changed

+37
-3
lines changed

4 files changed

+37
-3
lines changed

src/LLM/图解BERT.md

Lines changed: 37 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -122,9 +122,6 @@ Read_Bert_Code/bert_read_step_to_step/chineseGLUEdatasets/tnews
122122

123123
![token_type_ids作用图解](图解BERT/2.png)
124124

125-
```json
126-
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
127-
```
128125
```python
129126
# BertTokenizer
130127
def create_token_type_ids_from_sequences(self, token_ids_0, token_ids_1=None):
@@ -163,3 +160,40 @@ Read_Bert_Code/bert_read_step_to_step/chineseGLUEdatasets/tnews
163160
encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"][:max_length]
164161
```
165162

163+
8. 生成padding部分的mask列表
164+
165+
![attention_mask作用图解](图解BERT/4.png)
166+
```python
167+
# 生成注意力掩码,真实token对应1,填充token对应0
168+
attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)
169+
```
170+
171+
9. 所有序列都填充到max_length长度,不足长度用padding填充
172+
173+
![填充过程图](图解BERT/5.png)
174+
175+
```python
176+
# 记录输入长度
177+
input_len = len(input_ids)
178+
# 计算需要填充的长度 --- 所有输入序列等长,都等于max_length
179+
padding_length = max_length - len(input_ids)
180+
# 右填充
181+
input_ids = input_ids + ([pad_token] * padding_length)
182+
attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)
183+
token_type_ids = token_type_ids + ([pad_token_segment_id] * padding_length)
184+
```
185+
186+
10. 数据集中每一个样本最终都会解析得到一个InputFeatures
187+
188+
![InputFeatures组成图解](图解BERT/6.png)
189+
190+
```python
191+
features.append(
192+
InputFeatures(input_ids=input_ids,
193+
attention_mask=attention_mask,
194+
token_type_ids=token_type_ids,
195+
label=label,
196+
input_len=input_len))
197+
```
198+
> label 是当前文本对应的类别标签
199+
> input_len 是序列实际长度(含special tokens)

src/LLM/图解BERT/4.png

188 KB
Loading

src/LLM/图解BERT/5.png

233 KB
Loading

src/LLM/图解BERT/6.png

142 KB
Loading

0 commit comments

Comments
 (0)