Skip to content

Commit cc0c34c

Browse files
tiberiu44dumitrescustefanrscctestKoichiYasuoka
authored
3.0 (#135)
* Corrected permissions * Bugfix * Added GPU support at runtime * Wrong config package * Refactoring * refactoring * add lightning to dependencies * Dummy test * Dummy test * Tweak * Tweak * Update test * Test * Finished loading for UD CONLL-U format * Working on tagger * Work on tagger * tagger training * tagger training * tagger training * Sync * Sync * Sync * Sync * Tagger working * Better weight for aux loss * Better weight for aux loss * Added save and printing for tagger and shared options class * Multilanguage evaluation * Saving multiple models * Updated ignore list * Added XLM-Roberta support * Using custom ro model * Score update * Bugfixing * Code refactor * Refactor * Added option to load external config * Added option to select LM-model from CLI or config * added option to overwrite config lm from CLI * Bugfix * Working on parser * Sync work on parser * Parser working * Removed load limit * Bugfix in evaluation * Added bi-affine attention * Added experimental ChuLiuEdmonds tree decoding * Better config for parser and bugfix * Added residuals to tagging * Model update * Switched to AdamW optimizer * Working on tokenizer * Working on tokenizer * Training working - validation to do * Bugfix in language id * Working on tokenization validation * Tokenizer working * YAML update * Bug in LMHelper * Tagger is working * Tokenizer is working * bfix * bfix * Bugfix for bugfix :) * Sync * Tokenizer worker * Tagger working * Trainer updates * Trainer process now working * Added .DS_Store * Added datasets for Compound Word Expander and Lemmatizer * Added collate function for lemma+compound * Added training and validation step * Updated config for Lemmatizer * Minor fixes * Removed duplicate entries from lemma and cwe * Added training support for lemmatizer * Removed debug directives * Lemmatizer in testing phase * removed unused line * Bugfix in Lemma dataset * Corrected validation issue with gs labels being sent to the forward method and removed loss computation during testing * Lemmatizier training done * Compound word expander ready * Sync * Added support for FastText, Transformers and Languasito LM models * Added multi-lm support for tokenizer * Added support for multiword tokens * Sync * Bugfix in evaluation * Added Languasito as a subpackage * Added path to local Languasito * Bugfixing all around * Removed debug printing * Bugfix for no-space languages that actually contain spaces :) * Bugfix for no-space languages that actually contain spaces :) * Fixed GPU support * Biaffine transform for LAS and relative head location (RHL) for UAS * Bugfix * Tweaks * moved rhl to lower layer * Added configurable option for RHL * Safenet for spaces in languages that should use no spaces * Better defaults * Sync * Cleanup parser * Bilinear xpos and attrs * Added Biaffine module from Stanza * Tagger with reduced number of parameters: * Parser with conditional attrs * Working on tokenizer runtime * Tokenizer process 90% done * Added runtime for parser, tokenizer and tagger * Added quick test for runtime * Test for e2e * Added support for multiple word embeddings at the same time * Bugfix * Added multiple word representations for tokenizer * moved mask_concat to utils.py * Added XPOS prediction to pipeline * Bugfix in tokenizer shifted word embeddings * Using Languasito tokenizer for HF tokenization * Bugfix * Bugfixing * Bugfixing * Bugfix * Runtime fixing * Sync * Added spa for FT and Languasito * Added spa for FT and Languasito * Minor tweaks * Added configuration for RNN layers * Bugfix for spa * HF runtime fix * Mixed test fasttext+transformer * Added word reconstruction and MHA * Sync * Bugfix * bugfix * Added masked attention * Sync * Added test for runtime * Bugfix in mask values * Updated test * Added full mask dropout * Added resume option * Removed useless printouts * Removed useless printouts * Switched to eval at runtime * multiprocessing added * Added full mask dropout for word decoder * Bugfix * Residual * Added lexical-contextual cosine loss * Removed full mask dropout from WordDecoder * Bugfix * Training script generation update * Added residual * Updated languasito to pickle tokenized lines * Updated languasito to pickle tokenized lines * Updated languasito to pickle tokenized lines * Not training for seq len > max_seq_len * Added seq limmits for collates * Passing seq limits from collate to tokenizer * Skipping complex parsing * Working on word decomposer * Model update * Sync * Bugfix * Bugfix * Bugfix * Using all reprs * Dropped immediate context * Multi train script added * Changed gpu parameter type to string, for multiple gpus int failed * Updated pytorch_lightning callback method to work with newer version * Updated pytorch_lightning callback method to work with newer version * Transparently pass PL args from the command line; skip over empty compound word datasets * Fix typo * Refactoring and on the way to working API * API load working * Partial _call_ working * Partial _call_ working * Added partly working api and refactored everything back to cube/. Compound not working yet and tokenizer needs retraining. * api is working * Fixing api * Updated readme * Update Readme to include flavours * Device support * api update * Updated package * Tweak + results * Clarification * Test update * Update * Sync * Update README * Bugfixing * Bugfix and api update * Fixed compound * Evaluation update * Bugfix * Package update * Bugfix for large sentences * Pip package update * Corrected spanish evaluation * Package version update * Fixed tokenization issues on transformers * Removed pinned memory * Bugfix for GPU tensors * Update package version * Automatically detecting hidden state size * Automatically detecting hidden state size * Automatically detecting hidden state size * Sync * Evaluation update * Package update * Bugfix * Bugfixing * Package version update * Bugfix * Package version update * Update evaluation for Italian * tentative support torchtext>=0.9.0 (#127) as mentioned in Lightning-AI/pytorch-lightning#6211 and #100 * Update package dependencies * Dummy word embeddings * Update params * Better dropout values * Skipping long words * Skipping long words * dummy we -> float * Added gradient clipping * Update tokenizer * Update tokenizer * Sync * DCWE * Working on DCWE --------- Co-authored-by: Stefan Dumitrescu <[email protected]> Co-authored-by: Tiberiu Boros <[email protected]> Co-authored-by: Koichi Yasuoka <[email protected]>
1 parent 0bb4fa2 commit cc0c34c

File tree

14 files changed

+293
-31
lines changed

14 files changed

+293
-31
lines changed

cube/api.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -126,10 +126,7 @@ def __call__(self, text: Union[str, Document], flavour: Optional[str] = None):
126126
self._lm_helper.apply(doc)
127127
self._parser.process(doc, self._parser_collate, num_workers=0)
128128
self._lemmatizer.process(doc, self._lemmatizer_collate, num_workers=0)
129-
for seq in doc.sentences:
130-
for w in seq.words:
131-
if w.upos =='PUNCT':
132-
w.lemma = w.word
129+
133130
return doc
134131

135132

cube/io_utils/config.py

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,10 @@ def __init__(self, filename=None, verbose=False):
8787
self.cnn_filter = 512
8888
self.lang_emb_size = 100
8989
self.cnn_layers = 5
90-
self.external_proj_size = 300
90+
self.rnn_size = 50
91+
self.rnn_layers = 2
92+
self.external_proj_size = 2
93+
9194
self.no_space_lang = False
9295

9396
if filename is None:
@@ -139,9 +142,10 @@ def __init__(self, filename=None, verbose=False):
139142
self.head_size = 100
140143
self.label_size = 200
141144
self.lm_model = 'xlm-roberta-base'
142-
self.external_proj_size = 300
145+
self.external_proj_size = 2
143146
self.rhl_win_size = 2
144-
self.rnn_size = 50
147+
self.rnn_size = 200
148+
145149
self.rnn_layers = 3
146150

147151
self._valid = True
@@ -275,6 +279,26 @@ def __init__(self, filename=None, verbose=False):
275279
self.load(filename)
276280

277281

282+
class DCWEConfig(Config):
283+
def __init__(self, filename=None, verbose=False):
284+
super().__init__()
285+
self.char_emb_size = 256
286+
self.case_emb_size = 32
287+
self.num_filters = 512
288+
self.kernel_size = 5
289+
self.lang_emb_size = 32
290+
self.num_layers = 8
291+
self.output_size = 300 # this will be automatically updated at training time, so do not change
292+
293+
if filename is None:
294+
if verbose:
295+
sys.stdout.write("No configuration file supplied. Using default values.\n")
296+
else:
297+
if verbose:
298+
sys.stdout.write("Reading configuration file " + filename + " \n")
299+
self.load(filename)
300+
301+
278302
class GDBConfig(Config):
279303
def __init__(self, filename=None, verbose=False):
280304
super().__init__()

cube/networks/dcwe.py

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
import torch
2+
import torch.nn as nn
3+
import pytorch_lightning as pl
4+
from typing import *
5+
import sys
6+
7+
sys.path.append('')
8+
from cube.networks.modules import WordGram, LinearNorm
9+
from cube.io_utils.encodings import Encodings
10+
from cube.io_utils.config import DCWEConfig
11+
12+
13+
class DCWE(pl.LightningModule):
14+
encodings: Encodings
15+
config: DCWEConfig
16+
17+
def __init__(self, config: DCWEConfig, encodings: Encodings):
18+
super(DCWE, self).__init__()
19+
self._config = config
20+
self._encodings = encodings
21+
self._wg = WordGram(num_chars=len(encodings.char2int),
22+
num_langs=encodings.num_langs,
23+
num_layers=config.num_layers,
24+
num_filters=config.num_filters,
25+
char_emb_size=config.lang_emb_size,
26+
case_emb_size=config.case_emb_size,
27+
lang_emb_size=config.lang_emb_size
28+
)
29+
self._output_proj = LinearNorm(config.num_filters // 2, config.output_size, w_init_gain='linear')
30+
self._improve = 0
31+
self._best_loss = 9999
32+
33+
def forward(self, x_char, x_case, x_lang, x_mask, x_word_len):
34+
pre_proj = self._wg(x_char, x_case, x_lang, x_mask, x_word_len)
35+
proj = self._output_proj(pre_proj)
36+
return proj
37+
38+
def _get_device(self):
39+
if self._output_proj.linear_layer.weight.device.type == 'cpu':
40+
return 'cpu'
41+
return '{0}:{1}'.format(self._output_proj.linear_layer.weight.device.type,
42+
str(self._output_proj.linear_layer.weight.device.index))
43+
44+
def configure_optimizers(self):
45+
return torch.optim.AdamW(self.parameters())
46+
47+
def training_step(self, batch, batch_idx):
48+
x_char = batch['x_char']
49+
x_case = batch['x_case']
50+
x_lang = batch['x_lang']
51+
x_word_len = batch['x_word_len']
52+
x_mask = batch['x_mask']
53+
y_target = batch['y_target']
54+
y_pred = self.forward(x_char, x_case, x_lang, x_mask, x_word_len)
55+
loss = torch.mean((y_pred - y_target) ** 2)
56+
return loss
57+
58+
def validation_step(self, batch, batch_idx):
59+
x_char = batch['x_char']
60+
x_case = batch['x_case']
61+
x_lang = batch['x_lang']
62+
x_word_len = batch['x_word_len']
63+
x_mask = batch['x_mask']
64+
y_target = batch['y_target']
65+
y_pred = self.forward(x_char, x_case, x_lang, x_mask, x_word_len)
66+
loss = torch.mean((y_pred - y_target) ** 2)
67+
return {'loss': loss.detach().cpu().numpy()[0]}
68+
69+
def validation_epoch_end(self, outputs: List[Any]) -> None:
70+
mean_loss = sum([output['loss'] for output in outputs])
71+
mean_loss /= len(outputs)
72+
self.log('val/loss', mean_loss)
73+
self.log('val/early_meta', self._improve)
74+
75+
def save(self, path):
76+
torch.save(self.state_dict(), path)
77+
78+
def load(self, model_path: str, device: str = 'cpu'):
79+
self.load_state_dict(torch.load(model_path, map_location='cpu')['state_dict'])
80+
self.to(device)
81+
82+
83+

cube/networks/lm.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,27 @@ def apply_raw(self, batch):
211211
pass
212212

213213

214+
class LMHelperDummy(LMHelper):
215+
def __init__(self, device: str = 'cpu', model: str = None):
216+
pass
217+
218+
def get_embedding_size(self):
219+
return [1]
220+
221+
def apply(self, document: Document):
222+
for ii in tqdm.tqdm(range(len(document.sentences)), desc="Pre-computing embeddings", unit="sent"):
223+
for jj in range(len(document.sentences[ii].words)):
224+
document.sentences[ii].words[jj].emb = [[1.0]]
225+
226+
def apply_raw(self, batch):
227+
embeddings = []
228+
for ii in range(len(batch)):
229+
c_emb = []
230+
for jj in range(len(batch[ii])):
231+
c_emb.append([1.0])
232+
embeddings.append(c_emb)
233+
return embeddings
234+
214235
if __name__ == "__main__":
215236
from ipdb import set_trace
216237

cube/networks/modules.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -427,9 +427,10 @@ def __init__(self, num_chars: int, num_langs: int, num_filters=512, char_emb_siz
427427
super(WordGram, self).__init__()
428428
NUM_FILTERS = num_filters
429429
self._num_filters = NUM_FILTERS
430-
self._lang_emb = nn.Embedding(num_langs + 1, lang_emb_size)
431-
self._tok_emb = nn.Embedding(num_chars + 1, char_emb_size)
432-
self._case_emb = nn.Embedding(4, case_emb_size)
430+
self._lang_emb = nn.Embedding(num_langs + 1, lang_emb_size, padding_idx=0)
431+
self._tok_emb = nn.Embedding(num_chars + 3, char_emb_size, padding_idx=0)
432+
self._case_emb = nn.Embedding(4, case_emb_size, padding_idx=0)
433+
433434
self._num_layers = num_layers
434435
convolutions_char = []
435436
cs_inp = char_emb_size + lang_emb_size + case_emb_size

cube/networks/parser.py

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,8 @@ def __init__(self, config: ParserConfig, encodings: Encodings, language_codes: [
7676
self._upos_emb = nn.Embedding(len(encodings.upos2int), 64)
7777

7878
self._rnn = nn.LSTM(NUM_FILTERS // 2 + config.lang_emb_size + config.external_proj_size, config.rnn_size,
79-
num_layers=config.rnn_layers, batch_first=True, bidirectional=True, dropout=0.33)
79+
num_layers=config.rnn_layers, batch_first=True, bidirectional=True, dropout=0.1)
80+
8081

8182
self._pre_out = LinearNorm(config.rnn_size * 2 + config.lang_emb_size, config.pre_parser_size)
8283
# self._head_r1 = LinearNorm(config.pre_parser_size, config.head_size)
@@ -137,9 +138,10 @@ def forward(self, X):
137138
for ii in range(len(x_word_emb_packed)):
138139
we = unpack(x_word_emb_packed[ii], sl, x_sents.shape[1], self._get_device())
139140
if word_emb_ext is None:
140-
word_emb_ext = self._ext_proj[ii](we.float())
141+
word_emb_ext = self._ext_proj[ii](we)
141142
else:
142-
word_emb_ext = word_emb_ext + self._ext_proj[ii](we.float())
143+
word_emb_ext = word_emb_ext + self._ext_proj[ii](we)
144+
143145

144146
word_emb_ext = word_emb_ext / len(x_word_emb_packed)
145147
word_emb_ext = torch.tanh(word_emb_ext)
@@ -153,7 +155,8 @@ def forward(self, X):
153155

154156
word_emb = self._word_emb(x_sents)
155157

156-
x = mask_concat([word_emb, char_emb, word_emb_ext], 0.33, self.training, self._get_device())
158+
x = mask_concat([word_emb, char_emb, word_emb_ext], 0.1, self.training, self._get_device())
159+
157160

158161
x = torch.cat([x, lang_emb[:, 1:, :]], dim=-1)
159162
# prepend root
@@ -172,7 +175,8 @@ def forward(self, X):
172175
res = tmp
173176
else:
174177
res = res + tmp
175-
x = torch.dropout(tmp, 0.2, self.training)
178+
x = torch.dropout(tmp, 0.1, self.training)
179+
176180
cnt += 1
177181
if cnt == self._config.aux_softmax_location:
178182
hidden = torch.cat([x + res, lang_emb], dim=1)
@@ -184,7 +188,8 @@ def forward(self, X):
184188
# aux tagging
185189
lang_emb = lang_emb.permute(0, 2, 1)
186190
hidden = hidden.permute(0, 2, 1)[:, 1:, :]
187-
pre_morpho = torch.dropout(torch.tanh(self._pre_morpho(hidden)), 0.33, self.training)
191+
pre_morpho = torch.dropout(torch.tanh(self._pre_morpho(hidden)), 0.1, self.training)
192+
188193
pre_morpho = torch.cat([pre_morpho, lang_emb[:, 1:, :]], dim=2)
189194
upos = self._upos(pre_morpho)
190195
if gs_upos is None:
@@ -200,11 +205,12 @@ def forward(self, X):
200205
word_emb_ext = torch.cat(
201206
[torch.zeros((word_emb_ext.shape[0], 1, self._config.external_proj_size), device=self._get_device(),
202207
dtype=torch.float), word_emb_ext], dim=1)
203-
x = mask_concat([x_parse, word_emb_ext], 0.33, self.training, self._get_device())
208+
x = torch.cat([x_parse, word_emb_ext], dim=-1) #mask_concat([x_parse, word_emb_ext], 0.1, self.training, self._get_device())
204209
x = torch.cat([x, lang_emb], dim=-1)
205210
output, _ = self._rnn(x)
206211
output = torch.cat([output, lang_emb], dim=-1)
207-
pre_parsing = torch.dropout(torch.tanh(self._pre_out(output)), 0.33, self.training)
212+
pre_parsing = torch.dropout(torch.tanh(self._pre_out(output)), 0.1, self.training)
213+
208214
# h_r1 = torch.tanh(self._head_r1(pre_parsing))
209215
# h_r2 = torch.tanh(self._head_r2(pre_parsing))
210216
# l_r1 = torch.tanh(self._label_r1(pre_parsing))

cube/networks/tagger.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
import sys
2+
23
sys.path.append('')
34
import os, yaml
5+
6+
47
os.environ["TOKENIZERS_PARALLELISM"] = "false"
58
import pytorch_lightning as pl
69
import torch.nn as nn
@@ -14,6 +17,7 @@
1417
from cube.networks.utils import MorphoCollate, MorphoDataset, unpack, mask_concat
1518
from cube.networks.modules import WordGram
1619

20+
1721
class Tagger(pl.LightningModule):
1822
def __init__(self, config: TaggerConfig, encodings: Encodings, language_codes: [] = None, ext_word_emb=0):
1923
super().__init__()
@@ -276,7 +280,8 @@ def validation_epoch_end(self, outputs):
276280
# print("\n\n\n", upos_ok / total, xpos_ok / total, attrs_ok / total,
277281
# aupos_ok / total, axpos_ok / total, aattrs_ok / total, "\n\n\n")
278282

279-
def load(self, model_path:str, device: str = 'cpu'):
283+
def load(self, model_path: str, device: str = 'cpu'):
284+
280285
self.load_state_dict(torch.load(model_path, map_location='cpu')['state_dict'])
281286
self.to(device)
282287

cube/networks/tokenizer.py

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,8 @@ def __init__(self, config: TokenizerConfig, encodings: Encodings, language_codes
3939
conv_layer = nn.Sequential(
4040
ConvNorm(cs_inp,
4141
NUM_FILTERS,
42-
kernel_size=5, stride=1,
43-
padding=2,
42+
kernel_size=3, stride=1,
43+
padding=1,
4444
dilation=1, w_init_gain='tanh'),
4545
nn.BatchNorm1d(NUM_FILTERS))
4646
conv_layers.append(conv_layer)
@@ -49,7 +49,13 @@ def __init__(self, config: TokenizerConfig, encodings: Encodings, language_codes
4949
self._wg = WordGram(len(encodings.char2int), num_langs=encodings.num_langs)
5050
self._lang_emb = nn.Embedding(encodings.num_langs + 1, config.lang_emb_size, padding_idx=0)
5151
self._spa_emb = nn.Embedding(3, 16, padding_idx=0)
52-
self._output = LinearNorm(NUM_FILTERS // 2 + config.lang_emb_size, 5)
52+
self._rnn = nn.LSTM(NUM_FILTERS // 2 + config.lang_emb_size,
53+
config.rnn_size,
54+
num_layers=config.rnn_layers,
55+
bidirectional=True,
56+
batch_first=True)
57+
self._output = LinearNorm(config.rnn_size * 2, 5)
58+
5359

5460
ext2int = []
5561
for input_size in self._ext_word_emb:
@@ -103,20 +109,29 @@ def forward(self, batch):
103109
half = self._config.cnn_filter // 2
104110
res = None
105111
cnt = 0
112+
113+
skip = None
106114
for conv in self._convs:
107115
conv_out = conv(x)
108116
tmp = torch.tanh(conv_out[:, :half, :]) * torch.sigmoid((conv_out[:, half:, :]))
109117
if res is None:
110118
res = tmp
111119
else:
112120
res = res + tmp
113-
x = torch.dropout(tmp, 0.2, self.training)
121+
x = torch.dropout(tmp, 0.1, self.training)
114122
cnt += 1
115123
if cnt != self._config.cnn_layers:
124+
if skip is not None:
125+
x = x + skip
126+
skip = x
127+
116128
x = torch.cat([x, x_lang], dim=1)
117129
x = x + res
118130
x = torch.cat([x, x_lang], dim=1)
119131
x = x.permute(0, 2, 1)
132+
133+
x, _ = self._rnn(x)
134+
120135
return self._output(x)
121136

122137
def validation_step(self, batch, batch_idx):
@@ -297,7 +312,9 @@ def process(self, raw_text, collate: TokenCollate, batch_size=32, num_workers: i
297312
return d
298313

299314
def configure_optimizers(self):
300-
return torch.optim.AdamW(self.parameters())
315+
optimizer = torch.optim.AdamW(self.parameters(), lr=1e-3, weight_decay=1e-4)
316+
return optimizer
317+
301318

302319
def _compute_early_stop(self, res):
303320
for lang in res:

cube/networks/utils.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,8 @@ def __init__(self, document: Document, for_training=True):
106106
word = w.word
107107
lemma = w.lemma
108108
upos = w.upos
109+
if len(word) > 25:
110+
continue
109111

110112
key = (word, lang_id, upos)
111113
if key not in lookup or for_training is False:

0 commit comments

Comments
 (0)