Skip to content

Commit c759633

Browse files
tiberiu44dumitrescustefanrscctestTiberiu BorosKoichiYasuoka
authored
3.0 (#128)
* Partial update * Bugfix * API update * Bugfixing and API * Bugfix * Fix long words OOM by skipping sentences * bugfixing and api update * Added language flavour * Added early stopping condition * Corrected naming * Corrected permissions * Bugfix * Added GPU support at runtime * Wrong config package * Refactoring * refactoring * add lightning to dependencies * Dummy test * Dummy test * Tweak * Tweak * Update test * Test * Finished loading for UD CONLL-U format * Working on tagger * Work on tagger * tagger training * tagger training * tagger training * Sync * Sync * Sync * Sync * Tagger working * Better weight for aux loss * Better weight for aux loss * Added save and printing for tagger and shared options class * Multilanguage evaluation * Saving multiple models * Updated ignore list * Added XLM-Roberta support * Using custom ro model * Score update * Bugfixing * Code refactor * Refactor * Added option to load external config * Added option to select LM-model from CLI or config * added option to overwrite config lm from CLI * Bugfix * Working on parser * Sync work on parser * Parser working * Removed load limit * Bugfix in evaluation * Added bi-affine attention * Added experimental ChuLiuEdmonds tree decoding * Better config for parser and bugfix * Added residuals to tagging * Model update * Switched to AdamW optimizer * Working on tokenizer * Working on tokenizer * Training working - validation to do * Bugfix in language id * Working on tokenization validation * Tokenizer working * YAML update * Bug in LMHelper * Tagger is working * Tokenizer is working * bfix * bfix * Bugfix for bugfix :) * Sync * Tokenizer worker * Tagger working * Trainer updates * Trainer process now working * Added .DS_Store * Added datasets for Compound Word Expander and Lemmatizer * Added collate function for lemma+compound * Added training and validation step * Updated config for Lemmatizer * Minor fixes * Removed duplicate entries from lemma and cwe * Added training support for lemmatizer * Removed debug directives * Lemmatizer in testing phase * removed unused line * Bugfix in Lemma dataset * Corrected validation issue with gs labels being sent to the forward method and removed loss computation during testing * Lemmatizier training done * Compound word expander ready * Sync * Added support for FastText, Transformers and Languasito LM models * Added multi-lm support for tokenizer * Added support for multiword tokens * Sync * Bugfix in evaluation * Added Languasito as a subpackage * Added path to local Languasito * Bugfixing all around * Removed debug printing * Bugfix for no-space languages that actually contain spaces :) * Bugfix for no-space languages that actually contain spaces :) * Fixed GPU support * Biaffine transform for LAS and relative head location (RHL) for UAS * Bugfix * Tweaks * moved rhl to lower layer * Added configurable option for RHL * Safenet for spaces in languages that should use no spaces * Better defaults * Sync * Cleanup parser * Bilinear xpos and attrs * Added Biaffine module from Stanza * Tagger with reduced number of parameters: * Parser with conditional attrs * Working on tokenizer runtime * Tokenizer process 90% done * Added runtime for parser, tokenizer and tagger * Added quick test for runtime * Test for e2e * Added support for multiple word embeddings at the same time * Bugfix * Added multiple word representations for tokenizer * moved mask_concat to utils.py * Added XPOS prediction to pipeline * Bugfix in tokenizer shifted word embeddings * Using Languasito tokenizer for HF tokenization * Bugfix * Bugfixing * Bugfixing * Bugfix * Runtime fixing * Sync * Added spa for FT and Languasito * Added spa for FT and Languasito * Minor tweaks * Added configuration for RNN layers * Bugfix for spa * HF runtime fix * Mixed test fasttext+transformer * Added word reconstruction and MHA * Sync * Bugfix * bugfix * Added masked attention * Sync * Added test for runtime * Bugfix in mask values * Updated test * Added full mask dropout * Added resume option * Removed useless printouts * Removed useless printouts * Switched to eval at runtime * multiprocessing added * Added full mask dropout for word decoder * Bugfix * Residual * Added lexical-contextual cosine loss * Removed full mask dropout from WordDecoder * Bugfix * Training script generation update * Added residual * Updated languasito to pickle tokenized lines * Updated languasito to pickle tokenized lines * Updated languasito to pickle tokenized lines * Not training for seq len > max_seq_len * Added seq limmits for collates * Passing seq limits from collate to tokenizer * Skipping complex parsing * Working on word decomposer * Model update * Sync * Bugfix * Bugfix * Bugfix * Using all reprs * Dropped immediate context * Multi train script added * Changed gpu parameter type to string, for multiple gpus int failed * Updated pytorch_lightning callback method to work with newer version * Updated pytorch_lightning callback method to work with newer version * Transparently pass PL args from the command line; skip over empty compound word datasets * Fix typo * Refactoring and on the way to working API * API load working * Partial _call_ working * Partial _call_ working * Added partly working api and refactored everything back to cube/. Compound not working yet and tokenizer needs retraining. * api is working * Fixing api * Updated readme * Update Readme to include flavours * Device support * api update * Updated package * Tweak + results * Clarification * Test update * Update * Sync * Update README * Bugfixing * Bugfix and api update * Fixed compound * Evaluation update * Bugfix * Package update * Bugfix for large sentences * Pip package update * Corrected spanish evaluation * Package version update * Fixed tokenization issues on transformers * Removed pinned memory * Bugfix for GPU tensors * Update package version * Automatically detecting hidden state size * Automatically detecting hidden state size * Automatically detecting hidden state size * Sync * Evaluation update * Package update * Bugfix * Bugfixing * Package version update * Bugfix * Package version update * Update evaluation for Italian * tentative support torchtext>=0.9.0 (#127) as mentioned in Lightning-AI/pytorch-lightning#6211 and #100 * Update package dependencies Co-authored-by: Stefan Dumitrescu <[email protected]> Co-authored-by: dumitrescustefan <[email protected]> Co-authored-by: Tiberiu Boros <[email protected]> Co-authored-by: Tiberiu Boros <[email protected]> Co-authored-by: Koichi Yasuoka <[email protected]>
1 parent a16373a commit c759633

File tree

633 files changed

+27805
-5676
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

633 files changed

+27805
-5676
lines changed

.circleci/config.yml

Lines changed: 13 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,19 @@
1-
version: 2
1+
version: 2.1
2+
3+
orbs:
4+
python: circleci/[email protected]
5+
26
jobs:
3-
test_api_and_main_and_upload:
4-
docker:
5-
- image: circleci/python
7+
build-and-test:
8+
executor: python/default
69
steps:
710
- checkout
8-
- run:
9-
name: init .pypirc
10-
command: |
11-
echo -e "[pypi]" >> ~/.pypirc
12-
- run:
13-
name: install requirements
14-
command: |
15-
sudo apt-get install -y libblas3 liblapack3
16-
sudo apt-get install -y liblapack-dev libblas-dev
17-
cd /home/circleci/project/
18-
pip3 install --user -r requirements.txt
19-
- run:
20-
name: test main
21-
command: |
22-
cd /home/circleci/project/
23-
python3 tests/main_tests.py
24-
- run:
25-
name: test api
26-
command: |
27-
cd /home/circleci/project/
28-
python3 tests/api_tests.py
29-
- run:
30-
name: create packages
31-
command: |
32-
python3 setup.py sdist
33-
python3 setup.py bdist_wheel
34-
- run:
35-
name: upload to pypi
36-
command: |
37-
if [[ "$PYPI_USERNAME" == "" ]]; then
38-
echo "Skip upload"
39-
exit 0
40-
fi
41-
python3 -m pip install --user jq
42-
if [[ "$CIRCLE_BRANCH" == "master" ]]; then
43-
PYPI="pypi.org"
44-
else
45-
PYPI="test.pypi.org"
46-
fi
47-
LATEST_VERSION="$(curl -s https://$PYPI/pypi/nlpcube/json | jq -r '.info.version')"
48-
THIS_VERSION=`python3 <<< "import pkg_resources;print(pkg_resources.require('nlpcube')[0].version)"`
49-
if [[ $THIS_VERSION != $LATEST_VERSION ]]; then
50-
echo "\n\nthis: $THIS_VERSION - latest: $LATEST_VERSION => releasing to $PYPI\n\n"
51-
python3 -m pip install --user --upgrade twine
52-
python3 -m twine upload --repository-url https://$PYPI/legacy/ dist/* -u $PYPI_USERNAME -p $PYPI_PASSWORD || echo "Package already exists"
53-
else
54-
echo "this: $THIS_VERSION = latest: $LATEST_VERSION => skip release"
55-
fi
11+
- python/load-cache
12+
- python/install-deps
13+
- python/save-cache
14+
- run: echo "done"
5615

5716
workflows:
58-
version: 2
59-
test_api_and_main_and_upload:
17+
main:
6018
jobs:
61-
- test_api_and_main_and_upload
19+
- build-and-test

.gitignore

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,15 @@
1+
.DS_Store
2+
Languasito/data/
3+
*.txt
4+
lightning_logs
5+
*.gz
6+
*.encodings
7+
*.npy
8+
data/*
9+
nlp-cube-models/*
10+
corpus/
11+
models/
12+
scripts/packer
113
*.pyc
214
build/
315
dist/
@@ -11,12 +23,14 @@ cube/venv/*
1123
.idea/*
1224
venv/*
1325
cube/*.py
26+
*.json
1427

15-
models/
28+
scratch/
1629
tests/scratch/*
1730
scripts/*.json
1831
scripts/*.conllu
1932
scripts/*.md
33+
scripts/wikiextractor.py
2034

2135
# Jupyter notebooks
2236
notebooks/.ipynb_checkpoints/*

Languasito/.idea/.gitignore

Lines changed: 8 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

cube/.idea/cube.iml renamed to Languasito/.idea/Languasito.iml

Lines changed: 2 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Languasito/.idea/inspectionProfiles/Project_Default.xml

Lines changed: 47 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Languasito/.idea/inspectionProfiles/profiles_settings.xml

Lines changed: 6 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

cube/.idea/misc.xml renamed to Languasito/.idea/misc.xml

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

cube/.idea/modules.xml renamed to Languasito/.idea/modules.xml

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Languasito/.idea/other.xml

Lines changed: 6 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
File renamed without changes.

Languasito/languasito/api.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
import sys
2+
import torch
3+
from typing import *
4+
5+
sys.path.append('')
6+
7+
from languasito.model import Languasito
8+
from languasito.utils import LanguasitoCollate
9+
from languasito.utils import Encodings
10+
11+
12+
class LanguasitoAPI:
13+
14+
def __init__(self, languasito: Languasito, encodings: Encodings):
15+
self._languasito = languasito
16+
self._languasito.eval()
17+
self._encodings = encodings
18+
self._collate = LanguasitoCollate(encodings, live=True)
19+
self._device = 'cpu'
20+
21+
def to(self, device: str):
22+
self._languasito.to(device)
23+
self._device = device
24+
25+
def __call__(self, batch):
26+
with torch.no_grad():
27+
x = self._collate.collate_fn(batch)
28+
for key in x:
29+
if isinstance(x[key], torch.Tensor):
30+
x[key] = x[key].to(self._device)
31+
rez = self._languasito(x)
32+
emb = []
33+
pred_emb = rez['emb'].detach().cpu().numpy()
34+
for ii in range(len(batch)):
35+
c_emb = []
36+
for jj in range(len(batch[ii])):
37+
c_emb.append(pred_emb[ii, jj])
38+
emb.append(c_emb)
39+
return emb
40+
41+
@staticmethod
42+
def load(model_name: str):
43+
from pathlib import Path
44+
home = str(Path.home())
45+
filename = '{0}/.languasito/{1}'.format(home, model_name)
46+
import os
47+
if os.path.exists(filename + '.encodings'):
48+
return LanguasitoAPI.load_local(filename)
49+
else:
50+
print("UserWarning: Model not found and automatic downloading is not yet supported")
51+
return None
52+
53+
@staticmethod
54+
def load_local(model_name: str):
55+
enc = Encodings()
56+
enc.load('{0}.encodings'.format(model_name))
57+
model = Languasito(enc)
58+
tmp = torch.load('{0}.best'.format(model_name), map_location='cpu')
59+
# model.load(tmp['state_dict'])
60+
model.load_state_dict(tmp['state_dict'])
61+
model.eval()
62+
api = LanguasitoAPI(model, enc)
63+
return api

0 commit comments

Comments
 (0)