Skip to content
master
Switch branches/tags
Code

Latest commit

* Partial update

* Bugfix

* API update

* Bugfixing and API

* Bugfix

* Fix long words OOM by skipping sentences

* bugfixing and api update

* Added language flavour

* Added early stopping condition

* Corrected naming

* Corrected permissions

* Bugfix

* Added GPU support at runtime

* Wrong config package

* Refactoring

* refactoring

* add lightning to dependencies

* Dummy test

* Dummy test

* Tweak

* Tweak

* Update test

* Test

* Finished loading for UD CONLL-U format

* Working on tagger

* Work on tagger

* tagger training

* tagger training

* tagger training

* Sync

* Sync

* Sync

* Sync

* Tagger working

* Better weight for aux loss

* Better weight for aux loss

* Added save and printing for tagger and shared options class

* Multilanguage evaluation

* Saving multiple models

* Updated ignore list

* Added XLM-Roberta support

* Using custom ro model

* Score update

* Bugfixing

* Code refactor

* Refactor

* Added option to load external config

* Added option to select LM-model from CLI or config

* added option to overwrite config lm from CLI

* Bugfix

* Working on parser

* Sync work on parser

* Parser working

* Removed load limit

* Bugfix in evaluation

* Added bi-affine attention

* Added experimental ChuLiuEdmonds tree decoding

* Better config for parser and bugfix

* Added residuals to tagging

* Model update

* Switched to AdamW optimizer

* Working on tokenizer

* Working on tokenizer

* Training working - validation to do

* Bugfix in language id

* Working on tokenization validation

* Tokenizer working

* YAML update

* Bug in LMHelper

* Tagger is working

* Tokenizer is working

* bfix

* bfix

* Bugfix for bugfix :)

* Sync

* Tokenizer worker

* Tagger working

* Trainer updates

* Trainer process now working

* Added .DS_Store

* Added datasets for Compound Word Expander and Lemmatizer

* Added collate function for lemma+compound

* Added training and validation step

* Updated config for Lemmatizer

* Minor fixes

* Removed duplicate entries from lemma and cwe

* Added training support for lemmatizer

* Removed debug directives

* Lemmatizer in testing phase

* removed unused line

* Bugfix in Lemma dataset

* Corrected validation issue with gs labels being sent to the forward method and removed loss computation during testing

* Lemmatizier training done

* Compound word expander ready

* Sync

* Added support for FastText, Transformers and Languasito LM models

* Added multi-lm support for tokenizer

* Added support for multiword tokens

* Sync

* Bugfix in evaluation

* Added Languasito as a subpackage

* Added path to local Languasito

* Bugfixing all around

* Removed debug printing

* Bugfix for no-space languages that actually contain spaces :)

* Bugfix for no-space languages that actually contain spaces :)

* Fixed GPU support

* Biaffine transform for LAS and relative head location (RHL) for UAS

* Bugfix

* Tweaks

* moved rhl to lower layer

* Added configurable option for RHL

* Safenet for spaces in languages that should use no spaces

* Better defaults

* Sync

* Cleanup parser

* Bilinear xpos and attrs

* Added Biaffine module from Stanza

* Tagger with reduced number of parameters:

* Parser with conditional attrs

* Working on tokenizer runtime

* Tokenizer process 90% done

* Added runtime for parser, tokenizer and tagger

* Added quick test for runtime

* Test for e2e

* Added support for multiple word embeddings at the same time

* Bugfix

* Added multiple word representations for tokenizer

* moved mask_concat to utils.py

* Added XPOS prediction to pipeline

* Bugfix in tokenizer shifted word embeddings

* Using Languasito tokenizer for HF tokenization

* Bugfix

* Bugfixing

* Bugfixing

* Bugfix

* Runtime fixing

* Sync

* Added spa for FT and Languasito

* Added spa for FT and Languasito

* Minor tweaks

* Added configuration for RNN layers

* Bugfix for spa

* HF runtime fix

* Mixed test fasttext+transformer

* Added word reconstruction and MHA

* Sync

* Bugfix

* bugfix

* Added masked attention

* Sync

* Added test for runtime

* Bugfix in mask values

* Updated test

* Added full mask dropout

* Added resume option

* Removed useless printouts

* Removed useless printouts

* Switched to eval at runtime

* multiprocessing added

* Added full mask dropout for word decoder

* Bugfix

* Residual

* Added lexical-contextual cosine loss

* Removed full mask dropout from WordDecoder

* Bugfix

* Training script generation update

* Added residual

* Updated languasito to pickle tokenized lines

* Updated languasito to pickle tokenized lines

* Updated languasito to pickle tokenized lines

* Not training for seq len > max_seq_len

* Added seq limmits for collates

* Passing seq limits from collate to tokenizer

* Skipping complex parsing

* Working on word decomposer

* Model update

* Sync

* Bugfix

* Bugfix

* Bugfix

* Using all reprs

* Dropped immediate context

* Multi train script added

* Changed gpu parameter type to string, for multiple gpus int failed

* Updated pytorch_lightning callback method to work with newer version

* Updated pytorch_lightning callback method to work with newer version

* Transparently pass PL args from the command line; skip over empty compound word datasets

* Fix typo

* Refactoring and on the way to working API

* API load working

* Partial _call_ working

* Partial _call_ working

* Added partly working api and refactored everything back to cube/. Compound not working yet and tokenizer needs retraining.

* api is working

* Fixing api

* Updated readme

* Update Readme to include flavours

* Device support

* api update

* Updated package

* Tweak + results

* Clarification

* Test update

* Update

* Sync

* Update README

* Bugfixing

* Bugfix and api update

* Fixed compound

* Evaluation update

* Bugfix

* Package update

* Bugfix for large sentences

* Pip package update

* Corrected spanish evaluation

* Package version update

* Fixed tokenization issues on transformers

* Removed pinned memory

* Bugfix for GPU tensors

* Update package version

* Automatically detecting hidden state size

* Automatically detecting hidden state size

* Automatically detecting hidden state size

* Sync

* Evaluation update

* Package update

* Bugfix

* Bugfixing

* Package version update

* Bugfix

* Package version update

* Update evaluation for Italian

* tentative support torchtext>=0.9.0 (#127)

as mentioned in PyTorchLightning/pytorch-lightning#6211 and #100

* Update package dependencies

Co-authored-by: Stefan Dumitrescu <sdumitre@adobe.com>
Co-authored-by: dumitrescustefan <dumitrescu.stefan@gmail.com>
Co-authored-by: Tiberiu Boros <boros@adobe.com>
Co-authored-by: Tiberiu Boros <boros@boros-macos.local>
Co-authored-by: Koichi Yasuoka <yasuoka@kanji.zinbun.kyoto-u.ac.jp>
c759633

Git stats

Files

Permalink
Failed to load latest commit information.

Downloads Downloads Weekly daily Version Python 3 GitHub stars

News

[05 August 2021] - We are releasing version 3.0 of NLPCube and models and introducing FLAVOURS. This is a major update, but we did our best to maintain the same API, so previous implementation will not crash. The supported language list is smaller, but you can open an issue for unsupported languages, and we will do our best to add them. Other options include fixing the pip package version 1.0.8 pip install nlpcube==0.1.0.8.

[15 April 2019] - We are releasing version 1.1 models - check all supported languages below. Both 1.0 and 1.1 models are trained on the same UD2.2 corpus; however, models 1.1 do not use vector embeddings, thus reducing disk space and time required to use them. Some languages actually have a slightly increased accuracy, some a bit decreased. By default, NLP Cube will use the latest (at this time) 1.1 models.

To use the older 1.0 models just specify this version in the load call: cube.load("en", 1.0) (en for English, or any other language code). This will download (if not already downloaded) and use this specific model version. Same goes for any language/version you want to use.

If you already have NLP Cube installed and want to use the newer 1.1 models, type either cube.load("en", 1.1) or cube.load("en", "latest") to auto-download them. After this, calling cube.load("en") without version number will automatically use the latest ones from your disk.


NLP-Cube

NLP-Cube is an opensource Natural Language Processing Framework with support for languages which are included in the UD Treebanks (list of all available languages below). Use NLP-Cube if you need:

  • Sentence segmentation
  • Tokenization
  • POS Tagging (both language independent (UPOSes) and language dependent (XPOSes and ATTRs))
  • Lemmatization
  • Dependency parsing

Example input: "This is a test.", output is:

1       This    this    PRON    DT      Number=Sing|PronType=Dem        4       nsubj   _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       cop     _
3       a       a       DET     DT      Definite=Ind|PronType=Art       4       det     _
4       test    test    NOUN    NN      Number=Sing     0       root    SpaceAfter=No
5       .       .       PUNCT   .       _       4       punct   SpaceAfter=No

If you just want to run it, here's how to set it up and use NLP-Cube in a few lines: Quick Start Tutorial.

For advanced users that want to create and train their own models, please see the Advanced Tutorials in examples/, starting with how to locally install NLP-Cube.

Simple (PIP) installation / update version

Install (or update) NLP-Cube with:

pip3 install -U nlpcube

API Usage

To use NLP-Cube *programmatically (in Python), follow this tutorial The summary would be:

from cube.api import Cube       # import the Cube object
cube=Cube(verbose=True)         # initialize it
cube.load("en", device='cpu')   # select the desired language (it will auto-download the model on first run)
text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."
document=cube(text)            # call with your own text (string) to obtain the annotations

The document object now contains the annotated text, one sentence at a time. To print the third words's POS (in the first sentence), just run:

print(document.sentences[0][2].upos) # [0] is the first sentence and [2] is the third word

Each token object has the following attributes: index, word, lemma, upos, xpos, attrs, head, label, deps, space_after. For detailed info about each attribute please see the standard CoNLL format.

Flavours

Previous versions on NLP-Cube were trained on individual treebanks. This means that the same language was supported by multiple models at the same time. For instance, you could parse English (en) text with en_ewt, en_esl, en_lines, etc. The current version of NLPCube combines all flavours of a treebank under the same umbrella, by jointly optimizing a conditioned model. You only need to load the base language, for example en and then select which flavour to apply at runtime:

from cube.api import Cube       # import the Cube object
cube=Cube(verbose=True)         # initialize it
cube.load("en", device='cpu')   # select the desired language (it will auto-download the model on first run)
text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."


# Parse using the default flavour (in this case EWT)
document=cube(text)            # call with your own text (string) to obtain the annotations
# or you can specify a flavour
document=cube(text, flavour='en_lines') 

Webserver Usage

The current version dropped supported, since most people preferred to implement their one NLPCube as a service.

Cite

If you use NLP-Cube in your research we would be grateful if you would cite the following paper:

  • NLP-Cube: End-to-End Raw Text Processing With Neural Networks, Boroș, Tiberiu and Dumitrescu, Stefan Daniel and Burtica, Ruxandra, Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics. p. 171--179. October 2018

or, in bibtex format:

@InProceedings{boro-dumitrescu-burtica:2018:K18-2,
  author    = {Boroș, Tiberiu  and  Dumitrescu, Stefan Daniel  and  Burtica, Ruxandra},
  title     = {{NLP}-Cube: End-to-End Raw Text Processing With Neural Networks},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {171--179},
  abstract  = {We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL's "Multilingual Parsing from Raw Text to Universal Dependencies 2018" Shared Task. It performs sentence splitting, tokenization, compound word expansion, lemmatization, tagging and parsing. Based entirely on recurrent neural networks, written in Python, this ready-to-use open source system is freely available on GitHub. For each task we describe and discuss its specific network architecture, closing with an overview on the results obtained in the competition.},
  url       = {http://www.aclweb.org/anthology/K18-2017}
}

Languages and performance

For comparison, the performance of 3.0 models is reported on the 2.2 UD corpus, but distributed models are obtained from UD 2.7.

Results are reported against the test files for each language (available in the UD 2.2 corpus) using the 2018 conll eval script. Please see more info about what each metric represents here.

Notes:

  • version 1.1 of the models no longer need the large external vector embedding files. This makes loading the 1.1 models faster and less RAM-intensive.
  • all reported results here are end-2-end. (e.g. we test the tagging accuracy on our own segmented text, as this is the real use-case; CoNLL results are mostly reported on "gold" - or pre-segmented text, leading to higher accuracy for the tagger/parser/etc.)
Language Model Token Sentence UPOS XPOS AllTags Lemmas UAS LAS
Chinese
zh-1.0 93.03 99.10 88.22 88.15 86.91 92.74 73.43 69.52
zh-1.1 92.34 99.10 86.75 86.66 85.35 92.05 71.00 67.04
zh.3.0 95.88 87.36 91.67 83.54 82.74 85.88 79.15 70.08
English
en-1.0 99.25 72.8 95.34 94.83 92.48 95.62 84.7 81.93
en-1.1 99.2 70.94 94.4 93.93 91.04 95.18 83.3 80.32
en-3.0 98.95 75.00 96.01 95.71 93.75 96.06 87.06 84.61
French
fr-1.0 99.68 94.2 92.61 95.46 90.79 93.08 84.96 80.91
fr-1.1 99.67 95.31 92.51 95.45 90.8 93.0 83.88 80.16
fr-3.0 99.71 93.92 97.33 99.56 96.61 90.79 89.81 87.24
German
de-1.0 99.7 81.19 91.38 94.26 80.37 75.8 79.6 74.35
de-1.1 99.77 81.99 90.47 93.82 79.79 75.46 79.3 73.87
de-3.0 99.77 86.25 94.70 97.00 85.02 82.73 87.08 82.69
Hungarian
hu-1.0 99.8 94.18 94.52 99.8 86.22 91.07 81.57 75.95
hu-1.1 99.88 97.77 93.11 99.88 86.79 91.18 77.89 70.94
hu-3.0 99.75 91.64 96.43 99.75 89.89 91.31 86.34 81.29
Italian
it-1.0 99.89 98.14 86.86 86.67 84.97 87.03 78.3 74.59
it-1.1 99.92 99.07 86.58 86.4 84.53 86.75 76.38 72.35
it-3.0 99.92 98.13 98.26 98.15 97.34 97.76 94.07 92.66
Romanian (RO-RRT)
ro-1.0 99.74 95.56 97.42 96.59 95.49 96.91 90.38 85.23
ro-1.1 99.71 95.42 96.96 96.32 94.98 96.57 90.14 85.06
ro-3.0 99.80 95.64 97.67 97.11 96.76 97.55 92.06 87.67
Spanish
es-1.0 99.98 98.32 98.0 98.0 96.62 98.05 90.53 88.27
es-1.1 99.98 98.40 98.01 98.00 96.6 97.99 90.51 88.16
es-3.0 99.96 97.17 96.88 99.91 94.88 98.17 92.11 89.86