language-model

🚀 Feature request

Fast Tokenizer for DeBERTA-V3 and mDeBERTa-V3

Motivation

DeBERTa V3 is an improved version of DeBERTa. With the V3 version, the authors also released a multilingual model "mDeBERTa-base" that outperforms XLM-R-base. However, DeBERTa V3 currently lacks a FastTokenizer implementation which makes it impossible to use with some of the example scripts (They require a Fa

From paper, it mentioned

Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my
dog is hairy it chooses hairy.

It means that 15% of token will be choose for sure.

From https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L68,
for every single token, it has 15% of chance that go though the followup procedure.

_handle_duplicate_documents and _drop_duplicate_documents in the elastic search document store will always report self.index as the index with the conflict, which is obviously incorrect.

Edit: Upon further investigation, this is actually a lot worse. Using multiple indices with the ElasticSearch DocumentStore is completely broken due to the fact, that this is used in `_handle_duplicate_do

Issue to track tutorial requests:

Deep Learning with PyTorch: A 60 Minute Blitz - #69
Sentence Classification - #79

language-model

Here are 819 public repositories matching this topic...

huggingface / transformers

🚀 Feature request

Motivation

brightmart / nlp_chinese_corpus

EleutherAI / gpt-neo

huggingface / tokenizers

codertimo / BERT-pytorch

deepset-ai / haystack

speechbrain / speechbrain

CLUEbenchmark / CLUE

PaddlePaddle / PaddleSpeech

tensorflow / lingvo

CyberZHG / keras-bert

zzw922cn / awesome-speech-recognition-speech-synthesis-papers

chiphuyen / lazynlp

Separius / awesome-sentence-embedding

salesforce / awd-lstm-lm

NVIDIA / OpenSeq2Seq

huggingface / pytorch-openai-transformer-lm

EleutherAI / gpt-neox

prabhuomkar / pytorch-cpp

nlpodyssey / spago

explosion / spacy-transformers

ymcui / Chinese-ELECTRA

mihail911 / nlp-library

brightmart / bert_language_understanding

pykaldi / pykaldi

microsoft / DeBERTa

LiyuanLucasLiu / LM-LSTM-CRF

smilelight / lightNLP

SKTBrain / KoBERT

IsaacChanghau / DL-NLP-Readings

Improve this page

Add this topic to your repo