language-model

Hello,

The code says that it will add compatibility for Postponed Evaluation of Annotations (PEP 563) when Python 3.9 is released (which already happened on 2020.10.5). Is there any plan to complete this?

https://github.com/huggingface/transformers/blob/2c2a31ffbcfe03339b1721348781aac4fc05bc5e/src/transformers/hf_argparser.py#L85-L90

I wonder if it would be useful to have a sequence object for the decoders too.

It seems to me for example that if we build a tokenizer with a BPE model that defines a end_of_word_suffix, we will need to use the BPEDecoder decoder to replace theend_of_word_suffix and if we also used a ByteLevel pre-tokenization we will need the ByteLevel decoder to realign the codes.

At the moment, i

From paper, it mentioned

Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my
dog is hairy it chooses hairy.

It means that 15% of token will be choose for sure.

From https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L68,
for every single token, it has 15% of chance that go though the followup procedure.

_handle_duplicate_documents and _drop_duplicate_documents in the elastic search document store will always report self.index as the index with the conflict, which is obviously incorrect.

Edit: Upon further investigation, this is actually a lot worse. Using multiple indices with the ElasticSearch DocumentStore is completely broken due to the fact, that this is used in `_handle_duplicate_do

目前的多音字使用 pypinyin 或者 g2pM，精度有限，想做一个基于 BERT (或者 ERNIE) 多音字预测模型，简单来说就是假设某语言有 100 个多音字，每个多音字最多有 3 个发音，那么可以在 BERT 后面接 100 个 3 分类器（简单的 fc 层即可），在预测时，找到对应的分类器进行分类即可。
参考论文：
tencent_polyphone.pdf

数据可以用 https://github.com/kakaobrain/g2pM 提供的数据

进阶：多任务的 BERT
![image](https://user-images.githubusercontent.com/24568452

Issue to track tutorial requests:

Deep Learning with PyTorch: A 60 Minute Blitz - #69
Sentence Classification - #79

language-model

Here are 854 public repositories matching this topic...

huggingface / transformers

brightmart / nlp_chinese_corpus

EleutherAI / gpt-neo

huggingface / tokenizers

codertimo / BERT-pytorch

deepset-ai / haystack

speechbrain / speechbrain

PaddlePaddle / PaddleSpeech

CLUEbenchmark / CLUE

tensorflow / lingvo

CyberZHG / keras-bert

zzw922cn / awesome-speech-recognition-speech-synthesis-papers

Separius / awesome-sentence-embedding

chiphuyen / lazynlp

salesforce / awd-lstm-lm

EleutherAI / gpt-neox

NVIDIA / OpenSeq2Seq

huggingface / pytorch-openai-transformer-lm

prabhuomkar / pytorch-cpp

nlpodyssey / spago

ymcui / Chinese-ELECTRA

explosion / spacy-transformers

mihail911 / nlp-library

brightmart / bert_language_understanding

microsoft / DeBERTa

pykaldi / pykaldi

LiyuanLucasLiu / LM-LSTM-CRF

SKTBrain / KoBERT

smilelight / lightNLP

IsaacChanghau / DL-NLP-Readings

Improve this page

Add this topic to your repo