language-model

Consider this code that downloads models and tokenizers to disk and then uses BertTokenizer.from_pretrained to load the tokenizer from disk.

ISSUE: BertTokenizer.from_pretrained() does not seem to be compatible with Python's native pathlib module.

# -*- coding: utf-8 -*-
"""
Created on: 25-04-2020
Author: MacwanJ

ISSUE:

The position embedding in the BERT is not the same as in the transformer. Why not use the form in bert?

Spacy has customizable word level tokenizers with rules for multiple languages. I think porting that to rust would add nicely to this package. Having a customizable uniform word level tokenization across platforms (client web, server) and languages would be beneficial. Currently, idk any clean way or whether it's even possible to write bindings for spacy cython.

Spacy Tokenizer Code

https:

Hi,
When we try to tokenize the following sentence:

If we use spacy

a = spacy.load('en_core_web_lg')

doc = a("I like the link http://www.idph.iowa.gov/ohds/oral-health-center/coordinator")

list(doc)

We got

[I, like, the, link, http://www.idph.iowa.gov, /, ohds, /, oral, -, health, -, center, /, coordinator]

But if we use the Spacy transformer tokenizer:

I think the filenames in models.sh referred to on lines 4-9 should refer to kaldi-generic-en-tdnn_f-r20190609* which is downloaded on line 3.

language-model

Here are 530 public repositories matching this topic...

huggingface / transformers

brightmart / nlp_chinese_corpus

codertimo / BERT-pytorch

huggingface / tokenizers

Spacy Tokenizer Code

tensorflow / lingvo

chiphuyen / lazynlp

CyberZHG / keras-bert

salesforce / awd-lstm-lm

zzw922cn / awesome-speech-recognition-speech-synthesis-papers

JohnSnowLabs / spark-nlp

NVIDIA / OpenSeq2Seq

huggingface / pytorch-openai-transformer-lm

CLUEbenchmark / CLUE

mihail911 / nlp-library

brightmart / bert_language_understanding

explosion / spacy-transformers

LiyuanLucasLiu / LM-LSTM-CRF

nlpodyssey / spago

smilelight / lightNLP

codekansas / keras-language-modeling

prabhuomkar / pytorch-cpp

pykaldi / pykaldi

IsaacChanghau / DL-NLP-Readings

brightmart / sentiment_analysis_fine_grain

ymcui / Chinese-ELECTRA

lonePatient / albert_pytorch

githubharald / CTCDecoder

deepset-ai / haystack

SKTBrain / KoBERT

cedrickchee / awesome-bert-nlp

Improve this page

Add this topic to your repo