transformers

The Split class accepts SplitDelimiterBehavior which is really useful. The Punctuation however always uses SplitDelimiterBehavior::Isolated (and Whitespace on the other hand behaves like SplitDelimiterBehavior::Removed).

impl PreTokenizer for Punctuation {
    fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()> {
        pretokenized.split(|_, s| s.spl

We have a Jupyter notebook on the README, a good idea could be to expand this to the parts that we have yet a notebook to have a more interactive README.

Problem
Since Java 8 was introduced there is no need to use Joda as it has been replaced the native Date-Time API.

Solution
Ideally greping and replacing the text should work (mostly)

Additional context
Need to check if de/serializing will still work.

Hey! Thanks for the work on this.

Wondering how we can use this with mocha? tsconfig-paths has its own tsconfig-paths/register to make this work

https://github.com/dividab/tsconfig-paths#with-mocha-and-ts-node

Basically with mocha we have to run mocha -r ts-node/register -- but that wouldnt have the compiler flag.

Would be worthwhile to have the ability to do it which looks like

Would be good to replace the weird tokenizer class megatron has with HF tokenizers, or at least accept both types of tokenizer. HF library is super intuitive for training and nice and easy to use. This would allow us to experiment with different tokenization schemes, etc.

The tokenizer code is a little spread out through the repo currently so this might be a bit of a task.

transformers

Here are 488 public repositories matching this topic...

huggingface / tokenizers

EleutherAI / gpt-neo

lucidrains / vit-pytorch

bentrevett / pytorch-sentiment-analysis

IntelLabs / nlp-architect

jina-ai / jina

lab-ml / nn

lucidrains / DALLE-pytorch

ThilinaRajapakse / simpletransformers

JohnSnowLabs / spark-nlp

salesforce / TransmogrifAI

CLUEbenchmark / CLUE

utterworks / fast-bert

lucidrains / reformer-pytorch

combust / mleap

nyu-mll / jiant

lihanghang / NLP-Knowledge-Graph

cevek / ttypescript

MaartenGr / BERTopic

lucidrains / deep-daze

turtlesoupy / this-word-does-not-exist

lucidrains / performer-pytorch

cdqa-suite / cdQA

lonePatient / Bert-Multi-Label-Text-Classification

dorarad / gansformer

lucidrains / bottleneck-transformer-pytorch

gordicaleksa / pytorch-original-transformer

EleutherAI / gpt-neox

abhimishra91 / transformers-tutorials

ChenRocks / UNITER

Improve this page

Add this topic to your repo