Skip to content
#

transformers

Here are 488 public repositories matching this topic...

tokenizers
david-waterworth
david-waterworth commented Feb 27, 2021

The Split class accepts SplitDelimiterBehavior which is really useful. The Punctuation however always uses SplitDelimiterBehavior::Isolated (and Whitespace on the other hand behaves like SplitDelimiterBehavior::Removed).

impl PreTokenizer for Punctuation {
    fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()> {
        pretokenized.split(|_, s| s.spl
pytorch-original-transformer

My implementation of the original transformer model (Vaswani et al.). I've additionally included the playground.py file for visualizing otherwise seemingly hard concepts. Currently included IWSLT pretrained models.

  • Updated Dec 27, 2020
  • Jupyter Notebook
sdtblck
sdtblck commented Mar 16, 2021

Would be good to replace the weird tokenizer class megatron has with HF tokenizers, or at least accept both types of tokenizer. HF library is super intuitive for training and nice and easy to use. This would allow us to experiment with different tokenization schemes, etc.

The tokenizer code is a little spread out through the repo currently so this might be a bit of a task.

Improve this page

Add a description, image, and links to the transformers topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the transformers topic, visit your repo's landing page and select "manage topics."

Learn more