-
Updated
Feb 13, 2020 - C++
#
tokenization
Here are 186 public repositories matching this topic...
Unsupervised text tokenizer focused on computational efficiency
PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language
-
Updated
May 2, 2020 - PHP
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
nlp
tokenizer
text-processing
semeval
nlp-library
word-segmentation
spelling-correction
tokenization
text-segmentation
spell-corrector
word-normalization
-
Updated
Oct 22, 2019 - Python
ClangKit provides an Objective-C frontend to LibClang. Source tokenization, diagnostics and fix-its are actually implemented.
c
syntax-highlighting
c-plus-plus
parsing
objective-c
code
llvm
static-analysis
clang
source
diagnostics
tokenization
-
Updated
May 9, 2017 - C
Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing
parse
machine-translation
embeddings
information-extraction
dependency-parser
universal-dependencies
part-of-speech-tagger
dependency-parsing
tokenization
lemmatization
sentence-splitting
nlp-cube
language-pipeline
-
Updated
May 5, 2020 - Python
Remagpie
commented
Sep 24, 2019
The Transaction.md file doesn't contain enough details about its actual behavior.
Fast and customizable text tokenization library with BPE and SentencePiece support
python
unicode
natural-language-processing
cpp
icu
tokenizer
machine-translation
tokenization
bpe
sentencepiece
-
Updated
May 22, 2020 - C++
Rule-based token, sentence segmentation for Russian language
-
Updated
May 18, 2020 - Python
5
rth
commented
May 3, 2019
It would be useful to add a sentence splitter, for instance, possibilities could be,
- Puntk sentence tokenizer from NLTK (needs pre-trained model)
- Unicode sentence boundaries from unicode-rs/unicode-segmentation#24 (doesn't need a pre-trained model)
- investigate spacy implementation (likely needs pre-traine
Language Modeling and Text Classification in Malayalam Language using ULMFiT
-
Updated
Mar 31, 2020 - Jupyter Notebook
A Japanese morphological analyzer: An unofficial Sudachi clone in Rust 🦀
-
Updated
Dec 20, 2019 - Rust
Collection of Wongnai's datasets
-
Updated
Aug 26, 2019
Natural Language Processing Toolkit in Golang
-
Updated
May 9, 2020 - Go
High performance tokenizers for natural language processing and other related tasks
-
Updated
Feb 20, 2020 - Julia
python
nlp
docker
spacy
named-entity-recognition
sense2vec
part-of-speech-tagger
tokenization
sentence-segmentation
-
Updated
Apr 18, 2020 - Python
POS Tagger, lemmatizer and stemmer for french language in javascript
-
Updated
Sep 13, 2017 - JavaScript
Simple and customizable text tokenization gem.
-
Updated
May 30, 2019 - Ruby
coventry
commented
May 23, 2017
morphology_han-readings.py passes "北京大学生物系主任办公室内部会议" and prints out
{'hanReadings': [['Bei3-jing1-Da4-xue2'], null, ['zhu3-ren4'], ['ban4-gong1-shi4'], ['nei4-bu4'], ['hui4-yi4']]}
The element of the list, null, should be ['Sheng1-wu4'], i.e., "Biology."
Multilingual tokenizer that automatically tags each token with its type
multilingual
german
tokenizer
tagging
latin
french
hindi
wink
devanagari
marathi
tokenization
konkani
-
Updated
Nov 23, 2019 - JavaScript
Smart Language Model
-
Updated
Jun 5, 2020 - C++
Custom Russian tokenizer for spaCy
-
Updated
May 14, 2019 - Python
The Unicode Cookbook for Linguists
python
unicode
r
transliteration
linguistics
ipa
phonetics
transcription
writing-systems
tokenization
-
Updated
Sep 14, 2018 - TeX
A tokenizer based on Unicode text segmentation (UAX 29), for Go
-
Updated
Jun 1, 2020 - Go
Factom Asset Tokens - Open tokenization standards on Factom
-
Updated
May 20, 2020
This is a java version of Chinese tokenization descried in BERT.
-
Updated
Jul 17, 2019 - Java
Use Python and NLTK to build out your own text classifiers and solve common NLP problems
python
nlp
api
natural-language-processing
unsupervised
linear-regression
scikit-learn
markov-chain
pandas
lda
supervised
latent-dirichlet-allocation
tokenization
binary-classifier
-
Updated
Jan 15, 2020 - Jupyter Notebook
Tokenize, encrypt/decrypt, mask your data on the fly with Vaulty proxy
-
Updated
Jun 5, 2020 - Go
Pre-trained models for tokenization, sentence segmentation and so on
machine-learning
natural-language-processing
russian-specific
conditional-random-fields
tokenization
sentence-segmentation
-
Updated
Aug 22, 2017 - Python
Improve this page
Add a description, image, and links to the tokenization topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the tokenization topic, visit your repo's landing page and select "manage topics."
OSX build notes have the following line
brew install automake berkeley-db4 libtool boost --c++11 miniupnpc openssl pkg-config protobuf python3 qt libevent
However, the boost --c++11 isn't a valid command anymore. Need to update it