tokenization

OSX build notes have the following line
brew install automake berkeley-db4 libtool boost --c++11 miniupnpc openssl pkg-config protobuf python3 qt libevent

However, the boost --c++11 isn't a valid command anymore. Need to update it

The Transaction.md file doesn't contain enough details about its actual behavior.

It would be useful to add a sentence splitter, for instance, possibilities could be,

Puntk sentence tokenizer from NLTK (needs pre-trained model)
Unicode sentence boundaries from unicode-rs/unicode-segmentation#24 (doesn't need a pre-trained model)
investigate spacy implementation (likely needs pre-traine

morphology_han-readings.py passes "北京大学生物系主任办公室内部会议" and prints out

{'hanReadings': [['Bei3-jing1-Da4-xue2'], null, ['zhu3-ren4'], ['ban4-gong1-shi4'], ['nei4-bu4'], ['hui4-yi4']]}

The element of the list, null, should be ['Sheng1-wu4'], i.e., "Biology."

tokenization

Here are 186 public repositories matching this topic...

VKCOM / YouTokenToMe

RavenProject / Ravencoin

yooper / php-text-analysis

cbaziotis / ekphrasis

macmade / ClangKit

adobe / NLP-Cube

CodeChain-io / codechain

OpenNMT / Tokenizer

natasha / razdel

rth / vtext

adamshamsudeen / Vaaku2Vec

sorami / sudachi.rs

wongnai / wongnai-corpus

liuzl / ling

JuliaText / WordTokenizers.jl

neelkamath / spacy-server

bastienbot / nlp-js-tools-french

manorie / textoken

rosette-api / python

winkjs / wink-tokenizer

anyks / alm

aatimofeev / spacy_russian_tokenizer

PyThaiNLP / attacut

unicode-cookbook / cookbook

clipperhouse / uax29

Factom-Asset-Tokens / FAT

zhongbin1 / bert_tokenization_for_java

TrainingByPackt / Natural-Language-Processing-Fundamentals

vaulty-co / vaulty

bureaucratic-labs / models

Improve this page

Add this topic to your repo