bigrams

Right now the tokenize() function is splitting whenever a ' . ' character is found. Most of the time it's a correct approach to split a fine into sentences but sometimes the abbreviation like Dr., Mr., Mrs, etc. appear in a middle of a sentence and hence splits the sentence right there. I want to enhance the regex to not to spit the sentences on abbreviations.

bigrams

Here are 78 public repositories matching this topic...

ollie283 / language-models

starlordvk / Typing-Assistant

susantabiswas / Word-Prediction-Ngram

rohitthapliyal2000 / Sentiment-Analysis-NLTK

dohliam / hawaiian-corpus

DigitalTools / nltk-book

senya-ashukha / bigram-anchor-words

mochi-co / ngrams

ez-sherlock / N-Gram-Language-Model

The tokenize() function shouldn't split sentences on abbreviations like Dr. Fahad, Mr. Wayne etc

sohailahmedkhan / Sentence-Completion-using-Hidden-Markov-Models

sachin-bisht / YouTube-Sentiment-Analysis

gromag / Data-Science-Specialisation-Predict-Next-Word

sachin-bisht / Sentiment-Analysis-NLTK

ricardobreis / Text-Mining-Acesso-Info-SP

Premchand95 / Sentiment-Analysis-of-Reviews-using-Machine-Learning-algorithms-on-Textual-data

ZNClub-PA-ML-AI / NLP-techniques

motiurinfo / sentiment_classification

Adrianogba / bigrama-trigrama-python

AslanDevbrat / Computational-Linguistic

VaasuDevanS / Natural-Language-Processing-Assignments

faisalsyfl / IndoLangModel

sashakenjeeva / spell-corrector

fikrirazor / bigramindo

gjorm / WordSeg

vgratian / phon_bigrams

jianleisun / NLP-project

bobbingwide / bigram

sienmonika / text_mining_hillary_emails

umangkshah / Machine-Learning-and-NLP

Mini-Stren / TYAPIMT

Improve this page

Add this topic to your repo