nltk

The latest versions of Python are more strict wrt. escape in regex.
For instance with 3.6.8, there are 10+ warnings like this one:

...
lib/python3.6/site-packages/nltk/featstruct.py:2092: DeprecationWarning: invalid escape sequence \d
    RANGE_RE = re.compile('(-?\d+):(-?\d+)')

The regex(es) should be updated to silence these warnings.

Despite the documentation here stating:

You can use other tokenizers, such as those provided by NLTK, by passing them into the TextBlob constructor then accessing the tokens property.

This fails:

from textblob import TextBlob
from nltk.tokenize import TweetTokenizer

blob = TextBlob("I don't work!", tokenizer=T

The procedure for downloading the "en_vectors_web_lg" in spacy. by downloading and unzipping the file, and shifting it to the appropriate directory, as illustrated here is l

This is also on page 356.

from nltk.corpus import sentiwordnet as swn

good = swn.senti_synsets('good', 'n')[0]
Traceback (most recent call last):
File "", line 1, in
TypeError: 'filter' object is not subscriptable

thanks for sharing! here's the rake.py file edited to use spacy instead of nltk. it removes certain verb types in _get_phrase_list_from_words, which i found to improve performance a bit (in small sample size).

# -*- coding: utf-8 -*- """Implementation of Rapid Automatic Keyword Extraction algorithm. As described in the paper Automatic keyword extraction from individual
documents` by Stuart

Just an idea:
I think README would be the best thing to run LDA on, since it contains a pretty good description of the project. Projects without README should be penalised either way. Often times the repository description is too short to describe in detail what the repository is all about.

#840 updated the stop module (cltk.stop.stop) but its docs was not updated.

It seems that a child class of BaseCorpusStoplist must have attributes vectorizer and tfidf_vectorizer like in cltk.stop.latin

It appears the pickled tokenizers are old, and do not contain current code.

https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip

The .zip that is downloaded is older than the source code:

https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py

There are a few changes in punkt.py since the .zip was created that seem to improve the tokenization of senten

nltk

Here are 1,219 public repositories matching this topic...

nltk / nltk

sloria / TextBlob

dipanjanS / practical-machine-learning-with-python

alfredfrancis / ai-chatbot-framework

dipanjanS / text-analytics-with-python

alexgreene / WikiQuiz

csurfer / rake-nltk

csurfer / gitsuggest

cltk / cltk

shirosaidev / stocksight

nltk / nltk_data

hb20007 / hands-on-nltk-tutorial

javedsha / text-classification

chen0040 / keras-english-resume-parser-and-analyzer

fendouai / Awesome-Text-Classification

NLP-kr / tensorflow-ml-nlp

gionanide / Speech_Signal_Processing_and_Classification

lfcipriani / punkt-segmenter

prabhakar267 / vertikin

biolab / orange3-text

usyiyi / nlp-py-2e-zh

mertkahyaoglu / twitter-sentiment-analysis

arop / ner-re-pt

nagypeterjob / Sentiment-Analysis-NLTK-ML-LSTM

apanimesh061 / VaderSentimentJava

saidziani / Arabic-News-Article-Classification

gyanesh-m / Sentiment-analysis-of-financial-news-data

christabor / namebot

TiesdeKok / Python_NLP_Tutorial

sachin-bisht / YouTube-Sentiment-Analysis

Improve this page

Add this topic to your repo