Natural language processing

Natural language processing (NLP) is a field of computer science that studies how computers and humans interact. In the 1950s, Alan Turing published an article that proposed a measure of intelligence, now called the Turing test. More modern techniques, such as deep learning, have produced results in the fields of language modeling, parsing, and natural-language tasks.

Google has started using BERT in its search engine. I imagine it creates embeddings for the query on the search engine, and then find a kind of similarity measure with the potential candidate websites/pages, finally ranking them in search results.

I am curious how do they create embeddings for the documents (the potential candidate websites/pages) if any? Or am I interpreting it wrong?

🐛 Bug

This is a documentation-related bug. In the TransfoXL documentation, the tokenization example is wrong. The snippet goes:

tokenizer = TransfoXLTokenizer.from_pretrained('transfo-xl-wt103')
...
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1

This code output

I was going though the existing enhancement issues again and though it'd be nice to collect ideas for spaCy plugins and related projects. There are always people in the community who are looking for new things to build, so here's some inspiration ✨ For existing plugins and projects, check out the spaCy universe.

If you have questions about the projects I suggested,

The usage example in the word2vec.py doc-comment regarding KeyedVectors uses inconsistent paths and thus doesn't work.

https://github.com/RaRe-Technologies/gensim/blob/e859c11f6f57bf3c883a718a9ab7067ac0c2d4cf/gensim/models/word2vec.py#L73

https://github.com/RaRe-Technologies/gensim/blob/e859c11f6f57bf3c883a718a9ab7067ac0c2d4cf/gensim/models/word2vec.py#L76

If vectors were saved to a tm

In README.md of stanford-tensorflow-tutorials/assignments/chatbot/ directory. The hyper-link https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/ is not currently in working state.

Same problem duplicated in chatbot.py comments.

When normalizing this text

"the guest-singer mr. smith who was supposed to show up at at seven thirty didn't.

The output that is received is

the guest singer Mr. Smith who was supposed to show up at at 7 30 did not

Expected output

the guest singer Mr. Smith who was supposed to show up at at 7 30 did not.

Sample code

var doc = nlp("the guest-singer mr. sm

The latest versions of Python are more strict wrt. escape in regex.
For instance with 3.6.8, there are 10+ warnings like this one:

...
lib/python3.6/site-packages/nltk/featstruct.py:2092: DeprecationWarning: invalid escape sequence \d
    RANGE_RE = re.compile('(-?\d+):(-?\d+)')

The regex(es) should be updated to silence these warnings.

My feature request is to include an option on a button made from choice skill, to redirect a link to an external url...

Here's a detailed explanation including screenshots

https://help.botpress.io/t/how-to-redirect-to-an-external-url-while-using-a-button-made-in-choice-skill/1791

This option will be really beneficial using choice skill buttons since at the moment, you can only add an ext

Train a simple NER tagger for Swedish trained for instance over this dataset.

For this task, we need to adapt the NLPTaskDataFetcher for the appropriate Swedish dataset and train a simple model using Swedish word embeddings. How to train a model is [illustrated here](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_TRAI

Question

Can PretrainedTransformerTokenizer track character offset like WordTokenizer？
Since character offset is important to calculate answer span after wordpiece tokenization？

Right now, the installation section tells you how to install, but not how to upgrade your installed version. we should add that info to the documentation. Inspired by this forum post

As per the StanfordCoreNLP documentation for CoreLabel, The functions after() and before() should return white space strings between the token and the next/previous tokens respectively.
However, they return an empty string always even if there are some white spaces when the tokenizer option **normalizeOth

Despite the documentation here stating:

You can use other tokenizers, such as those provided by NLTK, by passing them into the TextBlob constructor then accessing the tokens property.

This fails:

from textblob import TextBlob
from nltk.tokenize import TweetTokenizer

blob = TextBlob("I don't work!", tokenizer=T

[x ] Are you running the latest bert-as-service?
[x ] Did you follow the installation and the usage instructions in README.md?
[x ] Did you check the FAQ list in README.md?
[x ] Did you perform [a cursory searc

Hi I would like to propose a better implementation for 'test_indices':

We can remove the unneeded np.array casting:

Cleaner/New:
test_indices = list(set(range(len(texts))) - set(train_indices))

Old:
test_indices = np.array(list(set(range(len(texts))) - set(train_indices)))

Hi, can batchify method only batch a doc in a file, not two docs in the same file? Why the EOD flag not use to distinguish different docs in data_utils.py ?

Natural language processing

Here are 8,244 public repositories matching this topic...

practicalAI / practicalAI

apachecn / AiLearning

google-research / bert

huggingface / transformers

🐛 Bug

hankcs / HanLP

explosion / spaCy

oxford-cs-deepnlp-2017 / lectures

virgili0 / Virgilio

RaRe-Technologies / gensim

keon / awesome-nlp

chiphuyen / stanford-tensorflow-tutorials

spencermountain / compromise

bharathgs / Awesome-pytorch-list

nltk / nltk

botpress / botpress

flairNLP / flair

allenai / allennlp

RasaHQ / rasa

stanfordnlp / CoreNLP

sloria / TextBlob

hanxiao / bert-as-service

brightmart / text_classification

nfmcclure / tensorflow_cookbook

NLPchina / ansj_seg

crownpku / Awesome-Chinese-NLP

graykode / nlp-tutorial

zihangdai / xlnet

brightmart / nlp_chinese_corpus

dragen1860 / TensorFlow-2.x-Tutorials

vi3k6i5 / flashtext

Related Topics