Google Research Datasets

cvss Public
CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

33 CC-BY-4.0 3 0 0 Updated Jan 16, 2022
Video-Timeline-Tags-ViTT Public
A collection of videos annotated with timelines where each video is divided into segments, and each segment is labelled with a short free-text description

12 1 1 0 Updated Jan 15, 2022
clay Public
The dataset includes UI object type labels (e.g., BUTTON, IMAGE, CHECKBOX) that describes the semantic type of an UI object on Android app screenshots. It is used for training and evaluation of the screen layout denoising models (paper will be linked soon).

0 0 0 0 Updated Jan 14, 2022
poem-sentiment Public
Poem Sentiment is a sentiment dataset of poem verses from Project Gutenberg. This dataset can be used for tasks such as sentiment classification or style transfer for poems.

2 CC-BY-4.0 3 1 0 Updated Jan 12, 2022
NewSHead Public
The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

24 2 0 0 Updated Jan 7, 2022
ToTTo Public
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.

323 30 5 1 Updated Jan 7, 2022
paws Public
This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.

Python 405 46 9 1 Updated Jan 4, 2022
MAVE Public
The dataset contains 3 million attribute-value annotations across 1257 unique categories on 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.

Python 20 1 0 0 Updated Dec 19, 2021
NatGenMT Public
This dataset is intended as an evaluation benchmark for gender issues in Machine Translation. We consider the challenges in modeling and handling gendered language in the context of machine translation and extend over previous work that identifies issues using synthetic examples. We focus on the class of issues which surface when a neutral refer…

0 Apache-2.0 1 0 0 Updated Dec 15, 2021
C4_200M-synthetic-dataset-for-grammatical-error-correction Public
This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)

Python 71 CC-BY-4.0 19 0 0 Updated Dec 7, 2021

View all repositories

Google Research Datasets

Pinned

Repositories

People

Top languages

Most used topics