Skip to content
@google-research-datasets

Google Research Datasets

Datasets released by Google Research

Pinned

  1. Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 697 131

  2. Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 326 17

  3. ToTTo Public

    ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, p…

    323 30

  4. dakshina Public

    The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia tex…

    141 15

  5. tydiqa Public

    TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the trai…

    Python 211 34

  6. GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia for the evaluation of coreference resolution in practica…

    Python 203 74

Repositories

  • cvss Public

    CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

    33 CC-BY-4.0 3 0 0 Updated Jan 16, 2022
  • Video-Timeline-Tags-ViTT Public

    A collection of videos annotated with timelines where each video is divided into segments, and each segment is labelled with a short free-text description

    12 1 1 0 Updated Jan 15, 2022
  • clay Public

    The dataset includes UI object type labels (e.g., BUTTON, IMAGE, CHECKBOX) that describes the semantic type of an UI object on Android app screenshots. It is used for training and evaluation of the screen layout denoising models (paper will be linked soon).

    0 0 0 0 Updated Jan 14, 2022
  • poem-sentiment Public

    Poem Sentiment is a sentiment dataset of poem verses from Project Gutenberg. This dataset can be used for tasks such as sentiment classification or style transfer for poems.

    2 CC-BY-4.0 3 1 0 Updated Jan 12, 2022
  • NewSHead Public

    The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

    24 2 0 0 Updated Jan 7, 2022
  • ToTTo Public

    ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.

    323 30 5 1 Updated Jan 7, 2022
  • paws Public

    This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.

    Python 405 46 9 1 Updated Jan 4, 2022
  • MAVE Public

    The dataset contains 3 million attribute-value annotations across 1257 unique categories on 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.

    Python 20 1 0 0 Updated Dec 19, 2021
  • NatGenMT Public

    This dataset is intended as an evaluation benchmark for gender issues in Machine Translation. We consider the challenges in modeling and handling gendered language in the context of machine translation and extend over previous work that identifies issues using synthetic examples. We focus on the class of issues which surface when a neutral refer…

    0 Apache-2.0 1 0 0 Updated Dec 15, 2021
  • C4_200M-synthetic-dataset-for-grammatical-error-correction Public

    This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)

    Python 71 CC-BY-4.0 19 0 0 Updated Dec 7, 2021

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…