UniLM

We develop pre-trained models for natural language understanding (NLU) and generation (NLG) tasks

The family of UniLM:

UniLM: unified pre-training for language understanding and generation

MiniLM (new): small pre-trained models for language understanding and generation

LayoutLM (new): multimodal (text + layout/format + image) pre-training for document understanding (e.g. scanned documents, PDF, etc.)

s2s-ft (new): sequence-to-sequence fine-tuning toolkit

News

April 5, 2020: Multilingual MiniLM released!

Release

***** New February, 2020: UniLM v2 | MiniLM v1 | LayoutLM v1 | s2s-ft v1 release *****

LayoutLM 1.0 (February 18, 2020): pre-trained models for document (image) understanding (e.g. receipts, forms, etc.) . It achieves new SOTA results in several downstream tasks, including form understanding (the FUNSD dataset from 70.72 to 79.27), receipt understanding (the ICDAR 2019 SROIE leaderboard from 94.02 to 95.24) and document image classification (the RVL-CDIP dataset from 93.07 to 94.42). "LayoutLM: Pre-training of Text and Layout for Document Image Understanding"
s2s-ft 1.0 (February 26, 2020): A PyTorch package used to fine-tune pre-trained Transformers for sequence-to-sequence language generation. "s2s-ft: Fine-Tuning Pre-Trained Transformers for Sequence-to-Sequence Learning"
MiniLM 1.0 (February 26, 2020): deep self-attention distillation is all you need (for task-agnostic knowledge distillation of pre-trained Transformers). MiniLM (12-layer, 384-hidden) achieves 2.7x speedup and comparable results over BERT-base (12-layer, 768-hidden) on NLU tasks as well as strong results on NLG tasks. The even smaller MiniLM (6-layer, 384-hidden) obtains 5.3x speedup and produces very competitive results. "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers"
UniLM 2.0 (February 28, 2020): unified pre-training of bi-directional LM (via autoencoding) and sequence-to-sequence LM (via partially autoregressive) w/ Pseudo-Masked Language Model for language understanding and generation. UniLM v2 achieves new SOTA in a wide range of natural language understanding and generation tasks. "UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training"

***** October 1st, 2019: UniLM v1 release *****

UniLM v1 (September 30, 2019): the code and pre-trained models for the NeurIPS 2019 paper entitled "Unified Language Model Pre-training for Natural Language Understanding and Generation". UniLM (v1) achieves the new SOTA results in NLG (especially sequence-to-sequence generation) tasks, including abstractive summarization (the Gigaword and CNN/DM datasets), question generation (the SQuAD QG dataset), etc.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the transformers project.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using UniLM, please submit a GitHub issue.

For other communications related to UniLM, please contact Li Dong (lidong1@microsoft.com), Furu Wei (fuwei@microsoft.com).

Name	Latest commit message	Commit time
Failed to load latest commit information.
layoutlm	Update README.md	May 16, 2020
minilm	Update README.md	Apr 16, 2020
s2s-ft	s2s CPU	Apr 2, 2020
storage	Create unilm-base-cased-vocab.txt	Dec 16, 2019
unilm-v1	Update README.md	Apr 26, 2020
unilm	Update README.md	Apr 30, 2020
.gitignore	Initial commit	Jul 23, 2019
CODE_OF_CONDUCT.md	Initial commit	Jul 23, 2019
CONTRIBUTING.md	Create CONTRIBUTING.md	Sep 19, 2019
LICENSE	init	Sep 30, 2019
NOTICE.md	init	Sep 30, 2019
README.md	Update README.md	Apr 5, 2020

microsoft / unilm

Join GitHub today

Clone with HTTPS

Downloading

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio

Latest commit

Files

README.md

UniLM

News

Release

License

Contact Information