cleanlab Examples
This repo contains code examples that demonstrate how to use cleanlab with specific real-world models/datasets, how its underlying algorithms work, how to get better results via advanced functionality, and how to train certain models used in some cleanlab tutorials.
To quickly learn how to run cleanlab on your own data, first check out the quickstart tutorials before diving into the examples below.
Table of Contents
| Example | Description | |
|---|---|---|
| 1 | find_label_errors_iris | Find label errors introduced into the Iris classification dataset. |
| 2 | classifier_comparison | Use CleanLearning to train 10 different classifiers on 4 dataset distributions with label errors. |
| 3 | hyperparameter_optimization | Hyperparameter optimization to find the best settings of CleanLearning's optional parameters. |
| 4 | simplifying_confident_learning | Straightforward implementation of Confident Learning algorithm with raw numpy code. |
| 5 | visualizing_confident_learning | See how cleanlab estimates parameters of the label error distribution (noise matrix). |
| 6 | find_tabular_errors | Handle mislabeled tabular data to improve a XGBoost classifier. |
| 7 | cnn_mnist | Finding label errors in MNIST image data with a Convolutional Neural Network. |
| 8 | huggingface_keras_imdb | CleanLearning for text classification with Keras Model + pretrained BERT backbone and Tensorflow Dataset. |
| 9 | fasttext_amazon_reviews | Finding label errors in Amazon Reviews text dataset using a cleanlab-compatible FastText model. |
| 10 | multiannotator_cifar10 | Iteratively improve consensus labels and trained classifier from data labeled by multiple annotators. |
| 11 | active_learning_multiannotator | Improve model performance by iteratively collecting additional labels from annotators. This active learning pipeline allows for examples labeled in batches by multiple annotators. |
| 12 | outlier_detection_cifar10 | Train AutoML for image classification and use it to detect out-of-distribution images. |
| 13 | multilabel_classification | Find label errors in an image tagging dataset (CelebA) using a Pytorch model you can easily train for multi-label classification. |
| 14 | entity_recognition | Train Transformer model for Named Entity Recognition and produce out-of-sample pred_probs for cleanlab.token_classification. |
| 15 | transformer_sklearn | How to use KerasWrapperModel to make any Keras model sklearn-compatible, demonstrated here for a BERT Transformer. |
| 16 | cnn_coteaching_cifar10 | Train a Convolutional Neural Network on noisily labeled Cifar10 image data using cleanlab with coteaching. |
Instructions
To run the latest example notebooks, execute the commands below which will install the required libraries in a virtual environment.
$ python -m pip install virtualenv
$ python -m venv cleanlab-examples # creates a new venv named cleanlab-examples
$ source cleanlab-examples/bin/activate
$ python -m pip install -r requirements.txtAlternatively you can only install those dependencies required for a specific example by calling pip install -r requirements.txt inside the subfolder for that example (each example's subfolder contains a separate requirements.txt file).
It is recommended to run the examples with the latest stable cleanlab release (pip install cleanlab).
However be aware that notebooks in the master branch of this repository are assumed to correspond to master branch version of cleanlab, hence some very-recently added examples may require you to instead install the developer version of cleanlab (pip install git+https://github.com/cleanlab/cleanlab.git).
To see the examples corresponding to specific version of cleanlab, check out the Tagged Releases of this repository (e.g. the examples for cleanlab v2.1.0 are here).
Running all examples
You may run the notebooks individually or run the bash script below which will execute and save each notebook (for examples: 1-7). Note that before executing the script to run all notebooks for the first time you will need to create a jupyter kernel named cleanlab-examples. Be sure that you have already created and activated the virtual environment (steps provided above) before running the following command to create the jupyter kernel.
$ python -m ipykernel install --user --name=cleanlab-examplesBash script to run all notebooks:
$ bash ./run_all_notebooks.shOlder Examples
For running older versions of cleanlab, look at the Tagged Releases of this repository to see the corresponding older versions of the example notebooks (e.g. the examples for cleanlab v2.0.0 are here).
See the contrib folder for examples from v1 of cleanlab which may be helpful for reproducing results from the Confident Learning paper.
License
Copyright (c) 2017-2023 Cleanlab Inc.
All files listed above and contained in this folder (https://github.com/cleanlab/examples) are part of cleanlab.
cleanlab is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
cleanlab is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License in LICENSE.