#
model-parallelism
Here are 18 public repositories matching this topic...
Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training
deep-learning
hpc
large-scale
data-parallelism
model-parallelism
distributed-training
pipeline-parallelism
-
Updated
Mar 10, 2022 - Python
A GPipe implementation in PyTorch
-
Updated
Sep 18, 2020 - Python
Paddle Distributed Training Examples. 飞桨分布式训练示例 Resnet Bert GPT MOE DataParallel ModelParallel PipelineParallel HybridParallel AutoParallel Zero Sharding Recompute GradientMerge Offload AMP DGC LocalSGD Wide&Deep
benchmark
cloud
lightning
elastic
unsupervised-learning
large-scale
data-parallelism
paddlepaddle
model-parallelism
distributed-algorithm
self-supervised-learning
pipeline-parallelism
pretraining
fleet-api
paddlecloud
-
Updated
Feb 16, 2022 - Shell
Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed giant model training.
-
Updated
Mar 9, 2022 - Python
LiBai: A Toolbox for Large-Scale Distributed Parallel Training
nlp
deep-learning
transformer
large-scale
data-parallelism
model-parallelism
distributed-training
self-supervised-learning
oneflow
pipeline-parallelism
-
Updated
Mar 10, 2022 - Python
PyTorch implementation of 3D U-Net with model parallel in 2GPU for large model
-
Updated
Aug 9, 2020 - Python
Development of Project HPGO | Hybrid Parallelism Global Orchestration
rust
machine-learning
tensorflow
pytorch
data-parallelism
model-parallelism
distributed-training
pipedream
gpipe
pipeline-parallelism
-
Updated
Mar 26, 2021
Torch Automatic Distributed Neural Network (TorchAD-NN) training library. Built on top of TorchMPI, this module automatically parallelizes neural network training.
machine-learning
neural-network
torch7
openmpi
data-parallelism
model-parallelism
distributed-machine-learning
-
Updated
Feb 28, 2018 - Lua
distributed tensorflow (model parallelism) example repository
-
Updated
Jul 13, 2019 - Python
A decentralized and distributed framework for training DNNs
-
Updated
Aug 25, 2019 - Python
A simple graph partitioning algorithm written in Go. Designed for use for partitioning neural networks across multiple devices which has an added cost when crossing device boundaries.
-
Updated
Jun 17, 2020 - Go
An MPI-based distributed model parallelism technique for MLP
-
Updated
Jun 10, 2020 - C
Contains materials of internship at ALCF during summer of 2019
-
Updated
Aug 15, 2019 - Python
A fully distributed hyperparameter optimization tool for PyTorch DNNs.
-
Updated
Jan 12, 2022 - Python
Mesh TensorFlow: Model Parallelism Made Easier
-
Updated
Dec 18, 2018 - Python
performance test of MNIST hand writings usign MXNet + TF
python
mxnet
tensorflow
keras
mnist
classification
gluon
multi-gpu
model-parallelism
horovod
multi-gpu-training
mirrored-strategy
-
Updated
Jan 31, 2020 - Python
Improve this page
Add a description, image, and links to the model-parallelism topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the model-parallelism topic, visit your repo's landing page and select "manage topics."
Hi ,
I have tried out both loss.backward() and model_engine.backward(loss) for my code. There are several subtle differences that I have observed , for one retain_graph = True does not work for model_engine.backward(loss) . This is creating a problem since buffers are not being retained every time I run the code for some reason.
Please look into this if you could.