3,576 issues in microsoft/DeepSpeed
Fix the bug AttributeError: BF16_Optimizer object has no attribute bit16_groups when using bf16 + zero1 + pp to train model and saving the bf16 optimizer state
Describe the bug It seems the latest master branch has some problems with the inference kernels when running with multiple GPUs. After I tested several models including OPT/LLaMA/Vicuna, none of them can ...
Describe the Question I just notice code in https://github.com/HuangLK/llama-deepspeed/blob/faedea514b11c18c695e1b2a6adb63b102ef001c/models/llama_pipeline_model.py#L174 It appears that the code is utilizing ...
Fix the bug AttributeError: BF16_Optimizer object has no attribute bit16_groups when using bf16 + zero1 + pp to train model and saving the bf16 optimizer state
Describe the bug Unable to run deepspeed comm TestDistInitNoEnv test with 0.9.3 onwards, getting keyError Rank. To Reproduce Run comm tests from Deepspeed repo with deepspeed 0.9.3 or higher: pytest ...
Describe the bug Create a model with deepspeed.zero.Init(), then do training with stage 2. Got the following exception. I debugged a bit and found out, in initialize_gradient_partition(), current_index ...
Describe the bug Lightning version (pytorch-lightning 2.0.2) deepspeed version (0.9.4) Trained chatglm, but got weight and input on different devices exception. Weight is on cpu, while input is on gpu. ...
https://github.com/microsoft/DeepSpeed/blob/cd911f9ab2213edb0c8781bd5fd604c37c020dfb/deepspeed/runtime/pipe/engine.py#L470-L478 In else block output[0] is of type list, thus for idx, out in outputs and ...