New Feature
- Expose and increase default sync concurrency (#60)
- Treat invalid Pod caused by network error as PodCreationUnknownError (#61)
Full Commit History Since Previous Release
Assets
2
New Feature
- Enrich PodSpecError to early fail Pod (#52)
Bug Fix
- Fix invalid json in log caused by fmt (MISSING) (#49)
- Aware UID change during Update event and Sync (#51)
Full Commit History Since Previous Release
Assets
2
New Feature
- Support large scale Framework by LargeFrameworkCompression (#44)
- Add PodNodeName to help track failures on node before PodIP is available (#45)
Full Commit History Since Previous Release
Assets
2
New Feature
- Add example to leverage HivedScheduler to achieve GPU Topology-Aware, Multi-Tenant, Priority and Gang Scheduling (#34)
- Support to expose Framework and Pod history snapshots to external systems (#31)
- Support to classify and summarize Pod failures (#41)
- Support to tune Framework Consistency vs Availability (#43)
This helps to avoid the Pod is stuck in deleting forever, such as if its Node is down forever. - Support Stop Framework (#24)
This helps to stop the Framework without deleting it. - Still sync Task after FrameworkAttempt Completing (#27)
This helps to make sure all Tasks in the Framework are updated to the right completed status when the whole FrameworkAttempt is completed. - Support FrameworkCompletedRetainSec (#37)
This helps to automatically delete the Framework after it is completed for a long time, to free ETCD space. - Add FrameworkAttemptPreparing State (#12)
This helps to distinguish if there is at least one Task of current attempt has ever entered TaskAttemptRunning state. If not, it is FrameworkAttemptPreparing instead of FrameworkAttemptRunning anymore. - Redefine FrameworkAttemptRunning and Record attempt running start time (#35)
This helps to measure Framework and Task pure running duration. - Support Pod Template Placeholders (#21)
Bug Fix
- Fix TaskCompleted may transition to TaskAttemptCompleted (#10)
- Fix fExpectedStatusInfos map race condition (#18)
Misc
- Upgrade to kubernetes-1.14.2 (#16)
- Remove Internal and External CompletionTypeAttribute (#22 #41)
This is because FrameworkController does not need to aware it, so leave the freedom to controller wrapper - Upgrade to golang 1.12.6 (#29)
- Switch to klog (#30)
Full Commit History Since Previous Release
Assets
2
yqwang-ms
released this
Breaking Change
Refine Doc and Example
Assets
2
yqwang-ms
released this
Add Distributed TensorFlow Training Example
Feature
- Support both GPU and CPU Distributed Training
- Automatically clean up PS when the whole FrameworkAttempt is completed
- No need to adjust existing TensorFlow image
- No need to setup Kubernetes DNS and Kubernetes Service
- Common Feature
Prerequisite
- Need to setup Kubernetes GPU, if you need GPU Training
- Need to setup Kubernetes Cluster-Level Logging, if you need to persist and expose the log for deleted Pod
Quick Start
Support FrameworkBarrier for GangExecution
Feature
It is usually used as the InitContainer to provide a simple way to
- Do Gang Execution without resource deadlock
- Start the AppContainers in the Pod only after its PodUID is persisted by FrameworkController
- Inject peer-to-peer service discovery information into the AppContainers
Quick Start
Assets
2
yqwang-ms
released this
Initial FrameworkController
General-Purpose Kubernetes Pod Controller
https://github.com/Microsoft/frameworkcontroller
FrameworkController is built to orchestrate all kinds of applications on Kubernetes by a single controller.
See more in README.