Skip to content

sbak5/nvidia-resiliency-ext

 
 

Repository files navigation

NVIDIA Resiliency Extension

The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads.

Figure highlighting core NVRx features including automatic restart, hierarchical checkpointing, fault detection and health checks

Core Components and Capabilities

Installation

From sources

  • git clone https://github.com/NVIDIA/nvidia-resiliency-ext
  • cd nvidia-resiliency-ext
  • pip install .

From PyPI wheel

  • pip install nvidia-resiliency-ext

Platform Support

Category Supported Versions / Requirements
Architecture x86_64, arm64
Operating System Ubuntu 22.04, 24.04
Python Version >= 3.10, < 3.13
PyTorch Version >= 2.3.1 (injob & chkpt), >= 2.5.1 (inprocess)
CUDA & CUDA Toolkit >= 12.5 (12.8 required for GPU health check)
NVML Driver >= 535 (570 required for GPU health check)
NCCL Version >= 2.21.5 (injob & chkpt), >= 2.26.2 (inprocess)

Usage

For detailed documentation and usage information about each component, please refer to the https://nvidia.github.io/nvidia-resiliency-ext/.

About

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to failures and interruptions.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 96.4%
  • Shell 2.6%
  • C++ 1.0%