Skip to content

[examples] Add WideEP DP group fault tolerance example#54

Open
jeffreywang-anyscale wants to merge 8 commits intomainfrom
wideep-dp-group-ft
Open

[examples] Add WideEP DP group fault tolerance example#54
jeffreywang-anyscale wants to merge 8 commits intomainfrom
wideep-dp-group-ft

Conversation

@jeffreywang-anyscale
Copy link
Copy Markdown

No description provided.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
jeffreywang-anyscale and others added 7 commits April 1, 2026 21:53
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
g6.12xlarge (4x L4 GPUs) has 192 GiB RAM, not 32 GiB.
The incorrect value caused free pod shape validation to fail
on deployment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
- Rewrite README to match the pattern of other examples (Install CLI,
  Clone, Deploy, Query, Understanding, Shutdown)
- Split into two clear demos: autoscaling service and fault tolerance job
- Add job.yaml so fault_tolerance_demo.py can be run as an Anyscale job
  from a laptop without needing direct cluster access

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Use the deployed service + console terminal to demonstrate fault
tolerance, which is simpler and more visual than submitting a job.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants