Symmetric Ray Run CLI A Torchrun-Style Approach To Distributed Computing
Hey guys! Let's dive into a cool way to handle distributed computing with Ray, making it feel more like the familiar torchrun. We're going to break down the challenges, the solution, and why this approach can make your life easier. So, buckle up and let's get started!
The Challenge: Simplifying Ray Cluster Management
Currently, using Ray for distributed computing can feel a bit clunky. You need separate commands for the head node and worker nodes, and then yet another command to actually submit your job. It's like juggling three balls at once! Think about it: with Ray, you might end up doing this:
# head node, terminal 1:
ray start --block
# worker node, terminal 1:
ray start --block --address='ip:6379'
# head node, terminal 2:
vllm serve -tp 16
This contrasts sharply with simpler setups like torchrun or mpirun, where you just run one command across all nodes, specifying parameters like rank and node. The goal here is simplifying the current process of setting up Ray clusters, which involves distinct commands for the head node, worker nodes, and job submission. The existing method can be cumbersome compared to more streamlined approaches like torchrun, where a single command manages the setup across all nodes. This simplification is crucial for improving user experience and reducing the complexity of distributed computing workflows.
Consider the elegance of the SGLang approach:
# On all nodes, run this (and set rank to 0 or 1)
# This will setup distributed comms and start the model server
python3 -m sglang.launch_server --tp 16 --dist-init-addr ip:20000 --nnodes 2 --rank ?
This style plays nicely with utilities like xpanes, making parallel execution a breeze:
xpanes -c "NCCL_SOCKET_IFNAME=bond0 GLOO_SOCKET_IFNAME=bond0 python3 -m sglang.launch_server --tp 16 --dist-init-addr ip:20000 --nnodes 2 --node-rank {}" 0 1
It's also similar to how torchrun works with SLURM, where rank and other variables are pulled from the SLURM runtime:
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
GPUS_PER_NODE=8
torchrun --nproc_per_node $GPUS_PER_NODE \
--nnodes $SLURM_NNODES \
--node_rank $SLURM_PROCID \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
$YOUR_SCRIPT
The current Ray setup requires users to manage different commands for the head and worker nodes, along with a separate invocation for job submission. This contrasts with more streamlined systems like torchrun, where a single command handles the entire distributed setup. The complexity of managing multiple commands and configurations can lead to user errors and increased operational overhead. Simplifying the cluster management process, aligning it with tools like torchrun, would significantly enhance the usability of Ray in distributed computing environments.
Core Requirements for a Solution
To make things smoother, we need a solution that ticks these boxes:
- One Command to Rule Them All: We need a single command to start the cluster and kick off program execution on every node. This command should be flexible enough to handle variable input per node, especially the cluster IP address for all nodes to connect to.
- Lifecycle Harmony: The cluster's lifecycle should be tied to the program. Think
ray.init()
on a single node – the cluster starts with the command and cleans up when the command finishes. The current Ray setup often involves manual starting and stopping of the cluster, which can lead to orphaned resources and increased management overhead. Tying the cluster lifecycle to the program execution simplifies resource management and ensures that clusters are automatically cleaned up after use. This behavior mimics the simplicity ofray.init()
in single-node setups, providing a more consistent and user-friendly experience. - Environmentally Aware: Environment variables should be passed seamlessly to the executing program. This is essential for configuring various aspects of the distributed application, such as communication backends and resource allocations. The ability to pass environment variables directly to the program ensures that the distributed application has access to the necessary configurations and settings. This eliminates the need for manual environment setup on each node and reduces the potential for configuration errors. Supporting environment variable passing is crucial for flexibility and ease of use in distributed computing scenarios.
- Speed Matters: Cluster startup should be lightning-fast. No one wants to wait around for ages before their program starts running!
Enter Symmetric Ray Run: A Torchrun-Style Solution
The solution? A symmetric_run
command that brings the torchrun philosophy to Ray. Imagine this:
# Run this command on each node, but only run the driver script on the head node
python -m ray.scripts.symmetric_run --address=ip:port -- python script.py --args --args2
# KEY: DON'T GENERALLY DOCUMENT THIS - ONLY IN A CERTAIN CONTEXT
This single command does it all! It starts the Ray cluster, connects the nodes, and executes your script. It's all about simplicity and efficiency. The symmetric_run command aims to streamline the cluster startup and job execution process by allowing users to initiate these tasks with a single command. This approach simplifies the workflow and reduces the complexity associated with managing distributed Ray applications. By mimicking the behavior of torchrun, symmetric_run offers a familiar and intuitive experience for users accustomed to distributed training frameworks.
Flexibility with Environment Variables
Need to set some environment variables? No problem:
NCCL_SOCKET_IFNAME=bond0 GLOO_SOCKET_IFNAME=bond0 python -m ray.scripts.symmetric_run --head-ip-address=... -- python script.py --script-arg1 --script-arg2
Power of Xpanes
And for those who love the power of xpanes:
xpanes -c \
"NCCL_SOCKET_IFNAME=bond0 GLOO_SOCKET_IFNAME=bond0 python -m ray.scripts.symmetric_run --head-ip-address=... \
'vllm serve --model ...'"
The symmetric_run command enhances flexibility by allowing users to specify environment variables directly within the command invocation. This eliminates the need for manual environment configuration on each node, reducing the potential for errors and simplifying the setup process. Additionally, the command's compatibility with tools like xpanes enables efficient parallel execution across multiple nodes, further streamlining distributed computing workflows. By providing a unified interface for cluster management and job execution, symmetric_run significantly improves the user experience and reduces the operational overhead associated with distributed Ray applications.
Benefits of Symmetric Ray Run
- Simplified Workflow: The
symmetric_run
command greatly simplifies the workflow for starting and managing distributed Ray applications. By consolidating cluster startup and job execution into a single command, it reduces the complexity and potential for errors associated with the traditional multi-step process. This simplification is especially beneficial for users who are new to distributed computing or who prefer a more streamlined approach. - Reduced Overhead: By tying the cluster lifecycle to the program execution,
symmetric_run
minimizes resource overhead. The cluster automatically starts when the command is invoked and shuts down when the program completes, preventing orphaned resources and reducing the need for manual cluster management. This automated lifecycle management ensures efficient resource utilization and reduces the risk of unnecessary costs. - Enhanced Flexibility: The ability to pass environment variables directly through the command provides enhanced flexibility in configuring distributed applications. This eliminates the need for manual environment setup on each node and allows users to easily customize application behavior based on specific requirements. The compatibility with tools like xpanes further enhances flexibility by enabling efficient parallel execution across multiple nodes.
- Faster Startup: The primary goal of
symmetric_run
is to provide a faster cluster startup time, addressing a common pain point in distributed computing environments. By optimizing the startup process,symmetric_run
reduces the time users spend waiting for their applications to begin execution, improving overall productivity and efficiency. This fast startup time is particularly crucial for iterative development and experimentation.
In Conclusion
The Symmetric Ray Run CLI is a game-changer for distributed computing with Ray. It brings the simplicity and elegance of torchrun to the Ray ecosystem, making it easier than ever to spin up clusters, run your code, and manage your resources. By addressing the core requirements of a streamlined distributed computing workflow, symmetric_run empowers users to focus on their applications rather than the complexities of cluster management. This innovative approach not only simplifies the user experience but also enhances the efficiency and scalability of Ray-based distributed systems. So, if you're looking for a more intuitive and efficient way to work with Ray, symmetric_run is definitely worth checking out!