diff --git a/.gitignore b/.gitignore index 7ab517ff20..5398cb7bd1 100644 --- a/.gitignore +++ b/.gitignore @@ -98,3 +98,4 @@ wandb/ # checkpoints checkpoints/ +launcher_record \ No newline at end of file diff --git a/README_zh.md b/README_zh.md index bd8846f4d6..fc2fde08ca 100644 --- a/README_zh.md +++ b/README_zh.md @@ -176,7 +176,7 @@ pip install -e ".[flash_attn]" [`uv`](https://github.com/astral-sh/uv) 是现代的 Python 包管理工具。 ```bash -uv sync --extra dev --extra flash_attn +uv sync --extra dev --extra flash_attn -i https://mirrors.aliyun.com/pypi/simple/ --no-build-isolation ``` ## 通过 PyPI 安装 @@ -193,6 +193,7 @@ pip install flash-attn==2.8.1 ```bash uv pip install trinity-rft uv pip install flash-attn==2.8.1 +uv pip install --verbose flash-attn -i https://mirrors.aliyun.com/pypi/simple/ --no-deps --no-build-isolation ``` ## 使用 Docker diff --git a/deps_incremental.txt b/deps_incremental.txt new file mode 100644 index 0000000000..373d554376 --- /dev/null +++ b/deps_incremental.txt @@ -0,0 +1,2 @@ +logoru +beast_logger \ No newline at end of file diff --git a/examples/agentscope_react/gsm8k.yaml b/examples/agentscope_react/gsm8k.yaml index c1b79f7016..1ddfd504bc 100644 --- a/examples/agentscope_react/gsm8k.yaml +++ b/examples/agentscope_react/gsm8k.yaml @@ -7,7 +7,7 @@ algorithm: optimizer: lr: 1e-6 model: - model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen3-8B} + model_path: '/mnt/data/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-1___5B-Instruct' max_response_tokens: 16384 max_model_len: 24576 cluster: diff --git a/examples/agentscope_react/gsm8k_agentopia.yaml b/examples/agentscope_react/gsm8k_agentopia.yaml new file mode 100644 index 0000000000..4bbf1c6d4d --- /dev/null +++ b/examples/agentscope_react/gsm8k_agentopia.yaml @@ -0,0 +1,67 @@ +project: AgentScope-ReAct +name: GSM8K-Qwen3-8B +checkpoint_root_dir: ${oc.env:TRINITY_CHECKPOINT_ROOT_DIR,./checkpoints} +algorithm: + algorithm_type: multi_step_grpo + repeat_times: 8 + optimizer: + lr: 1e-6 +model: + model_path: '/mnt/data/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-1___5B-Instruct' + max_response_tokens: 16384 + max_model_len: 24576 +cluster: + node_num: 1 + gpu_per_node: 8 +buffer: + total_epochs: 1 + batch_size: 32 + train_batch_size: 256 + explorer_input: + taskset: + name: gsm8k + storage_type: env_service + path: 'http://localhost:8080' + subset_name: 'appworld' + split: 'train' + format: + prompt_key: 'question' + response_key: 'answer' + rollout_args: + temperature: 1.0 + default_workflow_type: 'agentopia_workflow' + eval_tasksets: [] + trainer_input: + experience_buffer: + name: agentscope_gsm8k_buffer + storage_type: queue +explorer: + eval_interval: 50 + runner_per_model: 16 + max_timeout: 1800 + rollout_model: + engine_num: 4 + tensor_parallel_size: 1 + enable_prefix_caching: false + enforce_eager: true + enable_openai_api: true + enable_history: true + enable_auto_tool_choice: true + tool_call_parser: hermes + # reasoning_parser: deepseek_r1 + enable_thinking: false + dtype: bfloat16 + seed: 42 +synchronizer: + sync_style: dynamic_by_explorer + sync_method: 'nccl' + sync_interval: 2 + sync_timeout: 1200 +trainer: + save_interval: 100 + grad_clip: 1.0 + use_dynamic_bsz: true + max_token_len_per_gpu: 24576 + ulysses_sequence_parallel_size: 2 +monitor: + monitor_type: tensorboard \ No newline at end of file diff --git a/examples/grpo_gsm8k/gsm8k.yaml b/examples/grpo_gsm8k/gsm8k.yaml index b0640f089c..36868231ae 100644 --- a/examples/grpo_gsm8k/gsm8k.yaml +++ b/examples/grpo_gsm8k/gsm8k.yaml @@ -7,7 +7,7 @@ algorithm: optimizer: lr: 1e-5 model: - model_path: ${oc.env:TRINITY_MODEL_PATH,Qwen/Qwen2.5-1.5B-Instruct} + model_path: '/mnt/data/model_cache/modelscope/hub/Qwen/Qwen/Qwen2___5-1___5B-Instruct' max_response_tokens: 1024 max_model_len: 2048 cluster: @@ -50,6 +50,7 @@ explorer: engine_num: 2 tensor_parallel_size: 1 enable_prefix_caching: false + gpu_memory_utilization: 0.7 enforce_eager: true dtype: bfloat16 seed: 42 diff --git a/launcher_trinity.py b/launcher_trinity.py new file mode 100644 index 0000000000..e69de29bb2 diff --git a/note.md b/note.md new file mode 100644 index 0000000000..f6158ccadc --- /dev/null +++ b/note.md @@ -0,0 +1,12 @@ + + +/mnt/data_cpfs/qingxu.fu/trinity/trinity/explorer/workflow_runner.py +run_task() + +--> +/mnt/data_cpfs/qingxu.fu/trinity/examples/agentscope_react/gsm8k.yaml +buffer.explorer_input.default_workflow_type +--> + +/mnt/data_cpfs/qingxu.fu/trinity/trinity/common/workflows/agentscope/react/react_workflow.py +run_async() diff --git a/tests/utils/monitor_swanlab_test.py b/tests/utils/monitor_swanlab_test.py new file mode 100644 index 0000000000..6c6259b600 --- /dev/null +++ b/tests/utils/monitor_swanlab_test.py @@ -0,0 +1,77 @@ +""" +Simple smoke test for SwanlabMonitor. + +Run: + python cradle.py + +What it does: +- Ensures SWANLAB_API_KEY is read from environment (sets a dummy if missing). +- Initializes SwanlabMonitor with minimal args. +- Logs a small metric and closes the run. + +Notes: +- If `swanlab` is not installed, this script will print a helpful message and exit. +- The dummy API key is used only to exercise the login path; real authentication isn't required for this smoke test. +""" + +import os +import sys + + +def main() -> int: + # Defer imports to keep error handling simple + try: + from trinity.utils.monitor import SwanlabMonitor + except Exception as e: + print("Failed to import SwanlabMonitor:", e) + return 1 + + # Ensure an env-based key path is exercised (uses dummy if not provided) + env_keys = ["SWANLAB_API_KEY", "SWANLAB_APIKEY", "SWANLAB_KEY", "SWANLAB_TOKEN"] + if not any(os.getenv(k) for k in env_keys): + os.environ["SWANLAB_API_KEY"] = "dummy_key_for_smoke_test" + print("Set SWANLAB_API_KEY to a dummy value to test env-based login path.") + + # Try creating the monitor; if swanlab isn't installed, __init__ will assert + try: + mon = SwanlabMonitor( + project="trinity-smoke", + group="cradle", + name="swanlab-env", + role="tester", + config=None, + ) + except AssertionError as e: + print("SwanLab not available or not installed:", e) + print("Install swanlab to run this smoke test: pip install swanlab") + return 0 + except Exception as e: + print("Unexpected error constructing SwanlabMonitor:", e) + return 1 + + # Log a minimal metric to verify basic flow + try: + mon.log({"smoke/metric": 1.0}, step=1) + print("Logged a test metric via SwanlabMonitor.") + except Exception as e: + print("Error during logging:", e) + try: + mon.close() + except Exception: + pass + return 1 + + # Close cleanly + try: + mon.close() + print("SwanlabMonitor closed successfully.") + except Exception as e: + print("Error closing monitor:", e) + return 1 + + print("Smoke test completed.") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/trinity/buffer/buffer.py b/trinity/buffer/buffer.py index 46929f06be..59d744117a 100644 --- a/trinity/buffer/buffer.py +++ b/trinity/buffer/buffer.py @@ -24,6 +24,10 @@ def get_buffer_reader(config: BufferStorageConfig) -> BufferReader: from trinity.buffer.reader.queue_reader import QueueReader return QueueReader(storage_config) + elif storage_config.storage_type == StorageType.ASTUNE: + from trinity.buffer.reader.file_reader import AstuneTaskReader + + return AstuneTaskReader(storage_config) elif storage_config.storage_type == StorageType.FILE: from trinity.buffer.reader.file_reader import ( ExperienceFileReader, diff --git a/trinity/buffer/reader/file_reader.py b/trinity/buffer/reader/file_reader.py index b6f39979f4..7d25749065 100644 --- a/trinity/buffer/reader/file_reader.py +++ b/trinity/buffer/reader/file_reader.py @@ -164,3 +164,61 @@ def read_with_indices(self, indices: List[int]) -> List: async def read_with_indices_async(self, indices: List[int]) -> List: """Read tasks with indices asynchronously.""" return self.read_with_indices(indices) + + +import os +def read_astune_config(yaml_fp): + from hydra import initialize, compose + from omegaconf import DictConfig + + def load_hydra_config(config_path: str, config_name: str) -> DictConfig: + with initialize(config_path=config_path, version_base=None): + cfg = compose(config_name=config_name, overrides=[]) + return cfg + + dir_path = os.path.dirname(yaml_fp) + file_name = os.path.basename(yaml_fp) + return load_hydra_config(config_path=dir_path, config_name=file_name) + +class AstuneTaskReader(TaskFileReader): + def __init__(self, config): + self.config = config + self.read_batch_size = config.batch_size + self.split = config.split + + yaml_path = os.environ.get('ASTUNE_CONFIG_REDIRECT', None) + if yaml_path is None: + raise ValueError("ASTUNE_CONFIG_REDIRECT is not set in environment variables") + astune_config = read_astune_config(os.path.relpath(yaml_path, os.path.dirname(__file__))) + + # from vsdb import bp + # bp("XXX") + + from astune.task_reader import TaskReaderRouter, task_to_standard_dataset + task_reader = TaskReaderRouter(astune_config) + if 'train' in self.split: + train_dataset = task_to_standard_dataset(task_reader.get_training_tasks()) + if 'val' in self.split: + train_dataset = task_to_standard_dataset(task_reader.get_validation_tasks()) + + self.dataset = _HFBatchReader( + datasets.concatenate_datasets([train_dataset]), # type ignore + name=self.config.name, + default_batch_size=self.read_batch_size, + total_epochs=self.config.total_epochs if not self.config.is_eval else 1, + offset=self.config.index, + drop_last=not self.config.is_eval, + total_steps=self.config.total_steps, + enable_progress_bar=self.config.enable_progress_bar, + ) + self.formatter = FORMATTER.get("task")(self.config) + + def read(self, batch_size: Optional[int] = None) -> List: + batch_size = batch_size or self.read_batch_size + tasks = [] + samples = self.dataset.read_batch(batch_size) + for sample in samples: + task = self.formatter.format(sample) + tasks.append(task) + return tasks + diff --git a/trinity/buffer/task_scheduler.py b/trinity/buffer/task_scheduler.py index 01b6fa1a47..3262da17ee 100644 --- a/trinity/buffer/task_scheduler.py +++ b/trinity/buffer/task_scheduler.py @@ -12,6 +12,8 @@ from trinity.common.constants import SELECTOR_METRIC from trinity.utils.annotations import Experimental +from trinity.buffer.reader.file_reader import AstuneTaskReader + @Experimental class TasksetScheduler: @@ -62,7 +64,7 @@ def __init__(self, explorer_state: Dict, config: Config): for taskset_config, taskset_state in zip(taskset_configs, taskset_states): assert not taskset_config.is_eval # assume drop last taskset = get_buffer_reader(taskset_config) - if not isinstance(taskset, TaskFileReader): + if not isinstance(taskset, TaskFileReader) and not isinstance(taskset, AstuneTaskReader): raise TypeError( f"Taskset '{taskset_config.name}' has an unsupported type '{type(taskset).__name__}'." f"Currently, only 'TaskFileReader' is supported by TasksetScheduler." diff --git a/trinity/cli/launcher.py b/trinity/cli/launcher.py index 468ab2df53..e18ae8d000 100644 --- a/trinity/cli/launcher.py +++ b/trinity/cli/launcher.py @@ -163,6 +163,9 @@ def run_stage(config: Config) -> None: def run(config_path: str, dlc: bool = False, plugin_dir: str = None): + if os.path.exists(".env"): + from dotenv import load_dotenv + load_dotenv(".env") if plugin_dir: os.environ[PLUGIN_DIRS_ENV_VAR] = plugin_dir load_plugins() diff --git a/trinity/common/config.py b/trinity/common/config.py index c722959b96..2f2f5c8b1d 100644 --- a/trinity/common/config.py +++ b/trinity/common/config.py @@ -170,7 +170,7 @@ class StorageConfig: default_workflow_type: Optional[str] = None default_reward_fn_type: Optional[str] = None rollout_args: GenerationConfig = field(default_factory=GenerationConfig) - workflow_args: dict = field(default_factory=dict) + workflow_args: dict = field(default_factory=dict) # qingxu: TODO reward_fn_args: dict = field(default_factory=dict) task_selector: TaskSelectorConfig = field(default_factory=TaskSelectorConfig) @@ -738,6 +738,7 @@ class StageConfig: trainer: Optional[TrainerConfig] = None + @dataclass class Config: """Global Configuration""" diff --git a/trinity/common/constants.py b/trinity/common/constants.py index 183702927b..255e29337a 100644 --- a/trinity/common/constants.py +++ b/trinity/common/constants.py @@ -62,6 +62,7 @@ class StorageType(CaseInsensitiveEnum): SQL = "sql" QUEUE = "queue" FILE = "file" + ASTUNE = "astune" class SyncMethodEnumMeta(CaseInsensitiveEnumMeta): diff --git a/trinity/common/workflows/agentscope/react/react_workflow.py b/trinity/common/workflows/agentscope/react/react_workflow.py index a6dbca28e3..6d2f1fbfc0 100644 --- a/trinity/common/workflows/agentscope/react/react_workflow.py +++ b/trinity/common/workflows/agentscope/react/react_workflow.py @@ -2,25 +2,22 @@ This workflow is a demonstration of how to integrate the AgentScope framework within the Trinity-RFT workflow system with minimal modifications. """ - -from typing import Dict, List, Optional, Union - +import uuid import openai - +from typing import Dict, List, Optional, Union from trinity.common.experience import Experience from trinity.common.models.model import ModelWrapper from trinity.common.workflows.workflow import WORKFLOWS, Task, Workflow - +from transformers import AutoTokenizer from .templates import TEMPLATE_MAP - @WORKFLOWS.register_module("as_react_workflow") class AgentScopeReActWorkflow(Workflow): is_async: bool = True def __init__( self, - *, + config, task: Task, model: ModelWrapper, auxiliary_models: Optional[List[openai.OpenAI]] = None, @@ -97,3 +94,4 @@ def construct_experiences(self, reward: Union[float, Dict[str, float]]) -> List[ if isinstance(reward, dict): exp.metrics.update(reward) return exps + diff --git a/trinity/common/workflows/workflow.py b/trinity/common/workflows/workflow.py index 91716e1688..254357e09f 100644 --- a/trinity/common/workflows/workflow.py +++ b/trinity/common/workflows/workflow.py @@ -40,7 +40,7 @@ class Task(dict): index: dict = field(default_factory=dict) def to_workflow( - self, model: Any, auxiliary_models: Optional[List[openai.OpenAI]] = None + self, config, model: Any, auxiliary_models: Optional[List[openai.OpenAI]] = None ) -> Workflow: """Convert the task to a workflow. @@ -55,6 +55,7 @@ def to_workflow( Workflow: The generated workflow object. """ return self.workflow( + config=config, model=model, task=self, auxiliary_models=auxiliary_models, diff --git a/trinity/explorer/explorer.py b/trinity/explorer/explorer.py index 038c1dd5f9..4b5e22c5e1 100644 --- a/trinity/explorer/explorer.py +++ b/trinity/explorer/explorer.py @@ -34,6 +34,10 @@ from trinity.utils.plugin_loader import load_plugins from trinity.utils.timer import Timer +try: + from astune.backbone_trinity import * +except ImportError: + from astune.backbone.trinity_compat_workflow import * class Explorer: """Responsible for exploring the taskset.""" diff --git a/trinity/explorer/scheduler.py b/trinity/explorer/scheduler.py index ae17649c86..fa2822a23a 100644 --- a/trinity/explorer/scheduler.py +++ b/trinity/explorer/scheduler.py @@ -81,7 +81,7 @@ async def run_with_retry(self, task: TaskWrapper) -> Tuple[Status, List, int]: for attempt in range(self.retry_times + 1): try: status, exps = await asyncio.wait_for( - self.runner.run_task.remote(task.task, task.repeat_times, task.run_id_base), + self.runner.run_task.remote(task.task, task.repeat_times, task.run_id_base), # Here we call the runner's run_task() self.timeout, ) if status.ok: diff --git a/trinity/explorer/workflow_runner.py b/trinity/explorer/workflow_runner.py index 5c6a3933d4..363d467d48 100644 --- a/trinity/explorer/workflow_runner.py +++ b/trinity/explorer/workflow_runner.py @@ -79,8 +79,9 @@ def _create_workflow_instance(self, task: Task) -> None: or not self.workflow_instance.resettable ): self.workflow_instance = task.to_workflow( - self.model_wrapper, - ( + config=self.config, + model=self.model_wrapper, + auxiliary_models=( self.auxiliary_model_async_clients if task.workflow.is_async else self.auxiliary_model_clients diff --git a/trinity/utils/monitor.py b/trinity/utils/monitor.py index 0eee105608..b3c5dfb47a 100644 --- a/trinity/utils/monitor.py +++ b/trinity/utils/monitor.py @@ -16,6 +16,10 @@ import mlflow except ImportError: mlflow = None +try: + import swanlab +except ImportError: + swanlab = None from torch.utils.tensorboard import SummaryWriter from trinity.common.config import Config @@ -225,3 +229,126 @@ def default_args(cls) -> Dict: "username": None, "password": None, } + + +@MONITOR.register_module("swanlab") +class SwanlabMonitor(Monitor): + """Monitor with SwanLab. + This monitor integrates with SwanLab (https://swanlab.cn/) to track experiments. + """ + + def __init__( + self, project: str, group: str, name: str, role: str, config: Config = None + ) -> None: + assert ( + swanlab is not None + ), "swanlab is not installed. Please install it to use SwanlabMonitor." + + monitor_args = ( + (config.monitor.monitor_args or {}) + if config and getattr(config, "monitor", None) + else {} + ) + + # read api key from environment variable or monitor_args + api_key = monitor_args.get("api_key") or os.environ.get("SWANLAB_API_KEY") + if api_key: + try: + swanlab.login(api_key=api_key, save=True) + except Exception as e: + # Best-effort login; continue to init which may still work if already logged in + get_logger(__name__).warning( + f"Swanlab login failed, but continuing initialization: {e}" + ) + + # Compose tags (ensure list and include role/group markers) + tags = monitor_args.get("tags") or [] + if isinstance(tags, tuple): + tags = list(tags) + if role and role not in tags: + tags.append(role) + if group and group not in tags: + tags.append(group) + + # Determine experiment name + exp_name = monitor_args.get("experiment_name") or f"{name}_{role}" + + # Prepare init kwargs, passing only non-None values to respect library defaults + init_kwargs = { + "project": project, + "experiment_name": exp_name, + "description": monitor_args.get("description"), + "tags": tags or None, + "logdir": monitor_args.get("logdir"), + "mode": monitor_args.get("mode") or "cloud", + "settings": monitor_args.get("settings"), + "id": monitor_args.get("id"), + "resume": monitor_args.get("resume"), + "reinit": monitor_args.get("reinit"), + } + # Strip None values to avoid overriding swanlab defaults + init_kwargs = {k: v for k, v in init_kwargs.items() if v is not None} + + # Convert config to a plain dict for SwanLab config logging + cfg_dict = None + if config is not None: + if hasattr(config, "flatten"): + try: + cfg_dict = config.flatten() + except Exception: + # Fallback: try to cast to dict if possible + try: + cfg_dict = dict(config) + except Exception: + cfg_dict = None + else: + try: + cfg_dict = dict(config) + except Exception: + cfg_dict = None + if cfg_dict is not None: + init_kwargs["config"] = cfg_dict + + self.logger = swanlab.init(**init_kwargs) + self.console_logger = get_logger(__name__, in_ray_actor=True) + + def log_table(self, table_name: str, experiences_table: pd.DataFrame, step: int): + # Convert pandas DataFrame to SwanLab ECharts Table + headers: List[str] = list(experiences_table.columns) + # Ensure rows are native Python types + rows: List[List[object]] = experiences_table.astype(object).values.tolist() + try: + tbl = swanlab.echarts.Table() + tbl.add(headers, rows) + swanlab.log({table_name: tbl}, step=step) + except Exception: + # Fallback: log as CSV string if echarts table is unavailable + csv_str = experiences_table.to_csv(index=False) + swanlab.log({table_name: csv_str}, step=step) + + def log(self, data: dict, step: int, commit: bool = False) -> None: + """Log metrics.""" + # SwanLab doesn't use commit flag; keep signature for compatibility + swanlab.log(data, step=step) + self.console_logger.info(f"Step {step}: {data}") + + def close(self) -> None: + try: + # Prefer run.finish() if available + if hasattr(self, "logger") and hasattr(self.logger, "finish"): + self.logger.finish() + elif swanlab: + # Fallback to global finish + swanlab.finish() + except Exception as e: + logger = getattr(self, "console_logger", get_logger(__name__)) + logger.warning(f"Error closing Swanlab monitor: {e}") + + @classmethod + def default_args(cls) -> Dict: + """Return default arguments for the monitor.""" + return { + "api_key": None, + "mode": "cloud", + "logdir": None, + } diff --git a/vsdb.py b/vsdb.py new file mode 100644 index 0000000000..662b77e461 --- /dev/null +++ b/vsdb.py @@ -0,0 +1,142 @@ +import os + + + +def vscode_conditional_breakpoint(tag=None, rank=-1, once=True): + """ + Set a conditional breakpoint in VSCode based on given tag and rank conditions. + + This function is used to trigger breakpoints during debugging when specific conditions are met. + The breakpoint will be triggered if: + 1. The `rank` parameter is 0, or the rank environment variable is 0. + 2. The environment variable `RAY_DEBUG_POST_MORTEM` is set. + 3. If a `tag` parameter is provided, it exists in the environment variable `DEBUG_TAGS`. + + Parameters: + - tag (str, optional): Tag to match against the environment variable `DEBUG_TAGS`. + If None, the breakpoint triggers unconditionally. + - rank (int, optional): GPU index, world rank. + - once (bool, optional): If True, the breakpoint will only trigger once. + + Environment Variables: + - RAY_DEBUG_POST_MORTEM: If not set, the function returns immediately without triggering breakpoint. + - DEBUG_TAGS: Contains multiple tags separated by `|`. If `tag` parameter exists in this variable, + the breakpoint triggers. + """ + + env_tag = f'HIT_BREAKPOINT_REC_{tag}' + # if rank < 0: rank = os.getenv("RANK", 0) + # if rank != 0: return + if not os.getenv('RAY_DEBUG_POST_MORTEM'): return + if tag is None: + if once: + if os.getenv(env_tag, "") != "1": + os.environ[env_tag] = "1" + breakpoint() + return + else: + breakpoint() + return + else: + debug_tags = os.getenv('DEBUG_TAGS', '').split('|') + if tag in debug_tags: + if once: + if os.getenv(env_tag, "") != "1": + os.environ[env_tag] = "1" + breakpoint() + return + else: + breakpoint() + return + +import pickle + +def objdump(obj, file="objdump.tmp"): + with open(file, "wb+") as f: + pickle.dump(obj, f) + return + +def objload(file="objdump.tmp"): + import os + if not os.path.exists(file): + return + with open(file, "rb") as f: + return pickle.load(f) + +bp = vscode_conditional_breakpoint + + + + +""" +Document: + +Ray Distributed Debugger VSCode Extension + +1. Starting with Ray 2.39, Anyscale has introduced the `Ray Distributed Debugger `_ VSCode extension. Follow the extension’s installation instructions, then add your cluster using the dashboard URL you obtained earlier. + + .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/debugger.png?raw=true + :alt: Ray Distributed Debugger VSCode extension screenshot + +2. Prerequisites. + + Ensure the following are installed (see the extension README for more detail): + + - Visual Studio Code + - `ray[default]` >= 2.9.1 + - `debugpy` >= 1.8.0 + + .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/c7098b755ff689859837773a916c857.png?raw=true + :alt: VSCode with Ray prerequisites + +3. Environment Variables. + + To enable post‑mortem debugging, set: + + .. code-block:: bash + + export RAY_DEBUG_POST_MORTEM=1 + + .. admonition:: Note + :class: important + + Be sure to remove any legacy flags before starting Ray: + + - `RAY_DEBUG=legacy` + - `--ray-debugger-external` + +4. Configuring BreakpointsSet up breakpoint() in your code, and submit job to cluster. Then the extension will show the breakpoint information. + + + 1. Insert `breakpoint()` calls into your remote functions. + 2. Submit your job to the cluster. + + The extension will detect active breakpoints and display them in VSCode. + + .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/4ddad74395c79a1402331c0ce73316f.png?raw=true + :alt: Detected breakpoint in VSCode + + **Note:** Breakpoints are only supported inside functions decorated with `@ray.remote`. + +5. Launching the Debugger. + + Run your job directly from the command line (do not use a `launch.json`): + + .. code-block:: bash + + python job.py + +6. Attaching to a Breakpoint. + + Once the process hits the first `breakpoint()`, click the Ray Distributed Debugger icon in the VSCode sidebar to attach the debugger. + + .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/4ddad74395c79a1402331c0ce73316f.png?raw=true + :alt: Attaching VSCode debugger to Ray process + +7. Debugging With Multiple breakpoint(). + + For each subsequent task, first disconnect the current debugger session, then click the extension icon again to attach to the next breakpoint. + + .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/6e83c910a62c82fecb89c6619e001cd.png?raw=true + :alt: Disconnecting and reconnecting the debugger +""" \ No newline at end of file