Modal Setup for Open-R1 Lean Training
Step-by-step guide for cloud-based GRPO training
This document provides a complete walkthrough for setting up Modal cloud infrastructure to train Lean proof models using GRPO. Code is shown but not executed.
1) Overview & Architecture
This guide bridges the gap between the original Open-R1 repository (designed for local GPUs/SLURM clusters) and Modal’s serverless cloud infrastructure. The approach requires zero changes to existing code—all original scripts remain untouched. We create a single wrapper file (train_modal.py) that handles Modal’s specific requirements: volume mounting, secret management, dynamic GPU allocation, and container lifecycle.
Key differences from local training: - Volumes: Explicit mount points replace traditional filesystems - Secrets: API keys managed via Modal’s secret system (no .env files) - GPU allocation: Dynamically requested per job - Containers: Ephemeral instances that vanish after completion
2) Repository Setup
Fork both repositories to your GitHub account for independent development.
# Fork these repositories via GitHub web interface:
# https://github.com/auto-res/open-r1
# https://github.com/[YOUR_GITHUB_USERNAME]/open-r1-leanThe main Open-R1 repo provides the training framework, while the lean variant contains Lean-specific configurations and dataset formats.
3) Modal CLI Installation
Install Modal using the uv package manager for fast, reliable Python tooling.
curl -LsSf https://astral.sh/uv/install.sh | sh
uv tool install modal
modal --versionuvprovides portable Python environment managementmodalCLI enables cloud deployment and monitoring
4) Modal Authentication
Configure Modal with your account credentials and verify the profile.
modal setup
modal profile listThis one-time setup links your local environment to your Modal account and sets default project configurations.
5) Local Clone Setup
Clone your fork locally to prepare the Modal wrapper and configuration files.
git clone https://github.com/[YOUR_GITHUB_USERNAME]/open-r1-lean.git
cd open-r1-leanWorking from your fork allows you to commit changes while pulling upstream updates.
6) Wandb Secret Configuration
Create a Modal secret to securely store your Weights & Biases API key.
# Get your key from: wandb.ai/authorize
modal secret create wandb-secret WANDB_API_KEY="[YOUR_WANDB_API_KEY]"
modal secret listModal secrets are encrypted and injected into containers at runtime, never exposed in code or logs.
7) Checkpoint Volume Creation
Create a persistent volume to store model checkpoints across container runs.
modal volume create openr1-checkpoints
modal volume listVolumes provide durable storage that survives container termination—critical for multi-day training runs.
8) Dataset Preparation
Copy your Lean problems dataset into the expected location.
cp /path/to/your/grpo_problem_algebra.jsonl data/grpo_problem_algebra.jsonl
# Verify format:
head -n 1 data/grpo_problem_algebra.jsonlThe dataset should contain Lean conjectures with a problem column for compatibility with the training script.
9) Minimal Modal Training Script
Create a new file train_modal.py in the root directory:
from modal import App, Image, Volume, Secret, gpu
import datetime
import modal
EXPERIMENT_NAME = f"Lean-GRPO-V2-7B-Algebra-{datetime.datetime.now().strftime('%Y%m%d-%H%M')}"
# Configuration - easy to edit
GPU_COUNT = 8
WORK_DIR = "/app"
CHECKPOINT_DIR = f"{WORK_DIR}/checkpoints"
# Modal setup
app = App("openr1-lean-training")
volume = Volume.from_name("openr1-checkpoints", create_if_missing=True)
# ✅ Use NVIDIA PyTorch image (has python symlink + CUDA pre-configured)
image = (
Image.from_registry("nvcr.io/nvidia/pytorch:24.01-py3")
.apt_install("git", "curl")
.run_commands(
"curl -LsSf https://astral.sh/uv/install.sh | sh",
"pip install modal",
)
.dockerfile_commands([f"COPY . {WORK_DIR}"])
.workdir(WORK_DIR)
.run_commands(
"uv venv openr1 --python 3.11",
". openr1/bin/activate && uv pip install --upgrade pip",
"git config --global url.https://github.com/.insteadOf git@github.com:", # HTTPS fix
". openr1/bin/activate && uv pip install vllm==0.7.2",
". openr1/bin/activate && uv pip install setuptools",
". openr1/bin/activate && uv pip install flash-attn --no-build-isolation",
"GIT_LFS_SKIP_SMUDGE=1 . openr1/bin/activate && uv pip install -e '.[dev]'"
)
)
@app.function(
image=image,
gpu=f"H100:{GPU_COUNT}",
volumes={CHECKPOINT_DIR: volume},
secrets=[Secret.from_name("wandb-secret")],
timeout=86400, # 24 hours
cpu=8,
memory=32768, # 32GB RAM
#spot=True, # Enable spot instances for cost savings
)
def train():
import os
import subprocess
# Mount checkpoints
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
# Activate venv
activate_cmd = ". /app/openr1/bin/activate && "
# Run training
cmd = (
f"{activate_cmd} ACCELERATE_LOG_LEVEL=info accelerate launch "
f"--config_file recipes/accelerate_configs/zero2.yaml "
f"--num_processes={GPU_COUNT-1} src/open_r1/grpo.py "
f"--config experiments/[YOUR_CONFIG_PATH]/myconfig.yaml "
f"--output_dir {CHECKPOINT_DIR}/{EXPERIMENT_NAME}"
)
subprocess.run(cmd, shell=True, check=True)
@app.local_entrypoint()
def main():
train.remote()Key features: - Zero code changes to original Open-R1 scripts - Automatic dependency installation via uv and pip - CUDA-optimized base image with Flash Attention 2 pre-configured - Dynamic GPU allocation (H100 in this example)
10) Configuration Update
Edit your experiment configuration file:
# Model arguments
model_name_or_path: deepseek-ai/DeepSeek-Prover-V2-7B
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
# Data training arguments
# We edit the DeepSeek chat template to ensure (a) the reasoning block within <think> and </think> is included in the completion and (b) the <think> tag is not part of the prefill so that the format reward works
chat_template: "{% for message in messages %}{% if message['role'] == 'system' %}{{ message['content'] }}\n\n{% endif %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"
dataset_name: ./data/grpo_problem_algebra.jsonl
system_prompt: "Complete the following Lean 4 code:\n\n```lean4\n"
# GRPO trainer config
bf16: true
use_vllm: true
vllm_device: auto
vllm_tensor_parallel_size: 1
vllm_gpu_memory_utilization: 0.7
do_eval: false
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
hub_model_id: DeepSeek-Prover-V2-7B
hub_strategy: every_save
learning_rate: 1.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
min_lr_rate: 0.1
max_prompt_length: 512
max_completion_length: 2048
max_steps: 5 # Set to -1 for full training
num_generations: 16
num_train_epochs: 1
output_dir: /app/checkpoints/Lean-GRPO-V2-7B-Algebra-1
overwrite_output_dir: true
per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: false
report_to: wandb
wandb_entity: [YOUR_WANDB_ENTITY] # Your wandb username
wandb_project: [YOUR_PROJECT_NAME] # Project name
reward_funcs:
- lean
reward_weights:
- 1.0
save_strategy: "epoch"
save_total_limit: 1
seed: 42
temperature: 0.7
warmup_ratio: 0.1
#resume_from_checkpoint: ./data/Lean-GRPO-V2-7B-1/checkpoint-??Template customization: Replace [YOUR_WANDB_ENTITY] and [YOUR_PROJECT_NAME] with your credentials. Adjust max_steps for full training runs.
11) Running the Training
Launch a test run with minimal steps to verify the setup:
modal run train_modal.pyWhat happens: - Initial build: ~10-15 minutes (cached for subsequent runs) - Test execution: 5 training steps from your config - Checkpointing: Saves to your Modal volume - Logging: Streams to Wandb
Monitor progress in a separate terminal:
# Get App ID after ~2 minutes:
modal app list
# Stream logs:
modal app log [APP_ID]12) Checkpoint Management
List all experiments:
modal volume ls openr1-checkpointsDownload a single checkpoint:
modal volume get openr1-checkpoints [EXPERIMENT_NAME]/checkpoint-final ./downloaded-modelDownload entire experiment:
modal volume get openr1-checkpoints [EXPERIMENT_NAME] ./downloaded-experimentInspect without downloading:
modal volume ls openr1-checkpoints [EXPERIMENT_NAME]/Resume from checkpoint: Add this to the cmd in train_modal.py:
resume_path = "/app/checkpoints/[EXPERIMENT_NAME]/checkpoint-[STEP]"
f"--resume_from_checkpoint {resume_path}"Delete specific experiment:
modal volume rm openr1-checkpoints [EXPERIMENT_NAME] --recursiveDelete entire volume (⚠️ permanent):
modal volume delete openr1-checkpointsCheck storage costs:
modal volume list
# Shows size in GB. Example: 42GB × $0.10 = $4.20/month