Modal Setup for Open-R1 Lean Training

Step-by-step guide for cloud-based GRPO training

This document provides a complete walkthrough for setting up Modal cloud infrastructure to train Lean proof models using GRPO. Code is shown but not executed.

1) Overview & Architecture

This guide bridges the gap between the original Open-R1 repository (designed for local GPUs/SLURM clusters) and Modal’s serverless cloud infrastructure. The approach requires zero changes to existing code—all original scripts remain untouched. We create a single wrapper file (train_modal.py) that handles Modal’s specific requirements: volume mounting, secret management, dynamic GPU allocation, and container lifecycle.

Key differences from local training: - Volumes: Explicit mount points replace traditional filesystems - Secrets: API keys managed via Modal’s secret system (no .env files) - GPU allocation: Dynamically requested per job - Containers: Ephemeral instances that vanish after completion


2) Repository Setup

Fork both repositories to your GitHub account for independent development.

# Fork these repositories via GitHub web interface:
# https://github.com/auto-res/open-r1
# https://github.com/[YOUR_GITHUB_USERNAME]/open-r1-lean

The main Open-R1 repo provides the training framework, while the lean variant contains Lean-specific configurations and dataset formats.


5) Local Clone Setup

Clone your fork locally to prepare the Modal wrapper and configuration files.

git clone https://github.com/[YOUR_GITHUB_USERNAME]/open-r1-lean.git
cd open-r1-lean

Working from your fork allows you to commit changes while pulling upstream updates.


6) Wandb Secret Configuration

Create a Modal secret to securely store your Weights & Biases API key.

# Get your key from: wandb.ai/authorize
modal secret create wandb-secret WANDB_API_KEY="[YOUR_WANDB_API_KEY]"
modal secret list

Modal secrets are encrypted and injected into containers at runtime, never exposed in code or logs.


7) Checkpoint Volume Creation

Create a persistent volume to store model checkpoints across container runs.

modal volume create openr1-checkpoints
modal volume list

Volumes provide durable storage that survives container termination—critical for multi-day training runs.


8) Dataset Preparation

Copy your Lean problems dataset into the expected location.

cp /path/to/your/grpo_problem_algebra.jsonl data/grpo_problem_algebra.jsonl

# Verify format:
head -n 1 data/grpo_problem_algebra.jsonl

The dataset should contain Lean conjectures with a problem column for compatibility with the training script.


9) Minimal Modal Training Script

Create a new file train_modal.py in the root directory:

from modal import App, Image, Volume, Secret, gpu
import datetime
import modal

EXPERIMENT_NAME = f"Lean-GRPO-V2-7B-Algebra-{datetime.datetime.now().strftime('%Y%m%d-%H%M')}"

# Configuration - easy to edit
GPU_COUNT = 8 
WORK_DIR = "/app"
CHECKPOINT_DIR = f"{WORK_DIR}/checkpoints"

# Modal setup
app = App("openr1-lean-training")
volume = Volume.from_name("openr1-checkpoints", create_if_missing=True)

# ✅ Use NVIDIA PyTorch image (has python symlink + CUDA pre-configured)
image = (
    Image.from_registry("nvcr.io/nvidia/pytorch:24.01-py3")
    .apt_install("git", "curl")
    .run_commands(
        "curl -LsSf https://astral.sh/uv/install.sh  | sh",
        "pip install modal",
    )
    .dockerfile_commands([f"COPY . {WORK_DIR}"])
    .workdir(WORK_DIR)
    .run_commands(
        "uv venv openr1 --python 3.11",
        ". openr1/bin/activate && uv pip install --upgrade pip",
        "git config --global url.https://github.com/.insteadOf  git@github.com:",  # HTTPS fix
        ". openr1/bin/activate && uv pip install vllm==0.7.2",
        ". openr1/bin/activate && uv pip install setuptools",
        ". openr1/bin/activate && uv pip install flash-attn --no-build-isolation",
        "GIT_LFS_SKIP_SMUDGE=1 . openr1/bin/activate && uv pip install -e '.[dev]'"
    )
)

@app.function(
    image=image,
    gpu=f"H100:{GPU_COUNT}",
    volumes={CHECKPOINT_DIR: volume},
    secrets=[Secret.from_name("wandb-secret")],
    timeout=86400,  # 24 hours
    cpu=8,
    memory=32768,  # 32GB RAM
    #spot=True,  # Enable spot instances for cost savings
)
def train():
    import os
    import subprocess
    
    # Mount checkpoints
    os.makedirs(CHECKPOINT_DIR, exist_ok=True)
    
    # Activate venv
    activate_cmd = ". /app/openr1/bin/activate && "
    
    # Run training  
    cmd = (
    f"{activate_cmd} ACCELERATE_LOG_LEVEL=info accelerate launch "
    f"--config_file recipes/accelerate_configs/zero2.yaml "
    f"--num_processes={GPU_COUNT-1} src/open_r1/grpo.py "
    f"--config experiments/[YOUR_CONFIG_PATH]/myconfig.yaml "
    f"--output_dir {CHECKPOINT_DIR}/{EXPERIMENT_NAME}"
    )
    subprocess.run(cmd, shell=True, check=True)

@app.local_entrypoint()
def main():
    train.remote()

Key features: - Zero code changes to original Open-R1 scripts - Automatic dependency installation via uv and pip - CUDA-optimized base image with Flash Attention 2 pre-configured - Dynamic GPU allocation (H100 in this example)


10) Configuration Update

Edit your experiment configuration file:

# Model arguments
model_name_or_path: deepseek-ai/DeepSeek-Prover-V2-7B
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
# We edit the DeepSeek chat template to ensure (a) the reasoning block within <think> and </think> is included in the completion and (b) the <think> tag is not part of the prefill so that the format reward works
chat_template: "{% for message in messages %}{% if message['role'] == 'system' %}{{ message['content'] }}\n\n{% endif %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"
dataset_name: ./data/grpo_problem_algebra.jsonl
system_prompt: "Complete the following Lean 4 code:\n\n```lean4\n"

# GRPO trainer config
bf16: true
use_vllm: true
vllm_device: auto
vllm_tensor_parallel_size: 1
vllm_gpu_memory_utilization: 0.7
do_eval: false
gradient_accumulation_steps: 4 
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: DeepSeek-Prover-V2-7B
hub_strategy: every_save
learning_rate: 1.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
  min_lr_rate: 0.1
max_prompt_length: 512
max_completion_length: 2048
max_steps: 5  # Set to -1 for full training
num_generations: 16
num_train_epochs: 1
output_dir: /app/checkpoints/Lean-GRPO-V2-7B-Algebra-1
overwrite_output_dir: true
per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: false
report_to: wandb
wandb_entity: [YOUR_WANDB_ENTITY]          # Your wandb username
wandb_project: [YOUR_PROJECT_NAME]         # Project name

reward_funcs:
- lean
reward_weights:
- 1.0
save_strategy: "epoch"
save_total_limit: 1
seed: 42
temperature: 0.7
warmup_ratio: 0.1
#resume_from_checkpoint: ./data/Lean-GRPO-V2-7B-1/checkpoint-??

Template customization: Replace [YOUR_WANDB_ENTITY] and [YOUR_PROJECT_NAME] with your credentials. Adjust max_steps for full training runs.


11) Running the Training

Launch a test run with minimal steps to verify the setup:

modal run train_modal.py

What happens: - Initial build: ~10-15 minutes (cached for subsequent runs) - Test execution: 5 training steps from your config - Checkpointing: Saves to your Modal volume - Logging: Streams to Wandb

Monitor progress in a separate terminal:

# Get App ID after ~2 minutes:
modal app list

# Stream logs:
modal app log [APP_ID]

12) Checkpoint Management

List all experiments:

modal volume ls openr1-checkpoints

Download a single checkpoint:

modal volume get openr1-checkpoints [EXPERIMENT_NAME]/checkpoint-final ./downloaded-model

Download entire experiment:

modal volume get openr1-checkpoints [EXPERIMENT_NAME] ./downloaded-experiment

Inspect without downloading:

modal volume ls openr1-checkpoints [EXPERIMENT_NAME]/

Resume from checkpoint: Add this to the cmd in train_modal.py:

resume_path = "/app/checkpoints/[EXPERIMENT_NAME]/checkpoint-[STEP]"
f"--resume_from_checkpoint {resume_path}"

Delete specific experiment:

modal volume rm openr1-checkpoints [EXPERIMENT_NAME] --recursive

Delete entire volume (⚠️ permanent):

modal volume delete openr1-checkpoints

Check storage costs:

modal volume list
# Shows size in GB. Example: 42GB × $0.10 = $4.20/month