Modal Setup for Open-R1 Lean Training

Step-by-step guide for cloud-based GRPO training

This document provides a complete walkthrough for setting up Modal cloud infrastructure to train Lean proof models using GRPO. Code is shown but not executed.

1) Overview & Architecture

This guide bridges the gap between the original Open-R1 repository (designed for local GPUs/SLURM clusters) and Modal’s serverless cloud infrastructure. The approach requires zero changes to existing code—all original scripts remain untouched. We create a single wrapper file (train_modal.py) that handles Modal’s specific requirements: volume mounting, secret management, dynamic GPU allocation, and container lifecycle.

Key differences from local training: - Volumes: Explicit mount points replace traditional filesystems - Secrets: API keys managed via Modal’s secret system (no .env files) - GPU allocation: Dynamically requested per job - Containers: Ephemeral instances that vanish after completion

2) Repository Setup

Fork both repositories to your GitHub account for independent development.

# Fork these repositories via GitHub web interface:
# https://github.com/auto-res/open-r1
# https://github.com/[YOUR_GITHUB_USERNAME]/open-r1-lean

The main Open-R1 repo provides the training framework, while the lean variant contains Lean-specific configurations and dataset formats.

3) Modal CLI Installation

Install Modal using the uv package manager for fast, reliable Python tooling.

curl -LsSf https://astral.sh/uv/install.sh | sh
uv tool install modal
modal --version

uv provides portable Python environment management
modal CLI enables cloud deployment and monitoring

4) Modal Authentication

Configure Modal with your account credentials and verify the profile.

modal setup
modal profile list

This one-time setup links your local environment to your Modal account and sets default project configurations.

5) Local Clone Setup

Clone your fork locally to prepare the Modal wrapper and configuration files.

git clone https://github.com/[YOUR_GITHUB_USERNAME]/open-r1-lean.git
cd open-r1-lean

Working from your fork allows you to commit changes while pulling upstream updates.

6) Wandb Secret Configuration

Create a Modal secret to securely store your Weights & Biases API key.

# Get your key from: wandb.ai/authorize
modal secret create wandb-secret WANDB_API_KEY="[YOUR_WANDB_API_KEY]"
modal secret list

Modal secrets are encrypted and injected into containers at runtime, never exposed in code or logs.

7) Checkpoint Volume Creation

Create a persistent volume to store model checkpoints across container runs.

modal volume create openr1-checkpoints
modal volume list

Volumes provide durable storage that survives container termination—critical for multi-day training runs.

8) Dataset Preparation

Copy your Lean problems dataset into the expected location.

cp /path/to/your/grpo_problem_algebra.jsonl data/grpo_problem_algebra.jsonl

# Verify format:
head -n 1 data/grpo_problem_algebra.jsonl

The dataset should contain Lean conjectures with a problem column for compatibility with the training script.

9) Minimal Modal Training Script

Create a new file train_modal.py in the root directory:

from modal import App, Image, Volume, Secret, gpu
import datetime
import modal

EXPERIMENT_NAME = f"Lean-GRPO-V2-7B-Algebra-{datetime.datetime.now().strftime('%Y%m%d-%H%M')}"
LEAN_REPL_PATH = "./repl"
# change .gitmodules
# [submodule "repl"]
#   path = repl
#   # url = git@github.com:ondanaoto/repl.git to
#   url = https://github.com/ondanaoto/repl.git

# Configuration - easy to edit
GPU_COUNT = 8 
WORK_DIR = "/app"
CHECKPOINT_DIR = f"{WORK_DIR}/checkpoints"

# Modal setup
app = App("openr1-lean-training")
volume = Volume.from_name("openr1-checkpoints", create_if_missing=True)

# ✅ Use NVIDIA PyTorch image (has python symlink + CUDA pre-configured)
image = (
    Image.from_registry("nvcr.io/nvidia/pytorch:24.01-py3")
    .apt_install("git", "curl", "clang", "make") 
    .run_commands(
        "curl -LsSf https://astral.sh/uv/install.sh | sh",
        "pip install modal",
    )
    .dockerfile_commands([
        f"COPY . {WORK_DIR}",
        # ✅ Copy the submodule directory
        f"COPY {LEAN_REPL_PATH} /tmp/lean-repl"
    ])
    .workdir(WORK_DIR)
    .run_commands(
        "uv venv openr1 --python 3.11",
        ". openr1/bin/activate && uv pip install --upgrade pip",
        "git config --global url.https://github.com/.insteadOf git@github.com:",  # HTTPS fix
        ". openr1/bin/activate && uv pip install vllm==0.7.2",
        ". openr1/bin/activate && uv pip install setuptools",
        ". openr1/bin/activate && uv pip install flash-attn --no-build-isolation",
        "GIT_LFS_SKIP_SMUDGE=1 . openr1/bin/activate && uv pip install -e '.[dev]'"
    )
    .run_commands(
    # Install Lean toolchain
    "curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh -s -- -y --default-toolchain leanprover/lean4:v4.15.0",
    
    # Build the REPL from submodule
    "cd /tmp/lean-repl && ~/.elan/bin/lake build",
    
    # ✅ CORRECT PATH: .lake/build/bin/repl
    "cp /tmp/lean-repl/.lake/build/bin/repl /app/repl",
    "chmod +x /app/repl",
    
    # Verify
    "ls -lh /app/repl && /app/repl --version || echo 'REPL built successfully'"
    )
)


@app.function(
    image=image,
    gpu=gpu.H100(count=GPU_COUNT),
    volumes={CHECKPOINT_DIR: volume},
    secrets=[Secret.from_name("wandb-secret")],
    timeout=86400,  # 24 hours
    cpu=8,
    memory=32768,  # 32GB RAM
    #spot=True,  # Enable spot instances for cost savings
)
def train():
    import os
    import subprocess
    
    # Mount checkpoints
    os.makedirs(CHECKPOINT_DIR, exist_ok=True)
    
    # Activate venv
    activate_cmd = ". /app/openr1/bin/activate && "
    
    # Run training  
    cmd = (
    f"{activate_cmd} ACCELERATE_LOG_LEVEL=info accelerate launch "
    f"--config_file recipes/accelerate_configs/zero2.yaml "
    f"--num_processes={GPU_COUNT-1} src/open_r1/grpo.py "
    f"--config experiments/[YOUR_CONFIG_PATH]/myconfig.yaml "
    f"--output_dir {CHECKPOINT_DIR}/{EXPERIMENT_NAME}"
    )
    subprocess.run(cmd, shell=True, check=True)

@app.local_entrypoint()
def main():
    train.remote()

Key features: - Zero code changes to original Open-R1 scripts - Automatic dependency installation via uv and pip - CUDA-optimized base image with Flash Attention 2 pre-configured - Dynamic GPU allocation (H100 in this example)

10) Configuration Update

Edit your experiment configuration file:

# Model arguments
model_name_or_path: deepseek-ai/DeepSeek-Prover-V2-7B
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data training arguments
# We edit the DeepSeek chat template to ensure (a) the reasoning block within <think> and </think> is included in the completion and (b) the <think> tag is not part of the prefill so that the format reward works
chat_template: "{% for message in messages %}{% if message['role'] == 'system' %}{{ message['content'] }}\n\n{% endif %}{% if message['role'] == 'user' %}{{ message['content'] }}{% endif %}{% endfor %}"
dataset_name: ./data/grpo_problem_algebra.jsonl
system_prompt: "Complete the following Lean 4 code:\n\n```lean4\n"

# GRPO trainer config
bf16: true
use_vllm: true
vllm_device: auto
vllm_tensor_parallel_size: 1
vllm_gpu_memory_utilization: 0.7
do_eval: false
gradient_accumulation_steps: 4 
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: DeepSeek-Prover-V2-7B
hub_strategy: every_save
learning_rate: 1.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine_with_min_lr
lr_scheduler_kwargs:
  min_lr_rate: 0.1
max_prompt_length: 512
max_completion_length: 2048
max_steps: 5  # Set to -1 for full training
num_generations: 16
num_train_epochs: 1
output_dir: /app/checkpoints/Lean-GRPO-V2-7B-Algebra-1
overwrite_output_dir: true
per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: false
report_to: wandb
wandb_entity: [YOUR_WANDB_ENTITY]          # Your wandb username
wandb_project: [YOUR_PROJECT_NAME]         # Project name

reward_funcs:
- lean
reward_weights:
- 1.0
save_strategy: "epoch"
save_total_limit: 1
seed: 42
temperature: 0.7
warmup_ratio: 0.1
#resume_from_checkpoint: ./data/Lean-GRPO-V2-7B-1/checkpoint-??

Template customization: Replace [YOUR_WANDB_ENTITY] and [YOUR_PROJECT_NAME] with your credentials. Adjust max_steps for full training runs.

11) Running the Training

modal run --detach train_modal.py > lean_training_task.txt

What happens: - Initial build: ~10-15 minutes (cached for subsequent runs) - Test execution: 5 training steps from your config - Checkpointing: Saves to your Modal volume - Logging: Streams to Wandb

Monitor progress in a separate terminal:

# Get App ID after ~2 minutes:
modal app list

# Stream logs:
modal app log [APP_ID]

12) Checkpoint Management

List all experiments:

modal volume ls openr1-checkpoints

Download a single checkpoint:

modal volume get openr1-checkpoints [EXPERIMENT_NAME]/checkpoint-final ./downloaded-model

Download entire experiment:

modal volume get openr1-checkpoints [EXPERIMENT_NAME] ./downloaded-experiment

Inspect without downloading:

modal volume ls openr1-checkpoints [EXPERIMENT_NAME]/

Resume from checkpoint: Add this to the cmd in train_modal.py:

resume_path = "/app/checkpoints/[EXPERIMENT_NAME]/checkpoint-[STEP]"
f"--resume_from_checkpoint {resume_path}"

Delete specific experiment:

modal volume rm openr1-checkpoints [EXPERIMENT_NAME] --recursive

Delete entire volume (⚠️ permanent):

modal volume delete openr1-checkpoints

Check storage costs:

modal volume list
# Shows size in GB. Example: 42GB × $0.10 = $4.20/month