The Reward Scheduler

Author

Laith Zumot

Published

February 2, 2026

I drink a lot of coffee, mostly V60s. Today, somewhere around cup three, the heavily caffeinated crazy squirrels in my brain started yelling that the three-stage pour-over process has a decent analogy to Lean RL training, specifically curriculum and reward scheduling.

For non-coffee people: a V60 is a paper cone dripper. You put grounds in a filter, pour hot water, and gravity does the rest. The way you pour changes the taste because it changes how evenly water moves through the coffee bed. If you are curious, Hario makes the popular one (no affiliation).

Below, I reference the reward “voices” as they exist in my GitHub repo lean-grpo. There are five reward scripts I keep coming back to: base.py, domain.py, difficulty.py, efficiency.py, and composite.py. I am not trying to explain the code line by line. I just want a few tiny snippets as anchors so the metaphor stays tied to something real.

0) A tiny V60 primer so the rest makes sense

When you start a V60, the dry coffee is full of trapped gas. If you dump water too fast, water finds weird paths, the bed gets uneven, and you get a cup that tastes confused. So most people do three things. First, they do a small wetting pour and wait. That is bloom. Second, they pour the main water in controlled pulses and circles so the bed stays even. Third, they stop pouring and watch the drip finish. That final drain time is the drawdown.

You do not taste “bloom flavor” and “pulse flavor” separately. You watch cues while brewing, then you taste at the end. Training is like that too.

1) The Bloom

Bloom exists because stability comes before speed. It is where you wet the grounds and wait. The coffee puffs and bubbles. You are not trying to brew the cup yet. You are trying to wake up the bed so the later pour behaves. In training, bloom is Foundation. This is the “stay Lean valid” phase. The model is learning how to breathe without crashing. In base.py, the shape of that signal is basically: count valid steps, scale it, keep it small.

# base.py
valid_steps = sum(1 for step in trajectory.steps if step.is_valid)
progress = valid_steps / max_steps
return progress * self.config.max_partial_reward

This is why it feels like a tracker. It is a small drip of credit for staying in-bounds. If you try to force speed during bloom, you usually pour too hard or too soon. The bed does not settle. Water cuts channels, which are little shortcuts where water rushes through one path and ignores the rest. That gives uneven extraction. You end up with a cup that tastes thin, sharp, or just oddly hollow.

if you demand efficiency too early, you punish exploration before the model knows any reliable moves. The model then learns to avoid situations that trigger penalties. Depending on the rest of your reward mix, that can look like stopping early, or doing safe low-value loops that keep it “valid” but do not finish. Ever try GRPO on hard algebra with a very sparse signal?

2) The Pulse Pour and Bean Informed Variations

That section title sounds like a neat name for a coffee pub, or an arxiv preprint perhaps. Alas, I digress…

After bloom, you start the real pour. You pour in pulses and circles. This is where you shape the brew. You are trying to keep the bed evenly saturated, avoid channeling, and pull out the good stuff at a steady pace. This is where Domain fits cleanly. Domain is about what the proof feels like. It is style, tool choice, and “normal moves.” It nudges the model toward tactics that usually matter for the kind of math you are doing.

In domain.py, the algebra voice is basically: look at the tactic text, reward certain patterns, and discourage a few bad smells.

# domain.py (algebra-ish)
tactics_text = trajectory.get_tactics_text().lower()
for tactic in self.PREFERRED_TACTICS:
    if tactic in tactics_text:
        reward += self.config.preferred_tactic_bonus

That maps nicely to the pour pattern. You are teaching “how to pour” in this domain.

In difficulty.py, the spirit is: hard problems cost more, so hard wins should pay more. Hard problems fail more often. They take more steps. They create more chances to crash. If the reward does not pay more for hard completion, the model learns a very simple habit: farm easy proofs because they are reliable.

Now, difficulty is real, but I think it fits better as the thing that changes what a “good pour” even means. In coffee, some beans are easy and forgiving. Some are dense, light-roasted, stubborn little rocks that punish you if you rush. Dense, light roasts extract slower. If you treat them like an easy dark roast, you under-extract and get sour, thin cups. You need to give them more “budget” in some form: more time, a different grind, more careful pouring, sometimes more total water energy to get sweetness instead of sharpness.

3) The Drawdown

Now you stop pouring and you watch the drip. Too fast and the cup is weak. Too slow and it gets heavy and bitter. There is a range where it tastes clean and balanced. It may feel a bit of a stretch but in training, I see drawdown as efficiency. It checks bloat, repetition, and wandering.

In efficiency.py, the voice is basically: stay near an “optimal length,” penalize excessive length, and punish obvious repetition.

# efficiency.py
if num_steps <= optimal + tolerance:
    reward += self.config.conciseness_bonus
else:
    reward += -(excess * self.config.length_penalty_factor)

# efficiency.py
redundancy_ratio = 1 - len(unique_tactics) / len(tactics_list)
if redundancy_ratio > 0.3:
    reward += self.config.repetition_penalty * redundancy_ratio

V60 cues and training metrics

In V60, I am not tasting at every stage, although in the morning I really really want to get some caffeine in me ASAP.

Instead, I am watching a few signals while brewing. Time is the obvious one. I watch bloom time and total brew time. I watch the flow. Is it steady, or did it suddenly rush down one side. I watch the bed. Does it look even, or did it crack and channel. Then I taste at the end and decide what to change next time. It is basically a torture regime.

Training has the same pattern. While training, I’d watch survival and finishing trends, not “final quality” every step. For bloom, I’d watch how often trajectories stay valid for more than a few steps, and how early they crash. For pulse pour, I’d watch whether the model is even willing to play on hard problems, and whether its tactic choices look like the domain instead of random flailing. For drawdown, I’d watch whether proof length and repetition are shrinking without completion collapsing.

Then I’d probably “taste” with a stable eval set. That can be as simple as running a checkpoint on a fixed batch of problems and checking completion rate and a couple of sanity stats. It is the final cup.

The main point is that you do not need perfect metrics. You need a few cues that tell you whether you are still waking up the bed, actually extracting, or just stalling the drip. This is still an art.

Your bAIrista in training
Laith