Solving Mojo GPU Puzzles
Puzzle 1
I began solving Mojo Puzzles today (I needed to get a nudge from a friend to get enough FOMO to do it).
Luckily I spent some time reading the famous PMPP book and attended a few Cuda-Mode lectures to possible grok whats going on, so this is my attempt to refresh my knowledge and also learn Mojo again after training a small FIM style coder on the 400 repos that existed at the time a few years ago (shameless plug for my github for that).
Note that alot of the info and notes are from Mojo Docs

Puzzle 1) - the problem
This puzzle is simple trying to add a addition over a vector. A classical Map/Reduce in a way. The code is mostly from Mojo except for one line of code you need to add yourself..
from memory import UnsafePointer
from gpu import thread_idx
from gpu.host import DeviceContext
from testing import assert_equal
# ANCHOR: add_10
alias SIZE = 4 # alias sets compile‑time constants.
## runs one block with 4 threads along x. Total threads = 1×4 = 4
alias BLOCKS_PER_GRID = 1 ## how many thread blocks you launch in the grid. Here it’s 1. One value means a 1‑D grid; tuples give 2‑D or 3‑D
alias THREADS_PER_BLOCK = SIZE ## how many GPU threads you launch inside each block. Here it’s 4, one‑dimensional. You can also pass a tuple for (x, y, z).
alias dtype = DType.float32 #element type for buffers and scalar
fn add_10(
output: UnsafePointer[Scalar[dtype]], a: UnsafePointer[Scalar[dtype]]
## `UnsafePointer[...]`: a raw pointer to contiguous memory of the given type. No bounds or lifetime checks. You must index correctly yourself.
## Each parameter is a device pointer to a 1‑D array of Scalar[dtype] elements. Scalar[dtype] is the single‑value wrapper used throughout Mojo’s GPU APIs. You load/store these through the pointer.
):
## In a single‑block 1‑D launch, threads do
i = thread_idx.x
output[i] = a[i] + 10.0 # ---> here the solution is a simple vectorized operation. Mojo calls it Raw Memoery Approach AKA Naive appraoch
## With multiple blocks you include the block offset.
## The indices come from gpu.thread_idx, gpu.block_idx, and gpu.block_dim
# ANCHOR_END: add_10Steps:
- Step 1: Allocate out and a on the GPU and zero‑fill them.
- Step 2: Map a to the host, write [0, 1, 2, 3], unmap to push back to device.
- Step 3: Launch the kernel with grid_dim=1, block_dim=4. Threads 0..3 each handle one index.
- Step 4: Allocate pinned host expected, zero it, then synchronize() to ensure the GPU work is
- Step 5: Fill expected[i] = i + 10 and compare with out mapped to host.
- Step 6: Result should be out = [10, 11, 12, 13].
A Note
- When you call a GPU function, you can pass a DeviceBuffer directly; Mojo remaps it to a device pointer for the kernel (UnsafePointer[Scalar[dtype]]) under the hood. So both out.unsafe_ptr() and out work.
Puzzle 1) - what the test does
The test in main is where you can probably start to learn how mojo works. I used an LLM to help me grok the pieces here temporarily while i try to play with the code. Some of the choices have to do with GPU design and how to handle cpu and GPU communication and memory.
def main():
with DeviceContext() as ctx: ## Creates a GPU execution context and a stream. All operations below are enqueued on this stream
out = ctx.enqueue_create_buffer[dtype](SIZE) ## Allocates device buffers of SIZE elements on the GPU. Asynchronous. Returns DeviceBuffer[dtype]
out.enqueue_fill(0) ## Asynchronously fills the entire buffer with the value. Non‑blocking; scheduled on the context’s stream
a = ctx.enqueue_create_buffer[dtype](SIZE)
a.enqueue_fill(0)
## Below Maps a for CPU access. Inside the with block you read/write like a normal array; changes are pushed back to the device when the block exits.
with a.map_to_host() as a_host:
for i in range(SIZE):
a_host[i] = i
## Compiles and enqueues the kernel with type‑checking.
## You currently pass the same function twice to enable compile‑time signature checks; this redundancy will go away in a future API update.
ctx.enqueue_function_checked[add_10, add_10](
out,
a,
grid_dim=BLOCKS_PER_GRID,
block_dim=THREADS_PER_BLOCK,
)
expected = ctx.enqueue_create_host_buffer[dtype](SIZE)
expected.enqueue_fill(0)
## Allocates a host buffer of SIZE elements. Host buffers are pinned (page‑locked) memory for fast transfers between CPU and GPU. The fill sets it to zero.
ctx.synchronize() ## Blocks the CPU until all previously enqueued async work on this context (fills, kernel launch, etc.) has completed. Use this before reading results on the host
for i in range(SIZE): ## CPU‑side reference result.
expected[i] = i + 10
with out.map_to_host() as out_host: ## Maps the device output back to the CPU, prints both, and verifies elementwise equality.
print("out:", out_host)
print("expected:", expected)
for i in range(SIZE):
assert_equal(out_host[i], expected[i])