Solving Mojo GPU Puzzles

Puzzle 1

I began solving Mojo Puzzles today (I needed to get a nudge from a friend to get enough FOMO to do it).

Luckily I spent some time reading the famous PMPP book and attended a few Cuda-Mode lectures to possible grok whats going on, so this is my attempt to refresh my knowledge and also learn Mojo again after training a small FIM style coder on the 400 repos that existed at the time a few years ago (shameless plug for my github for that).

Note that alot of the info and notes are from Mojo Docs

Puzzle 1) - the problem

This puzzle is simple trying to add a addition over a vector. A classical Map/Reduce in a way. The code is mostly from Mojo except for one line of code you need to add yourself..

from memory import UnsafePointer
from gpu import thread_idx
from gpu.host import DeviceContext
from testing import assert_equal

# ANCHOR: add_10
alias SIZE = 4 # alias sets compile‑time constants.
## runs one block with 4 threads along x. Total threads = 1×4 = 4
alias BLOCKS_PER_GRID = 1 ## how many thread blocks you launch in the grid. Here it’s 1. One value means a 1‑D grid; tuples give 2‑D or 3‑D
alias THREADS_PER_BLOCK = SIZE ## how many GPU threads you launch inside each block. Here it’s 4, one‑dimensional. You can also pass a tuple for (x, y, z).

alias dtype = DType.float32 #element type for buffers and scalar

fn add_10(
    output: UnsafePointer[Scalar[dtype]], a: UnsafePointer[Scalar[dtype]]
    ## `UnsafePointer[...]`: a raw pointer to contiguous memory of the given type. No bounds or lifetime checks. You must index correctly yourself.
    ## Each parameter is a device pointer to a 1‑D array of Scalar[dtype] elements. Scalar[dtype] is the single‑value wrapper used throughout Mojo’s GPU APIs. You load/store these through the pointer.
):
    ## In a single‑block 1‑D launch, threads do 
    i = thread_idx.x
    output[i] = a[i] + 10.0 # ---> here the solution is a simple vectorized operation. Mojo calls it Raw Memoery Approach AKA Naive appraoch
    
    ## With multiple blocks you include the block offset. 
    ## The indices come from gpu.thread_idx, gpu.block_idx, and gpu.block_dim


# ANCHOR_END: add_10

Steps:

Step 1: Allocate out and a on the GPU and zero‑fill them.
Step 2: Map a to the host, write [0, 1, 2, 3], unmap to push back to device.
Step 3: Launch the kernel with grid_dim=1, block_dim=4. Threads 0..3 each handle one index.
Step 4: Allocate pinned host expected, zero it, then synchronize() to ensure the GPU work is
Step 5: Fill expected[i] = i + 10 and compare with out mapped to host.
Step 6: Result should be out = [10, 11, 12, 13].

A Note

When you call a GPU function, you can pass a DeviceBuffer directly; Mojo remaps it to a device pointer for the kernel (UnsafePointer[Scalar[dtype]]) under the hood. So both out.unsafe_ptr() and out work.

Puzzle 1) - what the test does

The test in main is where you can probably start to learn how mojo works. I used an LLM to help me grok the pieces here temporarily while i try to play with the code. Some of the choices have to do with GPU design and how to handle cpu and GPU communication and memory.

def main():
    with DeviceContext() as ctx: ## Creates a GPU execution context and a stream. All operations below are enqueued on this stream
        out = ctx.enqueue_create_buffer[dtype](SIZE) ## Allocates device buffers of SIZE elements on the GPU. Asynchronous. Returns DeviceBuffer[dtype]
        out.enqueue_fill(0) ## Asynchronously fills the entire buffer with the value. Non‑blocking; scheduled on the context’s stream
        a = ctx.enqueue_create_buffer[dtype](SIZE)
        a.enqueue_fill(0)
        ## Below Maps a for CPU access. Inside the with block you read/write like a normal array; changes are pushed back to the device when the block exits.
        with a.map_to_host() as a_host:
            for i in range(SIZE):
                a_host[i] = i

        ## Compiles and enqueues the kernel with type‑checking. 
        ## You currently pass the same function twice to enable compile‑time signature checks; this redundancy will go away in a future API update. 
        ctx.enqueue_function_checked[add_10, add_10](
            out,
            a,
            grid_dim=BLOCKS_PER_GRID,
            block_dim=THREADS_PER_BLOCK,
        )

        expected = ctx.enqueue_create_host_buffer[dtype](SIZE)
        expected.enqueue_fill(0)
        ## Allocates a host buffer of SIZE elements. Host buffers are pinned (page‑locked) memory for fast transfers between CPU and GPU. The fill sets it to zero.
        ctx.synchronize() ## Blocks the CPU until all previously enqueued async work on this context (fills, kernel launch, etc.) has completed. Use this before reading results on the host

        for i in range(SIZE): ## CPU‑side reference result.
            expected[i] = i + 10

        with out.map_to_host() as out_host: ## Maps the device output back to the CPU, prints both, and verifies elementwise equality.
            print("out:", out_host)
            print("expected:", expected)
            for i in range(SIZE):
                assert_equal(out_host[i], expected[i])

Puzzle 1) - the problem

Steps:

A Note

Puzzle 1) - what the test does

Puzzle 1) - GPU/CPU flow