nbdistributed

Notebook friendly distributed training

citation

"""
citation:

@misc{zumot20205nbdistdemo,
  title={NBDistributed walkthrough},
  author={Zumot, Laith},
  howpublished={\url{https://lazyevaluator.com/presentations/dist/nbdistributed.html}},
  date = {2025-09-17},
  note = {GitHub Gist}
}
"""

1) What is nbdistributed?

It is a small, pure-Python IPython extension that turns a single Jupyter notebook into a living distributed cluster. created by Zach Meuller (HuggingFace Accelerate).[https://pypi.org/project/nbdistributed/]

# Installation is simple with uv
uv pip install nbdistributed
# Load it once with
%load_ext nbdistributed
# spin up workers with 
# num-processes === the number of gpus you want to use
# gpu-ids === device ids if you want to be picky
%dist_init --num-processes 2 --gpu-ids 0,1
Using GPU IDs: [0, 1]
Starting 2 distributed workers...
βœ“ Successfully started 2 workers
  Rank 0 -> GPU 0
  Rank 1 -> GPU 1
Available commands:
  %%distributed - Execute code on all ranks (explicit)
  %%rank [0,n] - Execute code on specific ranks
  %sync - Synchronize all ranks
  %dist_status - Show worker status
  %dist_mode - Toggle automatic distributed mode
  %dist_shutdown - Shutdown workers

πŸš€ Distributed mode active: All cells will now execute on workers automatically!
   Magic commands (%, %%) will still execute locally as normal.

🐍 Below are auto-imported and special variables auto-generated into the namespace to use
  `torch`
  `dist`: `torch.distributed` import alias
  `rank` (`int`): The local rank
  `world_size` (`int`): The global world size
  `gpu_id` (`int`): The specific GPU ID assigned to this worker
  `device` (`torch.device`): The current PyTorch device object (e.g. `cuda:1`)
# see status
%dist_status
Distributed cluster status (2 processes):
============================================================
Rank 0: βœ“ PID 61856
  β”œβ”€ GPU: 0 (NVIDIA GeForce RTX 3090)
  β”œβ”€ Memory: 0.0GB / 24.0GB (0.0% used)
  └─ Status: Running

Rank 1: βœ“ PID 61857
  β”œβ”€ GPU: 1 (NVIDIA GeForce RTX 3090)
  β”œβ”€ Memory: 0.0GB / 24.0GB (0.0% used)
  └─ Status: Running

Every cell you run can be executed on any subset of ranks, or on all of them, while you keep the interactive prompt in your hand.

# When you are done you call
%dist_shutdown
# the GPUs are released again.

2) Tinkering cell by cell.

  1. Early CUDA is allowed. You can probe torch.cuda.device_count() or allocate a tensor on cuda:3 before you ever start the workers. The plugin spawns the torch.distributed group later, so nothing is locked in advance.

  2. Cell-level targeting. Prefix a cell with %%rank [0,1] or %%distributed and only the chosen ranks run it. You stay in the notebook UI the whole time.

  3. Fault isolation without full restart. If rank 1 throws a NameError, only that process shows the trace. Fix the variable in the next cell and rerunβ€”no need to bring down the whole group.

3) Auto-created variables you can use right away

After %dist_init each worker wakes up with these names already in its namespace:

  1. torch – the full PyTorch module

  2. dist – alias for torch.distributed

  3. rank – local rank id (int)

  4. world_size – total number of workers (int)

  5. gpu_id – the exact GPU index assigned to this worker (int)

  6. device – ready-made torch.device(f'cuda:{gpu_id}')

4) A Classroom Analogy

With certain libraries you have to declare resources up-front, and after the actor starts you cannot move it to another GPU without rebuilding the job.

nbdistributed treats GPUs like seats in a classroom: you tell it β€œtake 2 GPUs” and you can still walk over to seat 3 or seat 4, or even evict one worker mid-session, all from the same notebook kernel.