(Streaming) DiLoCo Simulator

Vibecoded (repo) based on the paper trilogy: DiLoCo, Streaming DiLoCo, Scaling Laws

Model & Data

N, assumes bfloat16
D (Default: 20 × N)
B (Recommended: ... based on Table 10)

DiLoCo Parameters

Must be ≥ 2
Inner steps
Use Streaming DiLoCo
Overlap Inter-DC comms with computation
Overlapping steps (1..10)
Chunks per sync

Hardware & Network

Q (Assumes ~50% MFU of bf16 peak)
Intra-Datacenter
Gbps
Inter-Datacenter
Gbps
ms

Estimated Wall-Clock Time

Total Steps (T) 0
Computation Time 0h 0m
Intra-DC Comm (Inner) 0h 0m
Inter-DC Comm (Outer) 0h 0m
Total Duration 0h 0m
Predicted Eval Loss (Absolute):
0.000 Lower is better
Inter-DC Capacity (at each DC):
0 Gbps
Compute Utilization:
0%
Underlying Math

Standard DiLoCo (Sequential)

T_comp = 6ND / (R × Q)

// Intra-DC Communication (Inner Steps)

T_intra = [ (2N / W₀) × (1 - M/R) + ε₀ ] × T

// Inter-DC Communication (Outer Steps)

T_comm_inter = [ (2N / W₁) × (1 - 1/R) + ε₁ ] × (T / H)

Total = T_comp + T_comm_intra + T_comm_inter

Compute Utilization plot (Figure 4 of the Streaming DiLoCo paper)

Frequently Asked Questions

1. Source / Citation
Based on Appendix A.2 and A.3 of "Communication-Efficient Language Model Training Scales Reliably and Robustly". Model assumes bfloat16 precision.
2. Wouldn't a real-world implementation of Streaming Diloco need some form of Hierarchical All-Reduce?
I think so, most likely. Real-world implementations typically use hierarchical collectives. However, the Scaling Laws paper explicitly models the time using the standard flat ring lower bound ($T \approx 2N/W$). This implies that if the Inter-DC link is the bottleneck, the total transfer time is governed by that bottleneck bandwidth ($W$) regardless of the reduction topology.
3. What are Staleness (S) and Fragments (P)?
Fragments (P): Streaming DiLoCo breaks the outer sync payload into $P$ chunks. While this doesn't change the total bytes sent, it allows for finer-grained overlap and lower memory overhead for buffers.
Staleness (S): By allowing the outer optimizer to be $S$ steps stale, the communication of gradients can be spread over $S$ intervals of length $H$. This effectively reduces the bandwidth requirement by a factor of $S$ (or allows using a link $S$ times slower without bottlenecking).
4. Why is "Bandwidth (Per Chip)" used in the formulas?

The training time calculation uses the bandwidth-optimal All-Reduce cost model from Patarasuk & Yuan (2009), approximated in the paper as T ≈ 2N / W. In this formula, W represents the per-chip link bandwidth.

However, the "Inter-DC Capacity" metric displayed in the results represents the aggregate cross-sectional bandwidth that would be required if every chip in the datacenter saturated its Inter-DC link simultaneously. It is calculated as Bandwidth_Per_Chip × (R / M).

5. Why do Vanilla and Streaming DiLoCo lines overlap?
In the simulator's model, Streaming DiLoCo sends the same total amount of data as Vanilla DiLoCo, just in smaller chunks. The only difference is the added latency overhead of sending multiple fragments ($P \times \epsilon$). For large models (e.g., 100B params), the bandwidth transfer time (tens of seconds) massively dominates the latency overhead (milliseconds), making the lines appear identical on the plot. To see them diverge, try increasing the Fragments count or the Inter-DC Latency.