(Streaming) DiLoCo Simulator

Model & Data

Model Size

N, assumes bfloat16

Token Budget

D (Default: 20 × N)

Global Batch Size (Tokens)

B (Recommended: ... based on Table 10)

DiLoCo Parameters

Replicas (M)

Must be ≥ 2

Sync Cadence (H)

Inner steps

Use Streaming DiLoCo

Overlap Inter-DC comms with computation

Staleness (S)

Overlapping steps (1..10)

Fragments (P)

Chunks per sync

Hardware & Network

Total Chips (R)

Chip Type (Est. Effective TFLOPS)

Q (Assumes ~50% MFU of bf16 peak)

Intra-Datacenter

Bandwidth (Per Chip)

Gbps

Overlap Intra-DC?

Inter-Datacenter

Bandwidth (Per Chip)

Gbps

Latency

Estimated Wall-Clock Time

Total Steps (T) 0

Computation Time 0h 0m

Intra-DC Comm (Inner) 0h 0m

Inter-DC Comm (Outer) 0h 0m

Total Duration 0h 0m

Predicted Eval Loss (Absolute):

0.000 Lower is better

Inter-DC Capacity (at each DC):

0 Gbps

Compute Utilization:

Underlying Math

Standard DiLoCo (Sequential)

T_comp = 6ND / (R × Q)

// Intra-DC Communication (Inner Steps)

T_intra = [ (2N / W₀) × (1 - M/R) + ε₀ ] × T

// Inter-DC Communication (Outer Steps)

T_comm_inter = [ (2N / W₁) × (1 - 1/R) + ε₁ ] × (T / H)

Total = T_comp + T_comm_intra + T_comm_inter

Frequently Asked Questions

1. Source / Citation

Based on Appendix A.2 and A.3 of "Communication-Efficient Language Model Training Scales Reliably and Robustly". Model assumes bfloat16 precision.

2. Wouldn't a real-world implementation of Streaming Diloco need some form of Hierarchical All-Reduce?

I think so, most likely. Real-world implementations typically use hierarchical collectives. However, the Scaling Laws paper explicitly models the time using the standard flat ring lower bound ($T \approx 2N/W$). This implies that if the Inter-DC link is the bottleneck, the total transfer time is governed by that bottleneck bandwidth ($W$) regardless of the reduction topology.

3. What are Staleness (S) and Fragments (P)?

Fragments (P): Streaming DiLoCo breaks the outer sync payload into $P$ chunks. While this doesn't change the total bytes sent, it allows for finer-grained overlap and lower memory overhead for buffers.
Staleness (S): By allowing the outer optimizer to be $S$ steps stale, the communication of gradients can be spread over $S$ intervals of length $H$. This effectively reduces the bandwidth requirement by a factor of $S$ (or allows using a link $S$ times slower without bottlenecking).

4. Why is "Bandwidth (Per Chip)" used in the formulas?

The training time calculation uses the bandwidth-optimal All-Reduce cost model from Patarasuk & Yuan (2009), approximated in the paper as T ≈ 2N / W. In this formula, W represents the per-chip link bandwidth.

However, the "Inter-DC Capacity" metric displayed in the results represents the aggregate cross-sectional bandwidth that would be required if every chip in the datacenter saturated its Inter-DC link simultaneously. It is calculated as Bandwidth_Per_Chip × (R / M).

5. Why do Vanilla and Streaming DiLoCo lines overlap?

In the simulator's model, Streaming DiLoCo sends the same total amount of data as Vanilla DiLoCo, just in smaller chunks. The only difference is the added latency overhead of sending multiple fragments ($P \times \epsilon$). For large models (e.g., 100B params), the bandwidth transfer time (tens of seconds) massively dominates the latency overhead (milliseconds), making the lines appear identical on the plot. To see them diverge, try increasing the Fragments count or the Inter-DC Latency.

Model & Data

DiLoCo Parameters

Hardware & Network

Estimated Wall-Clock Time

Compute Utilization plot (Figure 4 of the Streaming DiLoCo paper)

Frequently Asked Questions