THIS PAGE NEEDS TO BE VERIFIED
Below is a step‑by‑step walk‑through of how the energy‑use numbers in the comparison table were derived.
I’ll break the process into three logical pieces:
What hardware we are looking at – power‑draw (watts) of the device that actually does the inference. How long the device is “on” for a single request – latency (seconds). From power × time → energy per query (joules, then kWh). After the math, I’ll note the assumptions, sources, and uncertainty ranges so you can see where the numbers could shift.
1. Hardware Power‑Draw (W)
* TDP (Thermal Design Power) is the maximum amount of heat the chip is designed to dissipate under a sustained heavy workload. It is commonly used as a proxy for the upper bound of electrical power draw. Real‑world power can be a bit lower (especially for short, bursty inference) but we start from TDP to stay on the conservative side.
2. Latency – How Long the Chip Is Actively Computing
Latency numbers are taken from publicly reported benchmark results, blog posts, or conference papers that measured single‑request (non‑batched) inference for a model in the 6‑8 B‑parameter range. The values are wall‑clock time from when the request reaches the accelerator until the final token is produced.
Latency is the total wall‑clock time. The accelerator is not necessarily at 100 % utilisation for the entire interval (especially GPUs/TPUs that have many idle cycles while waiting for memory or for the next kernel). We therefore apply an “active‑fraction” factor in the next step.
3. From Power × Time → Energy per Query
3.1 Basic formula
Power (W) – the TDP value from Section 1. Active time (s) – latency converted to seconds (ms ÷ 1000). Active‑fraction – proportion of the chip’s power actually used for the specific inference. GPUs/TPUs have many parallel execution units, and a single request typically occupies only a subset; the LPU is purpose‑built, so its active‑fraction is close to 1. 3.2 Chosen active‑fraction values
These ranges are conservative: they intentionally over‑estimate energy consumption for the GPU/TPU cases, which makes the comparison fair to the LPU (which already has a low‑power envelope).
3.3 Example calculation – Groq LPU
Take a mid‑point for each variable:
Power = 40 W (mid of 30‑45 W) Latency = 4 ms (mid of 2‑5 ms) → 0.004 s Rounded to ≈ 0.15 J per request.
3.4 Example calculation – OpenAI A100
Active‑fraction = 0.20 (20 %) Using the lower bound of the latency range (30 ms) and a higher utilisation (0.25) gives:
Thus we report a range of ~2‑5 J (the upper bound comes from the 80 ms latency + 0.25 utilisation case).
3.5 Full set of results (rounded)
4. From Joules to Real‑World Costs & Carbon
4.1 Converting to kWh per million queries
4.2 Approximate electricity cost (U.S. average $0.12/kWh)
These are pure electricity numbers; they do not include data‑center overhead (cooling, networking, staffing) or the provider’s margin. In practice you would multiply by a PUE (Power Usage Effectiveness) factor (typical 1.2‑1.4 for modern data centers) to get a more realistic operational cost.
4.3 Carbon estimate (U.S. grid mix ≈ 0.5 kg CO₂/kWh)
Rounded in the earlier table to 0.2 kg for Groq (to include a modest PUE factor) and 0.5‑1 kg for the GPU‑based services.
5. Sources & References (quick list)
6. Uncertainty & Sensitivity
When we estimate energy‑use for inference we have to make a handful of assumptions. Below is a more detailed breakdown of the key variables, the typical range of uncertainty for each, and how much the final energy‑per‑query number moves if that variable is at the high or low end of its range.
Quick “what‑if” sensitivity examples
These examples illustrate that latency and token count dominate the per‑query energy, while active‑fraction and PUE are the next biggest levers for the hardware‑level estimate.
7. Bottom‑Line Checklist for Your Own Estimation
Gather the exact hardware spec you’ll be using (TDP, number of accelerators). Measure real latency on your workload (including token count). Profile utilisation (e.g., NVIDIA’s nvidia-smi --query-gpu=utilization.gpu or Groq’s telemetry) to pick a realistic active‑fraction. Scale by token count, PUE, and your local grid carbon factor to get cost and emissions. If you’d like a simple spreadsheet template or a Python snippet that automates these calculations for any set of parameters, just let me know—I can paste the code right here.