149.3 tokens/sec.
Ring-0. CPU-only.

When AI runs at the kernel level — no syscall tax, no OS overhead — the hardware gets to keep its full potential. This is what that looks like on bare metal.

149.3
tok/s
Zero — AMD EPYC 9354P
CPU-only · Ring-0
4–15×
faster
10–35
tok/s
Typical above-kernel
CPU inference

The same model. Same hardware class.
Very different results.

All results use Qwen3-1.7B Q4_K_M unless noted. CPU-only entries run on server-class x86_64 silicon — no GPU acceleration.

System / Runtime
Model
Mode
Tokens/sec
Zero — AMD EPYC 9354P Ring-0 kernel runtime
Qwen3-1.7B Q4_K_M
CPU · Ring-0
149.3
Intel i7-13700K, llama.cpp Q4 Above-kernel (Linux + glibc)
Qwen3-1.7B
CPU · userland
~35
Fast desktop CPU, Ollama / llama.cpp class Above-kernel (macOS / Linux)
Qwen3-1.7B Q4_K_M
CPU · userland
10–15
Consumer GPU, 6 GB VRAM Above-kernel (CUDA + OS driver stack)
Qwen3-1.7B
GPU · userland
~126
Qwen official SGLang, 1 GPU Above-kernel (CUDA stack)
Qwen3-1.7B BF16
GPU · serving
227.8

Assumptions: CPU rows use matching hardware class (server x86_64 EPYC / Xeon) where possible. GPU rows included for context only — GPU acceleration is a future Zero roadmap item (Stage 16+). BF16 vs Q4 quantisation means the SGLang figure is not apples-to-apples with Q4 CPU results. The 149.3 result is CPU-only, no GPU, no CUDA, no driver stack.

Why Ring-0 wins on raw throughput

Every above-kernel runtime — including llama.cpp, Ollama, vLLM, SGLang — pays a constant tax to the OS. Ring-0 eliminates the toll booth.

Zero syscall overhead

Above-kernel runtimes cross the kernel boundary thousands of times per inference call — memory allocation, I/O, threading. At Ring-0, the AI is the kernel. No boundary to cross.

Single address space

Model weights, KV cache, tokeniser buffers, and device memory live in one flat address space with no privilege-level switches. Cache lines stay hot. NUMA-aware placement is trivial.

Direct hardware telemetry

Ring-0 reads CPU performance counters, thermal sensors, and memory-controller stats natively — no abstraction layer. The scheduler can make real-time decisions based on actual silicon state, not OS-mediated approximations.

Read the full architecture →

How we measured it

Credibility is repeatable. Here's exactly what produced the 149.3 tok/s figure.

Hardware
AMD EPYC 9354P
64 logical CPUs · 128 GB ECC DDR5 · Native NIC · SMP enabled · No GPU · No accelerator
Model
Qwen3-1.7B Q4_K_M
4-bit quantised GGUF format. Same weights used in all CPU comparison entries.
Prompt length
32 tokens output
Short generation window to isolate token-generation throughput from prefill latency.
Measurement
Wall-clock, 10 runs
Median over 10 consecutive runs after a 2-run warm-up. Reported figure is the median, not peak.
Runtime
Zero kernel v0.1-alpha
Ring-0 inference path. No Linux kernel. No libc. Bare-metal boot → inference loop.
Repeatability
Self-hostable
The Ring0 Benchmark tool (see below) lets anyone reproduce this on compatible hardware. Results within ±3% across runs in our lab.
We publish these numbers knowing they invite scrutiny. If you reproduce this benchmark and get a different result, tell us — we'll update this page with your data and methodology.

Benchmark your own hardware

Don't take our word for it. Run it yourself on any AMD EPYC or compatible server.

Free · Self-hosted

Ring0 Benchmark

Open-source benchmark harness. Boot from USB, run the standard suite, get a signed results file. Compare against our reference numbers or publish your own.

  • Qwen3, Llama, Mistral model support
  • CPU + memory bandwidth profiling
  • Signed, verifiable output format
  • AGPLv3 — fully auditable
Get on GitHub — Free
Paid · Managed

Ring0 Benchmark Cloud

Managed benchmark infrastructure. Submit your hardware spec, we run the suite on your behalf on identical Ring-0 hardware and return a verified results report.

  • Verified reference environment
  • Side-by-side comparison with Ring0 baseline
  • PDF report with signed hash chain
  • Priority queue for enterprise customers
Contact for pricing →

The fastest CPU inference is
already open source.

Zero runs on any AMD EPYC server. No GPU required. No cloud dependency. You own the hardware, the model, and the runtime.