Benchmark

149.3 tokens/sec.
Ring-0. CPU-only.

When AI runs at the kernel level — no syscall tax, no OS overhead — the hardware gets to keep its full potential. This is what that looks like on bare metal.

149.3

tok/s

Zero — AMD EPYC 9354P
CPU-only · Ring-0

4–15×
faster

10–35

tok/s

Typical above-kernel
CPU inference

Comparison

The same model. Same hardware class.
Very different results.

All results use Qwen3-1.7B Q4_K_M unless noted. CPU-only entries run on server-class x86_64 silicon — no GPU acceleration.

System / Runtime

Model

Mode

Tokens/sec

Zero — AMD EPYC 9354P Ring-0 kernel runtime

Qwen3-1.7B Q4_K_M

CPU · Ring-0

149.3

Intel i7-13700K, llama.cpp Q4 Above-kernel (Linux + glibc)

Qwen3-1.7B

CPU · userland

~35

Fast desktop CPU, Ollama / llama.cpp class Above-kernel (macOS / Linux)

Qwen3-1.7B Q4_K_M

CPU · userland

10–15

Consumer GPU, 6 GB VRAM Above-kernel (CUDA + OS driver stack)

Qwen3-1.7B

GPU · userland

~126

Qwen official SGLang, 1 GPU Above-kernel (CUDA stack)

Qwen3-1.7B BF16

GPU · serving

227.8

Assumptions: CPU rows use matching hardware class (server x86_64 EPYC / Xeon) where possible. GPU rows included for context only — GPU acceleration is a future Zero roadmap item (Stage 16+). BF16 vs Q4 quantisation means the SGLang figure is not apples-to-apples with Q4 CPU results. The 149.3 result is CPU-only, no GPU, no CUDA, no driver stack.

Architecture advantage

Why Ring-0 wins on raw throughput

Every above-kernel runtime — including llama.cpp, Ollama, vLLM, SGLang — pays a constant tax to the OS. Ring-0 eliminates the toll booth.

Zero syscall overhead

Above-kernel runtimes cross the kernel boundary thousands of times per inference call — memory allocation, I/O, threading. At Ring-0, the AI is the kernel. No boundary to cross.

Single address space

Model weights, KV cache, tokeniser buffers, and device memory live in one flat address space with no privilege-level switches. Cache lines stay hot. NUMA-aware placement is trivial.

Direct hardware telemetry

Ring-0 reads CPU performance counters, thermal sensors, and memory-controller stats natively — no abstraction layer. The scheduler can make real-time decisions based on actual silicon state, not OS-mediated approximations.

Read the full architecture →

Methodology

How we measured it

Credibility is repeatable. Here's exactly what produced the 149.3 tok/s figure.

Hardware

AMD EPYC 9354P

64 logical CPUs · 128 GB ECC DDR5 · Native NIC · SMP enabled · No GPU · No accelerator

Model

Qwen3-1.7B Q4_K_M

4-bit quantised GGUF format. Same weights used in all CPU comparison entries.

Prompt length

32 tokens output

Short generation window to isolate token-generation throughput from prefill latency.

Measurement

Wall-clock, 10 runs

Median over 10 consecutive runs after a 2-run warm-up. Reported figure is the median, not peak.

Runtime

Zero kernel v0.1-alpha

Ring-0 inference path. No Linux kernel. No libc. Bare-metal boot → inference loop.

Repeatability

Self-hostable

The Ring0 Benchmark tool (see below) lets anyone reproduce this on compatible hardware. Results within ±3% across runs in our lab.

We publish these numbers knowing they invite scrutiny. If you reproduce this benchmark and get a different result, tell us — we'll update this page with your data and methodology.

Ring0 Benchmark

Benchmark your own hardware

Don't take our word for it. Run it yourself on any AMD EPYC or compatible server.

Free · Self-hosted

Ring0 Benchmark

Open-source benchmark harness. Boot from USB, run the standard suite, get a signed results file. Compare against our reference numbers or publish your own.

Qwen3, Llama, Mistral model support
CPU + memory bandwidth profiling
Signed, verifiable output format
AGPLv3 — fully auditable

Get on GitHub — Free

Paid · Managed

Ring0 Benchmark Cloud

Managed benchmark infrastructure. Submit your hardware spec, we run the suite on your behalf on identical Ring-0 hardware and return a verified results report.

Verified reference environment
Side-by-side comparison with Ring0 baseline
PDF report with signed hash chain
Priority queue for enterprise customers

Contact for pricing →

The fastest CPU inference is
already open source.

Zero runs on any AMD EPYC server. No GPU required. No cloud dependency. You own the hardware, the model, and the runtime.

Get Zero — Free Star on GitHub

149.3 tokens/sec.Ring-0. CPU-only.

The same model. Same hardware class.Very different results.

Why Ring-0 wins on raw throughput

Zero syscall overhead

Single address space

Direct hardware telemetry

How we measured it

Benchmark your own hardware

Ring0 Benchmark

Ring0 Benchmark Cloud

The fastest CPU inference isalready open source.

149.3 tokens/sec.
Ring-0. CPU-only.

The same model. Same hardware class.
Very different results.

The fastest CPU inference is
already open source.