When AI runs at the kernel level — no syscall tax, no OS overhead — the hardware gets to keep its full potential. This is what that looks like on bare metal.
All results use Qwen3-1.7B Q4_K_M unless noted. CPU-only entries run on server-class x86_64 silicon — no GPU acceleration.
Assumptions: CPU rows use matching hardware class (server x86_64 EPYC / Xeon) where possible. GPU rows included for context only — GPU acceleration is a future Zero roadmap item (Stage 16+). BF16 vs Q4 quantisation means the SGLang figure is not apples-to-apples with Q4 CPU results. The 149.3 result is CPU-only, no GPU, no CUDA, no driver stack.
Every above-kernel runtime — including llama.cpp, Ollama, vLLM, SGLang — pays a constant tax to the OS. Ring-0 eliminates the toll booth.
Above-kernel runtimes cross the kernel boundary thousands of times per inference call — memory allocation, I/O, threading. At Ring-0, the AI is the kernel. No boundary to cross.
Model weights, KV cache, tokeniser buffers, and device memory live in one flat address space with no privilege-level switches. Cache lines stay hot. NUMA-aware placement is trivial.
Ring-0 reads CPU performance counters, thermal sensors, and memory-controller stats natively — no abstraction layer. The scheduler can make real-time decisions based on actual silicon state, not OS-mediated approximations.
Credibility is repeatable. Here's exactly what produced the 149.3 tok/s figure.
Don't take our word for it. Run it yourself on any AMD EPYC or compatible server.
Open-source benchmark harness. Boot from USB, run the standard suite, get a signed results file. Compare against our reference numbers or publish your own.
Managed benchmark infrastructure. Submit your hardware spec, we run the suite on your behalf on identical Ring-0 hardware and return a verified results report.
Zero runs on any AMD EPYC server. No GPU required. No cloud dependency. You own the hardware, the model, and the runtime.