A 4B model at 86% HumanEval.
Neural-Physical Chain of Thought + a verifier-gated compounding stack lifts Qwen3.5-4B from 58.5% to 85.98% pass@1 — ahead of Qwen3.5-9B baseline — for $0.39 of GPU. Every solve persists into a prompt cache so the next matching request returns at zero cost. No extra training. No larger model.
One thesis, five subsystems.
A computer can be built out of learned components — and once the whole execution stack is differentiable, programs stop being things you write and become things you search for by gradient descent. NPCoT is one pillar of nCPU, not the whole project.
1 · The neural computer
Every ALU operation is a trained network — 100% exact 32-bit integer arithmetic, exhaustively verified. Neural multiplication is 12× faster than neural addition.
2 · The GPU computer
A complete UNIX machine on a single GPU: 25-command shell, a self-hosting C compiler, real BusyBox and Alpine Linux v3.20, ~1.9M IPS with zero timing variance.
3 · Differentiable synthesis
Programs discovered by gradient descent through a differentiable CPU: Mog at 315/315 and nSynth at 105/105 benchmark coverage. Try it live, or watch it learn Pong.
4 · The coprocessor
The neural ALU injected into a transformer forward pass. Qwen3.5-2B arithmetic: 14.5% → 71.0%. Measured on real HumanEval, not extrapolated.
5 · JEPA machine dynamics
A predictive world model of the computer itself — latent speculation and anomaly detection over an exact execution substrate with unlimited free ground truth.
The headline above — 86% HumanEval from a 4B model — is what happens when pillar 3 is pointed at code. Pillars 1, 2, and 5 are the computer it runs on.
The neural-computer story →What actually happens.
Conventional chain-of-thought asks a language model to emit reasoning as tokens and trusts it to follow them. NPCoT compiles reasoning into a discrete program inside the forward pass, then caches the program so the next invocation runs without a single gradient op.
1. Train once
A transformer's hidden state drives a differentiable array-reduction head. Gradient descent finds the program that matches the target.
2. Crystallize
When soft-path and discrete-path outputs agree within threshold, the 5-tuple program is cached in the library with a signed fingerprint.
3. Reuse everywhere
Consult the library on new hidden states: 100% hit rate → ~4 ns consult + execute. Runs on CPU, GPU, WASM, or a 475 KB standalone binary.
Verified on real hardware.
No synthetic benchmarks. Every number below comes from an actual run on an NVIDIA RTX A4000 rented on vast.ai (April 18, 2026). Total cost: $0.16.
Library self-consistency check
200 array-reduction problems — NOT a real LLM comparison
This is a regression test for the library itself. For real LLM numbers see the HumanEval / MBPP runs below (pending vast.ai sweep).
Qwen3.5-4B on HumanEval: +27.5 pt full stack
164 problems, greedy decode, RTX 3090 — same weights, layered wrapper
30 hard-fails rescued by autoresearch at $0.39 GPU. Beats Qwen3.5-9B baseline (71.3%). Every solve persists into a compounding store so the next run short-circuits matching prompts at zero cost.
Cross-platform reproducibility
Same library, same MAE
Shipping artifacts
The stack.
Differentiable training
ArrayExecutableThoughtHead learns programs by gradient descent. Coprocessor wraps any HF transformer layer behind a max_gate safety cap.
Discrete library
Cosine-similarity keyed cache of DiscreteArrayProgram 5-tuples. LRU eviction, HMAC signing, (ε,δ)-DP perturbation, fingerprint IDs.
Native runtime
Pure-Rust executor, Metal compute shader, 130 KB WASM. Zero Python, zero PyTorch on the inference path.
Compliance pipeline
Static verifier proves termination, division safety, overflow bounds. Compliance report emits safe/warn/high aggregate for regulated deployments.
Federation
merge_libraries across organizations with conflict resolution. Teacher→student distillation via least-squares projection fit on paired hiddens.
Session lifecycle
ProgramLibrarySession handles load/save + snapshot/diff. Every task produces an audit trail of what skills changed.
Run it on your laptop in 30 seconds.
git clone https://github.com/robertcprice/nCPU cd nCPU python3 -m pytest tests/self_optimizing/ -q python3 -m demos.npcot_scale_practicality
458 tests, Apple Silicon native, real benchmarks. Takes about one minute total.