nCPU / the neural computer
The model doesn't run on the computer.
The model is the computer.
A complete computer in which every layer — arithmetic, OS, compiler, display — is either a trained neural network or runs entirely on GPU. Not a simulation of computation: an execution substrate, with every claim below verified by tests you can run.
Self-hosting C compiler — on the GPU
A ~3,500-line C compiler compiles itself on Metal, then compiles and runs other programs. All 73 test programs pass. (Paper §16)
Alpine Linux boots on Metal
Real BusyBox (264 KB, 34 verified applets) and an Alpine Linux v3.20 rootfs run on the GPU via an ELF64 loader and 50+ Linux syscalls. (Paper §17)
Turing-complete through trained networks
MUXLEQ — a two-instruction computer — executes bit-exactly through the neural ALU and boots eForth. Constructive universality proof. (Paper §18)
Fully neural display pipeline
Char → glyph MLP → color embedding → compositor ConvNet. Every output pixel is a neural forward pass, at 305 FPS as a Metal shader. (Paper §19)
The hero demo: a UNIX machine on your GPU.
From demos/HERO_GPU_DEMO_TRANSCRIPT.md — the single highest-signal walkthrough of what makes nCPU unique. Every command below is real and runs today on Apple Silicon.
$ python -m ncpu gpu
# A real UNIX shell prompt running entirely on the Apple Silicon GPU
# via Metal compute shaders — ~1.9M instructions/sec for real C workloads.
# Run ls, cat, cc, compile programs, fork processes, use pipes.
# This is not "a program running on GPU". This is the operating system
# + hardware of a computer, implemented on the GPU.
$ python -m ncpu gpu debug
# The 26-command deterministic toolkit:
gpu-trace gpu-history gpu-replay gpu-diff
gpu-break <a> gpu-watch <a> gpu-step
gpu-profile gpu-stack gpu-heat gpu-coverage
gpu-taint gpu-reverse gpu-sanitize
gpu-const-time # prove AES has zero timing leakage
gpu-timing-proof
# Why this is impossible on a normal CPU:
# - every run is bit-identical (sigma = 0.0 cycle variance)
# - full machine state persists after the program exits
# - breakpoints / watchpoints have zero overhead (checked in-shader)
# - diff two executions instruction-by-instruction, exact cycle counts
# - post-mortem analysis on a process that already terminated
$ python -m ncpu gpu --neural-alu
# Neural weights active inside the Metal kernel:
# ADD/SUB/CMP -> Kogge-Stone neural carry-lookahead (64-thread MLP)
# MUL -> byte-pair lookup tensor
# AND/OR/XOR -> learned truth tables
# Not an approximation: the integer ALU models are exhaustively
# verified to 100% accuracy on all inputs for their bit-width.
# The inversion you can observe: MUL is dramatically faster than ADD.
$ python -m ncpu gpu alpine --demo
# A real Alpine Linux v3.20 environment boots and runs on the GPU:
# BusyBox as the multi-call binary; pipes, scripting, /proc.
# The GPU superpower commands are first-class inside this environment.
# This is the concrete existence proof of the thesis.
$ cd kernels/rust_metal
$ cargo run --bin ncpu_run -- --elf ../../demos/gpu/busybox.elf \
--rootfs --interactive
# Bypasses Python entirely — the purest form of
# "the GPU is the computer."A self-sufficient computer on a single GPU.
The Rust + Metal kernel executes ~200 ARM64 instructions (integer and floating-point) at roughly 1.9M instructions per second, with zero-copy shared memory and zero cycle-count variance across runs (σ = 0.0 over 270 runs). The CPU is involved only at bootstrap.
Multi-process UNIX OS
fork / pipe / wait / dup2, a 25-command shell, 28 syscalls, up to 15 concurrent processes — compiled C running on Metal shaders. (Paper §14)
Self-hosting compiler chain
cc.c (115,725 bytes of source) lexes itself into 21,388 tokens and emits 90,664 bytes of ARM64 in ~34.6M GPU cycles. Four layers deep: host GCC → GPU compiler₀ → GPU compiler₁ → test program → correct result. (Paper §16.4)
Real Linux userspace
An ELF64 loader runs unmodified aarch64 binaries: BusyBox ls, sort, grep, find and 30 more applets pass on GPU, against an Alpine v3.20 rootfs with 61 directories and 109 files. (Paper §17)
Also running on it: SHA-256, AES-128 (all 6 FIPS 197 vectors pass, T-table timing attacks have nothing to measure), Tetris, Snake, a Brainfuck interpreter, a Forth REPL, a CHIP-8 emulator, an HTTP server, and an MNIST classifier — ~11,300 lines of freestanding C in total. (Paper §13.4, §15)
neurOS: the operating system is trained, not written.
Memory management, scheduling, interrupts, caching, compilation, and monitoring implemented as trained models with 93.7–100% accuracy. In the v3.1 integration, eight neural models run live alongside the Metal kernel at 76K IPS. (Paper §9, §21)
| Model | Architecture | Parameters | Trained accuracy | Role |
|---|---|---|---|---|
| Display | glyph MLP + color embed + ConvNet | 390,916 | 29 dB PSNR | Text-to-pixel rendering |
| Cache | LSTM replacement policy | ~21K | Belady-optimal | Cache line eviction |
| Prefetch | LSTM address predictor | ~8K | 97.8% | Predict memory accesses |
| Scheduler | Transformer encoder | ~12K | 99.2% | Multi-process scheduling |
| Watchdog | LSTM anomaly detector | ~6K | 100% | Execution health monitoring |
| GIC | Neural interrupt controller | ~4K | 93.7% | Syscall priority dispatch |
| Compiler Opt. | Peephole optimizer MLP | ~3K | 95.2% | Optimization suggestions |
| Syscall Pred. | Online bigram model | 0 (online) | 60–76% | Syscall stream prediction |
Three components keep learning at runtime — the TLB, cache, and scheduler take single conservative gradient steps during normal operation. No conventional OS learns from its own scheduling decisions in real time. (Paper §9.11)
100% exact neural arithmetic — exhaustively verified.
The neural ALU reaches 100% accuracy on 32-bit integer arithmetic, verified exhaustively over every possible input. The trick is memorization-by-decomposition: break each operation into sub-problems with finite, enumerable input spaces (8-entry truth tables, 16-entry carry combiners, a 65,536-entry multiplication LUT), train each to 100%, and compose them structurally. Weights are frozen; sigmoid thresholds have >0.4 margins, so accuracy is permanent. (Paper §5–6)
| Instruction | Strategy | Latency |
|---|---|---|
| ADD/SUB/CMP | Kogge-Stone carry-lookahead (8 passes) | 248 µs |
| MUL | Byte-pair LUT (65,536 entries) | 21 µs |
| AND/OR/XOR | Vectorized truth table | 21 µs |
| SHL/SHR | Attention-based bit routing | 434 µs |
| DIV | Restoring division (neural subtraction) | varies |
The performance inversion
In silicon, multiplication is the expensive operation. Here it inverts: neural multiplication (21 µs) is 12× faster than neural addition (248 µs), because addition needs an 8-pass carry chain through an MLP while multiplication decomposes into parallel byte-pair table lookups — O(1) instead of O(log n).
Determinism tests confirm 100 repeated executions produce identical results, across platforms and PyTorch versions.
MUXLEQ: Turing-complete in two instructions.
The strongest possible proof of neural computational universality: a two-instruction computer (SUBLEQ + MUX, 16-bit memory, 65,536 words) where every SUB routes through the Kogge-Stone neural adder (~248 µs/op) and every MUX through neural AND/OR/NOT gates (~63 µs/op) — zero fallbacks, bit-exact results. It loads .dec images and boots eForth; 32 tests verify all instruction cases and neural/compute parity.
If trained networks exactly execute a two-instruction one-instruction-set computer, the construction extends to any instruction set. The proof is constructive — a working implementation, not an existence argument. (Paper §18)
# MUXLEQ: the entire instruction set SUBLEQ [A] = [A] - [B]; if [A] <= 0, jump to C MUX [A] = ([B] AND [C]) OR (NOT [B] AND [D]) # In neural mode: # SUB -> arithmetic.pt + carry_combine.pt (~248 µs/op) # MUX -> logical.pt neural AND/OR/NOT (~63 µs/op) # Loads .dec images. Boots eForth. 32 tests green.
JEPA: the computer predicts itself.
Alongside exact execution, a JEPA-style world model learns machine state transitions in a compressed latent space: latent_state_t + instruction → predictor → latent_state_t+1. A Python demo turns prediction error into a live anomaly signal; a Rust Metal implementation (kernels/rust_metal/src/jepa/, 2,858 lines) observes deterministic GPU execution and steers scheduling through learned bias overrides.
Because the substrate underneath is exact, this world model gets what most lack: unlimited free ground truth (just run more programs), and the ability to mix predicted and exact execution at will — cheap latent speculation when exploring, exact execution when it matters.
Run it yourself.
macOS / Apple Silicon for the Metal path. Everything below is in the repo README.
pip install -e ".[demo,dev]" # The headline demo: the GPU as a complete computer python -m ncpu gpu # boot it python -m ncpu gpu --neural-alu # neural ALU inside the Metal shader python -m ncpu gpu debug # 26-command deterministic debugger python -m ncpu gpu alpine --demo # Alpine Linux v3.20 on the GPU # JEPA predictive layer python3 -m ncpu.jepa_neural_cpu.demo python -m ncpu.world_model.quickstart # Rust-native, no Python required cd kernels/rust_metal cargo run --bin ncpu_run -- --elf ../../demos/gpu/busybox.elf --rootfs -- echo hello