❯ ncpu // the neural computer

The model doesn't run on the computer. The model is the computer.

A complete computer in which every layer — arithmetic, OS, compiler, display — is either a trained neural network or runs entirely on GPU. Not a simulation of computation: an execution substrate, with every claim below verified by tests you can run.

73/73

Self-hosting C compiler — on the GPU

A ~3,500-line C compiler compiles itself on Metal, then compiles and runs other programs. All 73 test programs pass. (Paper §16)

v3.20

Alpine Linux boots on Metal

Real BusyBox (264 KB, 34 verified applets) and an Alpine Linux v3.20 rootfs run on the GPU via an ELF64 loader and 50+ Linux syscalls. (Paper §17)

2 inst.

Turing-complete through trained networks

MUXLEQ — a two-instruction computer — executes bit-exactly through the neural ALU and boots eForth. Constructive universality proof. (Paper §18)

143K params

Fully neural display pipeline

Char → glyph MLP → color embedding → compositor ConvNet. Every output pixel is a neural forward pass, at 305 FPS as a Metal shader. (Paper §19)

❯ hero demo

A UNIX machine on your GPU.

From demos/HERO_GPU_DEMO_TRANSCRIPT.md — the single highest-signal walkthrough of what makes nCPU unique. Every command below is real and runs today on Apple Silicon.

● ncpu — the GPU is the computer

❯ python -m ncpu gpu
# A real UNIX shell prompt running entirely on the Apple Silicon GPU
# via Metal compute shaders — ~1.9M instructions/sec for real C workloads.
# Run ls, cat, cc, compile programs, fork processes, use pipes.
# This is not "a program running on GPU". This is the operating system
# + hardware of a computer, implemented on the GPU.

❯ python -m ncpu gpu debug
# The 26-command deterministic toolkit:
gpu-trace      gpu-history    gpu-replay     gpu-diff
gpu-break <a>  gpu-watch <a>  gpu-step
gpu-profile    gpu-stack      gpu-heat       gpu-coverage
gpu-taint      gpu-reverse    gpu-sanitize
gpu-const-time     # prove AES has zero timing leakage
gpu-timing-proof
# Why this is impossible on a normal CPU:
#  - every run is bit-identical (sigma = 0.0 cycle variance)
#  - full machine state persists after the program exits
#  - breakpoints / watchpoints have zero overhead (checked in-shader)
#  - diff two executions instruction-by-instruction, exact cycle counts
#  - post-mortem analysis on a process that already terminated

❯ python -m ncpu gpu --neural-alu
# Neural weights active inside the Metal kernel:
#  ADD/SUB/CMP -> Kogge-Stone neural carry-lookahead (64-thread MLP)
#  MUL         -> byte-pair lookup tensor
#  AND/OR/XOR  -> learned truth tables
# Not an approximation: the integer ALU models are exhaustively
# verified to 100% accuracy on all inputs for their bit-width.
# The inversion you can observe: MUL is dramatically faster than ADD.

❯ python -m ncpu gpu alpine --demo
# A real Alpine Linux v3.20 environment boots and runs on the GPU:
#  BusyBox as the multi-call binary; pipes, scripting, /proc.
# The GPU superpower commands are first-class inside this environment.
# This is the concrete existence proof of the thesis.

❯ cd kernels/rust_metal
❯ cargo run --bin ncpu_run -- --elf ../../demos/gpu/busybox.elf \
    --rootfs --interactive
# Bypasses Python entirely — the purest form of
# "the GPU is the computer."

❯ substrate

A self-sufficient computer on a single GPU.

The Rust + Metal kernel executes ~200 ARM64 instructions (integer and floating-point) at roughly 1.9M instructions per second, with zero-copy shared memory and zero cycle-count variance across runs (σ = 0.0 over 270 runs). The CPU is involved only at bootstrap.

Multi-process UNIX OS

fork / pipe / wait / dup2, a 25-command shell, 28 syscalls, up to 15 concurrent processes — compiled C running on Metal shaders. (Paper §14)

Self-hosting compiler chain

cc.c (115,725 bytes of source) lexes itself into 21,388 tokens and emits 90,664 bytes of ARM64 in ~34.6M GPU cycles. Four layers deep: host GCC → GPU compiler₀ → GPU compiler₁ → test program → correct result. (Paper §16.4)

Real Linux userspace

An ELF64 loader runs unmodified aarch64 binaries: BusyBox ls, sort, grep, find and 30 more applets pass on GPU, against an Alpine v3.20 rootfs with 61 directories and 109 files. (Paper §17)

Also running on it: SHA-256, AES-128 (all 6 FIPS 197 vectors pass, T-table timing attacks have nothing to measure), Tetris, Snake, a Brainfuck interpreter, a Forth REPL, a CHIP-8 emulator, an HTTP server, and an MNIST classifier — ~11,300 lines of freestanding C in total. (Paper §13.4, §15)

❯ operating system

neurOS: the OS is trained, not written.

Memory management, scheduling, interrupts, caching, compilation, and monitoring implemented as trained models with 93.7–100% accuracy. In the v3.1 integration, eight neural models run live alongside the Metal kernel at 76K IPS. (Paper §9, §21)

Model	Architecture	Parameters	Trained accuracy	Role
Display	glyph MLP + color embed + ConvNet	390,916	29 dB PSNR	Text-to-pixel rendering
Cache	LSTM replacement policy	~21K	Belady-optimal	Cache line eviction
Prefetch	LSTM address predictor	~8K	97.8%	Predict memory accesses
Scheduler	Transformer encoder	~12K	99.2%	Multi-process scheduling
Watchdog	LSTM anomaly detector	~6K	100%	Execution health monitoring
GIC	Neural interrupt controller	~4K	93.7%	Syscall priority dispatch
Compiler Opt.	Peephole optimizer MLP	~3K	95.2%	Optimization suggestions
Syscall Pred.	Online bigram model	0 (online)	60–76%	Syscall stream prediction

Three components keep learning at runtime — the TLB, cache, and scheduler take single conservative gradient steps during normal operation. No conventional OS learns from its own scheduling decisions in real time. (Paper §9.11)

❯ arithmetic

100% exact neural arithmetic — exhaustively verified.

The neural ALU reaches 100% accuracy on 32-bit integer arithmetic, verified exhaustively over every possible input. The trick is memorization-by-decomposition: break each operation into sub-problems with finite, enumerable input spaces (8-entry truth tables, 16-entry carry combiners, a 65,536-entry multiplication LUT), train each to 100%, and compose them structurally. Weights are frozen; sigmoid thresholds have >0.4 margins, so accuracy is permanent. (Paper §5–6)

Instruction	Strategy	Latency
ADD/SUB/CMP	Kogge-Stone carry-lookahead (8 passes)	248 µs
MUL	Byte-pair LUT (65,536 entries)	21 µs
AND/OR/XOR	Vectorized truth table	21 µs
SHL/SHR	Attention-based bit routing	434 µs
DIV	Restoring division (neural subtraction)	varies

The performance inversion

In silicon, multiplication is the expensive operation. Here it inverts: neural multiplication (21 µs) is 12× faster than neural addition (248 µs), because addition needs an 8-pass carry chain through an MLP while multiplication decomposes into parallel byte-pair table lookups — O(1) instead of O(log n).

Determinism tests confirm 100 repeated executions produce identical results, across platforms and PyTorch versions.

❯ universality

MUXLEQ: Turing-complete in two instructions.

The strongest possible proof of neural computational universality: a two-instruction computer (SUBLEQ + MUX, 16-bit memory, 65,536 words) where every SUB routes through the Kogge-Stone neural adder (~248 µs/op) and every MUX through neural AND/OR/NOT gates (~63 µs/op) — zero fallbacks, bit-exact results. It loads .dec images and boots eForth; 32 tests verify all instruction cases and neural/compute parity.

If trained networks exactly execute a two-instruction one-instruction-set computer, the construction extends to any instruction set. The proof is constructive — a working implementation, not an existence argument. (Paper §18)

● muxleq — the entire instruction set

SUBLEQ  [A] = [A] - [B]; if [A] <= 0, jump to C
MUX     [A] = ([B] AND [C]) OR (NOT [B] AND [D])

# In neural mode:
#   SUB -> arithmetic.pt + carry_combine.pt  (~248 us/op)
#   MUX -> logical.pt neural AND/OR/NOT      (~63 us/op)
# Loads .dec images. Boots eForth. 32 tests green.

❯ prediction

JEPA: the computer predicts itself.

Alongside exact execution, a JEPA-style world model learns machine state transitions in a compressed latent space: latent_state_t + instruction → predictor → latent_state_t+1. A Python demo turns prediction error into a live anomaly signal; a Rust Metal implementation (kernels/rust_metal/src/jepa/, 2,858 lines) observes deterministic GPU execution and steers scheduling through learned bias overrides.

Because the substrate underneath is exact, this world model gets what most lack: unlimited free ground truth (just run more programs), and the ability to mix predicted and exact execution at will — cheap latent speculation when exploring, exact execution when it matters.

❯ the bridge

A program no human wrote, running on a computer that is a neural network.

The two halves of this project meet in one artifact. The synthesizer was given six input/output pairs for an unnamed function — 1→1, 2→5, 3→14, 4→30, 5→55, 6→91 — and nothing else. No name, no description, no human-written line. In 9.4 seconds it discovered the program by gradient descent, emitted it as Mog, and machine-transpiled it to C.

Then that C was compiled by the self-hosting compiler running on the GPU, wrapped as an ELF, and executed on the same neural-substrate computer. Author and toolchain and machine are all the system itself.

● synthesized — nobody wrote this

# nsynth discovered this from 6 I/O pairs (synth_gradient, 9.4s)
fn sum_of_squares(a: i64) -> i64 {
    acc: i64 = 0;
    i: i64 = 1;
    while i <= a {
        acc = acc + i * i;
        i = i + 1;
    }
    return acc;
}

● executed — on the GPU computer

❯ compile on rust_metal GPU   171,715 cycles · 1,108-byte binary
❯ wrap as ELF                 5,204 bytes
❯ execute on rust_metal GPU   3,195 cycles · exit 0

inputs   1  2  3  4  5  6  | 7  10  12   20   ← last four UNSEEN
gpu out  1  5 14 30 55 91  |140 385 650 2870
oracle   1  5 14 30 55 91  |140 385 650 2870  ✓ all match

❯ the full path, four independent executors agreeing

nsynth (gradient) → Mog → C → host gcc compiles cc.c only → the self-hosting compiler runs on the rust_metal GPU → minimal-ELF wrap → execute on the rust_metal GPU

The GPU output is cross-checked three ways — against the arithmetic oracle, against host clang compiling the same C, and against the Mog program transpiled to Python. All four agree on all ten inputs, including the four the synthesizer never saw. Reproduce it with python demos/bridge/synthesized_on_gpu.py; the recorded run is artifacts/bridge_demo_result.json.

❯ run it

Run it yourself.

macOS / Apple Silicon for the Metal path. Everything below is in the repo README.

● zsh — ~/nCPU

❯ pip install -e ".[demo,dev]"

# The headline demo: the GPU as a complete computer
❯ python -m ncpu gpu                 # boot it
❯ python -m ncpu gpu --neural-alu    # neural ALU inside the Metal shader
❯ python -m ncpu gpu debug           # 26-command deterministic debugger
❯ python -m ncpu gpu alpine --demo   # Alpine Linux v3.20 on the GPU

# JEPA predictive layer
❯ python3 -m ncpu.jepa_neural_cpu.demo
❯ python -m ncpu.world_model.quickstart

# Rust-native, no Python required
❯ cd kernels/rust_metal
❯ cargo run --bin ncpu_run -- --elf ../../demos/gpu/busybox.elf --rootfs -- echo hello

Get the repo →Program synthesis demo Verified numbers