Technical Whitepaper

Exploring OS Primitives for Inter-Model Scheduling: An Inference OS for Multi-model AI

Published: March 30, 2026

Abstract

Production AI agent systems orchestrate multiple models per request — generation, classification, retrieval, reranking, speech-to-text, document understanding etc. — each with different memory footprints, compute profiles, inference latencies and invocation patterns. The infrastructure serving these models was not built for this. Each model is treated as an independent service with its own GPU partition, and nothing in the stack reasons about how those models interact, contend for memory, or could share resources across time. The practical consequences are expensive: GPU fleets sized for worst-case simultaneous demand across all models, unpredictable eviction penalties if models are allowed to compete for VRAM, and GPUs sitting idle during the pipeline transitions that dominate wall-clock time in agentic workloads.

Neurafewz is an inference operating system designed to close this gap. An offline planning phase (the Builder) resolves kernel selection, memory layout, quantization, and model staging schedules at build time. A deterministic boot phase commits GPU memory from the Builder’s plan rather than discovering allocations at runtime. A runtime scheduler multiplexes GPU resources across models with preemptive dispatch, overlapping model staging with active inference to hide the cost of context switching. The result in practice: substantially more models served per GPU, with minimal measurable degradation in per-model throughput or latency, at a fraction of the fleet cost that conventional per-model provisioning requires.


1. The Multi-Model Inference Problem

Modern AI agent pipelines are not single-model systems. A production pipeline may invoke a speech recognition model, an intent classifier, an embedding model for retrieval, a large generation model, and a reranker — each conditionally, often repeatedly within a single request, all within a latency budget the user can feel. The binding constraint on end-to-end performance is no longer any single model’s inference efficiency. It is the absence of coordination infrastructure across models: how memory is shared, how compute is scheduled between pipeline stages, and how the system responds to concurrent demand across a heterogeneous fleet.

1.1 The Single-Model Optimization Foundation

The past several years have produced remarkable advances in single-model inference: PagedAttention1 for KV cache fragmentation, continuous batching2 for dynamic request interleaving, CUDA graph capture for launch overhead elimination, quantization-aware kernels pushing bandwidth efficiency toward hardware limits. Frameworks like vLLM1, TGI, and TensorRT-LLM3 package these techniques into production systems that extract near-peak performance from a single model on a given GPU.

This work is the foundation on which multi-model infrastructure must be built — but it stops at the boundary of a single model. None of these systems have a concept of an integrated fleet. There is no cross-model memory allocator, no scheduler that reasons about which models should be resident, and no mechanism for prioritizing one model’s (or a bunch of parallelizable models’) execution over another’s. The moment you deploy more than one model to a GPU, you are outside the design envelope of every major inference framework.

1.2 Compounding Costs of Model Isolation

The consequences of this single-tenant assumption show up in three places, and they compound.

Over-provisioning. When each model owns its GPU partition, the fleet must be sized for worst-case simultaneous demand. A pipeline with a 30B generation model (~ 60 GB BF16), a 7B reranker (~ 14 GB), and a 1.5B embedding model (~ 3 GB) requires ~ 77 GB before KV cache headroom — exceeding a single 80 GB A100, requiring a second GPU — regardless of whether all three are ever simultaneously active. In practice, utilization is highly variable and largely disjoint: the embedding model runs while generation is idle; the reranker executes during post-processing. Cost of inference scales with model count rather than concurrent load.

Unplanned eviction and reload. Without a cross-model memory scheduler, memory pressure is handled reactively — LRU eviction, reload on next use. PCIe bandwidth from host to VRAM is typically 25–50 GB/s in practice (Gen4 through Gen5), so a 60 GB model requires one to two seconds of transfer before producing its first token. But the cost itself is less damaging than its unpredictability: a pipeline completing in 200 ms under light load may take 2,200 ms under moderate load, not because any model got slower, but because an unscheduled eviction occurred between stages. No existing inference server even surfaces this cost in its latency accounting, let alone schedules around it.

Structural idle time during pipeline transitions. Agentic pipelines are dominated by sequential operations — one model’s output feeds the next. During transitions, while the host post-processes output and prepares the next input, the GPU sits idle. Where stages could run in parallel (independent retrieval and reranking branches, concurrent document processing), a scheduler could overlap them. Without a system-level view of the pipeline graph, neither the idle time nor the parallel opportunities are exploited. This is not a utilization problem that continuous batching can solve — it exists between models, not within them.

The common thread is straightforward: there is no scheduling layer in the inference stack. No residency policy, no preemption, no cross-model priority.

1.3 How This Is Handled Today

The workarounds in production are largely architectural duct tape. Teams deploy each model as an independent microservice behind a load balancer, wire them together with event-driven state machines or orchestration frameworks, and manage the GPU fleet as a collection of single-tenant partitions. This works — in the sense that it ships — but it inherits all the costs above and adds its own: serialization at every service boundary, network hops between models that could share a GPU, and orchestration logic that reasons about model sequencing without any awareness of the hardware executing it. The orchestrator knows which model to call next; it has no idea whether that model is resident, whether the GPU is idle, or whether two independent stages could run in parallel.

The inference ecosystem reflects this gap. Single-model engines (vLLM, TGI, TensorRT-LLM) have driven per-model throughput close to hardware limits but treat the GPU as exclusively owned by one model; multi-model deployment means separate instances with separate memory partitions. NVIDIA Dynamo4 (1.0, March 2026) coordinates existing engines across multi-node GPU fleets — disaggregated prefill/decode, KV-cache-aware routing, SLO-driven scaling — but sits above the engine layer: it routes requests to vLLM or TRT-LLM instances, it doesn’t replace them, and it doesn’t manage multiple models within a single GPU’s memory space. Triton Inference Server hosts multiple models but runs each as an independent execution context. Ray Serve5 orchestrates at the application level without GPU awareness. NVIDIA MPS and MIG offer static hardware partitioning, not dynamic scheduling.

What’s missing is the layer beneath all of these: a runtime that treats multiple models as co-scheduled processes on shared GPU hardware, with compile-time planning, a unified memory pool, and preemptive dispatch. Operating systems solved the analogous problem for CPU processes decades ago.


2. Design Philosophy: Inference as an Operating System

Multi-model inference is a resource scheduling problem. Operating system theory already has the vocabulary and the mechanisms for it — process isolation, preemptive scheduling, virtual memory, demand paging. Neurafewz attempts to apply concepts to GPU inference, not as metaphor but as architecture.

Each model invocation is treated as a process: a unit of work with declared resource demands — VRAM, activation scratch, compute time — that it cannot unilaterally satisfy. A model’s definition — its weights, kernel configurations, memory layout requirements — functions as a driver: the static specification and associated capabilities that the system loads and manages. Each invocation of that model is a process instance: scheduled, dispatched, and potentially preempted by the runtime. No model holds a permanent claim to GPU memory or the decision of when it executes. These are system decisions.

The mechanisms that enforce this are drawn directly from OS design:

  • A unified memory allocator manages VRAM as a shared, pooled resource. Weight buffers and activation scratch regions are allocated from a single address space under system control. This is what makes cross-model memory sharing and co-residency structurally possible.

  • A preemptive scheduler time-slices GPU compute among concurrent model processes according to the Builder’s execution plan. A lower-priority model can be suspended at a safe preemption point to yield resources to a higher-priority model on the critical path.

  • A tiered residency system governs which model weights are resident in VRAM versus staged in pinned host memory or NVMe, functioning as plan-driven demand paging. Software signals mediate the transitions — suspending execution when a model’s resource window closes, dispatching it when onload completes — preserving the separation between scheduling policy and execution mechanism.

In practice, this means models don’t need to know they’re sharing hardware. They execute as if they have dedicated resources; the OS layer enforces sharing transparently.

The system divides into two phases — an offline planning phase and a runtime that executes the plan:

Neurafewz PhaseOS Analog
BuilderLinker + static scheduler: resolves symbols, lays out memory, produces a fixed execution plan before any process runs
RuntimeKernel init + process scheduler + demand pager: commits memory allocations at boot, then dispatches execution, enforces time-slicing, and manages the boundary between fast resident memory and slower backing store

The sections that follow describe each phase in detail.


3. The Builder: Offline Planning

The Builder is the compile-time phase of the Neurafewz inference OS. It takes two inputs — a hardware profile and an inference pipeline specification — and produces a complete deployment plan: resolved kernel binaries, static memory layouts, quantization configuration per model, an activation memory map, and an offload/onload schedule. Nothing the runtime needs to discover is left for runtime. The Builder’s output is a deterministic plan that the runtime executes without modification and that the scheduler enforces without re-solving.

The design principle is simple: every decision that can be made offline should be made offline. A compile-time decision is paid once while an equivalent runtime decision is paid on every request, under latency pressure, with incomplete information. The Builder’s algorithmic sophistication — the quality of the solver, the depth of the profiling, the breadth of framework integrations — deepens with each iteration of the system.

3.1 Inputs to the Builder

3.1.1 Target Hardware Profile

The Builder ingests a detailed hardware profile that constrains every downstream decision. The primary parameters are: SM architecture generation (sm_80 for A100, sm_89 for L40S/RTX 4090, sm_90 for H100), which determines available instruction families — mma.sync variants, wgmma on SM90, TMA for async tensor loads; L1/L2 cache capacity and associativity, which bound tile sizes for cache reuse without thrashing; peak DRAM bandwidth (~ 2 TB/s HBM2e on A100 80GB, 3.35 TB/s HBM3 on H100 SXM), which sets the ceiling for memory-bandwidth-bound operations; warp scheduler count and maximum occupancy per SM, which constrain tiling and pipelining depth; and pinned host memory capacity, which determines how much and what composition of the VRAM requirement can reside in fast-DMA-accessible memory.

These are hard constraints. A tile configuration feasible on H100’s 256 KB combined L1/shared memory per SM may not fit on A100’s 192 KB; the Builder resolves this at compile time.

3.1.2 The Inference Pipeline Specification

The second input is a declarative specification of the inference pipeline. The specification is designed to be generated by static analysis or transpilation of existing orchestration definitions — a LangGraph workflow, a DSPy program, a custom DAG — rather than authored directly, though the current implementation accepts a Neurafewz-native pipeline definition. The specification captures which models participate, their invocation ordering, conditional branching logic, data dependencies, and task-level constraints on sequence or context length.

Critically, the specification also captures tool calls and non-GPU operations — web requests, database lookups, external API calls. These are not executed on the GPU, but their latency profiles are known or estimable, and they represent scheduled idle windows. A tool call with a 150–300 ms expected round-trip is a planning opportunity: the Builder can schedule preemptive onload of the next model behind the tool call’s wall-clock duration, hiding staging latency entirely.

From the specification the Builder extracts: invocation frequency and criticality of each model (critical-path vs. conditionally invoked); parallelizability structure (data-independent stages that can execute concurrently); resource demand profiles (weight footprint, activation scratch, compute class); and tool-call windows available for latency hiding. Offload and onload operations are lazy, not eager — the plan specifies trigger conditions and ordering, not a fixed clock schedule. The runtime enforces these triggers as execution events arrive.

3.2 Pipeline Graph Construction and Analysis

The Builder constructs a directed acyclic task graph from the pipeline specification, where individual nodes may encapsulate cyclic sub-workflows (e.g., retry loops or iterative generation). Nodes represent all execution (models, tools or custom logic); edges represent execution dependencies and control flow.

The first thing the Builder extracts from this graph is the critical path — the longest dependency chain, which determines minimum achievable end-to-end latency. Models on the critical path receive higher residency priority: evicting one has direct latency impact, while evicting a non-critical-path model may be absorbed by scheduling slack.

Separately, the Builder performs parallelism and batch-sharing analysis to identify node groups with no data dependency between them — candidates for concurrent execution and shared GPU occupancy. Where independent branches have compatible batch dimensions (e.g., an embedding model processing retrieved documents in parallel with a reranker scoring the previous batch), the Builder identifies shared-batch opportunities.

3.3 Resource Planning

With the pipeline graph analyzed, the Builder runs a multi-pass resource planning phase that jointly optimizes GPU memory allocation across the fleet.

Multi-pass model negotiation reconciles each model’s ideal resource requirements against the shared GPU memory budget. The process is iterative: models are allocated memory according to the current plan, each model’s configuration (quantization depth, max context/sequence length, KV cache capacity) is adjusted to fit, and the plan is re-evaluated globally — since one model’s constraint may free memory that lets another relax its own. A concrete example: if a Whisper-scale model requires 4 GB of activation scratch during spectrogram processing, the negotiation may determine that a co-resident LLM’s maximum sequence length can be reduced from 8192 to 4096 tokens — accepting a configuration constraint in exchange for shared GPU occupancy rather than offload during Whisper inference. Whether this trade is acceptable depends on the task specification. The Builder makes this determination offline; the runtime does not re-negotiate.

Accuracy-first quantization applies the minimum quantization depth that brings each model’s weight footprint within its budget, preserving full precision wherever possible. Where the model author has published a quantization configuration (per-tensor, per-channel, or block-wise granularity), the Builder inherits it as a hard constraint — the granularity decision reflects accuracy characterization that an automated planner cannot safely reinterpret without equivalent data.

When not all layers require quantization to meet the memory budget, the Builder applies quantization guided by layer sensitivity. The intended approach uses established methods — GPTQ-style reconstruction error6, AWQ’s salient channel identification7, per-layer perplexity delta where calibration data is available — to rank layers by quantization robustness. In practice, quantization is applied preferentially to layers known to be robust (typically middle feed-forward layers in large transformers) while preserving full precision in sensitive layers (typically early attention and final output projections). The granularity of this sensitivity ranking improves as the profiling infrastructure matures.

The output is a complete allocation table: each model’s VRAM budget, quantization configuration, max context and sequence length, co-residency group, preemption priority, offload tier (VRAM-resident, pinned host, pageable host, or NVMe), and onload latency estimate per tier.

3.4 Kernel Design: Data-Flow-Centric Fusion

Most inference frameworks treat operations as first-class objects applied to tensors, each reading inputs from DRAM and writing results back before the next begins. Fusion is an optimization applied opportunistically to adjacent operation pairs.

Neurafewz inverts this. The primary design question is not “what operations do we apply?” but “how do we move data through the necessary transformations while touching DRAM the minimum number of times?” This IO-aware approach to kernel design, pioneered by FlashAttention8, treats computation as a memory movement problem. Fusion is the default — separation into distinct kernels requires explicit justification (tile size incompatibility, register pressure, occupancy constraints).

Two examples illustrate the practical impact.

Fused dequantization + GEMM + residual addition eliminates two DRAM round-trips. In an unfused FP8 pipeline: dequantization reads FP8 weights and writes BF16 to DRAM; the GEMM reads BF16 and writes FP32 accumulation to DRAM; residual addition reads the accumulation and residual tensor. A fused kernel performs dequantization in registers as weights load for the GEMM, accumulates in FP32 within the warp’s register file, and folds residual addition into the epilogue before a single DRAM write. At memory-bandwidth-bound operating points — which characterize autoregressive decode for all large models — eliminating two write-read pairs per layer has direct throughput impact.

Fused RMSNorm + QKV projection + rotary embedding eliminates materialization of the normalized activation tensor entirely. Without fusion, normalized activations are written to DRAM after RMSNorm and re-read by the QKV projection. With fusion, normalization happens in shared memory or registers, the projection computes on normalized values without a DRAM round-trip, and rotary embedding is applied to Q and K in the same kernel pass. DRAM sees only the final Q, K, V tensors.

3.5 Kernel Selection

Given the fused kernel designs available for a target hardware and model architecture, the Builder selects the specific configuration — tile dimensions, warp count, pipeline depth, quantization variant — that maximizes throughput for each operation on the deployment hardware.

The primary compile-time variables are: TILE_M, TILE_N, TILE_K for GEMM kernels (constrained by shared memory capacity and tensor core granularity — mma.m16n8k16 for BF16/FP16 on SM80+, mma.m16n8k32 for FP8 on SM89+, mma.m16n8k8 for TF32, with wgmma supporting larger persistent tiles on SM90); warp count per threadblock; software pipeline depth for async loads; quantization type (BF16, FP8 E4M3, INT8); and for attention kernels, sequence length bucket, head dimension, and causal masking configuration.

Selection uses a known-good selection table keyed on (Model, Hardware) tuples. For each supported model on each SM generation, the Builder maintains a pre-profiled table of optimal configurations produced by semi-exhaustive search. The search space is pruned analytically (tile configurations exceeding shared memory or violating alignment are eliminated before profiling), and remaining configurations are profiled against representative problem sizes. As the profiling infrastructure matures, selection is being extended to be pipeline-aware: a kernel achieving peak throughput in isolation may degrade when co-scheduled with an adjacent kernel competing for L2 cache, and capturing these interactions is an active area of refinement.

At build time, the Builder emits the selected configuration as a compile-time constant. There is no runtime autotuning, no warm-up period, no suboptimal early selections.

The natural extension is cross-model occupancy-aware selection: when co-resident models execute in overlapping windows, their kernels share SM resources, and the optimal tile configuration for each may differ from its single-model optimum. This interaction between co-resident configurations, occupancy, and L2 pressure is not yet captured by per-model profiling, and is an area of active research.

3.6 Weight Layout and Memory Planning

Kernel selection fully determines weight layout requirements. The optimal layout for a weight tensor is a function of the kernel that will consume it — tile access pattern, vectorized load width, shared memory bank layout, coalesced DRAM access properties etc - not the other way around.

The resulting layouts are not just row-major or column-major, for a fused GEMM kernel on a specific architecture, the layout is a complex interleaved permutation: weight blocks stored in the order consumed by the warp’s mma instructions, enabling maximally coalesced 128-byte loads, conflict-free shared memory bank access, and clean mapping to pipelined async load instructions (ldmatrix on SM80, cp.async, TMA on SM90 and so on). The Builder resolves these layouts for every weight tensor at build time. Weights are stored on disk in the permuted layout the kernel expects — no runtime transposition or re-tiling.

Disk layout is a further optimization. Sequential reads outperform random access on NVMe (random 4K throughput is typically 10–20× lower than sequential) and benefit from host memory prefetch hardware. The Builder arranges weight blocks on disk in the order they will be read during boot and onload operations.

Where multiple models share weight blocks — base weights between a model and its fine-tuned adapter, or shared embedding tables across models in the same vocabulary family — the Builder assigns shared blocks to a single physical allocation, reducing total memory demand. This deduplication is planned offline with full visibility into all models’ weight inventories.

3.7 Build Output: A Self-Contained Executable

The output of the Builder is a single statically compiled and linked binary for the target hardware. The kernel constants, memory layout, execution plan, resource allocation table, activation memory map, and offload/onload schedule are embedded in the binary as static data structures, not loaded from external configuration at runtime — eliminating the need for runtime middlewares, interpreters, and JIT compilers.

The practical consequence is binary size and deployment surface. A conventional inference stack — PyTorch, CUDA runtime, cuDNN, cuBLAS, framework dispatch infrastructure, and associated shared libraries — requires several gigabytes of runtime installation. A Neurafewz build statically links the CUDA runtime, cuFFT, and all kernel code into a single binary; the only runtime dependency is the NVIDIA kernel-mode driver (libcuda.so) and standard system libraries (libstdc++, libpthread, libdl). No Python interpreter, no cuDNN, no cuBLAS, no framework dispatch layer. The resulting binary for a specific pipeline on a specific hardware target is a few hundred megabytes — a direct measure of how much of the conventional stack exists to handle decisions at runtime that Neurafewz has already resolved at build time.


4. Runtime: The Inference OS

The runtime encompasses everything from initial memory commitment through live request scheduling. The scheduling architecture described below - the dispatch, preemption, and adaptation mechanisms - are functional and being hardened against progressively more demanding workload patterns.

4.1 Boot: Deterministic Resource Commitment and Model Staging

At boot, the runtime reads the Builder’s embedded resource allocation table and commits GPU memory accordingly — weight buffers, activation pools, scratch regions, all at planned sizes and alignments. No runtime measurement or adaptive sizing is involved; the VRAM footprint is known before the binary is deployed.

With memory committed, the runtime stages model weights. Neurafewz supports two boot modes that make the warmup-vs-responsiveness tradeoff explicit.

In eager mode, the system stages all models into their planned memory tiers before admitting requests. This suits long-running services where startup latency is acceptable and minimizing steady-state onload events is the priority.

In on-demand mode, boot follows a critical-path-first, lazy-deferred strategy. Only the model required for the first pipeline stage is loaded into VRAM on priority. The moment it is ready and inference begins, the boot phase uses the compute window to begin staging the next model in the plan.

A concrete example: if the pipeline begins with speech recognition, only Whisper’s weights are loaded at boot. As Whisper inference runs on the first audio input, the runtime transfers the next model’s weights (e.g., an intent classifier) from pinned host memory into VRAM over lower priority streams. By the time Whisper produces its transcript, the classifier may already be resident — staging latency hidden entirely behind compute.

This pattern extends through the pipeline. Each model’s onload is initiated at the latest point that allows completion before it is needed. The system reaches full warm state progressively, driven by actual execution. Time-to-first-response is bounded by the load time of a single model plus the first stage’s inference time — not the aggregate load time of the fleet. For larger pipelines, this progressively reduces boot-to-first-response time.

4.2 Dispatch: Resource-Readiness-Aware Scheduling

The Builder’s plan provides intended dispatch ordering for each pipeline stage. At runtime, the scheduler uses this plan as its primary guide but weighs it against resource readiness. Strict plan adherence can penalize execution when actual state diverges from assumptions — a model predicted to be warm may not yet have completed onload, or a parallel branch may have finished early.

The scheduler therefore dispatches any model that is both plan-eligible and resource-ready, rather than holding it to a fixed schedule. When two models are candidates for parallel execution, the one whose weights are warm fires immediately; its sibling dispatches as soon as it achieves readiness.

This implements opportunistic slack exploitation at the pipeline level. At fork-join points — where parallel branches must both complete before a downstream model can proceed — the join-point wait is not idle time. The scheduler fills it with data-independent work: ready models from the current pipeline, or invocations from concurrently executing pipelines that can claim available GPU occupancy.

4.3 Concurrent Request Execution

Multiple in-flight requests are not executed as independent sequential pipelines. The scheduler flattens them into a unified dispatch pool. Invocations from different requests targeting the same model at the same stage are batched — inputs concatenated along the batch dimension, the model executing once for all. Invocations targeting different models are candidates for concurrent execution under the Builder’s co-residency plan.

The resulting pattern resembles a single interleaved inference path rather than N independent pipelines. Each request sees a dedicated execution environment while the scheduler sees a shared work pool and keeps the GPU as fully occupied as the plan permits.

4.4 Preemption: Lazy, Layer-Granular, Priority-Stratified

Preemption is always lazy — triggered only when the scheduler determines that a pending invocation cannot make forward progress because a currently executing model holds resources it needs, and that holding is not justified by the executing model’s priority or plan position. Preemption is always executed at a layer boundary. An inference pass, mid-layer is never interrupted. Layer boundaries are safe preemption points: internal state is fully defined by weight tensors and the layer’s input activation.

The scheduler stratifies preemption into priority levels that trade fairness against the cost of discarding work. At the most urgent end, critical priority suspends at the next decoder step boundary — the finest granularity at which a generative model can be cleanly suspended with its state preserved. Medium priority is less aggressive: the executing model completes its current forward pass before suspending, retaining results for the current batch. At the other end, low priority defers suspension until all in-flight batches complete — the model accepts no new work but fulfills existing commitments, minimizing disruption at the cost of longer preemption latency.

There is also a fairness concern specific to agentic workloads: a long generative sequence (typically observed in agentic workflows) should not indefinitely block a short inference path that arrives mid-execution. The scheduler may choose to promote short-path invocations accordingly, preempting the long running execution to let the short path complete and release its resources first before resuming a long running task.

4.5 Plan Deviation and Runtime Adaptation

The Builder’s plan is accurate but not omniscient and the runtime does not re-plan inline when deviation occurs — that is precisely what the Builder exists to avoid. Instead, the scheduler responds through pre-specified fallback behaviors that are themselves products of the planning phase. When a model overruns its planned time window — more decoder steps than predicted, or a larger batch than anticipated — the scheduler absorbs the overrun by scheduling slack in non-critical-path stages.

Memory pressure is handled more aggressively. If actual KV cache consumption exceeds configured maximums, or concurrent pipelines land at memory-intensive stages simultaneously, the scheduler applies low-priority preemption to the lowest-criticality resident model. The preprocessing queue continues accepting work; the inference queue throttles until pressure resolves. The degradation order is deliberate: throughput degrades before latency, and latency before errors.

If a model’s onload hasn’t completed by the time its dispatch window opens, the scheduler holds the dispatch and advances other ready work. The late model dispatches on completion and the pipeline incurs latency at that stage but does not stall globally.


5. Performance Characteristics

Results are presented at three levels: kernel throughput, per-model inference, and end-to-end pipeline. All measurements on RTX 4090 (24 GB, sm_89, multi-server setup with RTX 5080) unless noted. BF16/FP16 peak: 165.2 TFLOPS.

5.1 Kernel Performance

These results isolate the effect of data-flow-centric fusion (Section 3.4) and compile-time kernel selection (Section 3.5) at the individual operation level.

5.1.1 Compute-Bound: Fused RMSNorm + QKV Projection + RoPE + KV Cache Write

Wall-clock (mean)Throughput (TFLOPS)% of PeakSpeedup
PyTorch0.4657ms110.867.0%baseline
Triton0.4154ms124.275.2%1.12×
Neurafewz0.3473ms148.589.9%1.34×

QKV projection [M=1024, K=4096, N=1024] + RMSNorm + RoPE (on-the-fly) + KV cache write | BF16 | 3 warmup / 100-run mean

5.1.2 Memory-Bound: Fused Conv1d + GELU + Positional Encoding + LayerNorm Stats

Wall-clock (mean)Throughput (TFLOPS)% of PeakSpeedup
PyTorch1.987ms5.283.20%baseline
Triton1.126ms9.395.68%1.76×
Neurafewz0.958ms10.976.64%2.07×

Conv1d [M=10500, K=384, N=1280] + GELU + positional encoding (on-the-fly) + LayerNorm stats | FP16 (FP32 epilogue) | 3 warmup / 10-run mean. Low % of peak is expected: K=384 limits tensor core wavefront utilization; epilogue is bandwidth-bound.

5.1.3 Unfused Baseline: Attention (Compile-Time Selection Only)

Wall-clock (mean)Throughput (TFLOPS)% of PeakSpeedup
PyTorch (SDPA)0.6754ms138.683.9%baseline
Neurafewz0.6371ms146.989.0%1.06×

SDPA [batch=8, heads=20, seq=1500, head_dim=64] | FP16 | 3 warmup / 100-run mean. No fusion applied — gain is entirely from compile-time kernel selection. PyTorch SDPA dispatches to FlashAttention-29 via runtime algorithm search; Neurafewz resolves the equivalent path statically.

5.1.4 Observations

The largest gains appear in the memory-bound workload (2.07× over PyTorch, 1.18× over Triton), where eliminating intermediate DRAM materializations is directly on the critical path. Triton’s10 JIT compilation recovers much of PyTorch’s overhead but the hand-written fused kernel still outperforms it — the remaining gap reflects layout-aware memory access patterns that Triton’s autotuner does not explore. The compute-bound fused workload reaches 89.8% of peak (1.50×), with the remaining gap attributable to non-GEMM operations (RoPE, KV writes) outside tensor core throughput. The unfused attention result (1.06×) isolates compile-time selection: both paths execute the same FlashAttention-2 algorithm, but Neurafewz eliminates the runtime dispatch search. Fusion delivers the largest wins where bandwidth is the bottleneck; compile-time selection provides gains even where fusion is not applied.


5.2 Per-Model Inference Performance

These results measure whether the multi-model runtime degrades single-model performance. Baselines run as exclusive GPU tenants. Neurafewz results were measured with Llama 3.1 8B and Whisper Large-v3 co-resident on a 24 GB GPU. To accommodate both models, the Builder offloads 8 of Llama’s 32 transformer layers to pinned host memory; these are preemptively re-onloaded during Whisper’s de-tokenization phase before Llama inference begins.

5.2.1 Llama 3.1 8B

Prefill (tok/s)Generation (tok/s)
vLLM10,90057.44
llama.cpp8,90055.70
Neurafewz (multi-model)9,70057.80

Llama 3.1 8B | BF16 (~ 16 GB full, ~ 12 GB resident with 8/32 layers offloaded) | 1024 input, 512 generated | continuous batching | 3 warmup / 10-run mean. vLLM and llama.cpp run as sole GPU tenants with all layers resident; Neurafewz co-resides with Whisper Large-v3 (~ 3 GB) and re-stages offloaded layers before generation begins.

Prefill is 11% lower than vLLM. Generation throughput is at parity. The prefill gap is the cost of the multi-model runtime path and is paid once per request.

5.2.2 Whisper Large-v3

Processing TimeSpeedBatch SizeNotes
openai-whisper60.30s13× real-time1OOM at batch > 1
faster-whisper12.70s62× real-time8
Neurafewz (multi-model)9.71s81× real-time8

Whisper Large-v3 | FP16 | Audio: 13.13 minutes | beam_size=5 | Speed = audio_duration / processing_time. faster-whisper: FP16, CUDA 12. openai-whisper OOMs at batch > 1 on 24 GB. Neurafewz co-resides with Llama 3.1 8B (~ 16 GB).

1.31× faster than faster-whisper (CTranslate2 backend)11. Gains from fused encoder kernels (Section 5.1) applied end-to-end.


5.3 End-to-End Pipeline Performance

Pipeline: 13 seconds of audio → Whisper Large-v3 (transcription) → Llama 3.1 8B (196 prompt tokens, 116 generated tokens). Half precision throughout.

DeploymentHardwareEnd-to-end Latency
Conventional (faster-whisper + vLLM)RTX 5080 + RTX 40902.1s
NeurafewzRTX 4090 (single GPU)1.7s

Conventional setup: faster-whisper on RTX 5080, vLLM on RTX 4090, networked. Neurafewz: both models co-resident on one RTX 4090. A single-GPU sequential baseline (faster-whisper then vLLM on one RTX 4090) was measured at ~ 11s; it is excluded because sequential single-tenant execution does not represent how multi-model pipelines are deployed in practice — the comparison against a multi-GPU deployment reflects the actual infrastructure alternative.

19% lower latency on one GPU versus two, where the conventional deployment includes an RTX 5080 (newer-generation hardware) for Whisper.

Runtime Behavior During This Pipeline

At steady state, Llama 3.1 8B is the primary resident model with 8 of its 32 transformer layers offloaded to pinned host memory — freeing VRAM for Whisper’s weights and activation scratch. When a request arrives, Whisper loads and begins inference. During beam finalization and de-tokenization — where GPU compute demand drops — the runtime initiates preemptive onload of Llama’s offloaded layers over a low-priority DMA stream. By the time the transcript reaches Llama, most or all offloaded layers are re-staged.

If a second request arrives during Llama’s generation, preemption is lazy: the scheduler offloads only as many Llama tail layers as needed to fit Whisper, not a full model swap. For short sequences where KV cache is small, no offload may be required at all.

This two-model deployment exercises compile-time co-residency planning with partial layer offload, preemptive onload overlapped with active compute, and lazy layer-granular preemption — the same primitives that scale to larger fleets.


Sections 5.1–5.3 validate kernel fusion, per-model parity under co-residency with partial offload, and two-model pipeline scheduling including preemptive layer staging. Concurrent request batching, tool-call window scheduling, and fleet sizes beyond two models are not yet benchmarked.

5.4 What Is Not Yet Measured

The following are not reported because the measurement methodology or workload scale required for reliable results is still being developed:

  • P50 / P95 / P99 latency under sustained concurrent load across multiple simultaneous pipelines.
  • Fleet-scale GPU utilization across concurrent pipelines.
  • SLO compliance rates under varying request arrival distributions.
  • Isolated preemption cost: suspension/resume latency, layer-staging overhead, and scheduling impact across complex multi-model workflows with inter-graph dependencies.
  • Tool-call window exploitation in pipelines with external API calls, database lookups, or retrieval operations — where the Builder schedules model staging behind non-GPU latency.

These will be added as measurement infrastructure matures.


6. Limitations and Scope

The guarantees Neurafewz provides are contingent on workload characteristics that match its design assumptions. This section describes those boundaries precisely, so evaluators can determine fit against their deployment context.

6.1 Model Fleet Homogeneity

Performance gains are proportional to the heterogeneity and temporal disjointness of the model fleet. When the fleet is homogeneous — models of similar size and compute profile — scheduling opportunities shrink accordingly.

The limiting case is two large models of comparable parameter count (e.g., two 70B models requiring ~ 140 GB each in BF16 — already exceeding a single A100 80GB after 8bit quantization). Co-residency and memory-sharing mechanisms have near-zero effect: the models cannot share occupancy, invocation overlap produces contention rather than opportunity, and offload/onload cost dominates. Operators in this scenario are better served by dedicated per-model GPU.

6.2 Highly Linear Single-Model-Dominated Pipelines

A pipeline that invokes the same model repeatedly across sequential stages — minimal branching, no parallel execution opportunities — provides limited scheduling surface. The Builder still resolves kernel configurations and memory layouts, but the runtime has no idle windows to exploit and no co-residency opportunities. A concrete example: an LLM tasked with only summarization of long form incoming text. In this configuration, scheduling overhead may marginally exceed gains. Operators should evaluate whether a single-model baseline is more appropriate.

6.3 Pipeline Composition Is Fixed at Build Time

The set of models, invocation ordering, branching logic, and resource demands are resolved offline. Any modification — adding a model, removing a stage, changing the graph — requires a full Builder rebuild and redeployment. There is no runtime model registration. This is a deliberate design choice (runtime flexibility and compile-time determinism are in tension and we choose to be deterministic), but a real constraint for teams that iterate on pipeline composition frequently.

6.4 Hardware Target Is Fixed at Build Time

Each binary targets a specific SM architecture. A build for sm_89 does not execute on sm_80 or at peak efficiency on sm_90. Multi-hardware deployments require separate Builder runs and binaries, with corresponding operational overhead. Heterogeneous GPU fleets must maintain per-target builds explicitly.

6.5 Known-Good Selection Table Coverage

A model architecture without a profiled entry in the selection table — a novel MoE design, an unusual attention variant, a recently released model family — will not receive an optimally selected kernel configuration. The system falls back to a reasonable default, but the compile-time selection advantage is not realized until the model is formally onboarded.

6.6 Sub-Millisecond Inference Workloads

Scheduling overhead is negligible relative to large-model inference (a millisecond or less against hundreds of milliseconds). For workloads where per-request inference is in the single-digit millisecond range — very small models, high-throughput microservice patterns — the overhead becomes a measurable fraction of request time. The system is designed for workloads where inference time dominates scheduling time.

6.7 Inference Only

Neurafewz is inference-only infrastructure. There is no support for in-place weight updates, online fine-tuning, or adapter swapping without rebuild. Multi-adapter setup allows dynamic swapping if known during build. Workflows requiring continuous learning or frequent fine-tuning cycles need a separate training infrastructure layer which is beyond the scope of Neurafewz.


7. Roadmap

The current system targets NVIDIA SM80–SM90 GPUs on single-GPU deployments. Three areas represent the natural expansion surface.

7.1 Hardware Coverage: AMD, TPUs, and Emerging Accelerators

The Builder is hardware-parameterized by design — kernel selection, memory layout, and occupancy modeling are keyed on hardware characteristics rather than hardcoded to NVIDIA’s ISA. Extending coverage requires onboarding: characterizing the target’s compute and memory hierarchy, profiling the kernel configuration space, and populating the selection table.

The near-term target is AMD CDNA — specifically the MI300X (192 GB HBM3). The HIP programming model presents a familiar porting surface; kernel designs are expressible in HIP with architecture-appropriate instruction substitutions.

TPUs and custom ASICs present a different challenge. Their programming models are more constrained, and GPU-oriented fusion strategies may not translate directly. The Builder’s compile-time philosophy is arguably more valuable on fixed-function accelerators where runtime flexibility is limited.

7.2 Multi-GPU Orchestration Within a Single Node

The logical extension is the multi-GPU node — 2, 4 or 8 GPUs — applying OS scheduling principles across the node’s GPU fleet.

This is not tensor or pipeline parallelism in the conventional sense. Existing strategies12 (FSDP, TP, PP) split a single model across GPUs, treating the collective as one logical accelerator. The Neurafewz extension applies the orthogonal principle: the scheduler manages multiple models across multiple GPUs, making residency, co-location, and dispatch decisions at the node level with a unified plan. GPU topology becomes a first-class planning input, changing the economics of cross-GPU staging and weight sharing.

Scope is deliberately bounded to single-box. Cross-node scheduling introduces network latency, fault tolerance, and distributed state management — a different class of problem beyond this roadmap horizon.

7.3 Giga-Kernels and Aggressive Persistent Fusion

The current strategy fuses operations within a single kernel launch. The natural extension is persistent kernel execution: a kernel occupying a fixed set of SMs for the duration of an inference pass, pulling work from a queue without kernel launch overhead between operations.

Taken to its limit, this converges toward a giga-kernel — a single persistent kernel spanning multiple transformer layers, managing its own work scheduling, tile dispatch, and memory movement across the full forward pass. NVIDIA’s wgmma and TMA on SM90 support this pattern: wgmma enables pipelined asynchronous warp-group matrix multiply, and TMA decouples memory movement from compute scheduling.

The potential gains are meaningful. Kernel launch overhead is 5–20 μs per launch; at decode batch sizes where each layer’s compute takes 50–100 μs, this is a non-trivial fraction of total time. More significantly, persistent kernels can maintain partial results in registers across layer boundaries rather than writing to DRAM and re-reading — reducing memory traffic beyond what per-operation fusion achieves.

This is an area of active exploration. The programming model complexity is substantially higher, and integration with the Builder’s tile selection and layout planning requires extensive deliberation.


  1. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention,” SOSP 2023. https://arxiv.org/abs/2309.06180
  2. Yu et al., “Orca: A Distributed Serving System for Transformer-Based Generative Models,” OSDI 2022. https://www.usenix.org/conference/osdi22/presentation/yu
  3. NVIDIA TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM
  4. NVIDIA Dynamo: Distributed Inference Serving Framework. https://developer.nvidia.com/dynamo
  5. Moritz et al., “Ray: A Distributed Framework for Emerging AI Applications,” OSDI 2018. https://arxiv.org/abs/1712.05889
  6. Frantar et al., “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers,” ICLR 2023. https://arxiv.org/abs/2210.17323
  7. Lin et al., “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration,” MLSys 2024. https://arxiv.org/abs/2306.00978
  8. Dao et al., “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness,” NeurIPS 2022. https://arxiv.org/abs/2205.14135
  9. Dao, “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” ICLR 2024. https://arxiv.org/abs/2307.08691
  10. Tillet et al., “Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations,” MAPL 2019. https://github.com/triton-lang/triton
  11. CTranslate2: Fast inference engine for Transformer models. https://github.com/OpenNMT/CTranslate2
  12. Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,” 2019. https://arxiv.org/abs/1909.08053