FakeGPU

A CUDA API interception library that simulates GPU devices in non-GPU environments, enabling basic operations for PyTorch and other deep learning frameworks, and supporting single-host multi-process distributed simulation for NCCL-style workloads.

Documentation

This repository now ships a MkDocs + Material for MkDocs documentation site configuration.

Local preview:

python3 -m pip install -e ".[docs]"
mkdocs serve

GitHub Pages deployment:

The site configuration lives in mkdocs.yml
The published content lives under docs/
.github/workflows/docs.yml builds and deploys the site through GitHub Pages
The workflow auto-deploys on pushes to main
workflow_dispatch remains available, so dev or any other branch can still be published manually from the Actions page

Timeline

Implemented Features

Planned Features

Deeper Protocol Fidelity - Better overlap/ordering realism beyond semantic NCCL simulation
Broader Multi-Host Validation - More real multi-machine coverage beyond current single-host and loopback transport validation
Enhanced Testing - Optimize test suite with more languages and runtime environments

Operation Modes

FakeGPU supports three compute modes, controlled by the FAKEGPU_MODE environment variable:

Simulate Mode (Default)

FAKEGPU_MODE=simulate ./fgpu python your_script.py

All CUDA APIs return fake data
No real GPU required
Device memory backed by system RAM
Kernel launches are no-ops

Passthrough Mode

FAKEGPU_MODE=passthrough ./fgpu python your_script.py

Forwards all CUDA calls to real GPU libraries
Results identical to running without FakeGPU
Useful for parity testing and debugging
Requires real GPU and CUDA installation

Hybrid Mode

FAKEGPU_MODE=hybrid FAKEGPU_OOM_POLICY=clamp ./fgpu python your_script.py

Device info is virtualized (can report different GPU specs)
Compute operations use real GPU
OOM safety policies prevent crashes when virtual memory exceeds real GPU capacity

OOM Policies (for Hybrid mode):

clamp (default): Report memory clamped to real GPU capacity
managed: Use cudaMallocManaged for oversubscription (relies on UVM)
mapped_host: Use cudaHostAllocMapped for overflow allocations
spill_cpu: Spill excess allocations to CPU memory

Environment Variables:

FAKEGPU_MODE={simulate,passthrough,hybrid}  # Operation mode
FAKEGPU_OOM_POLICY={clamp,managed,mapped_host,spill_cpu}  # OOM policy for hybrid mode
FAKEGPU_REAL_CUDA_LIB_DIR=/path/to/cuda/lib  # Custom CUDA library path

Distributed Communication Modes

Distributed communication is controlled separately by FAKEGPU_DIST_MODE so compute mode and communication mode can be combined.

`FAKEGPU_DIST_MODE`	Meaning
`disabled`	No FakeGPU distributed layer
`simulate`	FakeGPU coordinator executes collectives / p2p using simulated topology
`proxy`	Real NCCL executes collectives while FakeGPU records control-plane and cluster-report data
`passthrough`	Thin forwarding to real NCCL with minimal FakeGPU wrapping

For first-time setup, the recommended mode pair is:

FAKEGPU_MODE=simulate
FAKEGPU_DIST_MODE=simulate

Useful distributed environment variables:

FAKEGPU_DIST_MODE={disabled,simulate,proxy,passthrough}
FAKEGPU_CLUSTER_CONFIG=/abs/path/to/cluster.yaml
FAKEGPU_COORDINATOR_TRANSPORT={unix,tcp}
FAKEGPU_COORDINATOR_ADDR=/tmp/fakegpu.sock        # or 127.0.0.1:29591 for tcp
FAKEGPU_CLUSTER_REPORT_PATH=/path/to/cluster-report.json
FAKEGPU_STAGING_CHUNK_BYTES=1048576
FAKEGPU_STAGING_FORCE_SOCKET=1

Report Output (Hybrid mode):

{
  "report_version": 3,
  "mode": "hybrid",
  "oom_policy": "clamp",
  "hybrid_stats": {
    "real_alloc": {"count": 10, "bytes": 1073741824},
    "managed_alloc": {"count": 0, "bytes": 0},
    "spilled_alloc": {"count": 2, "bytes": 134217728}
  },
  "backing_gpus": [
    {"index": 0, "total_memory": 25769803776, "used_memory": 1073741824}
  ],
  ...
}

Quick Start

Build

cmake -S . -B build
cmake --build build

CPU-backed compute for supported cuBLAS/cuBLASLt operators is enabled by default (runs on CPU; no real GPU required).

Optional (disable CPU simulation and fall back to stub/no-op behavior):

cmake -S . -B build -DENABLE_FAKEGPU_CPU_SIMULATION=OFF
cmake --build build

Generated libraries:

Linux:
- build/libcuda.so.1 - CUDA Driver API
- build/libcudart.so.12 - CUDA Runtime API
- build/libcublas.so.12 - cuBLAS/cuBLASLt API
- build/libnvidia-ml.so.1 - NVML API
- build/libnccl.so.2 - Fake NCCL shim for distributed simulation / proxy / passthrough
- build/fakegpu-coordinator - Coordinator daemon for distributed communication
macOS:
- build/libcuda.dylib - CUDA Driver API
- build/libcudart.dylib - CUDA Runtime API
- build/libcublas.dylib - cuBLAS/cuBLASLt API
- build/libnvidia-ml.dylib - NVML API

Test

Standardized test runner (recommended):

./ftest smoke          # C + Python (no torch needed)
./ftest cpu_sim        # CPU simulation correctness (validates cuBLAS ops; runs a PyTorch matmul check if torch is installed)
./ftest python         # PyTorch tests (requires torch)
./ftest llm            # LLM inference smoke test (requires torch + transformers + local model files)
./ftest all            # smoke + python

Comparison test (recommended):

./test/run_comparison.sh

Runs identical tests on both real GPU and FakeGPU to verify correctness.

PyTorch test:

./fgpu python3 test/test_comparison.py --mode fake

Distributed smoke / validation:

./test/run_multinode_sim.sh 2      # 2-rank smoke
./test/run_multinode_sim.sh 4      # 4-rank smoke
./test/run_ddp_multinode.sh 4      # 4-rank DDP main path
./test/run_hybrid_multinode.sh 2   # hybrid compute + simulated communication

These scripts write logs and reports under test/output/, including cluster-level communication reports.

Usage

import torch

# All PyTorch CUDA operations are intercepted by FakeGPU
device = torch.device('cuda:0')
x = torch.randn(100, 100, device=device)
y = torch.randn(100, 100, device=device)
z = x @ y  # Matrix multiplication

# Simple neural network
model = torch.nn.Linear(100, 50).to(device)
output = model(x)

Runtime requires preloading all libraries: Linux:

LD_LIBRARY_PATH=./build:$LD_LIBRARY_PATH \
LD_PRELOAD=./build/libcublas.so.12:./build/libcudart.so.12:./build/libcuda.so.1:./build/libnvidia-ml.so.1 \
python your_script.py

macOS:

DYLD_LIBRARY_PATH=./build:$DYLD_LIBRARY_PATH \
DYLD_INSERT_LIBRARIES=./build/libcublas.dylib:./build/libcudart.dylib:./build/libcuda.dylib:./build/libnvidia-ml.dylib \
python3 your_script.py

Python wrapper (no need to start Python with LD_PRELOAD):

import fakegpu

# Call early (before importing torch / CUDA-using libraries)
fakegpu.init()  # default: 8x A100
# Optional: fakegpu.init(profile="t4", device_count=2)
# Optional: fakegpu.init(devices="a100:4,h100:4")

import torch

Shortcut runner:

./fgpu python your_script.py
# Optional: ./fgpu --profile t4 --device-count 2 python your_script.py
# Optional: ./fgpu --devices 't4,h100' python your_script.py
# Optional: FAKEGPU_BUILD_DIR=/path/to/build ./fgpu python your_script.py

Python runner (installs fakegpu console script):

fakegpu python your_script.py
# Optional: fakegpu --profile t4 --device-count 2 python your_script.py
# Optional: fakegpu --devices 'a100:4,h100:4' python your_script.py
# or: python -m fakegpu python your_script.py

Distributed runner example (single host, simulated multi-node):

SOCKET_PATH=/tmp/fakegpu-coordinator.sock
CLUSTER_CONFIG=$PWD/verification/data/cluster_valid.yaml

FAKEGPU_DIST_MODE=simulate \
FAKEGPU_CLUSTER_CONFIG="$CLUSTER_CONFIG" \
FAKEGPU_COORDINATOR_TRANSPORT=unix \
FAKEGPU_COORDINATOR_ADDR="$SOCKET_PATH" \
FAKEGPU_CLUSTER_REPORT_PATH=/tmp/fakegpu-cluster-report.json \
./build/fakegpu-coordinator --transport unix --address "$SOCKET_PATH"

In another terminal:

SOCKET_PATH=/tmp/fakegpu-coordinator.sock
CLUSTER_CONFIG=$PWD/verification/data/cluster_valid.yaml

export LD_PRELOAD="$PWD/build/libnccl.so.2${LD_PRELOAD:+:$LD_PRELOAD}"

./fgpu \
  --mode simulate \
  --dist-mode simulate \
  --cluster-config "$CLUSTER_CONFIG" \
  --coordinator-transport unix \
  --coordinator-addr "$SOCKET_PATH" \
  --device-count 4 \
  torchrun \
  --nnodes=1 \
  --nproc_per_node=4 \
  --master_addr 127.0.0.1 \
  --master_port 29500 \
  test/test_ddp_multinode.py \
  --report-dir /tmp/fakegpu-rank-reports \
  --epochs 1

For a more complete walkthrough, see docs/distributed-sim-usage.md.

GPU tools (nvidia-smi)

# FakeGPU-simulated devices via NVML stubs
./fgpu nvidia-smi
# Temperatures may show N/A because the TemperatureV struct is not fully emulated yet.

Reporting

FakeGPU writes fake_gpu_report.json at program exit (also triggered by nvmlShutdown()), including:

Per-device used_memory_peak (peak VRAM requirement)
Per-device IO bytes/calls: H2D / D2H / D2D / peer copies + memset
Per-device compute FLOPs/calls for GEMM/Matmul (cuBLAS / cuBLASLt)

When distributed mode is enabled and FAKEGPU_CLUSTER_REPORT_PATH is set, FakeGPU also writes a cluster-level JSON report with:

Cluster/world-size metadata
Collective counts, bytes, and estimated time
Intra-node and inter-node link statistics
Experimental topology/timing fields used by the distributed validation scripts

Notes:

FLOPs are theoretical estimates (GEMM ≈ 2*m*n*k, complex GEMM uses a larger factor); kernel launches are no-ops and not counted.
host_io.memcpy_* tracks Host↔Host copies (e.g. cudaMemcpyHostToHost).
Optional: set FAKEGPU_REPORT_PATH=/path/to/report.json to change the output location.

Test Results

Test	Status	Description
Tensor creation	✓	Basic memory allocation
Element-wise ops	✓	Add, multiply, trigonometric
Matrix multiplication	✓	cuBLAS/cuBLASLt GEMM
Linear layer	✓	PyTorch nn.Linear
Neural network	✓	Multi-layer forward pass
Memory transfer	✓	CPU ↔ GPU data copy

Architecture

FakeGPU
├── src/
│   ├── core/          # Global state and device management
│   ├── cuda/          # CUDA Driver/Runtime API stubs
│   ├── cublas/        # cuBLAS/cuBLASLt API stubs
│   ├── distributed/   # Coordinator, topology config, communicator, staging, collective execution
│   ├── nccl/          # Fake NCCL shim plus proxy/passthrough dispatch
│   ├── nvml/          # NVML API stubs
│   └── monitor/       # Resource monitoring and reporting
└── test/              # Test scripts

Core Design:

Uses LD_PRELOAD to intercept CUDA API calls
Device memory backed by system RAM (malloc/free)
By default, supported cuBLAS/cuBLASLt ops are executed on CPU (CPU simulation)
Build with -DENABLE_FAKEGPU_CPU_SIMULATION=OFF to disable CPU simulation
Kernel launches are no-ops (logging only)

GPU Profiles

Default build exposes eight Fake NVIDIA A100-SXM4-80GB devices to mirror common server nodes.
GPU parameters are edited in YAML under profiles/*.yaml; CMake embeds these files at build time so no runtime file lookup is needed. Add or tweak a file, rerun cmake -S . -B build, and the new profiles are compiled in.
Presets cover multiple compute capabilities (Maxwell→Blackwell) and feed the existing helpers (GpuProfile::GTX980/P100/V100/T4/A40/A100/H100/L40S/B100/B200), which now prefer the YAML data and fall back to code defaults if parsing fails.
Select presets at runtime via environment variables:
- FAKEGPU_PROFILE=<id> + FAKEGPU_DEVICE_COUNT=<n> (uniform devices)
- FAKEGPU_PROFILES=<spec> (per-device spec, e.g. a100:4,h100:4 or t4,l40s)
Python wrapper passes the same settings (must be called before importing CUDA-using libs like torch): fakegpu.init(profile="t4", device_count=2) or fakegpu.init(devices="a100:4,h100:4").

Limitations

❌ No real GPU execution (CUDA kernels are no-ops; supported cuBLAS/cuBLASLt ops run on CPU)
❌ Complex models (Transformers) may require additional APIs
⚠️ Distributed support is a semantic simulator, not a protocol-level recreation of NCCL/RDMA/NVLink internals
⚠️ The most validated distributed path is still single-host multi-process simulation; TCP coordinator support exists, but real multi-machine coverage is more limited
⚠️ Some proxy/passthrough and advanced NCCL behaviors remain experimental
⚠️ macOS: Official PyTorch wheels do not include CUDA, so FakeGPU only helps when running CUDA-enabled binaries (typically in Linux via Docker/VM).
⚠️ For testing and development environments only

Use Cases

✅ Running GPU code tests in CI/CD environments
✅ Debugging deep learning code on machines without GPUs
✅ Validating CUDA API call logic
✅ Prototyping and unit testing

Dependencies

CMake 3.14+
C++17 compiler
Python 3.10+ (for package, testing, and docs)
PyTorch 2.x (optional, for testing)

License

MIT License

Documentation

mkdocs.yml - MkDocs site config for local preview and GitHub Pages
Test Guide - Detailed testing instructions
Distributed Usage Guide - How to run single-host simulated multi-node workloads
Multi-Node Design - Distributed design notes, implementation plan, and current boundaries
cuBLASLt Implementation - cuBLASLt support details

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
.github/workflows		.github/workflows
docs		docs
fakegpu		fakegpu
profiles		profiles
research_poc		research_poc
scripts		scripts
src		src
test		test
tools		tools
verification		verification
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
TODOs.md		TODOs.md
USAGE_CUDART.md		USAGE_CUDART.md
build_debug.sh		build_debug.sh
build_release.sh		build_release.sh
demo_usage.py		demo_usage.py
design.md		design.md
fgpu		fgpu
find_missing_symbols.sh		find_missing_symbols.sh
ftest		ftest
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
run_nvitop_once.sh		run_nvitop_once.sh
run_test_clean.sh		run_test_clean.sh
setup.cfg		setup.cfg
setup.py		setup.py
show_gpu_info.py		show_gpu_info.py
test_logging.py		test_logging.py
test_summary.py		test_summary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FakeGPU

Documentation

Timeline

Implemented Features

Planned Features

Operation Modes

Simulate Mode (Default)

Passthrough Mode

Hybrid Mode

Distributed Communication Modes

Quick Start

Build

Test

Usage

Reporting

Test Results

Architecture

GPU Profiles

Limitations

Use Cases

Dependencies

License

Documentation

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FakeGPU

Documentation

Timeline

Implemented Features

Planned Features

Operation Modes

Simulate Mode (Default)

Passthrough Mode

Hybrid Mode

Distributed Communication Modes

Quick Start

Build

Test

Usage

Reporting

Test Results

Architecture

GPU Profiles

Limitations

Use Cases

Dependencies

License

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages