Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@ name: 🧪 CI

on:
push:
branches: [ main, master ]
branches: [ main, master, dev ]
pull_request:
branches: [ main, master ]
branches: [ main, master, dev ]

jobs:
build:
runs-on: ubuntu-latest

strategy:
matrix:
python-version: [ "3.11", "3.12" ]
python-version: [ "3.11", "3.12", "3.13" ]

steps:
- name: 🧰 Checkout repository
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@ __pycache__/
*$py.class
/.pytest_cache

.ruff_cache

96 changes: 74 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,38 @@
# TinyGPU 🐉⚡

[![PyPI version](https://img.shields.io/badge/version-1.0.0-blue.svg)](https://pypi.org/project/tinygpu)
[![PyPI version](https://img.shields.io/badge/version-2.0.0-blue.svg)](https://pypi.org/project/tinygpu)
[![Python 3.13](https://img.shields.io/badge/Python-3.13-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![CI](https://github.com/deaneeth/tinygpu/actions/workflows/ci.yml/badge.svg)](https://github.com/deaneeth/tinygpu/actions)
[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Tests](https://img.shields.io/github/actions/workflow/status/deaneeth/tinygpu/ci.yml?label=tests)](https://github.com/deaneeth/tinygpu/actions)

TinyGPU is a **tiny educational GPU simulator** - inspired by [Tiny8](https://github.com/sql-hkr/tiny8), designed to demonstrate how GPUs execute code in parallel. It models a small **SIMT (Single Instruction, Multiple Threads)** system with per-thread registers, global memory, synchronization barriers, branching, and a minimal GPU-like instruction set.

> 🎓 *Built for learning and visualization - see how threads, registers, and memory interact across cycles!*

| Odd-Even Sort | Reduction |
|---------------|------------|
| ![Odd-Even Sort](outputs/run_odd_even_sort/run_odd_even_sort_20251025-205516.gif) | ![Reduction](outputs/run_reduce_sum/run_reduce_sum_20251025-210237.gif) |
| ![Odd-Even Sort](src/outputs/run_odd_even_sort/run_odd_even_sort_20251026-212558.gif) | ![Reduction](src/outputs/run_reduce_sum/run_reduce_sum_20251026-212712.gif) |

---

## 🚀 What's New in v2.0.0

- **Enhanced Instruction Set**:
- Added `SHLD` and `SHST` for robust shared memory operations.
- Improved `SYNC` semantics for better thread coordination.
- **Visualizer Improvements**:
- Export execution as GIFs with enhanced clarity.
- Added support for saving visuals directly from the simulator.
- **Refactored Core**:
- Simplified step semantics for better extensibility.
- Optimized performance for larger thread counts.
- **CI/CD Updates**:
- Integrated linting (`ruff`, `black`) and testing workflows.
- Automated builds and tests on GitHub Actions.
- **Documentation**:
- Expanded examples and added detailed usage instructions.

---

Expand Down Expand Up @@ -51,10 +72,11 @@ TinyGPU was built as a **learning-first GPU simulator** - simple enough for begi
> 🧭 TinyGPU aims to make GPU learning *intuitive, visual, and interactive* - from classroom demos to self-guided exploration.

---

## ✨ Highlights

- 🧩 **GPU-like instruction set:**
`SET`, `ADD`, `MUL`, `LD`, `ST`, `JMP`, `BNE`, `BEQ`, `SYNC`, `CSWAP`.
`SET`, `ADD`, `MUL`, `LD`, `ST`, `JMP`, `BNE`, `BEQ`, `SYNC`, `CSWAP`, `SHLD`, `SHST`.
- 🧠 **Per-thread registers & PCs** - each thread executes the same kernel independently.
- 🧱 **Shared global memory** for inter-thread operations.
- 🔄 **Synchronization barriers** (`SYNC`) for parallel coordination.
Expand All @@ -69,31 +91,39 @@ TinyGPU was built as a **learning-first GPU simulator** - simple enough for begi

## 🖼️ Example Visuals

> Located in `examples/` — you can generate these GIFs yourself.
> Located in `src/outputs/` — run the example scripts to generate these GIFs (they're saved under `src/outputs/<script_name>/`).

| Odd-Even Sort | Reduction |
|---------------|------------|
| ![Odd-Even Sort](outputs/run_odd_even_sort/run_odd_even_sort_20251025-205516.gif) | ![Reduction](outputs/run_reduce_sum/run_reduce_sum_20251025-210237.gif) |
| Example | Description | GIF Preview |
|---------|-------------|-------------|
| Vector Add | Parallel vector addition (A+B -> C) | ![Vector Add](src/outputs/run_vector_add/run_vector_add_20251026-212734.gif) |
| Block Shared Sum | Per-block shared memory sum example | ![Block Shared Sum](src/outputs/run_block_shared_sum/run_block_shared_sum_20251026-212542.gif) |
| Odd-Even Sort | GPU-style odd-even transposition sort | ![Odd-Even Sort](src/outputs/run_odd_even_sort/run_odd_even_sort_20251026-212558.gif) |
| Parallel Reduction | Sum reduction across an array | ![Reduction](src/outputs/run_reduce_sum/run_reduce_sum_20251026-212712.gif) |
| Sync Test | Synchronization / barrier demonstration | ![Sync Test](src/outputs/run_sync_test/run_sync_test_20251027-000818.gif) |
| Loop Test | Branching and loop behavior demo | ![Test Loop](src/outputs/run_test_loop/run_test_loop_20251026-212814.gif) |
| Compare Test | Comparison and branching example | ![Test CMP](src/outputs/run_test_cmp/run_test_cmp_20251026-212823.gif) |
| Kernel Args Test | Demonstrates passing kernel arguments | ![Kernel Args](src/outputs/run_test_kernel_args/run_test_kernel_args_20251026-212830.gif) |

---

## 🚀 Quickstart

### Clone and install

```bash
git clone https://github.com/deaneeth/tinygpu.git
cd tinygpu
pip install -e .
pip install -r requirements-dev.txt
````
```

### Run an example

```bash
python -m examples.run_odd_even_sort
```

> Produces: `examples/odd_even_sort.gif` — a visual GPU-style sorting process.
> Produces: `src/outputs/run_odd_even_sort/run_odd_even_sort_*.gif` — a visual GPU-style sorting process.

### Other examples

Expand All @@ -108,30 +138,50 @@ python -m examples.run_sync_test

## 🧩 Project Layout

```
tinygpu/
```text
.
├─ .github/
│ └─ workflows/
│ └─ ci.yml
├─ docs/
│ └─ index.md
├─ examples/
│ ├─ vector_add.tgpu
│ ├─ odd_even_sort_tmp.tgpu
│ ├─ odd_even_sort.tgpu
│ ├─ reduce_sum.tgpu
│ ├─ run_vector_add.py
│ ├─ run_odd_even_sort.py
│ ├─ run_reduce_sum.py
│ ├─ run_sync_test.py
│ ├─ run_test_loop.py
│ └─ run_sync_test.py
│ ├─ run_vector_add.py
│ ├─ sync_test.tgpu
│ ├─ test_loop.tgpu
│ └─ vector_add.tgpu
├─ src/outputs/
│ ├─ run_block_shared_sum/
│ ├─ run_odd_even_sort/
│ ├─ run_reduce_sum/
│ ├─ run_sync_test/
│ ├─ run_test_cmp/
│ ├─ run_test_kernel_args/
│ ├─ run_test_loop/
│ └─ run_vector_add/
├─ src/
│ └─ tinygpu/
│ ├─ __init__.py
│ ├─ assembler.py
│ ├─ gpu.py
│ ├─ instructions.py
│ ├─ visualizer.py
│ └─ __init__.py
│ └─ visualizer.py
├─ tests/
│ ├─ test_assembler.py
│ ├─ test_gpu_core.py
│ ├─ test_gpu.py
│ └─ test_programs.py
├─ LICENSE
├─ pyproject.toml
├─ requirements-dev.txt
└─ README.md
├─ README.md
└─ requirements-dev.txt
```

---
Expand All @@ -156,6 +206,8 @@ TinyGPU uses a **minimal instruction set** designed for clarity and education -
| `BNE Ra, Rb, target` | Branch if not equal. | Jump to `target` if `Ra != Rb`. |
| `SYNC` | *(no operands)* | Synchronization barrier — all threads must reach this point before continuing. |
| `CSWAP addrA, addrB` | Compare-and-swap memory values. | If `mem[addrA] > mem[addrB]`, swap them. Used for sorting. |
| `SHLD addr, Rs` | Load shared memory into register. | `Rs = shared_mem[addr]` |
| `SHST addr, Rs` | Store register into shared memory. | `shared_mem[addr] = Rs` |
| `CMP Rd, Ra, Rb` *(optional)* | Compare and set flag or register. | Used internally for extended examples (e.g., prefix-scan). |
| `NOP` *(optional)* | *(no operands)* | No operation; placeholder instruction. |

Expand Down Expand Up @@ -267,7 +319,7 @@ MIT - see [LICENSE](LICENSE)

## 🌟 Credits & Inspiration

❤️ Built by [Deaneeth](https://github.com/deaneeth)
❤️ Built by [Deaneeth](https://github.com/deaneeth)

> Inspired by the educational design of [Tiny8 CPU Simulator](https://github.com/sql-hkr/tiny8).

Expand Down
123 changes: 123 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# TinyGPU 🐉⚡ — v2.0.0

[![Release v2.0.0](https://img.shields.io/badge/release-v2.0.0-blue.svg)](https://github.com/deaneeth/tinygpu/releases/tag/v2.0.0)
[![Python 3.13](https://img.shields.io/badge/Python-3.13-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![CI](https://github.com/deaneeth/tinygpu/actions/workflows/ci.yml/badge.svg)](https://github.com/deaneeth/tinygpu/actions)
[![Tests](https://img.shields.io/github/actions/workflow/status/deaneeth/tinygpu/ci.yml?label=tests)](https://github.com/deaneeth/tinygpu/actions)
[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

TinyGPU is a **tiny educational GPU simulator** — a minimal SIMT-style simulator with:

- Per-thread registers & program counters
- Shared global memory and per-block shared memory
- A small GPU-style ISA and assembler
- Visualizer and GIF export for educational animations

> 🎓 *Built for learning and visualization - see how threads, registers, and memory interact across cycles!*

---

## 🚀 What's New in v2.0.0

- **Enhanced Instruction Set**:
- Added `SHLD` and `SHST` for robust shared memory operations.
- Improved `SYNC` semantics for better thread coordination.
- **Visualizer Improvements**:
- Export execution as GIFs with enhanced clarity.
- Added support for saving visuals directly from the simulator.
- **Refactored Core**:
- Simplified step semantics for better extensibility.
- Optimized performance for larger thread counts.
- **CI/CD Updates**:
- Integrated linting (`ruff`, `black`) and testing workflows.
- Automated builds and tests on GitHub Actions.
- **Documentation**:
- Expanded examples and added detailed usage instructions.

---

## Quick Screenshots / Demos

### Odd–Even Transposition Sort

![Odd-Even Sort](../src/outputs/run_odd_even_sort/run_odd_even_sort_20251026-212558.gif)

### Parallel Reduction (Sum)

![Reduce Sum](../src/outputs/run_reduce_sum/run_reduce_sum_20251026-212712.gif)

---

## Getting Started

Clone and install (editable):

```bash
git clone https://github.com/deaneeth/tinygpu.git
cd tinygpu
pip install -e .
pip install -r requirements-dev.txt
```

Run a demo (odd-even sort):

```bash
python -m examples.run_odd_even_sort
```

> Produces: `outputs/run_odd_even_sort/run_odd_even_sort_*.gif` — a visual GPU-style sorting process.

---

## Examples & Runners

- `examples/run_vector_add.py` — simple parallel vector add
- `examples/run_vector_add_kernel.py` — vector add with kernel arguments
- `examples/run_test_loop.py` — branch/loop test (sum 1..4)
- `examples/run_test_cmp.py` — comparison and branching test
- `examples/run_test_kernel_args.py` — kernel arguments test
- `examples/run_odd_even_sort.py` — odd-even transposition sort (GIF)
- `examples/run_reduce_sum.py` — parallel reduction (GIF)
- `examples/run_block_shared_sum.py` — per-block shared memory example
- `examples/run_sync_test.py` — synchronization test
- `examples/debug_repl.py` — interactive REPL debugger

---

## Instruction Set (Quick Reference)

| **Instruction** | **Operands** | **Description** |
|-----------------------------|------------------------------------------|-----------------|
| `SET Rd, imm` | `Rd` = destination register, `imm` = immediate value | Set register `Rd` to an immediate constant. |
| `ADD Rd, Ra, Rb` | `Rd` = destination, `Ra` + `Rb` | Add two registers and store result in `Rd`. |
| `ADD Rd, Ra, imm` | `Rd` = destination, `Ra` + immediate | Add register and immediate value. |
| `MUL Rd, Ra, Rb` | Multiply two registers. | `Rd = Ra * Rb` |
| `MUL Rd, Ra, imm` | Multiply register by immediate. | `Rd = Ra * imm` |
| `LD Rd, addr` | Load from memory address into register. | `Rd = mem[addr]` |
| `LD Rd, Rk` | Load from address in register `Rk`. | `Rd = mem[Rk]` |
| `ST addr, Rs` | Store register into memory address. | `mem[addr] = Rs` |
| `ST Rk, Rs` | Store value from `Rs` into memory at address in register `Rk`. | `mem[Rk] = Rs` |
| `SHLD Rd, saddr` | Load from shared memory into register. | `Rd = shared_mem[saddr]` |
| `SHST saddr, Rs` | Store register into shared memory. | `shared_mem[saddr] = Rs` |
| `CSWAP addrA, addrB` | Compare-and-swap memory values. | If `mem[addrA] > mem[addrB]`, swap them. Used for sorting. |
| `CMP Ra, Rb` | Compare and set flags. | Set Z/N/G flags based on `Ra - Rb`. |
| `BRGT target` | Branch if greater. | Jump to `target` if G flag set. |
| `BRLT target` | Branch if less. | Jump to `target` if N flag set. |
| `BRZ target` | Branch if zero. | Jump to `target` if Z flag set. |
| `JMP target` | Label or immediate. | Unconditional jump — sets PC to `target`. |
| `SYNC` | *(no operands)* | Global synchronization barrier — all threads must reach this point. |
| `SYNCB` | *(no operands)* | Block-level synchronization barrier. |

---

## Publishing & Contributing

- See `.github/workflows/ci.yml` for CI and packaging
- To propose changes, open a PR. For bug reports, open an issue.

---

## License

MIT — See [LICENSE](../LICENSE).
37 changes: 37 additions & 0 deletions examples/block_shared_sum.tgpu
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
; block_shared_sum.tgpu
; R5 = block_id, R6 = thread_in_block, R7 = tid
; R0 -> temp
; R1 -> base (global base index for each block is block_id * block_stride)
; We'll assume runner sets up base_addr per block in memory (or use a simple scheme)

; Each thread loads its input and stores it into shared[thread_in_block]
; Then threads synchronize at block barrier and thread 0 sums the shared
; values and writes the block sum to memory at address (100 + block_id).

; Load own value from memory[tid] (R7 contains tid)
LD R3, R7 ; R3 = memory[tid]
SHST R6, R3 ; shared[thread_in_block] = R3
SYNCB ; wait for block

; Only thread with thread_in_block == 0 performs the reduction
CMP R6, 0
BRGT not_zero ; if R6 > 0 jump to not_zero (i.e., only R6==0 continues)

SET R4, 0 ; R4 = sum
SET R2, 0 ; R2 = loop index
sum_loop:
SHLD R0, R2 ; R0 = shared[R2]
ADD R4, R4, R0 ; R4 += R0
ADD R2, R2, 1
CMP R2, 4 ; compare with TPB (4)
BRLT sum_loop

; write sum to memory at 100 + block_id (R5 holds block_id)
SET R1, 100
ADD R1, R1, R5
ST R1, R4

JMP done_block
not_zero:
done_block:
; end
Loading