deaneeth · deaneeth · Oct 26, 2025 · Oct 26, 2025 · Oct 26, 2025 · Oct 26, 2025
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -2,17 +2,17 @@ name: 🧪 CI
 
 on:
   push:
-    branches: [ main, master ]
+    branches: [ main, master, dev ]
   pull_request:
-    branches: [ main, master ]
+    branches: [ main, master, dev ]
 
 jobs:
   build:
     runs-on: ubuntu-latest
 
     strategy:
       matrix:
-        python-version: [ "3.11", "3.12" ]
+        python-version: [ "3.11", "3.12", "3.13" ]
 
     steps:
       - name: 🧰 Checkout repository

diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,5 @@ __pycache__/
 *$py.class
 /.pytest_cache
 
+.ruff_cache
+
diff --git a/README.md b/README.md
@@ -1,17 +1,38 @@
 # TinyGPU 🐉⚡  
 
-[![PyPI version](https://img.shields.io/badge/version-1.0.0-blue.svg)](https://pypi.org/project/tinygpu)
+[![PyPI version](https://img.shields.io/badge/version-2.0.0-blue.svg)](https://pypi.org/project/tinygpu)
 [![Python 3.13](https://img.shields.io/badge/Python-3.13-blue.svg)](https://www.python.org/downloads/)
 [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
 [![CI](https://github.com/deaneeth/tinygpu/actions/workflows/ci.yml/badge.svg)](https://github.com/deaneeth/tinygpu/actions)
+[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+[![Tests](https://img.shields.io/github/actions/workflow/status/deaneeth/tinygpu/ci.yml?label=tests)](https://github.com/deaneeth/tinygpu/actions)
 
 TinyGPU is a **tiny educational GPU simulator** - inspired by [Tiny8](https://github.com/sql-hkr/tiny8), designed to demonstrate how GPUs execute code in parallel. It models a small **SIMT (Single Instruction, Multiple Threads)** system with per-thread registers, global memory, synchronization barriers, branching, and a minimal GPU-like instruction set.
 
 > 🎓 *Built for learning and visualization - see how threads, registers, and memory interact across cycles!*
-
+ 
 | Odd-Even Sort | Reduction |
 |---------------|------------|
-| ![Odd-Even Sort](outputs/run_odd_even_sort/run_odd_even_sort_20251025-205516.gif) | ![Reduction](outputs/run_reduce_sum/run_reduce_sum_20251025-210237.gif) |
+| ![Odd-Even Sort](src/outputs/run_odd_even_sort/run_odd_even_sort_20251026-212558.gif) | ![Reduction](src/outputs/run_reduce_sum/run_reduce_sum_20251026-212712.gif) |
+
+---
+
+## 🚀 What's New in v2.0.0
+
+- **Enhanced Instruction Set**:
+  - Added `SHLD` and `SHST` for robust shared memory operations.
+  - Improved `SYNC` semantics for better thread coordination.
+- **Visualizer Improvements**:
+  - Export execution as GIFs with enhanced clarity.
+  - Added support for saving visuals directly from the simulator.
+- **Refactored Core**:
+  - Simplified step semantics for better extensibility.
+  - Optimized performance for larger thread counts.
+- **CI/CD Updates**:
+  - Integrated linting (`ruff`, `black`) and testing workflows.
+  - Automated builds and tests on GitHub Actions.
+- **Documentation**:
+  - Expanded examples and added detailed usage instructions.
 
 ---
 
@@ -51,10 +72,11 @@ TinyGPU was built as a **learning-first GPU simulator** - simple enough for begi
 > 🧭 TinyGPU aims to make GPU learning *intuitive, visual, and interactive* - from classroom demos to self-guided exploration.
 
 ---
+
 ## ✨ Highlights
 
 - 🧩 **GPU-like instruction set:**  
-  `SET`, `ADD`, `MUL`, `LD`, `ST`, `JMP`, `BNE`, `BEQ`, `SYNC`, `CSWAP`.
+  `SET`, `ADD`, `MUL`, `LD`, `ST`, `JMP`, `BNE`, `BEQ`, `SYNC`, `CSWAP`, `SHLD`, `SHST`.
 - 🧠 **Per-thread registers & PCs** - each thread executes the same kernel independently.
 - 🧱 **Shared global memory** for inter-thread operations.
 - 🔄 **Synchronization barriers** (`SYNC`) for parallel coordination.
@@ -69,31 +91,39 @@ TinyGPU was built as a **learning-first GPU simulator** - simple enough for begi
 
 ## 🖼️ Example Visuals
 
-> Located in `examples/` — you can generate these GIFs yourself.
+> Located in `src/outputs/` — run the example scripts to generate these GIFs (they're saved under `src/outputs/<script_name>/`).
 
-| Odd-Even Sort | Reduction |
-|---------------|------------|
-| ![Odd-Even Sort](outputs/run_odd_even_sort/run_odd_even_sort_20251025-205516.gif) | ![Reduction](outputs/run_reduce_sum/run_reduce_sum_20251025-210237.gif) |
+| Example | Description | GIF Preview |
+|---------|-------------|-------------|
+| Vector Add | Parallel vector addition (A+B -> C) | ![Vector Add](src/outputs/run_vector_add/run_vector_add_20251026-212734.gif) |
+| Block Shared Sum | Per-block shared memory sum example | ![Block Shared Sum](src/outputs/run_block_shared_sum/run_block_shared_sum_20251026-212542.gif) |
+| Odd-Even Sort | GPU-style odd-even transposition sort | ![Odd-Even Sort](src/outputs/run_odd_even_sort/run_odd_even_sort_20251026-212558.gif) |
+| Parallel Reduction | Sum reduction across an array | ![Reduction](src/outputs/run_reduce_sum/run_reduce_sum_20251026-212712.gif) |
+| Sync Test | Synchronization / barrier demonstration | ![Sync Test](src/outputs/run_sync_test/run_sync_test_20251027-000818.gif) |
+| Loop Test | Branching and loop behavior demo | ![Test Loop](src/outputs/run_test_loop/run_test_loop_20251026-212814.gif) |
+| Compare Test | Comparison and branching example | ![Test CMP](src/outputs/run_test_cmp/run_test_cmp_20251026-212823.gif) |
+| Kernel Args Test | Demonstrates passing kernel arguments | ![Kernel Args](src/outputs/run_test_kernel_args/run_test_kernel_args_20251026-212830.gif) |
 
 ---
 
 ## 🚀 Quickstart
 
 ### Clone and install
+
 ```bash
 git clone https://github.com/deaneeth/tinygpu.git
 cd tinygpu
 pip install -e .
 pip install -r requirements-dev.txt
-````
+```
 
 ### Run an example
 
 ```bash
 python -m examples.run_odd_even_sort
 ```
 
-> Produces: `examples/odd_even_sort.gif` — a visual GPU-style sorting process.
+> Produces: `src/outputs/run_odd_even_sort/run_odd_even_sort_*.gif` — a visual GPU-style sorting process.
 
 ### Other examples
 
@@ -108,30 +138,50 @@ python -m examples.run_sync_test
 
 ## 🧩 Project Layout
 
-```
-tinygpu/
+```text
+.
+├─ .github/
+│  └─ workflows/
+│     └─ ci.yml
+├─ docs/
+│  └─ index.md
 ├─ examples/
-│  ├─ vector_add.tgpu
+│  ├─ odd_even_sort_tmp.tgpu
 │  ├─ odd_even_sort.tgpu
 │  ├─ reduce_sum.tgpu
-│  ├─ run_vector_add.py
 │  ├─ run_odd_even_sort.py
 │  ├─ run_reduce_sum.py
+│  ├─ run_sync_test.py
 │  ├─ run_test_loop.py
-│  └─ run_sync_test.py
-│
+│  ├─ run_vector_add.py
+│  ├─ sync_test.tgpu
+│  ├─ test_loop.tgpu
+│  └─ vector_add.tgpu
+├─ src/outputs/
+│  ├─ run_block_shared_sum/
+│  ├─ run_odd_even_sort/
+│  ├─ run_reduce_sum/
+│  ├─ run_sync_test/
+│  ├─ run_test_cmp/
+│  ├─ run_test_kernel_args/
+│  ├─ run_test_loop/
+│  └─ run_vector_add/
 ├─ src/
 │  └─ tinygpu/
+│     ├─ __init__.py
 │     ├─ assembler.py
 │     ├─ gpu.py
 │     ├─ instructions.py
-│     ├─ visualizer.py
-│     └─ __init__.py
-│
+│     └─ visualizer.py
 ├─ tests/
+│  ├─ test_assembler.py
+│  ├─ test_gpu_core.py
+│  ├─ test_gpu.py
+│  └─ test_programs.py
+├─ LICENSE
 ├─ pyproject.toml
-├─ requirements-dev.txt
-└─ README.md
+├─ README.md
+└─ requirements-dev.txt
 ```
 
 ---
@@ -156,6 +206,8 @@ TinyGPU uses a **minimal instruction set** designed for clarity and education -
 | `BNE Ra, Rb, target`        | Branch if not equal. | Jump to `target` if `Ra != Rb`. |
 | `SYNC`                      | *(no operands)* | Synchronization barrier — all threads must reach this point before continuing. |
 | `CSWAP addrA, addrB`        | Compare-and-swap memory values. | If `mem[addrA] > mem[addrB]`, swap them. Used for sorting. |
+| `SHLD addr, Rs`             | Load shared memory into register. | `Rs = shared_mem[addr]` |
+| `SHST addr, Rs`             | Store register into shared memory. | `shared_mem[addr] = Rs` |
 | `CMP Rd, Ra, Rb` *(optional)* | Compare and set flag or register. | Used internally for extended examples (e.g., prefix-scan). |
 | `NOP` *(optional)*          | *(no operands)* | No operation; placeholder instruction. |
 
@@ -267,7 +319,7 @@ MIT - see [LICENSE](LICENSE)
 
 ## 🌟 Credits & Inspiration
 
-❤️ Built by [Deaneeth](https://github.com/deaneeth) 
+❤️ Built by [Deaneeth](https://github.com/deaneeth)
 
 > Inspired by the educational design of [Tiny8 CPU Simulator](https://github.com/sql-hkr/tiny8).
 

diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,123 @@
+# TinyGPU 🐉⚡ — v2.0.0
+
+[![Release v2.0.0](https://img.shields.io/badge/release-v2.0.0-blue.svg)](https://github.com/deaneeth/tinygpu/releases/tag/v2.0.0)
+[![Python 3.13](https://img.shields.io/badge/Python-3.13-blue.svg)](https://www.python.org/downloads/)
+[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
+[![CI](https://github.com/deaneeth/tinygpu/actions/workflows/ci.yml/badge.svg)](https://github.com/deaneeth/tinygpu/actions)
+[![Tests](https://img.shields.io/github/actions/workflow/status/deaneeth/tinygpu/ci.yml?label=tests)](https://github.com/deaneeth/tinygpu/actions)
+[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
+
+TinyGPU is a **tiny educational GPU simulator** — a minimal SIMT-style simulator with:
+
+- Per-thread registers & program counters
+- Shared global memory and per-block shared memory
+- A small GPU-style ISA and assembler
+- Visualizer and GIF export for educational animations
+
+> 🎓 *Built for learning and visualization - see how threads, registers, and memory interact across cycles!*
+
+---
+
+## 🚀 What's New in v2.0.0
+
+- **Enhanced Instruction Set**:
+  - Added `SHLD` and `SHST` for robust shared memory operations.
+  - Improved `SYNC` semantics for better thread coordination.
+- **Visualizer Improvements**:
+  - Export execution as GIFs with enhanced clarity.
+  - Added support for saving visuals directly from the simulator.
+- **Refactored Core**:
+  - Simplified step semantics for better extensibility.
+  - Optimized performance for larger thread counts.
+- **CI/CD Updates**:
+  - Integrated linting (`ruff`, `black`) and testing workflows.
+  - Automated builds and tests on GitHub Actions.
+- **Documentation**:
+  - Expanded examples and added detailed usage instructions.
+
+---
+
+## Quick Screenshots / Demos
+
+### Odd–Even Transposition Sort
+
+![Odd-Even Sort](../src/outputs/run_odd_even_sort/run_odd_even_sort_20251026-212558.gif)
+
+### Parallel Reduction (Sum)
+
+![Reduce Sum](../src/outputs/run_reduce_sum/run_reduce_sum_20251026-212712.gif)
+
+---
+
+## Getting Started
+
+Clone and install (editable):
+
+```bash
+git clone https://github.com/deaneeth/tinygpu.git
+cd tinygpu
+pip install -e .
+pip install -r requirements-dev.txt
+```
+
+Run a demo (odd-even sort):
+
+```bash
+python -m examples.run_odd_even_sort
+```
+
+> Produces: `outputs/run_odd_even_sort/run_odd_even_sort_*.gif` — a visual GPU-style sorting process.
+
+---
+
+## Examples & Runners
+
+- `examples/run_vector_add.py` — simple parallel vector add
+- `examples/run_vector_add_kernel.py` — vector add with kernel arguments
+- `examples/run_test_loop.py` — branch/loop test (sum 1..4)
+- `examples/run_test_cmp.py` — comparison and branching test
+- `examples/run_test_kernel_args.py` — kernel arguments test
+- `examples/run_odd_even_sort.py` — odd-even transposition sort (GIF)
+- `examples/run_reduce_sum.py` — parallel reduction (GIF)
+- `examples/run_block_shared_sum.py` — per-block shared memory example
+- `examples/run_sync_test.py` — synchronization test
+- `examples/debug_repl.py` — interactive REPL debugger
+
+---
+
+## Instruction Set (Quick Reference)
+
+| **Instruction**             | **Operands**                            | **Description** |
+|-----------------------------|------------------------------------------|-----------------|
+| `SET Rd, imm`               | `Rd` = destination register, `imm` = immediate value | Set register `Rd` to an immediate constant. |
+| `ADD Rd, Ra, Rb`            | `Rd` = destination, `Ra` + `Rb` | Add two registers and store result in `Rd`. |
+| `ADD Rd, Ra, imm`           | `Rd` = destination, `Ra` + immediate | Add register and immediate value. |
+| `MUL Rd, Ra, Rb`            | Multiply two registers. | `Rd = Ra * Rb` |
+| `MUL Rd, Ra, imm`           | Multiply register by immediate. | `Rd = Ra * imm` |
+| `LD Rd, addr`               | Load from memory address into register. | `Rd = mem[addr]` |
+| `LD Rd, Rk`                 | Load from address in register `Rk`. | `Rd = mem[Rk]` |
+| `ST addr, Rs`               | Store register into memory address. | `mem[addr] = Rs` |
+| `ST Rk, Rs`                 | Store value from `Rs` into memory at address in register `Rk`. | `mem[Rk] = Rs` |
+| `SHLD Rd, saddr`            | Load from shared memory into register. | `Rd = shared_mem[saddr]` |
+| `SHST saddr, Rs`            | Store register into shared memory. | `shared_mem[saddr] = Rs` |
+| `CSWAP addrA, addrB`        | Compare-and-swap memory values. | If `mem[addrA] > mem[addrB]`, swap them. Used for sorting. |
+| `CMP Ra, Rb`                | Compare and set flags. | Set Z/N/G flags based on `Ra - Rb`. |
+| `BRGT target`               | Branch if greater. | Jump to `target` if G flag set. |
+| `BRLT target`               | Branch if less. | Jump to `target` if N flag set. |
+| `BRZ target`                | Branch if zero. | Jump to `target` if Z flag set. |
+| `JMP target`                | Label or immediate. | Unconditional jump — sets PC to `target`. |
+| `SYNC`                      | *(no operands)* | Global synchronization barrier — all threads must reach this point. |
+| `SYNCB`                     | *(no operands)* | Block-level synchronization barrier. |
+
+---
+
+## Publishing & Contributing
+
+- See `.github/workflows/ci.yml` for CI and packaging
+- To propose changes, open a PR. For bug reports, open an issue.
+
+---
+
+## License
+
+MIT — See [LICENSE](../LICENSE).
diff --git a/examples/block_shared_sum.tgpu b/examples/block_shared_sum.tgpu
@@ -0,0 +1,37 @@
+; block_shared_sum.tgpu
+; R5 = block_id, R6 = thread_in_block, R7 = tid
+; R0 -> temp
+; R1 -> base (global base index for each block is block_id * block_stride)
+; We'll assume runner sets up base_addr per block in memory (or use a simple scheme)
+
+; Each thread loads its input and stores it into shared[thread_in_block]
+; Then threads synchronize at block barrier and thread 0 sums the shared
+; values and writes the block sum to memory at address (100 + block_id).
+
+; Load own value from memory[tid] (R7 contains tid)
+LD R3, R7            ; R3 = memory[tid]
+SHST R6, R3          ; shared[thread_in_block] = R3
+SYNCB                ; wait for block
+
+; Only thread with thread_in_block == 0 performs the reduction
+CMP R6, 0
+BRGT not_zero        ; if R6 > 0 jump to not_zero (i.e., only R6==0 continues)
+
+SET R4, 0            ; R4 = sum
+SET R2, 0            ; R2 = loop index
+sum_loop:
+    SHLD R0, R2      ; R0 = shared[R2]
+    ADD R4, R4, R0   ; R4 += R0
+    ADD R2, R2, 1
+    CMP R2, 4        ; compare with TPB (4)
+    BRLT sum_loop
+
+; write sum to memory at 100 + block_id (R5 holds block_id)
+SET R1, 100
+ADD R1, R1, R5
+ST R1, R4
+
+JMP done_block
+not_zero:
+done_block:
+; end
Original file line number	Diff line number	Diff line change
Expand Up		@@ -5,3 +5,5 @@ __pycache__/
		*$py.class
		/.pytest_cache

		.ruff_cache