diff --git a/doc/modules/ROOT/pages/2.cpp20-coroutines/2.intro.adoc b/doc/modules/ROOT/pages/2.cpp20-coroutines/2.intro.adoc
index 3136e56e..f57b96a9 100644
--- a/doc/modules/ROOT/pages/2.cpp20-coroutines/2.intro.adoc
+++ b/doc/modules/ROOT/pages/2.cpp20-coroutines/2.intro.adoc
@@ -15,16 +15,4 @@ C++20 coroutines change the rules. A coroutine can _suspend_ its execution--savi
 
 This is not a minor syntactic convenience. It is a fundamental shift in how you can structure programs that wait.
 
-== What You Will Learn
-
-This section takes you from zero to a working understanding of C++20 coroutines. You do not need prior experience with coroutines, async programming, or any coroutine library.
-
-* **xref:2a.foundations.adoc[Foundations]** -- How regular functions use the call stack, what happens when a function needs to pause, and how coroutines solve the problem by decoupling a function's lifetime from its stack frame.
-
-* **xref:2b.syntax.adoc[C++20 Syntax]** -- The three coroutine keywords (`co_await`, `co_return`, `co_yield`), what the compiler does when it sees them, and how to write your first coroutine.
-
-* **xref:2c.machinery.adoc[Coroutine Machinery]** -- The promise type, coroutine handles, and the protocols that connect your coroutine to the runtime. This is where you see how the compiler transforms your code and how you can customize that transformation.
-
-* **xref:2d.advanced.adoc[Advanced Topics]** -- Symmetric transfer, heap allocation elision optimization (HALO), and the performance characteristics that make coroutines practical for high-throughput systems.
-
-By the end of this section, you will understand not only _how_ to write coroutines, but _why_ they work the way they do--knowledge that will make everything in the rest of this documentation click into place.
+This section takes you from zero to a working understanding of C++20 coroutines. No prior experience with coroutines or async programming is needed. You will start with the problem that coroutines solve, move through the language syntax and compiler machinery, and finish with the performance characteristics that make coroutines practical for real systems. By the end, you will understand not only _how_ to write coroutines but _why_ they work the way they do--knowledge that will make everything in the rest of this documentation click into place.
diff --git a/doc/modules/ROOT/pages/3.concurrency/3.intro.adoc b/doc/modules/ROOT/pages/3.concurrency/3.intro.adoc
index 983bfed8..3663951c 100644
--- a/doc/modules/ROOT/pages/3.concurrency/3.intro.adoc
+++ b/doc/modules/ROOT/pages/3.concurrency/3.intro.adoc
@@ -15,16 +15,4 @@ Yet concurrent programming has a reputation for being treacherous, and that repu
 
 The good news: these problems are well understood. Decades of research and practice have produced clear patterns, precise vocabulary, and reliable tools. Once you understand the fundamentals--what a data race actually is, why memory ordering matters, how synchronization primitives work--concurrent code becomes something you can reason about with confidence.
 
-== What You Will Learn
-
-This section builds your understanding of concurrency from first principles. You do not need any prior experience with threads or parallel programming.
-
-* **xref:3a.foundations.adoc[Foundations]** -- What threads are, how they share memory, and why running code in parallel introduces problems that sequential programs never face.
-
-* **xref:3b.synchronization.adoc[Synchronization]** -- Mutexes, locks, condition variables, and the mechanisms that let threads coordinate safely. You will learn when each tool is appropriate and what it actually guarantees.
-
-* **xref:3c.advanced.adoc[Advanced Primitives]** -- Atomics, memory ordering, and lock-free techniques. These are the building blocks underneath the higher-level tools, and understanding them gives you the power to make informed performance decisions.
-
-* **xref:3d.patterns.adoc[Communication & Patterns]** -- Producer-consumer queues, thread pools, and the architectural patterns that structure concurrent systems. These patterns appear everywhere, from operating systems to web servers to game engines.
-
-When you finish this section, you will have the vocabulary and mental models to understand how Capy's coroutine-based concurrency works under the hood--and why it eliminates entire categories of the bugs described here.
+This section builds your understanding of concurrency from first principles. No prior experience with threads or parallel programming is needed. You will learn what makes concurrent code hard to reason about, how the standard synchronization tools work, and the architectural patterns that tame that complexity. When you finish, you will have the vocabulary and mental models to understand how Capy's coroutine-based concurrency works under the hood--and why it eliminates entire categories of the bugs described here.
diff --git a/doc/modules/ROOT/pages/4.coroutines/4.intro.adoc b/doc/modules/ROOT/pages/4.coroutines/4.intro.adoc
index 674cc483..c265e5f4 100644
--- a/doc/modules/ROOT/pages/4.coroutines/4.intro.adoc
+++ b/doc/modules/ROOT/pages/4.coroutines/4.intro.adoc
@@ -15,20 +15,4 @@ Capy's coroutine model is built around a single principle: asynchronous code sho
 
 But this is not magic, and it is not a black box. Every piece of Capy's coroutine infrastructure is designed to be transparent. You can see how tasks are scheduled, control where they run, propagate cancellation, compose concurrent operations, and tune memory allocation. Understanding these mechanisms is what separates someone who uses the library from someone who uses it _well_.
 
-== What You Will Learn
-
-* **xref:4a.tasks.adoc[The task Type]** -- Capy's fundamental coroutine type. Lazy execution, symmetric transfer, executor inheritance, and stop token propagation--everything a `task<T>` gives you out of the box.
-
-* **xref:4b.launching.adoc[Launching Coroutines]** -- How to start tasks running: `co_await`, `spawn`, `run_async`, and the differences between them. When to use each, and what happens to exceptions and cancellation.
-
-* **xref:4c.executors.adoc[Executors and Execution Contexts]** -- Where your coroutines run. Thread pools, strands, executor binding, and how Capy ensures your code executes on the right thread.
-
-* **xref:4d.io-awaitable.adoc[The IoAwaitable Protocol]** -- The contract between I/O operations and the coroutine runtime. How `io_result` works, what the compiler sees, and how to write your own awaitable operations.
-
-* **xref:4e.cancellation.adoc[Stop Tokens and Cancellation]** -- Cooperative cancellation that propagates through your entire call tree. How to check for cancellation, respond to it gracefully, and design operations that clean up properly.
-
-* **xref:4f.composition.adoc[Concurrent Composition]** -- Running multiple operations simultaneously with `when_all` and `when_any`. Fan-out/fan-in patterns, timeouts, and racing operations against each other.
-
-* **xref:4g.allocators.adoc[Frame Allocators]** -- Controlling where coroutine frames are allocated. Custom allocators, arena strategies, and the techniques that eliminate allocation overhead in hot paths.
-
-This section is the bridge between theory and practice. By the end, you will be writing real asynchronous programs with Capy.
+This section is the bridge between theory and practice. You will see how Capy turns C++20 coroutines into a complete async programming model--from launching and scheduling tasks, through cancellation and concurrent composition, to fine-grained control over memory allocation. Each topic builds on the last, and by the end you will be writing real asynchronous programs with Capy.
diff --git a/doc/modules/ROOT/pages/5.buffers/5.intro.adoc b/doc/modules/ROOT/pages/5.buffers/5.intro.adoc
index defd8f3d..34fe1e7c 100644
--- a/doc/modules/ROOT/pages/5.buffers/5.intro.adoc
+++ b/doc/modules/ROOT/pages/5.buffers/5.intro.adoc
@@ -15,18 +15,4 @@ The obvious answer is a pointer and a size. And for a single contiguous buffer,
 
 Capy's buffer model is designed for this reality. Instead of forcing you to copy data into a single contiguous allocation, Capy uses _buffer sequences_--lightweight, zero-copy abstractions that let you describe any arrangement of memory and pass it directly to the OS. The design is concept-driven, meaning the compiler verifies correctness at compile time with no runtime overhead.
 
-== What You Will Learn
-
-* **xref:5a.overview.adoc[Why Concepts, Not Spans]** -- Why `std::span` falls short for I/O, how scatter/gather operations work, and the design reasoning behind Capy's concept-based approach.
-
-* **xref:5b.types.adoc[Buffer Types]** -- `const_buffer`, `mutable_buffer`, and `make_buffer`--the fundamental building blocks for describing contiguous memory regions.
-
-* **xref:5c.sequences.adoc[Buffer Sequences]** -- How to compose multiple buffers into sequences that I/O operations consume in a single call, without copying.
-
-* **xref:5d.system-io.adoc[System I/O Integration]** -- How buffer sequences map to operating system primitives like `readv` and `writev`, and why this matters for performance.
-
-* **xref:5e.algorithms.adoc[Buffer Algorithms]** -- Operations on buffer sequences: copying, prefix/suffix extraction, and the tools that make working with scattered data practical.
-
-* **xref:5f.dynamic.adoc[Dynamic Buffers]** -- Resizable buffers that grow as data arrives. The `DynamicBuffer` concept and how it integrates with stream operations for protocol parsing and message assembly.
-
-Understanding buffers is essential for everything that follows. Streams, I/O operations, and protocol implementations all build on the abstractions introduced here.
+This section covers everything you need to work with memory in Capy's I/O model. You will learn the fundamental buffer types, how to compose them into sequences for scatter/gather I/O, and how they map to operating system primitives. You will also meet the algorithms that manipulate buffer data and the dynamic buffer abstractions that grow as data arrives. Understanding buffers is essential for everything that follows--streams, I/O operations, and protocol implementations all build on the abstractions introduced here.
diff --git a/doc/modules/ROOT/pages/6.streams/6.intro.adoc b/doc/modules/ROOT/pages/6.streams/6.intro.adoc
index 9d376c66..cce486de 100644
--- a/doc/modules/ROOT/pages/6.streams/6.intro.adoc
+++ b/doc/modules/ROOT/pages/6.streams/6.intro.adoc
@@ -17,18 +17,4 @@ A socket might give you 47 bytes when you asked for 1024. That is not an error--
 
 On top of this, Capy adds _buffer sources_ and _buffer sinks_--concepts that work with dynamic buffers, enabling protocol parsers and message builders to grow their storage as needed without manual bookkeeping.
 
-== What You Will Learn
-
-* **xref:6a.overview.adoc[Overview]** -- The six stream concepts at a glance, how they relate to each other, and which one to reach for in different situations.
-
-* **xref:6b.streams.adoc[Streams (Partial I/O)]** -- `ReadStream` and `WriteStream`--the concepts for operations that transfer _some_ data and return immediately. The building blocks for everything else.
-
-* **xref:6c.sources-sinks.adoc[Sources and Sinks (Complete I/O)]** -- `ReadSource` and `WriteSink`--the concepts for operations that transfer _all_ requested data or report an error. Built on top of streams, with well-defined completion guarantees.
-
-* **xref:6d.buffer-concepts.adoc[Buffer Sources and Sinks]** -- `BufferSource` and `BufferSink`--concepts that pair complete I/O with dynamic buffers for protocol-level operations.
-
-* **xref:6e.algorithms.adoc[Transfer Algorithms]** -- Generic algorithms that move data between streams, sources, and sinks. Composable, efficient, and independent of any particular transport.
-
-* **xref:6f.isolation.adoc[Physical Isolation]** -- How Capy's stream concepts enable you to test, mock, and compose I/O layers without coupling to specific transports. Write your logic once; run it over TCP, TLS, pipes, or in-memory buffers.
-
-These concepts are the vocabulary of Capy's I/O model. Once you understand them, every I/O operation in the library will feel familiar.
+This section introduces the concepts that form Capy's vocabulary for data flow. You will learn the distinction between partial and complete I/O, how the concept pairs relate to each other, and how transfer algorithms and physical isolation let you write I/O logic that is composable, testable, and independent of any particular transport. Once you understand these concepts, every I/O operation in the library will feel familiar.
diff --git a/doc/modules/ROOT/pages/7.examples/7.intro.adoc b/doc/modules/ROOT/pages/7.examples/7.intro.adoc
index c241f464..5890ab8d 100644
--- a/doc/modules/ROOT/pages/7.examples/7.intro.adoc
+++ b/doc/modules/ROOT/pages/7.examples/7.intro.adoc
@@ -11,26 +11,4 @@
 
 The best way to learn a library is to watch it solve real problems. This section is a collection of complete, working programs that demonstrate how the pieces you have learned--tasks, buffers, streams, cancellation, composition--fit together in practice.
 
-Each example is self-contained. You can compile and run it. The code is followed by detailed explanations of what it does, why it is structured that way, and what happens at each step. Start with the examples that interest you most, or work through them in order for a guided tour of Capy's capabilities.
-
-== What You Will Find
-
-* **xref:7a.hello-task.adoc[Hello Task]** -- The minimal Capy program. Create a task, run it on a thread pool, and see coroutine execution in action.
-
-* **xref:7b.producer-consumer.adoc[Producer-Consumer]** -- Two coroutines communicating through a shared channel. A classic concurrency pattern, implemented without threads or locks.
-
-* **xref:7c.buffer-composition.adoc[Buffer Composition]** -- Assembling I/O from multiple memory regions using buffer sequences. Zero-copy message construction in practice.
-
-* **xref:7d.mock-stream-testing.adoc[Mock Stream Testing]** -- Testing I/O logic without a network. In-memory streams that simulate sockets, including partial reads and error injection.
-
-* **xref:7e.type-erased-echo.adoc[Type-Erased Echo]** -- An echo server that works over any transport. Demonstrates physical isolation and type erasure for streams.
-
-* **xref:7f.timeout-cancellation.adoc[Timeout with Cancellation]** -- Racing an operation against a deadline. Cooperative cancellation with `when_any` and stop tokens.
-
-* **xref:7g.parallel-fetch.adoc[Parallel Fetch]** -- Launching multiple operations concurrently and collecting results. Fan-out/fan-in with `when_all`.
-
-* **xref:7h.custom-dynamic-buffer.adoc[Custom Dynamic Buffer]** -- Implementing your own `DynamicBuffer` for specialized allocation strategies.
-
-* **xref:7i.echo-server-corosio.adoc[Echo Server with Corosio]** -- A complete multi-client echo server using Corosio for socket I/O. The full picture: accept loop, per-connection coroutines, graceful shutdown.
-
-* **xref:7j.stream-pipeline.adoc[Stream Pipeline]** -- Chaining stream transformations. Data flows through multiple processing stages, each implemented as a stream adapter.
+Every example is self-contained and compiles as a standalone program. The code is followed by detailed explanations of what it does, why it is structured that way, and what happens at each step. The examples range from minimal starting points to fully featured servers, covering real-world integration with Corosio. Start with whatever interests you most, or work through them in order for a guided tour of Capy's capabilities.
diff --git a/doc/modules/ROOT/pages/8.design/8.intro.adoc b/doc/modules/ROOT/pages/8.design/8.intro.adoc
index 9a69395e..a38010fc 100644
--- a/doc/modules/ROOT/pages/8.design/8.intro.adoc
+++ b/doc/modules/ROOT/pages/8.design/8.intro.adoc
@@ -11,28 +11,4 @@
 
 Capy's public interface--tasks, buffers, streams--is intentionally small. Behind that interface are design decisions that determine how concepts compose, where responsibility boundaries fall, and what guarantees the library can make. This section documents those decisions.
 
-Each page in this section examines one concept or facility in depth. You will find the formal concept definition, the rationale for its design, the alternatives that were considered, and the tradeoffs that were made. If you have ever wondered _why_ `ReadStream` requires `read_some` instead of `read`, or why buffer sinks and sources exist as separate concepts from streams, the answers are here.
-
-== What You Will Find
-
-* **xref:8a.ReadStream.adoc[ReadStream]** -- The partial-read concept. Why `read_some` is the correct primitive, how it composes with algorithms, and its relationship to `ReadSource`.
-
-* **xref:8b.ReadSource.adoc[ReadSource]** -- The complete-read concept. Guaranteed delivery semantics, EOF handling, and the contract between sources and consumers.
-
-* **xref:8c.BufferSource.adoc[BufferSource]** -- Pairing complete reads with dynamic buffers. How protocol parsers use `BufferSource` to accumulate data incrementally.
-
-* **xref:8d.WriteStream.adoc[WriteStream]** -- The partial-write concept. Symmetric design with `ReadStream`, and how write algorithms handle short writes.
-
-* **xref:8e.WriteSink.adoc[WriteSink]** -- The complete-write concept. Guaranteed delivery for outbound data, and the composition with serialization layers.
-
-* **xref:8f.BufferSink.adoc[BufferSink]** -- Dynamic buffer output. How message builders and serializers produce output without knowing the transport.
-
-* **xref:8g.RunApi.adoc[Run API]** -- The entry points for executing coroutines: `run`, `run_async`, and the bridge between synchronous and asynchronous worlds.
-
-* **xref:8h.TypeEraseAwaitable.adoc[Type-Erasing Awaitables]** -- Erasing the concrete type of an awaitable behind a uniform interface. When type erasure is worth the cost, and how Capy implements it.
-
-* **xref:8i.any_buffer_sink.adoc[AnyBufferSink]** -- A type-erased buffer sink. Combining the `BufferSink` concept with type erasure for runtime polymorphism.
-
-* **xref:8j.Executor.adoc[Executor]** -- The executor concept. Why `dispatch` returns `void`, why `defer` was dropped, how `executor_ref` achieves zero-allocation type erasure, and the I/O completion pattern that motivates the design.
-
-These documents are reference material for library contributors and advanced users. They assume familiarity with the tutorial sections and focus on design reasoning rather than usage.
+Each page examines one concept or facility in depth: its formal definition, the rationale behind its design, the alternatives that were considered, and the tradeoffs that were made. If you have ever wondered _why_ a particular concept requires a specific primitive, or why certain abstractions exist as separate concepts, the answers are here. These documents are reference material for library contributors and advanced users. They assume familiarity with the tutorial sections and focus on design reasoning rather than usage.
diff --git a/example/CMakeLists.txt b/example/CMakeLists.txt
index 0d39ebba..0dceb66a 100644
--- a/example/CMakeLists.txt
+++ b/example/CMakeLists.txt
@@ -22,4 +22,5 @@ if(TARGET Boost::corosio)
     add_subdirectory(echo-server-corosio)
 endif()
 
+add_subdirectory(allocation)
 add_subdirectory(asio)
diff --git a/example/allocation/CMakeLists.txt b/example/allocation/CMakeLists.txt
new file mode 100644
index 00000000..47be9306
--- /dev/null
+++ b/example/allocation/CMakeLists.txt
@@ -0,0 +1,22 @@
+#
+# Copyright (c) 2026 Mungo Gill
+#
+# Distributed under the Boost Software License, Version 1.0. (See accompanying
+# file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
+#
+# Official repository: https://github.com/cppalliance/capy
+#
+
+file(GLOB_RECURSE PFILES CONFIGURE_DEPENDS *.cpp *.hpp
+    CMakeLists.txt
+    Jamfile)
+
+source_group(TREE ${CMAKE_CURRENT_SOURCE_DIR} PREFIX "" FILES ${PFILES})
+
+add_executable(capy_example_allocation ${PFILES})
+
+set_property(TARGET capy_example_allocation
+    PROPERTY FOLDER "examples")
+
+target_link_libraries(capy_example_allocation
+    Boost::capy)
diff --git a/example/allocation/Jamfile b/example/allocation/Jamfile
new file mode 100644
index 00000000..47b312b6
--- /dev/null
+++ b/example/allocation/Jamfile
@@ -0,0 +1,18 @@
+#
+# Copyright (c) 2026 Mungo Gill
+#
+# Distributed under the Boost Software License, Version 1.0. (See accompanying
+# file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
+#
+# Official repository: https://github.com/cppalliance/capy
+#
+
+project
+    : requirements
+      <library>/boost/capy//boost_capy
+      <include>.
+    ;
+
+exe allocation :
+    [ glob *.cpp ]
+    ;
diff --git a/example/allocation/allocation.cpp b/example/allocation/allocation.cpp
new file mode 100644
index 00000000..94359806
--- /dev/null
+++ b/example/allocation/allocation.cpp
@@ -0,0 +1,115 @@
+//
+// Copyright (c) 2026 Mungo Gill
+//
+// Distributed under the Boost Software License, Version 1.0. (See accompanying
+// file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
+//
+// Official repository: https://github.com/cppalliance/capy
+//
+
+//
+// Allocation Example
+//
+// Compares the performance of the default recycling frame allocator
+// against std::allocator (no recycling). A 4-deep coroutine chain
+// is invoked 20 million times using test::run_blocking, once with
+// each allocator.
+//
+
+#include <boost/capy.hpp>
+#include <boost/capy/test/run_blocking.hpp>
+#include <atomic>
+#include <chrono>
+#include <cmath>
+#include <cstddef>
+#include <iomanip>
+#include <iostream>
+
+using namespace boost::capy;
+
+std::atomic<std::size_t> counter{0};
+
+// These coroutines simulate a "composed operation"
+// consisting of layered APIs. For example a user's
+// business logic awaiting an HTTP client, awaiting
+// a TLS stream, awaiting a tcp_socket
+
+task<> depth_4()
+{
+    counter.fetch_add(1, std::memory_order_relaxed);
+    co_return;
+}
+
+task<> depth_3()
+{
+    for(int i = 0; i < 3; ++i)
+        co_await depth_4();
+}
+
+task<> depth_2()
+{
+    for(int i = 0; i < 3; ++i)
+        co_await depth_3();
+}
+
+task<> depth_1()
+{
+    for(int i = 0; i < 5; ++i)
+        co_await depth_2();
+}
+
+task<> bench_loop(std::size_t n)
+{
+    for(std::size_t i = 0; i < n; ++i)
+        co_await depth_1();
+}
+
+int main()
+{
+    constexpr std::size_t iterations = 2000000;
+
+    // With recycling allocator
+    counter.store(0);
+    auto t0 = std::chrono::steady_clock::now();
+    {
+        test::blocking_context ctx;
+        ctx.set_frame_allocator(get_recycling_memory_resource());
+        run_async(ctx.get_executor(),
+            [&] { ctx.signal_done(); })(
+            bench_loop(iterations));
+        ctx.run();
+    }
+    auto t1 = std::chrono::steady_clock::now();
+
+    // With std::allocator (no recycling)
+    counter.store(0);
+    auto t2 = std::chrono::steady_clock::now();
+    {
+        test::blocking_context ctx;
+        run_async(ctx.get_executor(), std::allocator<std::byte>{},
+            [&] { ctx.signal_done(); })(
+            bench_loop(iterations));
+        ctx.run();
+    }
+    auto t3 = std::chrono::steady_clock::now();
+
+    auto ms_recycling =
+        std::chrono::duration<double, std::milli>(t1 - t0).count();
+    auto ms_standard =
+        std::chrono::duration<double, std::milli>(t3 - t2).count();
+
+    auto pct = std::round((ms_standard / ms_recycling - 1.0) * 1000.0) / 10.0;
+
+    std::cout
+        << iterations << " iterations, "
+        << "4-deep coroutine chain\n\n"
+        << "  Recycling allocator: "
+        << ms_recycling << " ms\n"
+        << "  std::allocator:      "
+        << ms_standard << " ms\n"
+        << "  Speedup:             "
+        << std::fixed << std::setprecision(1)
+        << pct << "%\n";
+
+    return 0;
+}
diff --git a/include/boost/capy/ex/frame_allocator.hpp b/include/boost/capy/ex/frame_allocator.hpp
index c4f517a5..8da47250 100644
--- a/include/boost/capy/ex/frame_allocator.hpp
+++ b/include/boost/capy/ex/frame_allocator.hpp
@@ -15,6 +15,20 @@
 
 #include <memory_resource>
 
+/*  Design rationale (pdimov):
+    This accessor is a thin wrapper over a thread-local pointer.
+    It returns exactly what was stored, including nullptr. No
+    dynamic initializer on the thread-local; a dynamic TLS
+    initializer moves you into a costlier implementation bucket
+    on some platforms - avoid it.
+
+    Null handling is the caller's responsibility (e.g. in
+    promise_type::operator new). The accessor must not substitute
+    a default, because there are multiple valid choices
+    (new_delete_resource, the default pmr resource, etc.). If
+    the allocator is not set, it reports "not set" and the
+    caller interprets that however it wants.                    */
+
 namespace boost {
 namespace capy {
 
diff --git a/include/boost/capy/ex/io_awaitable_support.hpp b/include/boost/capy/ex/io_awaitable_support.hpp
index 553978e6..bb72296b 100644
--- a/include/boost/capy/ex/io_awaitable_support.hpp
+++ b/include/boost/capy/ex/io_awaitable_support.hpp
@@ -13,10 +13,12 @@
 #include <boost/capy/detail/config.hpp>
 #include <boost/capy/ex/frame_allocator.hpp>
 #include <boost/capy/ex/io_env.hpp>
+#include <boost/capy/ex/recycling_memory_resource.hpp>
 #include <boost/capy/ex/this_coro.hpp>
 
 #include <coroutine>
 #include <cstddef>
+#include <cstring>
 #include <memory_resource>
 #include <stop_token>
 #include <type_traits>
@@ -138,40 +140,33 @@ class io_awaitable_support
     // Frame allocation support
     //----------------------------------------------------------
 
-private:
-    static constexpr std::size_t ptr_alignment = alignof(void*);
-
-    static std::size_t
-    aligned_offset(std::size_t n) noexcept
-    {
-        return (n + ptr_alignment - 1) & ~(ptr_alignment - 1);
-    }
-
 public:
     /** Allocate a coroutine frame.
 
         Uses the thread-local frame allocator set by run_async.
         Falls back to default memory resource if not set.
         Stores the allocator pointer at the end of each frame for
-        correct deallocation even when TLS changes.
+        correct deallocation even when TLS changes. Uses memcpy
+        to avoid alignment requirements on the trailing pointer.
+        Bypasses virtual dispatch for the recycling allocator.
     */
     static void*
     operator new(std::size_t size)
     {
+        static auto* const rmr = get_recycling_memory_resource();
+
         auto* mr = current_frame_allocator();
         if(!mr)
             mr = std::pmr::get_default_resource();
 
-        // Allocate extra space for memory_resource pointer
-        std::size_t ptr_offset = aligned_offset(size);
-        std::size_t total = ptr_offset + sizeof(std::pmr::memory_resource*);
-        void* raw = mr->allocate(total, alignof(std::max_align_t));
-
-        // Store the allocator pointer at the end
-        auto* ptr_loc = reinterpret_cast<std::pmr::memory_resource**>(
-            static_cast<char*>(raw) + ptr_offset);
-        *ptr_loc = mr;
-
+        auto total = size + sizeof(std::pmr::memory_resource*);
+        void* raw;
+        if(mr == rmr)
+            raw = static_cast<recycling_memory_resource*>(mr)
+                ->allocate_fast(total, alignof(std::max_align_t));
+        else
+            raw = mr->allocate(total, alignof(std::max_align_t));
+        std::memcpy(static_cast<char*>(raw) + size, &mr, sizeof(mr));
         return raw;
     }
 
@@ -179,18 +174,21 @@ class io_awaitable_support
 
         Reads the allocator pointer stored at the end of the frame
         to ensure correct deallocation regardless of current TLS.
+        Bypasses virtual dispatch for the recycling allocator.
     */
     static void
     operator delete(void* ptr, std::size_t size)
     {
-        // Read the allocator pointer from the end of the frame
-        std::size_t ptr_offset = aligned_offset(size);
-        auto* ptr_loc = reinterpret_cast<std::pmr::memory_resource**>(
-            static_cast<char*>(ptr) + ptr_offset);
-        auto* mr = *ptr_loc;
-
-        std::size_t total = ptr_offset + sizeof(std::pmr::memory_resource*);
-        mr->deallocate(ptr, total, alignof(std::max_align_t));
+        static auto* const rmr = get_recycling_memory_resource();
+
+        std::pmr::memory_resource* mr;
+        std::memcpy(&mr, static_cast<char*>(ptr) + size, sizeof(mr));
+        auto total = size + sizeof(std::pmr::memory_resource*);
+        if(mr == rmr)
+            static_cast<recycling_memory_resource*>(mr)
+                ->deallocate_fast(ptr, total, alignof(std::max_align_t));
+        else
+            mr->deallocate(ptr, total, alignof(std::max_align_t));
     }
 
     ~io_awaitable_support()
diff --git a/include/boost/capy/ex/recycling_memory_resource.hpp b/include/boost/capy/ex/recycling_memory_resource.hpp
index 7d097b8e..604c538b 100644
--- a/include/boost/capy/ex/recycling_memory_resource.hpp
+++ b/include/boost/capy/ex/recycling_memory_resource.hpp
@@ -46,7 +46,11 @@ namespace capy {
     @see get_recycling_memory_resource
     @see run_async
 */
-class recycling_memory_resource : public std::pmr::memory_resource
+#ifdef _MSC_VER
+# pragma warning(push)
+# pragma warning(disable: 4275) // non dll-interface base class
+#endif
+class BOOST_CAPY_DECL recycling_memory_resource : public std::pmr::memory_resource
 {
     static constexpr std::size_t num_classes = 6;
     static constexpr std::size_t min_class_size = 64;   // 2^6
@@ -111,15 +115,67 @@ class recycling_memory_resource : public std::pmr::memory_resource
         }
     };
 
-    BOOST_CAPY_DECL static pool& local() noexcept;
-    BOOST_CAPY_DECL static pool& global() noexcept;
-    BOOST_CAPY_DECL static std::mutex& global_mutex() noexcept;
+    static pool& local() noexcept
+    {
+        static thread_local pool p;
+        return p;
+    }
+
+    static pool& global() noexcept;
+    static std::mutex& global_mutex() noexcept;
+
+    void* allocate_slow(std::size_t rounded, std::size_t idx);
+    void deallocate_slow(void* p, std::size_t idx);
+
+public:
+    ~recycling_memory_resource();
+
+    /** Allocate without virtual dispatch.
+
+        Handles the fast path inline (thread-local bucket pop)
+        and falls through to the slow path for global pool or
+        heap allocation.
+    */
+    void*
+    allocate_fast(std::size_t bytes, std::size_t)
+    {
+        std::size_t rounded = round_up_pow2(bytes);
+        std::size_t idx = get_class_index(rounded);
+        if(idx >= num_classes)
+            return ::operator new(bytes);
+        auto& lp = local();
+        if(auto* p = lp.buckets[idx].pop())
+            return p;
+        return allocate_slow(rounded, idx);
+    }
+
+    /** Deallocate without virtual dispatch.
+
+        Handles the fast path inline (thread-local bucket push)
+        and falls through to the slow path for global pool or
+        heap deallocation.
+    */
+    void
+    deallocate_fast(void* p, std::size_t bytes, std::size_t)
+    {
+        std::size_t rounded = round_up_pow2(bytes);
+        std::size_t idx = get_class_index(rounded);
+        if(idx >= num_classes)
+        {
+            ::operator delete(p);
+            return;
+        }
+        auto& lp = local();
+        if(lp.buckets[idx].push(p))
+            return;
+        deallocate_slow(p, idx);
+    }
 
 protected:
-    BOOST_CAPY_DECL void*
+    void*
     do_allocate(std::size_t bytes, std::size_t) override;
 
-    BOOST_CAPY_DECL void
+    void
     do_deallocate(void* p, std::size_t bytes, std::size_t) override;
 
     bool
@@ -128,6 +184,9 @@ class recycling_memory_resource : public std::pmr::memory_resource
         return this == &other;
     }
 };
+#ifdef _MSC_VER
+# pragma warning(pop)
+#endif
 
 /** Returns pointer to the default recycling memory resource.
 
diff --git a/include/boost/capy/ex/run_async.hpp b/include/boost/capy/ex/run_async.hpp
index 5af68ed3..7b44ca5c 100644
--- a/include/boost/capy/ex/run_async.hpp
+++ b/include/boost/capy/ex/run_async.hpp
@@ -22,6 +22,7 @@
 #include <boost/capy/ex/work_guard.hpp>
 
 #include <coroutine>
+#include <cstring>
 #include <memory_resource>
 #include <new>
 #include <stop_token>
@@ -215,16 +216,16 @@ struct run_async_trampoline<Ex, Handlers, std::pmr::memory_resource*>
         {
             auto total = size + sizeof(mr);
             void* raw = mr->allocate(total, alignof(std::max_align_t));
-            *reinterpret_cast<std::pmr::memory_resource**>(
-                static_cast<char*>(raw) + size) = mr;
+            std::memcpy(static_cast<char*>(raw) + size, &mr, sizeof(mr));
             return raw;
         }
 
         static void operator delete(void* ptr, std::size_t size)
         {
-            auto* mr = *reinterpret_cast<std::pmr::memory_resource**>(
-                static_cast<char*>(ptr) + size);
-            mr->deallocate(ptr, size + sizeof(mr), alignof(std::max_align_t));
+            std::pmr::memory_resource* mr;
+            std::memcpy(&mr, static_cast<char*>(ptr) + size, sizeof(mr));
+            auto total = size + sizeof(mr);
+            mr->deallocate(ptr, total, alignof(std::max_align_t));
         }
 
         std::pmr::memory_resource* get_resource() noexcept
diff --git a/src/ex/recycling_memory_resource.cpp b/src/ex/recycling_memory_resource.cpp
index ff8a3576..9246e72d 100644
--- a/src/ex/recycling_memory_resource.cpp
+++ b/src/ex/recycling_memory_resource.cpp
@@ -12,12 +12,7 @@
 namespace boost {
 namespace capy {
 
-recycling_memory_resource::pool&
-recycling_memory_resource::local() noexcept
-{
-    static thread_local pool p;
-    return p;
-}
+recycling_memory_resource::~recycling_memory_resource() = default;
 
 recycling_memory_resource::pool&
 recycling_memory_resource::global() noexcept
@@ -34,50 +29,41 @@ recycling_memory_resource::global_mutex() noexcept
 }
 
 void*
-recycling_memory_resource::do_allocate(std::size_t bytes, std::size_t)
+recycling_memory_resource::allocate_slow(
+    std::size_t rounded, std::size_t idx)
 {
-    std::size_t rounded = round_up_pow2(bytes);
-    std::size_t idx = get_class_index(rounded);
-
-    if(idx >= num_classes)
-        return ::operator new(bytes);
-
-    if(auto* p = local().buckets[idx].pop())
-        return p;
-
     {
         std::lock_guard<std::mutex> lock(global_mutex());
         if(auto* p = global().buckets[idx].pop(local().buckets[idx]))
             return p;
     }
-
     return ::operator new(rounded);
 }
 
 void
-recycling_memory_resource::do_deallocate(void* p, std::size_t bytes, std::size_t)
+recycling_memory_resource::deallocate_slow(
+    void* p, std::size_t idx)
 {
-    std::size_t rounded = round_up_pow2(bytes);
-    std::size_t idx = get_class_index(rounded);
-
-    if(idx >= num_classes)
-    {
-        ::operator delete(p);
-        return;
-    }
-
-    if(local().buckets[idx].push(p))
-        return;
-
     {
         std::lock_guard<std::mutex> lock(global_mutex());
         if(global().buckets[idx].push(p))
             return;
     }
-
     ::operator delete(p);
 }
 
+void*
+recycling_memory_resource::do_allocate(std::size_t bytes, std::size_t alignment)
+{
+    return allocate_fast(bytes, alignment);
+}
+
+void
+recycling_memory_resource::do_deallocate(void* p, std::size_t bytes, std::size_t alignment)
+{
+    deallocate_fast(p, bytes, alignment);
+}
+
 std::pmr::memory_resource*
 get_recycling_memory_resource() noexcept
 {