feat(echo-cas): content-addressed blob store (Phase 1)#263
feat(echo-cas): content-addressed blob store (Phase 1)#263flyingrobots merged 4 commits intomainfrom
Conversation
…oryTier) Introduces echo-cas, a leaf crate providing the BlobStore trait and an in-memory implementation (MemoryTier) for content-addressed storage keyed by BLAKE3 hash. Sufficient for the in-browser website demo; disk/cold tiers and GC are deferred to Phase 3.
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughA new echo-cas crate is added to the workspace, providing a content‑addressed blob store: a 32‑byte BLAKE3 BlobHash type, blob_hash() helper, CasError enum, a sync object‑safe BlobStore trait (put / put_verified / get / has / pin / unpin), and an in‑memory MemoryTier implementation with optional byte budgeting and tests. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~23 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Fix all issues with AI agents
In `@crates/echo-cas/Cargo.toml`:
- Line 19: Remove the dangling empty [dev-dependencies] header in Cargo.toml or
make it intentional: either delete the entire [dev-dependencies] section if
there are no dev dependencies, or populate it (e.g., add proptest/criterion) or
replace the header with a commented TODO (e.g., "# TODO: add dev-dependencies")
so the presence of the section is explicit; locate the orphaned
"[dev-dependencies]" token in the Cargo.toml and apply one of these options.
- Line 17: Update the thiserror dependency in Cargo.toml from the 1.x line to
the 2.x line: replace the dependency spec `thiserror = "1"` with `thiserror =
"2"` (or a caret/semver range like `^2`) and then run cargo update / cargo build
to verify proc-macro derives still compile for wasm32; if there is a documented
downstream blocker, add a comment next to the thiserror entry explaining why v2
cannot be used.
In `@crates/echo-cas/README.md`:
- Line 1: The README's SPDX header ("Apache-2.0 OR MIND-UCAL-1.0") conflicts
with the workspace Cargo.toml which uses license.workspace = true and the root
workspace license = "Apache-2.0"; fix by making the license declarations
consistent: either update the workspace/root Cargo.toml to include the
dual-license string "Apache-2.0 OR MIND-UCAL-1.0" (so all crates inherit the OR
license) or remove "MIND-UCAL-1.0" from the README SPDX header; check and update
the crate's Cargo.toml/license or the root workspace license field accordingly
to ensure Cargo.toml, license.workspace, and README.md match.
In `@crates/echo-cas/src/lib.rs`:
- Around line 60-62: Change BlobHash to hide its inner field (make the [u8; 32]
private) so callers cannot construct it directly; keep #[repr(transparent)] and
the derives, expose a single safe public constructor blob_hash(bytes: [u8;32])
(already present) for normal creation, add a from_bytes([u8;32]) constructor
intended for deserialization/wire use with a doc comment "caller asserts this
came from BLAKE3", and keep an as_bytes(&self) -> &[u8;32] accessor for
read-only access; if you intentionally allowed public construction, add a clear
doc warning on the BlobHash struct that constructing from non-BLAKE3 bytes is
legal but semantically meaningless.
- Around line 130-132: The current trait method signature fn get(&self, hash:
&BlobHash) -> Option<Arc<[u8]>> bakes an Arc-backed in-memory representation
into the trait contract; add a FIXME comment above the get signature (reference
symbols: get, BlobHash, Arc<[u8]>) stating FIXME(phase3): avoid forcing
Arc<[u8]> — consider returning bytes::Bytes or using an associated type (e.g.,
type Blob: AsRef<[u8]>) or another zero-copy representation in Phase 3, so
future DiskTier/ColdTier implementations can choose mmap/Bytes/Cow/streams
instead of allocating an Arc.
- Around line 71-78: The Display impl for BlobHash (fn fmt on BlobHash) does 32
separate write! calls via the Formatter; instead, build the full hex string
first and do a single write to the Formatter. In fmt, allocate a String with
capacity self.0.len()*2, use std::fmt::Write::write_fmt or write! into that
String in the loop to append each byte as "{:02x}", then call
f.write_str(&hex_string) once; this keeps the existing Display behavior but
reduces Formatter invocations while remaining safe (no unsafe code).
In `@crates/echo-cas/src/memory.rs`:
- Around line 223-234: Add a unit test that verifies pin idempotence: create a
MemoryTier via MemoryTier::new(), put some data with put(...) to get a hash,
call pin(&hash) twice, assert pinned_count() == 1, then unpin(&hash) and assert
is_pinned(&hash) is false; this ensures pin uses set semantics rather than a
reference count (refer to MemoryTier::new, put, pin, pinned_count, unpin,
is_pinned).
- Around line 154-168: Add a new unit test that exercises the happy path and
idempotence of MemoryTier::put_verified: create a MemoryTier via
MemoryTier::new, compute the correct hash for some test data using blob_hash,
call put_verified(hash, data) and assert it returns Ok, then assert store.len()
== 1, store.byte_count() == data.len(), and that store.get(&hash) returns the
stored bytes; finally call put_verified again with the same hash/data and assert
it still returns Ok and that len() and byte_count() have not increased
(idempotence). Ensure the test is named (e.g., put_verified_happy_path) and
mirrors the style of existing tests.
- Around line 27-32: Add a doc comment on MemoryTier explaining that byte_count
is a monotonically increasing tally of stored bytes for Phase 1 (no GC) and
therefore is_over_budget() will remain true once exceeded until explicit blob
removals are implemented in Phase 3; when implementing remove/evict in the
future you must decrement byte_count by the removed blob's length (compute via
the Arc<[u8]>::len() at removal time) so the budget can recover, and update the
is_over_budget() doc to state this invariant and recovery condition.
- Around line 98-106: The put method currently calls blob_hash then probes the
blobs map twice via contains_key and insert; change both put and put_verified to
use the HashMap entry API (e.g., blobs.entry(hash)) to perform a single lookup:
on Vacant(entry) insert Arc::from(bytes) and increment byte_count only when you
actually insert, and on Occupied return the existing hash without modifying
byte_count; apply the same entry-based logic in put_verified to avoid the double
lookup and ensure byte_count stays correct.
- Around line 108-118: The put_verified function currently does a double lookup
by calling self.blobs.contains_key(&computed) then insert, which can be replaced
with the HashMap entry API to do a single lookup and update byte_count only when
a new blob is actually inserted. Change put_verified to use
self.blobs.entry(computed) and on Vacant insert Arc::from(bytes) while
incrementing self.byte_count only for the new insertion (avoid incrementing when
the entry is Occupied); reference function name put_verified and map self.blobs
to locate the code to update.
- Around line 86-88: The is_over_budget method currently uses strict >
(self.max_bytes.is_some_and(|max| self.byte_count > max)), so a stored
byte_count equal to max_bytes is considered within budget; update the API docs
to explicitly state the budget is inclusive (i.e., exactly max_bytes is NOT over
budget) and mention the related symbols is_over_budget, with_limits, max_bytes,
and byte_count so callers understand the off-by-one behavior; if you prefer
exclusive semantics instead, change the comparison to >= in is_over_budget and
update tests/docs accordingly.
| @@ -0,0 +1,6 @@ | |||
| <!-- SPDX-License-Identifier: Apache-2.0 OR MIND-UCAL-1.0 --> | |||
There was a problem hiding this comment.
License mismatch: README declares Apache-2.0 OR MIND-UCAL-1.0 but Cargo.toml inherits workspace license Apache-2.0 only.
This is a compliance inconsistency. Cargo.toml → license.workspace = true → root workspace declares license = "Apache-2.0". This README introduces a second license (MIND-UCAL-1.0) that doesn't appear anywhere in the manifest chain. Either:
- Update the workspace/crate license field to match the dual-license, or
- Drop
MIND-UCAL-1.0from the README header.
Shipping contradictory license declarations is a legal landmine.
🤖 Prompt for AI Agents
In `@crates/echo-cas/README.md` at line 1, The README's SPDX header ("Apache-2.0
OR MIND-UCAL-1.0") conflicts with the workspace Cargo.toml which uses
license.workspace = true and the root workspace license = "Apache-2.0"; fix by
making the license declarations consistent: either update the workspace/root
Cargo.toml to include the dual-license string "Apache-2.0 OR MIND-UCAL-1.0" (so
all crates inherit the OR license) or remove "MIND-UCAL-1.0" from the README
SPDX header; check and update the crate's Cargo.toml/license or the root
workspace license field accordingly to ensure Cargo.toml, license.workspace, and
README.md match.
| fn put(&mut self, bytes: &[u8]) -> BlobHash { | ||
| let hash = blob_hash(bytes); | ||
| if self.blobs.contains_key(&hash) { | ||
| return hash; | ||
| } | ||
| self.byte_count += bytes.len(); | ||
| self.blobs.insert(hash, Arc::from(bytes)); | ||
| hash | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
put computes hash then probes the map twice on the cold path.
contains_key followed by insert is two lookups. Use the entry API:
Proposed fix
fn put(&mut self, bytes: &[u8]) -> BlobHash {
let hash = blob_hash(bytes);
- if self.blobs.contains_key(&hash) {
- return hash;
- }
- self.byte_count += bytes.len();
- self.blobs.insert(hash, Arc::from(bytes));
+ if let std::collections::hash_map::Entry::Vacant(e) = self.blobs.entry(hash) {
+ self.byte_count += bytes.len();
+ e.insert(Arc::from(bytes));
+ }
hash
}Same applies to put_verified (Lines 113-115).
🤖 Prompt for AI Agents
In `@crates/echo-cas/src/memory.rs` around lines 98 - 106, The put method
currently calls blob_hash then probes the blobs map twice via contains_key and
insert; change both put and put_verified to use the HashMap entry API (e.g.,
blobs.entry(hash)) to perform a single lookup: on Vacant(entry) insert
Arc::from(bytes) and increment byte_count only when you actually insert, and on
Occupied return the existing hash without modifying byte_count; apply the same
entry-based logic in put_verified to avoid the double lookup and ensure
byte_count stays correct.
- Upgrade thiserror 1 → 2 (leaf crate, no compatibility concern) - Add put_verified happy-path test (store, retrieve, idempotence) - Add pin idempotence test (set semantics: double-pin keeps count at 1) - Document is_over_budget budget semantics (exclusive) and byte_count monotonic invariant (no removal until Phase 3 GC) - Remove empty [dev-dependencies] section - Update bytes 1.11.0 → 1.11.1 (RUSTSEC-2026-0007, workspace-wide)
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@crates/echo-cas/src/memory.rs`:
- Around line 286-296: Add a test that asserts the critical boundary where
stored bytes equal the configured limit remains within budget: create a
MemoryTier via MemoryTier::with_limits(n), call put to store exactly n bytes,
assert !store.is_over_budget(), then put one more byte and assert
store.is_over_budget(); reference the existing with_limits, put, and
is_over_budget methods so the test will catch off-by-one regressions (e.g., `>`
vs `>=`) and ensure the budget semantics remain exclusive.
- Around line 110-120: In put_verified, avoid computing blob_hash(bytes) when
the expected blob is already present: first check
self.blobs.contains_key(&expected) and return Ok(()) if found; only then compute
let computed = blob_hash(bytes) and compare to expected, returning
CasError::HashMismatch on mismatch; if absent, insert into self.blobs and update
self.byte_count — this also removes the redundant second contains_key probe
before insert. Use the existing symbols put_verified, expected, blob_hash,
self.blobs, and self.byte_count to locate and change the logic.
- Make BlobHash inner field private; add from_bytes() for deserialization - Optimize Display impl: fold into pre-allocated String, single write_str - Add FIXME(phase3) on get() return type (Arc<[u8]> coupling) - Use entry API in put() to eliminate double map lookup - Optimize put_verified() to skip hashing when blob already stored - Add budget boundary test for exactly-at-limit off-by-one protection
Summary
echo-casleaf crate withBlobStoretrait andMemoryTierimplementationwasm32-unknown-unknowntarget checks clean)What's deferred
serdederives, async methods, iteration APITest plan
cargo check -p echo-cascargo test -p echo-cas— 14/14 passcargo clippy -p echo-cas -- -D warnings— zero warningscargo check -p echo-cas --target wasm32-unknown-unknowncargo doc -p echo-cas --no-deps