Skip to content

Basic CLI#1

Draft
gaurav wants to merge 32 commits intomainfrom
basic-implementation-in-uv
Draft

Basic CLI#1
gaurav wants to merge 32 commits intomainfrom
basic-implementation-in-uv

Conversation

@gaurav
Copy link
Collaborator

@gaurav gaurav commented Dec 3, 2025

This PR implements a basic CLI for some of the functionality we're targeting.

WIP

gaurav and others added 19 commits December 2, 2025 15:38
- Add IdentifierRecord dataclass to babel_xrefs.py (resolves TODO)
- Add 89 tests across 3 files: test_downloader (26), test_babel_xrefs (31), test_nodenorm (23)
- Unit tests (71) use mocks and run without network; integration tests (18) use real downloads/APIs
- Add session-scoped fixtures in conftest.py for shared Parquet file downloads
- Parametrize integration tests over tests/data/valid_curies.txt for easy expansion
- Add integration and slow pytest markers to pyproject.toml
- Update CLAUDE.md and README.md with testing documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a basic version of babel-explorer in Python using the uv package manager. It's a tool for querying Babel intermediate files to understand why biological/chemical identifiers are considered equivalent. The implementation includes a downloader for large Parquet files with MD5 validation and resume support, NodeNorm API integration for label enrichment, DuckDB-based cross-reference querying, and a Click-based CLI.

Changes:

  • Initial project structure with uv-based package management (pyproject.toml, Python 3.11+)
  • Core functionality: BabelDownloader with streaming downloads and MD5 validation, NodeNorm API client with LRU caching, BabelXRefs for DuckDB-based Parquet queries
  • CLI with three commands: xrefs, ids, and test-concord
  • Comprehensive test suite with 80 tests split between unit tests (mocked) and integration tests (real network calls)

Reviewed changes

Copilot reviewed 15 out of 19 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pyproject.toml Project configuration with dependencies (click, duckdb, requests, tqdm) and pytest markers
.python-version Specifies Python 3.11 requirement
.gitignore Excludes /data directory for downloaded files
README.md User documentation with setup, usage examples, and testing instructions
CLAUDE.md AI assistant guidance documentation (contains outdated wget reference)
src/babel_explorer/cli.py Click-based CLI with xrefs, ids, and test-concord commands
src/babel_explorer/core/downloader.py Streaming file downloader with MD5 validation and resume capability
src/babel_explorer/core/nodenorm.py NodeNorm API client for identifier normalization
src/babel_explorer/core/babel_xrefs.py DuckDB-based cross-reference query engine (has frozen dataclass bug)
tests/conftest.py Session-scoped pytest fixtures for shared test resources
tests/constants.py Shared test constants and CURIE loader utility
tests/data/valid_curies.txt Parametrized test data (one CURIE)
tests/test_downloader.py 26 tests for BabelDownloader (22 unit, 3 integration, 1 slow)
tests/test_nodenorm.py 23 tests for NodeNorm (18 unit, 5 integration)
tests/test_babel_xrefs.py 31 tests for BabelXRefs (22 unit, 8 integration, 1 slow)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +60 to +64
return result[curie]

@functools.lru_cache(maxsize=None)
def get_clique_identifiers(self, curie, **kwargs):
result = self.normalize_curie(curie, **kwargs)
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalize_curie method can raise a KeyError if the API response doesn't contain the requested curie as a key. This happens when the curie is not found or invalid. Based on the test at line 295 in tests/test_nodenorm.py, the expected behavior is to return None for nonexistent CURIEs. Add error handling to catch KeyError and return None instead.

Suggested change
return result[curie]
@functools.lru_cache(maxsize=None)
def get_clique_identifiers(self, curie, **kwargs):
result = self.normalize_curie(curie, **kwargs)
try:
return result[curie]
except KeyError:
logging.debug(f"NodeNorm response did not contain CURIE {curie!r}; returning None")
return None
@functools.lru_cache(maxsize=None)
def get_clique_identifiers(self, curie, **kwargs):
result = self.normalize_curie(curie, **kwargs)
if not result:
return None

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +36
if local_path is None:
# Default to using TMPDIR.
# TODO: replace with a real temporary directory.
tmpdir = os.environ.get("TMPDIR")
if tmpdir:
local_path = tmpdir

# Make sure the local path is an existing directory or that we can create it.
if not os.path.exists(local_path):
os.makedirs(local_path, exist_ok=True)
self.local_path = local_path
elif os.path.exists(local_path) and os.path.isdir(local_path):
self.local_path = local_path
else:
raise ValueError(f"Invalid local_path (must be an existing directory): '{local_path}'")
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If local_path is None and TMPDIR is not set in the environment, local_path remains None. This will cause the os.path.exists(local_path) check on line 30 to fail with a TypeError. Add a fallback to a default temporary directory or raise a clear error message if no path can be determined.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +82
for curie in curies:
identifiers = nodenorm.get_clique_identifiers(curie)
for identifier in identifiers:
if identifier.label:
print(f"{curie}\t{identifier.curie}\t{identifier.label}\t{identifier.biolink_type}")
else:
print(f"{curie}\t{identifier.curie}\t\t{identifier.biolink_type}")
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_concord command can fail with an AttributeError if nodenorm.get_clique_identifiers returns None (which happens when the API response doesn't contain 'equivalent_identifiers'). Add a check to handle None return values before iterating.

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +59
def _calculate_md5(self, file_path, chunk_size=1024*1024):
"""
Calculate MD5 checksum of a file.

Args:
file_path: Path to the file to checksum
chunk_size: Size of chunks to read (default 1MB)

Returns:
str: Hexadecimal MD5 checksum
"""
md5_hash = hashlib.md5()
with open(file_path, 'rb') as f:
for chunk in iter(lambda: f.read(chunk_size), b''):
md5_hash.update(chunk)
return md5_hash.hexdigest()
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MD5 is used for checksum validation but is cryptographically weak. While MD5 is acceptable for detecting accidental file corruption, it's vulnerable to intentional tampering. For better security, consider upgrading to SHA-256. If MD5 must be used for compatibility with existing .md5 files, document this limitation.

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +42
self.subj_label = subj_label
self.subj_biolink_type = subj_biolink_type
self.obj_label = obj_label
self.obj_biolink_type = obj_biolink_type
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LabeledCrossReference extends the frozen dataclass CrossReference and attempts to add new attributes by assigning them in init. This violates the frozen=True constraint and will raise a FrozenInstanceError at runtime. The class should be defined as a frozen dataclass with all fields declared, or CrossReference should not be frozen.

Suggested change
self.subj_label = subj_label
self.subj_biolink_type = subj_biolink_type
self.obj_label = obj_label
self.obj_biolink_type = obj_biolink_type
object.__setattr__(self, "subj_label", subj_label)
object.__setattr__(self, "subj_biolink_type", subj_biolink_type)
object.__setattr__(self, "obj_label", obj_label)
object.__setattr__(self, "obj_biolink_type", obj_biolink_type)

Copilot uses AI. Check for mistakes.
gaurav and others added 6 commits March 2, 2026 17:35
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Remove _calculate_md5/_fetch_remote_md5 (too slow on 2.5-3.9 GB files)
- Add sidecar .meta JSON files (ETag, Last-Modified, Content-Length, last_checked)
- Three-tier logic: freshness window → HEAD/ETag check → full re-download
- Add freshness_seconds param to BabelDownloader (default 3h)
- Add --check-download CLI option to xrefs and ids commands (e.g. 3h, never)
- Update tests: replace MD5 test classes with meta/ETag/tier coverage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gaurav gaurav changed the title Basic implementation in uv Basic CLI Mar 3, 2026
gaurav and others added 7 commits March 3, 2026 14:38
- Add pytest-xdist[psutil] and filelock to dev dependencies
- Enable parallel execution by default with addopts = "-n auto"
- Switch DuckDB connections to in-memory mode (duckdb.connect()) to
  eliminate file locking that would deadlock parallel workers
- Make test_data_dir teardown worker-aware (only gw0 cleans up)
- Wrap download fixtures with FileLock to serialize concurrent downloads
- Fix test_babel_xrefs.py: update expand= to recurse= to match renamed param

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The recurse=True path previously issued one DuckDB query per CURIE and
called itself recursively (O(diameter) queries, Python stack growth).
It now delegates to _get_curie_xrefs_recursive, which traverses the
full connected component in a single SQL query using WITH RECURSIVE.

A bidirectional `edges` CTE (subj→obj and obj→subj) collapses the two
traversal directions into one recursive arm; UNION (not UNION ALL)
provides automatic cycle detection. ignore_curies_in_expansion is now
a no-op on the recurse=True path and emits a DeprecationWarning.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When pytest-xdist runs 8 workers, each worker session ends independently.
gw0 was deleting data/test/ as soon as it finished its own tests, but
other workers were still reading Concord.parquet.  This caused sporadic
IOException failures on any test that opened a fresh DuckDB connection
(e.g. _get_curie_xrefs_recursive) after gw0's teardown deleted the file.

Fix: only delete the shared test data directory in a sequential
(non-xdist) run where worker_id == "master".  In parallel runs the
directory persists; BabelDownloader's freshness-window logic re-validates
or re-downloads the files on the next run as needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants