Skip to content

Feat/huggingface backend#1665

Open
pmady wants to merge 17 commits intodragonflyoss:mainfrom
pmady:feat/huggingface-backend
Open

Feat/huggingface backend#1665
pmady wants to merge 17 commits intodragonflyoss:mainfrom
pmady:feat/huggingface-backend

Conversation

@pmady
Copy link

@pmady pmady commented Feb 7, 2026

  • Feature

What does this PR do?

This PR adds support for downloading files from Hugging Face Hub repositories using the `hf://` URL scheme, addressing issue dragonflyoss/dragonfly#4419.

Features

  • New `hf://` URL scheme: Download models, datasets, and spaces from Hugging Face Hub
  • URL format: `hf://[repo_type/]/[/][@]`
  • Authentication: Support via `--hf-token` flag or `HF_TOKEN` environment variable
  • Git LFS support: Handles large model files through Hugging Face HTTP API
  • Repository listing: Supports recursive downloads with `-r` flag

Usage Examples

# Download a single file
dfget hf://deepseek-ai/DeepSeek-OCR/model.safetensors -O /tmp/model.safetensors

# Download entire repository
dfget hf://deepseek-ai/DeepSeek-OCR -O /tmp/DeepSeek-OCR/ -r

# With authentication for private repos
dfget hf://owner/private-repo/model.bin -O /tmp/model.bin --hf-token=<token>

Changes

  1. `dragonfly-client-backend/src/huggingface.rs` (new): Hugging Face backend implementation
  2. `dragonfly-client-backend/src/lib.rs`: Register `hf` backend in BackendFactory
  3. `dragonfly-client-backend/Cargo.toml`: Add serde dependencies
  4. `dragonfly-client/src/bin/dfget/main.rs`: Add `--hf-token` CLI argument and examples

Related Issues

Closes dragonflyoss/dragonfly#4419

Checklist

  • Code follows project style guidelines
  • Tests added/updated (46 tests pass)
  • Documentation updated (CLI help)
  • Commits are signed off"

pmady added 4 commits February 7, 2026 16:42
Implement a new backend for downloading files from Hugging Face Hub
repositories using the hf:// URL scheme.

Features:
- Support for models, datasets, and spaces repositories
- URL parsing with revision/branch support (e.g., hf://owner/repo@v1.0)
- Authentication via HF_TOKEN environment variable
- Git LFS file support for large model files
- Repository listing for recursive downloads

Signed-off-by: pmady <pmady@users.noreply.github.com>
Register the hf:// scheme backend in load_builtin_backends() and update
tests to include the new backend in expected backends list.

Signed-off-by: pmady <pmady@users.noreply.github.com>
Add serde and serde_json workspace dependencies required for parsing
Hugging Face API responses.

Signed-off-by: pmady <pmady@users.noreply.github.com>
Add --hf-token argument for Hugging Face authentication and include
usage examples in the CLI help documentation.

Examples added:
- Download single file: dfget hf://owner/repo/path -O /tmp/file
- Download repository: dfget hf://owner/repo -O /tmp/repo/ -r
- With authentication: dfget hf://... --hf-token=<token>

Signed-off-by: pmady <pmady@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Feb 7, 2026

Codecov Report

❌ Patch coverage is 45.70637% with 392 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.66%. Comparing base (b71b4b5) to head (be0ef12).

Files with missing lines Patch % Lines
dragonfly-client-backend/src/hugging_face.rs 39.96% 344 Missing ⚠️
dragonfly-client/src/bin/dfget/main.rs 64.63% 29 Missing ⚠️
dragonfly-client/src/resource/persistent_task.rs 0.00% 6 Missing ⚠️
dragonfly-client/src/resource/task.rs 0.00% 6 Missing ⚠️
dragonfly-client-backend/src/object_storage.rs 79.16% 5 Missing ⚠️
dragonfly-client-backend/src/hdfs.rs 0.00% 1 Missing ⚠️
dragonfly-client/src/proxy/mod.rs 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1665      +/-   ##
==========================================
- Coverage   50.86%   50.66%   -0.21%     
==========================================
  Files          84       85       +1     
  Lines       20523    21196     +673     
==========================================
+ Hits        10439    10738     +299     
- Misses      10084    10458     +374     
Files with missing lines Coverage Δ
dragonfly-client-backend/src/http.rs 97.68% <100.00%> (+0.05%) ⬆️
dragonfly-client-backend/src/lib.rs 96.10% <100.00%> (+0.06%) ⬆️
dragonfly-client-core/src/error/mod.rs 59.09% <ø> (ø)
dragonfly-client/src/announcer/mod.rs 0.00% <ø> (ø)
dragonfly-client/src/grpc/dfdaemon_download.rs 4.67% <ø> (ø)
dragonfly-client/src/grpc/dfdaemon_upload.rs 0.00% <ø> (ø)
dragonfly-client/src/resource/piece.rs 57.94% <ø> (ø)
dragonfly-client-backend/src/hdfs.rs 40.90% <0.00%> (ø)
dragonfly-client/src/proxy/mod.rs 0.00% <0.00%> (ø)
dragonfly-client-backend/src/object_storage.rs 88.62% <79.16%> (ø)
... and 4 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Apply cargo fmt to fix formatting in huggingface.rs and lib.rs.

Signed-off-by: pmady <pmady@users.noreply.github.com>
@pmady
Copy link
Author

pmady commented Feb 7, 2026

Hi maintainers, could you please add the enhancement label to this PR? The PR Label check requires one of: bug, enhancement, documentation, or dependencies. Thank you!

@gaius-qi gaius-qi added the enhancement New feature or request label Feb 9, 2026
@gaius-qi
Copy link
Member

gaius-qi commented Feb 9, 2026

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new hf:// backend so dfdaemon/dfget can download from Hugging Face Hub repositories (models/datasets/spaces), including repo listing for recursive downloads, and introduces a CLI flag intended for HF authentication.

Changes:

  • Add huggingface backend implementation and register scheme hf in BackendFactory.
  • Extend dfget CLI help/examples and add --hf-token argument.
  • Add serde/serde_json deps for HF API response parsing.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
dragonfly-client/src/bin/dfget/main.rs Adds HF usage examples and --hf-token CLI option.
dragonfly-client-backend/src/lib.rs Registers the new hf backend and updates backend factory tests.
dragonfly-client-backend/src/huggingface.rs Implements HF backend: URL parsing, stat/list/get/exists, plus unit tests.
dragonfly-client-backend/Cargo.toml Adds serde dependencies needed by the new backend.
Cargo.lock Locks new transitive deps from serde additions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pmady added 3 commits February 9, 2026 10:24
- Update copyright year to 2026
- Remove environment variable fallback for HF_TOKEN, keep only CLI option
- Implement ParsedHfUrl with TryFrom<Url> and TryFrom<&str> traits
- Make ParsedHfUrl and RepoType public structs
- Update tests to use TryFrom pattern

Signed-off-by: pmady <pmady@users.noreply.github.com>
The HF backend was instantiated with HuggingFace::new() at startup,
making the --hf-token CLI flag ineffective since the token was stored
on the struct but never received from dfget.

Changes:
- dfget: inject --hf-token as Authorization header into request_header
  so it flows through gRPC to dfdaemon and into the backend
- HF backend: remove stored token field, read auth from request
  http_header instead via build_headers() method
- Remove new_with_token() constructor since it is no longer needed

Signed-off-by: pmady <pmady@users.noreply.github.com>
- Fix URL parsing: remove redundant early-return branch, always require
  owner/repo (two segments) after optional type prefix
- Fix list_files to return hf:// URLs instead of https:// so downstream
  downloads continue using the HF backend (preserving auth and semantics)
- Use versioned DEFAULT_USER_AGENT matching the HTTP backend pattern
  (concat!("dragonfly", "/", env!("CARGO_PKG_VERSION"))) and allow
  user-supplied User-Agent to override it
- Fix dataset test to use proper owner/repo URL format
- Add comprehensive test coverage: dataset, space, explicit model type,
  invalid scheme, missing repo, build_hf_url, build_headers behavior

Signed-off-by: pmady <pmady@users.noreply.github.com>
@pmady
Copy link
Author

pmady commented Feb 9, 2026

@gaius-qi I've created a documentation PR at dragonflyoss/d7y.io#386 that adds:

  • Hugging Face integration page: New section documenting the native hf:// protocol with URL format, single file download, authentication (--hf-token), recursive repository download, dataset download, and revision-specific download examples.
  • dfget reference page: Added "Download with Hugging Face protocol" section with complete examples.

@pmady pmady requested a review from gaius-qi February 9, 2026 18:13
…gFace proto support

- Add HuggingFace proto message support in dfget for passing
  hf_token through the download request instead of injecting
  Authorization headers manually in convert_args.
- Update dragonfly-api dependency from 2.2.13 to 2.2.19

Signed-off-by: Gaius <gaius.qi@gmail.com>
@gaius-qi
Copy link
Member

@pmady Thanks, I'll finish the review by this week.

Signed-off-by: Gaius <gaius.qi@gmail.com>
…ngface-backend

Signed-off-by: Gaius <gaius.qi@gmail.com>
…ngface-backend

Signed-off-by: Gaius <gaius.qi@gmail.com>
@gaius-qi gaius-qi force-pushed the feat/huggingface-backend branch from 915c64f to a8debc5 Compare March 4, 2026 06:10
gaius-qi added 3 commits March 4, 2026 22:49
…port

- Rename huggingface.rs to hugging_face.rs to follow Rust naming conventions
- Rewrite HuggingFace backend with improved architecture:
  - Inject Arc<Config> for TLS and hickory DNS configuration
  - Replace http_header-based auth with dedicated HuggingFace proto field
  - Add insecure_skip_verify flag via --hf-insecure-skip-verify CLI arg
  - Use TryStreamExt over StreamExt for proper error propagation
  - Remove unused Default impl and redundant helper methods
- Add HuggingFace field to StatRequest, GetRequest, ExistsRequest, and
  PutRequest structs for passing auth tokens through the request pipeline
- Propagate hugging_face field through task, persistent_task, piece,
  dfdaemon_download, and dfdaemon_upload request handling
- Bump dragonfly-api to 2.2.20 for HuggingFace proto message support
- Refactor backend modules to use explicit crate imports instead of
  super:: prefix for cleaner module boundaries
- Update doc comments across backend modules to use imperative mood

Signed-off-by: Gaius <gaius.qi@gmail.com>
Signed-off-by: Gaius <gaius.qi@gmail.com>
Signed-off-by: Gaius <gaius.qi@gmail.com>
@gaius-qi
Copy link
Member

gaius-qi commented Mar 5, 2026

@pmady The downloaded file is named LICENSE@main, but the @main part should be removed.

image image

The downloaded file was named LICENSE@main instead of LICENSE because
make_output_by_entry used the raw URL path which includes the @revision
suffix. Strip the @revision from both the base URL path and entry URL
path when the scheme is hf to produce correct output filenames.

Signed-off-by: pmady <pmady@users.noreply.github.com>
@pmady
Copy link
Author

pmady commented Mar 5, 2026

@gaius-qi Fixed in be0ef12make_output_by_entry now strips the @revision suffix from HF URL paths so the output filename is LICENSE instead of LICENSE@main. Added tests for both default and custom revisions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Supports directly pulling repositories from Hugging Face

3 participants