Skip to content

Comments

feat: ingest URL content into knowledge table#39

Open
aicodecraft1004 wants to merge 1 commit intoEmbeddedLLM:mainfrom
aicodecraft1004:feature
Open

feat: ingest URL content into knowledge table#39
aicodecraft1004 wants to merge 1 commit intoEmbeddedLLM:mainfrom
aicodecraft1004:feature

Conversation

@aicodecraft1004
Copy link

@aicodecraft1004 aicodecraft1004 commented Feb 9, 2026

feat: Direct URL ingestion for Knowledge Tables

Problem

Today users must manually scrape a URL, save the content to a file, and then upload that file into a Knowledge Table. This adds friction and slows down knowledge base construction.

Solution

Introduce a first-class URL ingestion endpoint: POST /v2/gen_tables/knowledge/embed_url.
The endpoint fetches the URL, extracts readable text, and then reuses the existing Knowledge Table embedding/ingestion pipeline so chunking, embeddings, and storage behave consistently with file-based ingestion.


Changes

  • services/api/src/owl/url_loader.py (NEW): Async URL fetch + HTML parsing + content validation (non-empty/meaningful text)
  • services/api/src/owl/types/init.py: Add URLEmbedFormData (URL field + validation wiring)
  • services/api/src/owl/routers/gen_table.py: Add POST /v2/gen_tables/knowledge/embed_url + OPTIONS preflight; integrates URL content into the existing ingestion flow

Safety / Robustness (current behavior)

  • URL format validation + fetch error handling (invalid URL / HTTP errors)
  • Configurable request timeout (default: 30s)
  • Best-effort size guard via Content-Length pre-check (50MB max; note: not a hard streaming cap)
  • HTML cleanup prior to extraction (removes script/style/meta/link) and rejects empty/too-short extracted content
  • Uses existing permission/quota checks from the current ingestion path

Follow-ups (if maintainers prefer)

  • Add SSRF protections (block localhost/private/link-local ranges + DNS-resolved private IPs)
  • Enforce hard max-bytes limit via streaming download (independent of Content-Length)
  • Add content-type allowlist (e.g text/html, text/plain) + redirect cap

API

POST /v2/gen_tables/knowledge/embed_url

{
  "url": "https://example.com",
  "table_id": "knowledge_table_id",
  "chunk_size": 2000,
  "chunk_overlap": 200
}

Add URL loader with validation/SSRF protection and size/timeout limits; reuse existing knowledge ingestion pipeline.
@aicodecraft1004
Copy link
Author

aicodecraft1004 commented Feb 9, 2026

Implemented direct URL ingestion for knowledge tables and opened PR #39.

  • adds POST /v2/gen_tables/knowledge/embed_url (+ options preflight)
  • fetches URL content, extracts readable text, and reuses the existing knowledge ingestion pipeline

CI is currently blocked due to fork workflow approval (“workflow awaiting approval”).
could a maintainer please approve and run workflows for this PR?

Thanks. happy to iterate on SSRF/streaming caps/content-type allowlist if you want those included in this PR.

@jiahuei
Copy link
Member

jiahuei commented Feb 11, 2026

Hi there, thanks for your contribution! We will take a look

Copy link
Member

@jiahuei jiahuei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your useful PR! I added some comments, please take a look

MAX_CONTENT_SIZE = 50 * 1024 * 1024


async def load_url_content(url: str, timeout: int = 30) -> Tuple[str, str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps can build on top of open_uri_async (has URL validation) and move into utils.io?



class URLEmbedFormData(BaseModel):
url: Annotated[str, Field(description="The URL to extract content from.")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be merged with FileEmbedFormData by making file nullable and url nullable? Then we can add a validator that requires either one to be not null.



@router.post(
"/v2/gen_tables/knowledge/embed_url",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I suggest merging this with /v2/gen_tables/knowledge/embed_file, by merging URLEmbedFormData into FileEmbedFormData.
Then we can just check if file is None and/or url is None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants