feat: ingest URL content into knowledge table by aicodecraft1004 · Pull Request #39 · EmbeddedLLM/JamAIBase

aicodecraft1004 · 2026-02-09T17:32:23Z

feat: Direct URL ingestion for Knowledge Tables

Problem

Today users must manually scrape a URL, save the content to a file, and then upload that file into a Knowledge Table. This adds friction and slows down knowledge base construction.

Solution

Introduce a first-class URL ingestion endpoint: POST /v2/gen_tables/knowledge/embed_url.
The endpoint fetches the URL, extracts readable text, and then reuses the existing Knowledge Table embedding/ingestion pipeline so chunking, embeddings, and storage behave consistently with file-based ingestion.

Changes

services/api/src/owl/url_loader.py (NEW): Async URL fetch + HTML parsing + content validation (non-empty/meaningful text)
services/api/src/owl/types/init.py: Add URLEmbedFormData (URL field + validation wiring)
services/api/src/owl/routers/gen_table.py: Add POST /v2/gen_tables/knowledge/embed_url + OPTIONS preflight; integrates URL content into the existing ingestion flow

Safety / Robustness (current behavior)

URL format validation + fetch error handling (invalid URL / HTTP errors)
Configurable request timeout (default: 30s)
Best-effort size guard via Content-Length pre-check (50MB max; note: not a hard streaming cap)
HTML cleanup prior to extraction (removes script/style/meta/link) and rejects empty/too-short extracted content
Uses existing permission/quota checks from the current ingestion path

Follow-ups (if maintainers prefer)

Add SSRF protections (block localhost/private/link-local ranges + DNS-resolved private IPs)
Enforce hard max-bytes limit via streaming download (independent of Content-Length)
Add content-type allowlist (e.g text/html, text/plain) + redirect cap

API

POST /v2/gen_tables/knowledge/embed_url

{
  "url": "https://example.com",
  "table_id": "knowledge_table_id",
  "chunk_size": 2000,
  "chunk_overlap": 200
}

Add URL loader with validation/SSRF protection and size/timeout limits; reuse existing knowledge ingestion pipeline.

aicodecraft1004 · 2026-02-09T18:04:09Z

Implemented direct URL ingestion for knowledge tables and opened PR #39.

adds POST /v2/gen_tables/knowledge/embed_url (+ options preflight)
fetches URL content, extracts readable text, and reuses the existing knowledge ingestion pipeline

CI is currently blocked due to fork workflow approval (“workflow awaiting approval”).
could a maintainer please approve and run workflows for this PR?

Thanks. happy to iterate on SSRF/streaming caps/content-type allowlist if you want those included in this PR.

jiahuei · 2026-02-11T08:22:04Z

Hi there, thanks for your contribution! We will take a look

jiahuei

Thanks for your useful PR! I added some comments, please take a look

jiahuei · 2026-02-13T08:04:04Z

services/api/src/owl/url_loader.py

+MAX_CONTENT_SIZE = 50 * 1024 * 1024
+
+
+async def load_url_content(url: str, timeout: int = 30) -> Tuple[str, str]:


Perhaps can build on top of open_uri_async (has URL validation) and move into utils.io?

jiahuei · 2026-02-13T08:06:06Z

services/api/src/owl/types/__init__.py



+class URLEmbedFormData(BaseModel):
+    url: Annotated[str, Field(description="The URL to extract content from.")]


Can this be merged with FileEmbedFormData by making file nullable and url nullable? Then we can add a validator that requires either one to be not null.

jiahuei · 2026-02-13T08:08:10Z

services/api/src/owl/routers/gen_table.py



+@router.post(
+    "/v2/gen_tables/knowledge/embed_url",


If possible, I suggest merging this with /v2/gen_tables/knowledge/embed_file, by merging URLEmbedFormData into FileEmbedFormData.
Then we can just check if file is None and/or url is None.

feat: ingest URL content into knowledge table

7409d0f

Add URL loader with validation/SSRF protection and size/timeout limits; reuse existing knowledge ingestion pipeline.

jiahuei reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: ingest URL content into knowledge table#39

feat: ingest URL content into knowledge table#39
aicodecraft1004 wants to merge 1 commit intoEmbeddedLLM:mainfrom
aicodecraft1004:feature

aicodecraft1004 commented Feb 9, 2026 •

edited

Loading

Uh oh!

aicodecraft1004 commented Feb 9, 2026 •

edited

Loading

Uh oh!

jiahuei commented Feb 11, 2026

Uh oh!

jiahuei left a comment

Uh oh!

jiahuei Feb 13, 2026

Uh oh!

jiahuei Feb 13, 2026

Uh oh!

jiahuei Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		MAX_CONTENT_SIZE = 50 * 1024 * 1024


		async def load_url_content(url: str, timeout: int = 30) -> Tuple[str, str]:



		class URLEmbedFormData(BaseModel):
		url: Annotated[str, Field(description="The URL to extract content from.")]



		@router.post(
		"/v2/gen_tables/knowledge/embed_url",

Comments

Conversation

aicodecraft1004 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: Direct URL ingestion for Knowledge Tables

Problem

Solution

Changes

Safety / Robustness (current behavior)

Follow-ups (if maintainers prefer)

API

Uh oh!

aicodecraft1004 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiahuei commented Feb 11, 2026

Uh oh!

jiahuei left a comment

Choose a reason for hiding this comment

Uh oh!

jiahuei Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

jiahuei Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

jiahuei Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aicodecraft1004 commented Feb 9, 2026 •

edited

Loading

aicodecraft1004 commented Feb 9, 2026 •

edited

Loading