feat: ingest URL content into knowledge table#39
feat: ingest URL content into knowledge table#39aicodecraft1004 wants to merge 1 commit intoEmbeddedLLM:mainfrom
Conversation
Add URL loader with validation/SSRF protection and size/timeout limits; reuse existing knowledge ingestion pipeline.
|
Implemented direct URL ingestion for knowledge tables and opened PR #39.
CI is currently blocked due to fork workflow approval (“workflow awaiting approval”). Thanks. happy to iterate on SSRF/streaming caps/content-type allowlist if you want those included in this PR. |
|
Hi there, thanks for your contribution! We will take a look |
jiahuei
left a comment
There was a problem hiding this comment.
Thanks for your useful PR! I added some comments, please take a look
| MAX_CONTENT_SIZE = 50 * 1024 * 1024 | ||
|
|
||
|
|
||
| async def load_url_content(url: str, timeout: int = 30) -> Tuple[str, str]: |
There was a problem hiding this comment.
Perhaps can build on top of open_uri_async (has URL validation) and move into utils.io?
|
|
||
|
|
||
| class URLEmbedFormData(BaseModel): | ||
| url: Annotated[str, Field(description="The URL to extract content from.")] |
There was a problem hiding this comment.
Can this be merged with FileEmbedFormData by making file nullable and url nullable? Then we can add a validator that requires either one to be not null.
|
|
||
|
|
||
| @router.post( | ||
| "/v2/gen_tables/knowledge/embed_url", |
There was a problem hiding this comment.
If possible, I suggest merging this with /v2/gen_tables/knowledge/embed_file, by merging URLEmbedFormData into FileEmbedFormData.
Then we can just check if file is None and/or url is None.
feat: Direct URL ingestion for Knowledge Tables
Problem
Today users must manually scrape a URL, save the content to a file, and then upload that file into a Knowledge Table. This adds friction and slows down knowledge base construction.
Solution
Introduce a first-class URL ingestion endpoint:
POST /v2/gen_tables/knowledge/embed_url.The endpoint fetches the URL, extracts readable text, and then reuses the existing Knowledge Table embedding/ingestion pipeline so chunking, embeddings, and storage behave consistently with file-based ingestion.
Changes
URLEmbedFormData(URL field + validation wiring)POST /v2/gen_tables/knowledge/embed_url+OPTIONSpreflight; integrates URL content into the existing ingestion flowSafety / Robustness (current behavior)
Content-Lengthpre-check (50MB max; note: not a hard streaming cap)Follow-ups (if maintainers prefer)
Content-Length)text/html,text/plain) + redirect capAPI
POST /v2/gen_tables/knowledge/embed_url{ "url": "https://example.com", "table_id": "knowledge_table_id", "chunk_size": 2000, "chunk_overlap": 200 }