Conversation
- hardened mbox reading to cope with also invalid entries and other codecs, e.g. - ubuntu-devel.mbox and kubuntu-users.mbox contained kind of special mails, which resulted in errors without this hardening first.
- added mbox handling - added filters for first n, or last n mails of mbox - added filters for selecting mails with a date range, after or before - use verbose option to debug ingestion in detail
There was a problem hiding this comment.
Pull request overview
This PR adds mbox file support to the email ingestion tool, enabling processing of mbox mailbox files in addition to individual .eml files. The changes include filtering capabilities (first/last N emails, date ranges) and hardened encoding/header handling for robustness with real-world email data.
Changes:
- Added mbox file format support with index-based and date-based filtering
- Refactored argument parsing to use mutually exclusive groups for --eml and --mbox sources
- Enhanced email header handling to properly coerce Header objects to strings
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| tools/ingest_email.py | Added mbox support, filtering options (--first/--last/--after/--before), refactored email iteration logic into _iter_emails function |
| src/typeagent/emails/email_import.py | Added import_emails_from_mbox and count_emails_in_mbox functions, improved header type handling with _header_to_str, enhanced encoding error handling |
| .gitignore | Added /tests/testdata/email-mbox to ignore test data files |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
more memory efficient for large mbox Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- added testcases
Two changes: 1. Default charset: or "utf-8" → or "latin-1" — when no charset header is present, latin-1 preserves all bytes. 2. LookupError fallback: same switch to "latin-1" (no errors param needed since latin-1 never fails on any byte value).
gvanrossum
left a comment
There was a problem hiding this comment.
I haven't looked at the tests yet, but let's first get these comments out of the way. :-)
- mbox are expanded to eml - removed test data - fixed argument handling dates - removed first/last argument handling and replaced with limit etc
|
we could also remove the --eml switch again, because it is now the only way to import |
gvanrossum
left a comment
There was a problem hiding this comment.
Somehow I can't seem to view test_mbox.py on this flight. I'll look into it later.
Yes please. |
gvanrossum
left a comment
There was a problem hiding this comment.
Here's the review for test_mbox.py.
- Updated .gitignore to reflect new email test data structure. - Modified demos.md to clarify usage of email ingestion tools and updated command-line arguments. - Changed references from `tools/gmail/` to `tools/mail/` in documentation and scripts. - Refactored test cases in test_mbox.py to use new date filtering parameters: --start-date and --stop-date. - Updated ingest_email.py to replace --after and --before with --start-date and --stop-date for date filtering. - Added new tools for downloading Gmail and Outlook emails as .eml files: gmail_dump.py and outlook_dump.py. - Introduced mbox_dump.py for extracting emails from mbox files into individual .eml files.
…in date filtering tests
|
all review points should be implemented now |
gvanrossum
left a comment
There was a problem hiding this comment.
Bunch more nits. Getting pretty close!
I don't have the courage to look at outlook_dump.py yet. :-)
- improve date parsing logic - add option for output-dir similar to gmail client - remove outlook-dump.py as this is handeled in microsoft#199 (copilot added the file arbitraritly, grrr.)
Copilot added the file arbitrarily and i did not notice it during commit. |
added mbox handling
testdata which i am using:
currently ingested fwts and bazzaar mbox.
unbunt mbox contain several 10k of mails hence we shall use selectors there (e.g. date ranges)