Skip to content

Comments

Email demo mbox#198

Open
bmerkle wants to merge 18 commits intomicrosoft:mainfrom
bmerkle:email-demo-mbox
Open

Email demo mbox#198
bmerkle wants to merge 18 commits intomicrosoft:mainfrom
bmerkle:email-demo-mbox

Conversation

@bmerkle
Copy link
Contributor

@bmerkle bmerkle commented Feb 15, 2026

added mbox handling

  • major refactoring
  • added mbox handling
  • added filters for first n, or last n mails of mbox
  • added filters for selecting mails with a date range, after or before
  • use verbose option to debug ingestion in detail
  • hardened mbox reading to cope with also invalid entries and other codecs, e.g.

testdata which i am using:

currently ingested fwts and bazzaar mbox.
unbunt mbox contain several 10k of mails hence we shall use selectors there (e.g. date ranges)

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---          13/02/2026    21:18        2140627 bazaar-announce.mbox
-a---          13/02/2026    21:55         648468 fwts-announce.mbox
-a---          13/02/2026    20:19      269515931 kubuntu-users.mbox
-a---          13/02/2026    20:18      195869391 ubuntu-devel.mbox

- hardened mbox reading to cope with also invalid entries and other codecs, e.g.
- ubuntu-devel.mbox and kubuntu-users.mbox contained kind of special mails, which resulted in errors without this hardening first.
- added mbox handling
- added filters for first n, or last n mails of mbox
- added filters for selecting mails with a date range, after or before
- use verbose option to debug ingestion in detail
Copilot AI review requested due to automatic review settings February 15, 2026 22:53
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds mbox file support to the email ingestion tool, enabling processing of mbox mailbox files in addition to individual .eml files. The changes include filtering capabilities (first/last N emails, date ranges) and hardened encoding/header handling for robustness with real-world email data.

Changes:

  • Added mbox file format support with index-based and date-based filtering
  • Refactored argument parsing to use mutually exclusive groups for --eml and --mbox sources
  • Enhanced email header handling to properly coerce Header objects to strings

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.

File Description
tools/ingest_email.py Added mbox support, filtering options (--first/--last/--after/--before), refactored email iteration logic into _iter_emails function
src/typeagent/emails/email_import.py Added import_emails_from_mbox and count_emails_in_mbox functions, improved header type handling with _header_to_str, enhanced encoding error handling
.gitignore Added /tests/testdata/email-mbox to ignore test data files

bmerkle and others added 5 commits February 16, 2026 00:01
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
more memory efficient for large mbox

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- added testcases
Two changes:

1. Default charset: or "utf-8" → or "latin-1" — when no charset header is present, latin-1 preserves all bytes.
2. LookupError fallback: same switch to "latin-1" (no errors param needed since latin-1 never fails on any byte value).
Copy link
Collaborator

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the tests yet, but let's first get these comments out of the way. :-)

- mbox are expanded to eml
- removed test data
- fixed argument handling dates
- removed first/last argument handling and replaced with limit etc
@bmerkle
Copy link
Contributor Author

bmerkle commented Feb 17, 2026

we could also remove the --eml switch again, because it is now the only way to import

Copy link
Collaborator

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I can't seem to view test_mbox.py on this flight. I'll look into it later.

@gvanrossum
Copy link
Collaborator

we could also remove the --eml switch again, because it is now the only way to import

Yes please.

Copy link
Collaborator

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the review for test_mbox.py.

- Updated .gitignore to reflect new email test data structure.
- Modified demos.md to clarify usage of email ingestion tools and updated command-line arguments.
- Changed references from `tools/gmail/` to `tools/mail/` in documentation and scripts.
- Refactored test cases in test_mbox.py to use new date filtering parameters: --start-date and --stop-date.
- Updated ingest_email.py to replace --after and --before with --start-date and --stop-date for date filtering.
- Added new tools for downloading Gmail and Outlook emails as .eml files: gmail_dump.py and outlook_dump.py.
- Introduced mbox_dump.py for extracting emails from mbox files into individual .eml files.
@bmerkle
Copy link
Contributor Author

bmerkle commented Feb 17, 2026

all review points should be implemented now

Copy link
Collaborator

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch more nits. Getting pretty close!

I don't have the courage to look at outlook_dump.py yet. :-)

- improve date parsing logic
- add option for output-dir similar to gmail client
- remove outlook-dump.py as this is handeled in microsoft#199 (copilot added the file arbitraritly, grrr.)
@bmerkle
Copy link
Contributor Author

bmerkle commented Feb 19, 2026

outlook_dump.py

Copilot added the file arbitrarily and i did not notice it during commit.
Is removed now. we look at this in #199.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants