MIME: Introduce MIME type parser #10640

dmsnell · 2025-12-16T21:02:24Z

Introduces the WP_Mime_Sniffer class for parsing MIME types from sources such as HTTP Content-Type headers, unknown binary files, and more.

WP_Mime_Sniffer::from_declaration( $supplied_type ) for decoding HTTP Content-Type headers, HTML <meta http-equiv> and <script type> tags, RFC 822 headers, and more, where the string is an affirmation of the type of content that should be contained within some associated resource.
WP_Mime_Sniffer::from_file( $file_path ) for inferring MIME type from the “resource header” of a file at the given path where harmonizing server and browser behaviors is warranted, largely to eliminate security vulnerabilities.
WP_Mime_Sniffer::from_binary_file_contents( $file_contents ) for the same, but when the file data has already been loaded, e.g. on media file upload or via HTTP GET.
$mime_type->serialize() to produce a normalized version of a potentially-malformed input.
$mime_type->minimize() to produce a privacy-sensitive stripped-down version of the MIME type suitable for use in APIs like PerformanceResourceTiming.
$mime_type->get_indicated_charset() to return a canonical character encoding referenced by the MIME type, if included and recognized.
A family of methods to indicate if a mime type is of a given common set, such as $mime_type->is_json() and $mime_type->is_javascript().

The ::declaring_javascript() and ::declaring_json() methods are interesting and might be worth emphasizing over from_declaration() if they stay in the patch. They only return a parsed MIME type if given something that matches those classes.

if ( WP_Mime_Sniffer::declaring_json( $content_type ) ) {
	$response = json_decode( $response );
}

Add ::from_http_headers_string( string $headers ) ?
Add ::from_http_headers_array( array<string> $headers ) ?

These two methods could ease code attempting to infer content type without needing to know the details surrounding Content-type parsing: in download_url(), in SimplePie, in discover_pingback_server_uri(), in wp_staticize_emoji_for_email() even! It would update WP_REST_Request::get_content_type() and wp_finalize_template_enhancement_output_buffer().

The Encoding part unlocks non-UTF-8 inputs in the HTML API for $this->bail( 'Cannot yet process META tags with http-equiv Content-Type to determine encoding.' );

Of the labeled encodings, they are mostly supported by the version of PHP running on my computer with mbstring and iconv extensions. Of the unsupported ones:

ISO-8859-8-I is a variant of ISO-8859-8 which might be textually identical and possibly only specified meta sequences based on the C0/C1 controls.
replacement groups security-risky encoding labels into a decoder that always fails. when decoded, the output is always '' (empty string).
x-user-defined is a mapping of non-US-ASCII bytes up by 0x4780 into the private-use area.

github-actions · 2025-12-16T21:18:51Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Various applications require making decisions based on a MIME type from an HTTP header, such as is found in the “Content-Type” header. This patch introduces a new utility class for parsing MIME media types compliant with the WHATWG MIME Sniffing algorithm. Co-authored-by: Jon Surrell <[email protected]>

dmsnell · 2025-12-31T04:04:21Z

Once again WPCS thinks vertical alignment is a crime…

Thoughts after letting this stew for a bit:

Perhaps we rename this to something more specific, like WP_MIME_Sniffer?

Various applications require making decisions based on a MIME type from an HTTP header, such as is found in the “Content-Type” header. This patch introduces a new utility class for parsing MIME media types compliant with the WHATWG MIME Sniffing algorithm. Co-authored-by: Jon Surrell <[email protected]>

dmsnell · 2025-12-31T20:58:15Z

I’ve made some large updates to the interface and I’m overall much happier with the new primary methods. There are still some parts of the sniffing algorithm left to implement (webm and mp3 without id3) but those are trivial to add.

Still need to think about essence() as a term, because while it’s specific and matches the specification, it may not be widely recognized or understood. Still also need to think about levels of confidence and the relationship with the detected Apache bug.

src/wp-includes/class-wp-mime-sniffer.php

westonruter · 2026-01-01T21:21:26Z

src/wp-includes/class-wp-mime-sniffer.php

+		return $serialization;
+	}
+
+	public function essence(): string {


What about get_essence()? While “essence” is still maybe esoteric, I also see it as a verb could mean “to perfume or scent”. Since class methods are normally verbs, I would not immediately recognize what an essence function is supposed to do, but I would understand get_essence.

I’m really not happy overall with this term, as it was brand new to me, and I think to @sirreal, and I don’t know about you, but were it not for the spec I would not give it this name.

It seems worth continuing to review the spec (which I’m doing) and trying to understand how we want to use this in which contexts. Given the test failures there are things I still don’t understand and I think some of them stem from other specifications, such as the Fetch and HTTP Structured Fields specs.

For now I will leave this as-is not because I disagree with your advice, but mostly as a questionable artifact of implementing the spec as-is.

This discussion is rather insightful, and I think we will need to also implement Content-Type parsing from Fetch if we want to handle the full web-platform-tests suite. This addresses questions you and I have discussed around seeing multiple Content-Type headers, as even though the web servers we tested limit them to one, apparently not all do, plus there might be cases where the header is duplicated in order to split long lines. Various browsers still disagree around the edges, so we have some leeway.

I’m hoping this doesn’t get too complicated.

Scratch that: I had $position .= strcspn() instead of $position += strcspn() 🤦‍♂️

The tests pass, but I did implement the Content-Type parsing and think it might be useful. The same tests pass and fail from the mime-types.json file.

Regarding essence, I was not familiar with the term before researching MIME.

It's obvious looking at the function, but the essence is type/subtype (without any parameters). I think most folks think of the essence when they think of a MIME type, things like text/plain or application/json.

It's redundant, but maybe mime_type_essence would be a good name. It includes mime_type so it should be visible to search and autocomplete when folks are looking want the essence and look for something about mime type.

Is "media type" not the most common alternative term. So get_media_type() would make sense to me as a method name. The phpdoc can also mention that this is also the essence. Or would media type this entail that parameters may also be present?

westonruter · 2026-01-01T21:22:42Z

src/wp-includes/class-wp-mime-sniffer.php

+		}
+	}
+
+	public function serialize(): string {


How about also implementing a __toString method which is just an alias for this serialize method?

in the past, people have been pretty averse to adding these implicit magic methods. I don’t have an opinion, though I do know that there are multiple legitimate ways to represent one of these as a string

I don't recall that aversion. I see that WP_HTML_Tag_Processor includes it. I don't feel strongly about this, however.

It came up in WordPress/gutenberg#42485 which led to the creation of get_updated_html() in WordPress/gutenberg#44597, which was a long time ago. It also is the reason (string) casting the HTML API is never used in documentation.

For now I’m going to let this stew a bit without adding an implicit choice of what representation to use. I think serialize() makes sense, but the difference between the serialized and minimized form could matter in context-dependent ways.

For example, are we attempting to pass-through a normalized form of the input value? Are we trying to return the decoded MIME type? Are we trying to return the MIME type for an upload?

I think these three contexts warrant different returns:

serialize() will return forms of non-sensical types like !#$%&/!#$%& and x/x

minimize() does not.

Whereas minimize() is supposed to filter the media types by whether they are “supported by the user agent,” meaning whether the user agent knows how to display the type. In the case of WordPress I don’t have a great answer for what this means, so I have tossed this out as a start:

use wp_get_mime_types() for reporting decoded MIME type when support is relevant

use get_allowed_mime_types() for uploads or where user-permissions are relevant

My fear is that if we declare one canonical form of string value for the MIME type instance it will over-simplify this context-dependence and accidentally encourage developers to overlook it.

…to save pixels.

github-actions · 2026-01-02T10:20:35Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, westonruter, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

dmsnell · 2026-01-02T10:47:44Z

Finishing up today’s work I have made substantial-enough changes that I am marking this ready for review:

All of the web-platform-tests MIME sniffing tests pass for parsing of supplied MIME types. These are things like HTTP Content-Type headers, email (RFC822) headers, and certain HTML attributes.
The from_content_type() method is something I added when I thought I was misunderstanding the spec and overlooked a typo. It goes beyond what is required by the MIMESNIFF spec but could be useful, especially since we occasionally see multiple Content-Type headers and are likely to ingest malicious values. It has no tests currently but there might be tests covering the “getting, decoding, and splitting” part of the algorithm in the HTTP Structured Fields spec.
Minimization needs a better name, such as get_privacy_sensitive_type_string(). It’s specifically designed in response to PerformanceResourceTiming adding Content-Type and meant to strip away things that might fingerprint users.
essence() is something I don’t understand yet. I think it’s only relevant within the spec and may not be something outside users need or want. It’s effectively the MIME type string returned from serialize() when all parameters are stripped away. Unfortunately we cannot call it get_mime_type_string() because those may contain parameters.
The parsing from binaries isn’t complete and I don’t know if it needs to have a place in 7.0.0. We can of course finish it and verify it with tests, but the web-platform-tests suite isn’t bulky for them: I think they have one test file per type. They could provide a nice security uplift for server-side content/media type sniffing though, far better than file-extension-based methods.

The tests cover a lot of edge cases.

OK (987 tests, 1676 assertions)

A few related specifications integrate here and it makes me think about creating a WHATWG spec family inside wp-includes/whatwg, though it’s not limited to WHATWG in this case.

The Encoding spec brings the concept of names and labels for character encoding schemes (i.e. “charsets”) and I think that could be very helpful for the HTML API and other places where Core attempts to make sense of user-supplied encodings. Curiously, ascii, latin1, and iso-8859-1 converge into the canonical windows-1252 enshrining the reuse of the C1 controls even when that should cause a fatal.
The Fetch spec is full of HTTP semantics and I’d like to read more into it; I’m guessing there are valuable insights in there for WordPress.
Fetch of course leans on RFC9651 for Structured Field Parsing, which should answer all of our questions on how to handle unexpected HTTP headers, duplicates, line-wrapping, comma-lists, etc… I find this spec a little less clear on the byte-level parsing of HTTP headers but it probably just requires more concentrated reading.

What should be reviewed?

Please focus on design review: how will this be used? how should it be used? Are there better names? Are there different constructions that will lead to more responsible use?

For example, I renamed from_string() to from_declaration() for two reasons:

There are multiple conflicting contexts for how to sniff a MIME type. This involves the elephant in the PHProom that binary data is a string! from_string() doesn’t help anyone understand when they should call it or with what kind of data.
The “declaration” terminology hints more at the idea that some external source or metadata has made a claim about a MIME type, yet that claim requires parsing.

Another example is the split in from_binary() where I wanted to emphasize the primacy of from_file() which only reads in up to 1,445 bytes rather than reading in a file which may potentially be multiple gigabytes in size. If I were looking for a method to sniff a type of a file I would start with ->file and see what auto-completes. Renaming from_binary() to from_binary_file_contents() made sense to me that it would convey the sense that it’s operating on file data and hint that there could be a simpler function, from_file(), to reach for.

Anyway, this is a bit of rambling but there is a risk in implementing specs like this that things are simpler than they appear. The spec provides us instructions on how to look at something and give the same answer that a browser would to the question, “what is this?” But the specs are general and not built for our unique or special use cases. We may choose to diverge where something better-meets the needs of WordPress developers, as we collectively see fit, as long as it doesn’t violate spec-compliance.

Thank you for your time and consideration on this!

westonruter · 2026-01-02T23:29:53Z

how will this be used? how should it be used?

Could this PR (or a new sub-PR) implement this new MIME type parser to replace the ad-hoc parsers listed in the ticket? This would more easily demonstrate that it satisfies the current use cases.

dmsnell force-pushed the mime/introduce-mime-class branch from 304205a to 75854fd Compare December 16, 2025 23:00

This was referenced Dec 17, 2025

HTML API: Auto-escape JavaScript and JSON script tag contents when necessary #10635

Open

Scripts: Use HTML API to build SCRIPT tags #10639

Draft

dmsnell and others added 3 commits December 31, 2025 10:44

MIME: Binary sniffs

11b419c

Rename MIME sniffer class and restructure interface methods.

8f46374

dmsnell force-pushed the mime/introduce-mime-class branch from e2f6863 to 8f46374 Compare December 31, 2025 20:49

westonruter reviewed Jan 1, 2026

View reviewed changes

dmsnell added 8 commits January 1, 2026 16:41

MIME: Add test data downloader from web-platform-tests

1097018

MIME: Get basic tests running

04687c8

MIME: More tests

0e337c2

Fix some typos, add Content-Type parser, add minimize()

2eeaf34

PR feedback

5e44fc4

Pass remaining tests.

8ac14da

Test minimization

47c537d

Misalign vertical rhythm to silence yappy WPCS and use single quotes …

65f3330

…to save pixels.

dmsnell marked this pull request as ready for review January 2, 2026 10:20

WPCS does not understand what the word "recommended" means

2063b80

dmsnell added 3 commits January 2, 2026 04:00

Add missing TODO items

906cb29

Rename sniff_json() to declaring_json() etc...

f16aae2

More notes/todos

6096a58

MIME: Introduce MIME type parser #10640

Are you sure you want to change the base?

MIME: Introduce MIME type parser #10640

Uh oh!

Conversation

dmsnell commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 16, 2025

Test using WordPress Playground

Some things to be aware of

Uh oh!

dmsnell commented Dec 31, 2025

Uh oh!

dmsnell commented Dec 31, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmsnell commented Jan 2, 2026

What should be reviewed?

Uh oh!

westonruter commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dmsnell commented Dec 16, 2025 •

edited

Loading

github-actions bot commented Jan 2, 2026 •

edited

Loading