Skip to content

Conversation

@dmsnell
Copy link
Member

@dmsnell dmsnell commented Dec 16, 2025

Trac ticket: Core-64427

Introduces the WP_Mime_Sniffer class for parsing MIME types from sources such as HTTP Content-Type headers, unknown binary files, and more.

  • WP_Mime_Sniffer::from_declaration( $supplied_type ) for decoding HTTP Content-Type headers, HTML <meta http-equiv> and <script type> tags, RFC 822 headers, and more, where the string is an affirmation of the type of content that should be contained within some associated resource.
  • WP_Mime_Sniffer::from_file( $file_path ) for inferring MIME type from the “resource header” of a file at the given path where harmonizing server and browser behaviors is warranted, largely to eliminate security vulnerabilities.
  • WP_Mime_Sniffer::from_binary_file_contents( $file_contents ) for the same, but when the file data has already been loaded, e.g. on media file upload or via HTTP GET.
  • $mime_type->serialize() to produce a normalized version of a potentially-malformed input.
  • $mime_type->minimize() to produce a privacy-sensitive stripped-down version of the MIME type suitable for use in APIs like PerformanceResourceTiming.
  • $mime_type->get_indicated_charset() to return a canonical character encoding referenced by the MIME type, if included and recognized.
  • A family of methods to indicate if a mime type is of a given common set, such as $mime_type->is_json() and $mime_type->is_javascript().

The ::declaring_javascript() and ::declaring_json() methods are interesting and might be worth emphasizing over from_declaration() if they stay in the patch. They only return a parsed MIME type if given something that matches those classes.

if ( WP_Mime_Sniffer::declaring_json( $content_type ) ) {
	$response = json_decode( $response );
}
  • Add ::from_http_headers_string( string $headers ) ?
  • Add ::from_http_headers_array( array<string> $headers ) ?

These two methods could ease code attempting to infer content type without needing to know the details surrounding Content-type parsing: in download_url(), in SimplePie, in discover_pingback_server_uri(), in wp_staticize_emoji_for_email() even! It would update WP_REST_Request::get_content_type() and wp_finalize_template_enhancement_output_buffer().

The Encoding part unlocks non-UTF-8 inputs in the HTML API for $this->bail( 'Cannot yet process META tags with http-equiv Content-Type to determine encoding.' );

Of the labeled encodings, they are mostly supported by the version of PHP running on my computer with mbstring and iconv extensions. Of the unsupported ones:

  • ISO-8859-8-I is a variant of ISO-8859-8 which might be textually identical and possibly only specified meta sequences based on the C0/C1 controls.
  • replacement groups security-risky encoding labels into a decoder that always fails. when decoded, the output is always '' (empty string).
  • x-user-defined is a mapping of non-US-ASCII bytes up by 0x4780 into the private-use area.
Screenshot 2026-01-02 at 4 53 45 AM

@github-actions
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

dmsnell added a commit to dmsnell/wordpress-develop that referenced this pull request Dec 16, 2025
Various applications require making decisions based on a MIME type from
an HTTP header, such as is found in the “Content-Type” header.

This patch introduces a new utility class for parsing MIME media types
compliant with the WHATWG MIME Sniffing algorithm.

Co-authored-by: Jon Surrell <[email protected]>
@dmsnell dmsnell force-pushed the mime/introduce-mime-class branch from 304205a to 75854fd Compare December 16, 2025 23:00
@dmsnell
Copy link
Member Author

dmsnell commented Dec 31, 2025

Once again WPCS thinks vertical alignment is a crime…

Thoughts after letting this stew for a bit:

  • Perhaps we rename this to something more specific, like WP_MIME_Sniffer?

dmsnell and others added 3 commits December 31, 2025 10:44
Various applications require making decisions based on a MIME type from
an HTTP header, such as is found in the “Content-Type” header.

This patch introduces a new utility class for parsing MIME media types
compliant with the WHATWG MIME Sniffing algorithm.

Co-authored-by: Jon Surrell <[email protected]>
@dmsnell dmsnell force-pushed the mime/introduce-mime-class branch from e2f6863 to 8f46374 Compare December 31, 2025 20:49
@dmsnell
Copy link
Member Author

dmsnell commented Dec 31, 2025

I’ve made some large updates to the interface and I’m overall much happier with the new primary methods. There are still some parts of the sniffing algorithm left to implement (webm and mp3 without id3) but those are trivial to add.

Still need to think about essence() as a term, because while it’s specific and matches the specification, it may not be widely recognized or understood. Still also need to think about levels of confidence and the relationship with the detected Apache bug.

return $serialization;
}

public function essence(): string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about get_essence()? While “essence” is still maybe esoteric, I also see it as a verb could mean “to perfume or scent”. Since class methods are normally verbs, I would not immediately recognize what an essence function is supposed to do, but I would understand get_essence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m really not happy overall with this term, as it was brand new to me, and I think to @sirreal, and I don’t know about you, but were it not for the spec I would not give it this name.

It seems worth continuing to review the spec (which I’m doing) and trying to understand how we want to use this in which contexts. Given the test failures there are things I still don’t understand and I think some of them stem from other specifications, such as the Fetch and HTTP Structured Fields specs.

For now I will leave this as-is not because I disagree with your advice, but mostly as a questionable artifact of implementing the spec as-is.

This discussion is rather insightful, and I think we will need to also implement Content-Type parsing from Fetch if we want to handle the full web-platform-tests suite. This addresses questions you and I have discussed around seeing multiple Content-Type headers, as even though the web servers we tested limit them to one, apparently not all do, plus there might be cases where the header is duplicated in order to split long lines. Various browsers still disagree around the edges, so we have some leeway.

I’m hoping this doesn’t get too complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that: I had $position .= strcspn() instead of $position += strcspn() 🤦‍♂️

The tests pass, but I did implement the Content-Type parsing and think it might be useful. The same tests pass and fail from the mime-types.json file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding essence, I was not familiar with the term before researching MIME.

It's obvious looking at the function, but the essence is type/subtype (without any parameters). I think most folks think of the essence when they think of a MIME type, things like text/plain or application/json.

It's redundant, but maybe mime_type_essence would be a good name. It includes mime_type so it should be visible to search and autocomplete when folks are looking want the essence and look for something about mime type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "media type" not the most common alternative term. So get_media_type() would make sense to me as a method name. The phpdoc can also mention that this is also the essence. Or would media type this entail that parameters may also be present?

}
}

public function serialize(): string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about also implementing a __toString method which is just an alias for this serialize method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the past, people have been pretty averse to adding these implicit magic methods. I don’t have an opinion, though I do know that there are multiple legitimate ways to represent one of these as a string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall that aversion. I see that WP_HTML_Tag_Processor includes it. I don't feel strongly about this, however.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It came up in WordPress/gutenberg#42485 which led to the creation of get_updated_html() in WordPress/gutenberg#44597, which was a long time ago. It also is the reason (string) casting the HTML API is never used in documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I’m going to let this stew a bit without adding an implicit choice of what representation to use. I think serialize() makes sense, but the difference between the serialized and minimized form could matter in context-dependent ways.

For example, are we attempting to pass-through a normalized form of the input value? Are we trying to return the decoded MIME type? Are we trying to return the MIME type for an upload?

I think these three contexts warrant different returns:

  • serialize() will return forms of non-sensical types like !#$%&/!#$%& and x/x
  • minimize() does not.

Whereas minimize() is supposed to filter the media types by whether they are “supported by the user agent,” meaning whether the user agent knows how to display the type. In the case of WordPress I don’t have a great answer for what this means, so I have tossed this out as a start:

  • use wp_get_mime_types() for reporting decoded MIME type when support is relevant
  • use get_allowed_mime_types() for uploads or where user-permissions are relevant

My fear is that if we declare one canonical form of string value for the MIME type instance it will over-simplify this context-dependence and accidentally encourage developers to overlook it.

@dmsnell dmsnell marked this pull request as ready for review January 2, 2026 10:20
@github-actions
Copy link

github-actions bot commented Jan 2, 2026

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, westonruter, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dmsnell
Copy link
Member Author

dmsnell commented Jan 2, 2026

Finishing up today’s work I have made substantial-enough changes that I am marking this ready for review:

  • All of the web-platform-tests MIME sniffing tests pass for parsing of supplied MIME types. These are things like HTTP Content-Type headers, email (RFC822) headers, and certain HTML attributes.
  • The from_content_type() method is something I added when I thought I was misunderstanding the spec and overlooked a typo. It goes beyond what is required by the MIMESNIFF spec but could be useful, especially since we occasionally see multiple Content-Type headers and are likely to ingest malicious values. It has no tests currently but there might be tests covering the “getting, decoding, and splitting” part of the algorithm in the HTTP Structured Fields spec.
  • Minimization needs a better name, such as get_privacy_sensitive_type_string(). It’s specifically designed in response to PerformanceResourceTiming adding Content-Type and meant to strip away things that might fingerprint users.
  • essence() is something I don’t understand yet. I think it’s only relevant within the spec and may not be something outside users need or want. It’s effectively the MIME type string returned from serialize() when all parameters are stripped away. Unfortunately we cannot call it get_mime_type_string() because those may contain parameters.
  • The parsing from binaries isn’t complete and I don’t know if it needs to have a place in 7.0.0. We can of course finish it and verify it with tests, but the web-platform-tests suite isn’t bulky for them: I think they have one test file per type. They could provide a nice security uplift for server-side content/media type sniffing though, far better than file-extension-based methods.

The tests cover a lot of edge cases.

OK (987 tests, 1676 assertions)

A few related specifications integrate here and it makes me think about creating a WHATWG spec family inside wp-includes/whatwg, though it’s not limited to WHATWG in this case.

  • The Encoding spec brings the concept of names and labels for character encoding schemes (i.e. “charsets”) and I think that could be very helpful for the HTML API and other places where Core attempts to make sense of user-supplied encodings. Curiously, ascii, latin1, and iso-8859-1 converge into the canonical windows-1252 enshrining the reuse of the C1 controls even when that should cause a fatal.
  • The Fetch spec is full of HTTP semantics and I’d like to read more into it; I’m guessing there are valuable insights in there for WordPress.
  • Fetch of course leans on RFC9651 for Structured Field Parsing, which should answer all of our questions on how to handle unexpected HTTP headers, duplicates, line-wrapping, comma-lists, etc… I find this spec a little less clear on the byte-level parsing of HTTP headers but it probably just requires more concentrated reading.

What should be reviewed?

Please focus on design review: how will this be used? how should it be used? Are there better names? Are there different constructions that will lead to more responsible use?

For example, I renamed from_string() to from_declaration() for two reasons:

  • There are multiple conflicting contexts for how to sniff a MIME type. This involves the elephant in the PHProom that binary data is a string! from_string() doesn’t help anyone understand when they should call it or with what kind of data.
  • The “declaration” terminology hints more at the idea that some external source or metadata has made a claim about a MIME type, yet that claim requires parsing.

Another example is the split in from_binary() where I wanted to emphasize the primacy of from_file() which only reads in up to 1,445 bytes rather than reading in a file which may potentially be multiple gigabytes in size. If I were looking for a method to sniff a type of a file I would start with ->file and see what auto-completes. Renaming from_binary() to from_binary_file_contents() made sense to me that it would convey the sense that it’s operating on file data and hint that there could be a simpler function, from_file(), to reach for.

Anyway, this is a bit of rambling but there is a risk in implementing specs like this that things are simpler than they appear. The spec provides us instructions on how to look at something and give the same answer that a browser would to the question, “what is this?” But the specs are general and not built for our unique or special use cases. We may choose to diverge where something better-meets the needs of WordPress developers, as we collectively see fit, as long as it doesn’t violate spec-compliance.

Thank you for your time and consideration on this!

@westonruter
Copy link
Member

how will this be used? how should it be used?

Could this PR (or a new sub-PR) implement this new MIME type parser to replace the ad-hoc parsers listed in the ticket? This would more easily demonstrate that it satisfies the current use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants