WordPress.org

WordPress Developer Blog

The HTML API: process your tags, not your pain

If you regularly parse HTML with regular expressions, you have some pretty impressive skills across the WordPress stack.

They definitely work. Most of the time. But the patterns in the code look just plain weird, like something they found chiseled on the wall of an ancient tomb. 

And when they don’t? Stuff gets real, fast, with breakages and vulnerabilities everywhere you look.

New in 6.2, growing since

New in WordPress 6.2, the first stage of the HTML API was the HTML Tag Processor. All by itself, that processor is better than regular expressions. It’s convenient, reliable, fast—and You. Can. Read. It.

Take a look at two examples below—first with regex and then with the HTML Tag Processor. They happen to be the use cases that motivated contributors to build the processor in the first place. And you’ll probably think of a bunch of ways to use the processor yourself.

Example 1: How to lazy-load images with the loading attribute

First: Doing it the regex way

It’s easy to forget the pain when you get the result you want. But here’s hoping this walk back through the hard way will persuade you that it’s even better to avoid as much of the pain as you can, especially if you get a better result. Note that the regex way will take two iterations and some cleanup:

Iteration 1: a lazy-loading image pattern

Let’s build a pattern that does three things:

  • Matches the image tag
  • Captures its attributes
  • Appends the loading attribute to the end of the snippet
function render_block( $html ) {
    return preg_replace(
        '~<img(.*)>~',
        '<img$1 loading="lazy">',
        $html
    );
}

Now, that works fine if you have one image tag in the markup. 

More than that, though, and the pattern will start consuming all the characters that match, all at once (in regular expressions, the technical term is greedy, but that’s not a value judgment. It’s an observation.)

So this line will add the loading attribute to the very last tag in the document.

The <img> tag is a <code>void element.

Iteration 2: a little tunnel vision

You can solve the immediate problem by changing the . to .? which is sort of like jumping past everything that isn’t a > by using ~<img(.*?)>~.

function render_block( $html ) {
    return preg_replace(
        '~<img(.*?)>~',
        '<img$1 loading="lazy">',
        $html
    );
}

But wait a minute. The > can go inside an HTML attribute, like this:

<img title="bears > tigers">

And give you some seriously mangled markup.

<img title="bears  loading="lazy"> tigers">

Now the pattern recognizes attributes that might be part of an HTML tag: if the sub-match group sees an attribute name, an =, and a quoted value all together, that’s part of a tag, and the pattern skips it.

Except it doesn’t.

function render_block( $html ) {
    return preg_replace(
        '~<img(.*?)([a-z-]+="[^"]*"\s*)*>~,
        '<img$1$2 loading="lazy">',
        $html
    );
}

Now things are getting complicated, and the pattern isn’t even catching attributes that have single quotes or no quotes. Or that don’t have an = sign. 

That’s already a lot of cases to check for. You will probably miss some.

<img((?:\s+[a-z-]+(?:=(?:\w+|(['"]).*$2))?)*)>

What. A. Mess. It looks impressive.

With all these iterations, who can tell what you were trying to do in the first place? You took care of all the edge cases.

But did your code?

You’ve gone through this thing three times now. And iteration #3 isn’t really any better than the one you started with. If there’s already a loading attribute set by something else, the browser (or whatever is parsing the markup at this point) will ignore the one you really want to use as a duplicate.

Some common attribute-naming schemes will break this. Sometimes DOMDocument won’t work properly, even at a basic level.

<textarea>

There is no <img> inside a textarea because everything inside is plaintext.

</textarea>

<script>

console.log( "This also contains no <img> because it's inside a script." );

</script>
This contains an image but because it already has an earlier definition of the loading attribute, any duplicates appended to the end will be ignored by the browser.

<img loading=eager src="machine-rusty.jpg">

And then there is this: Say that in this one spot in the regular expression pattern managed to understand the entire HTML syntax AND all its semantic processing rules.

There are more places that would have to be true—a lot more. Thousands, over the length of your code—doing this job in a host of different ways.

Do you really want to monitor 1575 different instances, give or take 762, for bugs that you then have to patch and track? (Ed. note: Your numbers may vary. Every time.)

Didn’t think so.

Writing intention with the HTML API

If you’re ready to try an alternative to regular expressions, you’ll want to level up your thinking. Regex makes you think about characters and spaces and quotes—very basic, string-level stuff. 

The HTML Tag processor pulls your view up and out—to a more global approach, that processes tag and attribute names and attribute values.

Take a look:

Iteration 1: Add the lazy attribute to an image

function render_block( $html ) {

    $processor = new WP_HTML_Tag_Processor( $html );

    if ( $processor->next_tag( 'img' ) ) {

        $processor->set_attribute( 'loading', 'lazy' );

    }

    return $processor->get_updated_html();

}

This is the first iteration and the last—you get the job done in ONE go. It’s fast. It’s robust. And it’s secure.

Welcome to the Tag Processor

It’s the first interface in the new HTML API. It has one job: to find, and then read or alter, the attributes of specific HTML tags.

Here’s how you use it.

  • Make a new instance of the Tag Processor and add the input HTML you want to work with.
  • Call next_tag() to find and match the next tag you want.
  • Call get_attribute() and set_attribute() to read and modify attributes on the tags that match.

When you’re done, call get_updated_html() to return (or do whatever else you need to do to) the transformed output.

That’s it!

Example 2: How to add a class to blocks that have a particular block attribute

First, another look at the hard way—expanding regular expressions. In case you were thinking of doing that.

Interations 1: add class at the end of image tag

preg_replace(
    '~<img(.*?)class="([^"]*)"([^>]*)>~',
    '<img$1 class="$2 wp-full-width"$3>',
    $html
);

This first iteration is kind of clear. It’s trying to add wp-full-width to the end of the `class` attribute in an image tag.

Never mind the issues around single and double quotes, and the duplicate attributes, in this example. Just pretend those things are fine.

If some other attributes have names that partially match the ones you want to work with—but not completely!—stuff already starts to break.

<img src="machine-rusty.jpg" data-image-class="black-and-white">

And the code above turns into the code below.

Not only has the regular expression not added the class. It’s also garbled the application logic you need in the custom data attribute.

<img src="machine-rusty.jpg" data-image-class="black-and-white wp-full-width">

The fix typically has two steps:

  • Require a space before class with ~<img(.*?) class="([^"]*)"([^>]*)>~
  • After staring at it for maybe ten seconds, realize that lots of attributes have some sort of whitespace before them. And newlines, too. There’s one in this second step: ~<img(.*?)([ \t\f\r\n]+)class="([^"]*)"([^>]*)>~

Things are better. But the whole fix depends on an existing class attribute. No class? No fix.

Iteration 2: externalize the logic

Because regex adds as many issues as it solves, lots of developers will extract a match, run the logic outside the expression, and then stitch the code back together. 

Sound fun?

if ( 1 !== preg_match(
    // Fetch the attributes inside the IMG tag
    '~<img([^>]*)>~i',
    $html,
    $img_match
) ) {
    return $html;
}

if ( 1 !== preg_match(
    // Find an existing class attribute.
    '~[ \\t\\f\\r\\n]+class=([\'"]).*?\1~i',
    $img_match[1],
    $class_match,
    PREG_OFFSET_CAPTURE
) ) {
    return preg_replace( '~<img([^>]*)>~i', '<img class="wp-full-width" $1>', $html );
}

$class  = $class_match[2][0];
$offset = $class_match[2][1];

$before = substr( $html, 0, $offset );
$after  = substr( $html, $offset + strlen( $class ) );

return $before . $class . " wp-full-width" . $after;

All you wanted to do was add a class name!

But no: first you had to handle the HTML syntax. And the code will still break in way too many common HTML inputs.

Should anything be this hard?

Here’s an idea: what if you added a class by using the word `class`? (That’s what HTML uses to mean classes. Go figure. 😜)

Okay then. What will the Tag Processor do?

function add_additional_class( $html, $block ) {
    if ( ! isset( $block->attrs['additional_class'] ) ) {
        return $html;
    }

    $processor = new WP_HTML_Tag_Processor( $html );
    if ( $processor->next_tag( 'img' ) ) {
        $processor->add_class( $block->attrs['additional_class'] );
    }

    return $processor->get_updated_html();
}

add_filter( 'render_block', 'add_additional_class', 10, 2 );

Well, look at that.

Tag processor functions for CSS classes

The Tag Processor gives you two helper functions for CSS classes: add_class() and remove_class()

You can probably guess what they do. You can add the one you need to the tag that needs it—even if that tag already has a class attribute. (If it doesn’t, the processor will add one.) 

And if a class attribute already has the same name as the one you’re trying to add? The tag processor will leave the markup alone.

How refreshing!

The HTML API produces markup that does what you need. Without a lot of gymnastics around syntax and case-sensitive this or that. Or endless bug tracking. Or naming conventions. Or anything.

When you trade in regex for the HTML API, your code will live as long as its intentions.

HTML bugs in plugins can get fixed in the HTML, without anyone touching the plugin logic. (Ed. note: Why is this amazing? It should be normal )

Any code that uses the API will by default focus on the intention of that code, not the minutiae of its methods.

Resources to learn more

Processing HTML now or in the future? Read up on the Tag Processor in its documentation. It’s the first thing you can do with the API, but there’s more to come.

As future releases add to the API, its new capabilities will show up in the aptly named HTML Processor. So pay attention to that, and follow the coverage right here on the blog.

Props for inspiration @dmsnell and @zieladam
Props for reviews to @milana_cap @gaambo @greenshady @bph

5 responses to “The HTML API: process your tags, not your pain”

  1. just a coder Avatar
    just a coder

    Just a great read. I have no use for this tool right now. Despite that, your demolition of the “just use regex” argument deserves elevation.
    Thanks again!

  2. Greg Miller Avatar
    Greg Miller

    On Example 1, Iteration 1 and 2, I think there is an incorrect use of the template literals. Add a closing ” ‘ ” between the “~” and “,” on the third line. I don’t think it will change the results, but the code will look better.

    1. Dennis Snell Avatar

      Thanks Greg! We’ll get that updated. You’re right, it’s a formatting issue that occurred when preparing this post.

  3. Mark Howells-Mead Avatar

    I’m quite surprised to see the references to the use of regex to parse HTML: this was superseded by DomDocument with PHP 4.0 twenty-three years ago. Have people really still been trying to use regex to parse HTML? 😱

    I’ve been using DomDocument for years to parse HTML output with WordPress and although this has been lightning-fast for years — especially in combination with xPath — this new API seems like a good solution for less-experienced developers.

    1. Dennis Snell Avatar

      Thanks for the thoughts, Mark!

      `DOMDocument` was never really sufficient, and suffers from almost as many flaws as many RegExp approaches do. In addition, it’s a memory-heavy approach because it has to load in the full HTML document as a tree, and it inevitably rewrites the HTML it’s given. `DOMDocument` is a kind of neither-here-nor-there solution trying to be both XML and HTML. It was broken for HTML4 when it shipped, let alone HTML5.

      The HTML API is not just being developed for less-experienced developers, but also seasoned developers and Core itself. It’s designed to be low-overhead for a streaming API suitable for server work, designed to avoid changing more of the HTML than is required, and designed to quickly find and modify HTML inside a broader document.

      PHP will soon have an HTML5-compliant `DOMDocument` parser, fifteen years overdue, but even when that appears there will be value in avoiding the overhead and side-effects that interface necessarily brings.

      You may not be aware of the security and content vulnerabilities inherent in `DOMDocument`, so you might give this new API a try. It could be more convenient for you, depending on what you’re doing.

Leave a Reply

Your email address will not be published. Required fields are marked *