Gets all the meta tag elements that have a ‘content’ attribute.
Parameters
$html
stringrequired- The string of HTML to be parsed.
Source
private function get_meta_with_content_elements( $html ) {
/*
* Parse all meta elements with a content attribute.
*
* Why first search for the content attribute rather than directly searching for name=description element?
* tl;dr The content attribute's value will be truncated when it contains a > symbol.
*
* The content attribute's value (i.e. the description to get) can have HTML in it and be well-formed as
* it's a string to the browser. Imagine what happens when attempting to match for the name=description
* first. Hmm, if a > or /> symbol is in the content attribute's value, then it terminates the match
* as the element's closing symbol. But wait, it's in the content attribute and is not the end of the
* element. This is a limitation of using regex. It can't determine "wait a minute this is inside of quotation".
* If this happens, what gets matched is not the entire element or all of the content.
*
* Why not search for the name=description and then content="(.*)"?
* The attribute order could be opposite. Plus, additional attributes may exist including being between
* the name and content attributes.
*
* Why not lookahead?
* Lookahead is not constrained to stay within the element. The first <meta it finds may not include
* the name or content, but rather could be from a different element downstream.
*/
$pattern = '#<meta\s' .
/*
* Allows for additional attributes before the content attribute.
* Searches for anything other than > symbol.
*/
'[^>]*' .
/*
* Find the content attribute. When found, capture its value (.*).
*
* Allows for (a) single or double quotes and (b) whitespace in the value.
*
* Why capture the opening quotation mark, i.e. (["\']), and then backreference,
* i.e \1, for the closing quotation mark?
* To ensure the closing quotation mark matches the opening one. Why? Attribute values
* can contain quotation marks, such as an apostrophe in the content.
*/
'content=(["\']??)(.*)\1' .
/*
* Allows for additional attributes after the content attribute.
* Searches for anything other than > symbol.
*/
'[^>]*' .
/*
* \/?> searches for the closing > symbol, which can be in either /> or > format.
* # ends the pattern.
*/
'\/?>#' .
/*
* These are the options:
* - i : case insensitive
* - s : allows newline characters for the . match (needed for multiline elements)
* - U means non-greedy matching
*/
'isU';
preg_match_all( $pattern, $html, $elements );
return $elements;
}
Changelog
Version | Description |
---|---|
5.9.0 | Introduced. |
User Contributed Notes
You must log in before being able to contribute a note or feedback.