Encode a code point number into the UTF-8 encoding.
Description
This encoder implements the UTF-8 encoding algorithm for converting a code point into a byte sequence. If it receives an invalid code point it will return the Unicode Replacement Character U+FFFD �
.
Example:
'🅰' === WP_HTML_Decoder::code_point_to_utf8_bytes( 0x1f170 );
// Half of a surrogate pair is an invalid code point.
'�' === WP_HTML_Decoder::code_point_to_utf8_bytes( 0xd83c );
See also
- https://www.rfc-editor.org/rfc/rfc3629: For the UTF-8 standard.
Parameters
$code_point
intrequired- Which code point to convert.
Source
public static function code_point_to_utf8_bytes( $code_point ): string {
// Pre-check to ensure a valid code point.
if (
$code_point <= 0 ||
( $code_point >= 0xD800 && $code_point <= 0xDFFF ) ||
$code_point > 0x10FFFF
) {
return '�';
}
if ( $code_point <= 0x7F ) {
return chr( $code_point );
}
if ( $code_point <= 0x7FF ) {
$byte1 = chr( ( $code_point >> 6 ) | 0xC0 );
$byte2 = chr( $code_point & 0x3F | 0x80 );
return "{$byte1}{$byte2}";
}
if ( $code_point <= 0xFFFF ) {
$byte1 = chr( ( $code_point >> 12 ) | 0xE0 );
$byte2 = chr( ( $code_point >> 6 ) & 0x3F | 0x80 );
$byte3 = chr( $code_point & 0x3F | 0x80 );
return "{$byte1}{$byte2}{$byte3}";
}
// Any values above U+10FFFF are eliminated above in the pre-check.
$byte1 = chr( ( $code_point >> 18 ) | 0xF0 );
$byte2 = chr( ( $code_point >> 12 ) & 0x3F | 0x80 );
$byte3 = chr( ( $code_point >> 6 ) & 0x3F | 0x80 );
$byte4 = chr( $code_point & 0x3F | 0x80 );
return "{$byte1}{$byte2}{$byte3}{$byte4}";
}
Changelog
Version | Description |
---|---|
6.6.0 | Introduced. |
User Contributed Notes
You must log in before being able to contribute a note or feedback.