Title: wp_scrub_utf8
Published: February 24, 2026

---

# wp_scrub_utf8( string $text ): string

## In this article

 * [Description](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#description)
    - [See also](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#see-also)
 * [Parameters](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#parameters)
 * [Return](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#return)
 * [Source](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#source)
 * [Related](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#related)
 * [Changelog](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#changelog)

[ Back to top](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#wp--skip-link--target)

Replaces ill-formed UTF-8 byte sequences with the Unicode Replacement Character.

## 󠀁[Description](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#description)󠁿

Knowing what to do in the presence of text encoding issues can be complicated.
This
function replaces invalid spans of bytes to neutralize any corruption that may be
there and prevent it from causing further problems downstream.

However, it’s not always ideal to replace those bytes. In some settings it may be
best to leave the invalid bytes in the string so that downstream code can handle
them in a specific way. Replacing the bytes too early, like escaping for HTML too
early, can introduce other forms of corruption and data loss.

When in doubt, use this function to replace spans of invalid bytes.

Replacement follows the “maximal subpart” algorithm for secure and interoperable
strings. This can lead to sequences of multiple replacement characters in a row.

Example:

    ```php
    // Valid strings come through unchanged.
    'test' === wp_scrub_utf8( 'test' );

    // Invalid sequences of bytes are replaced.
    $invalid = "the byte xC0 is never allowed in a UTF-8 string.";
    "the byte \u{FFFD} is never allowed in a UTF-8 string." === wp_scrub_utf8( $invalid, true );
    'the byte � is never allowed in a UTF-8 string.' === wp_scrub_utf8( $invalid, true );

    // Maximal subparts are replaced individually.
    '.�.' === wp_scrub_utf8( ".\xC0." );              // C0 is never valid.
    '.�.' === wp_scrub_utf8( ".\xE2\x8C." );          // Missing A3 at end.
    '.��.' === wp_scrub_utf8( ".\xE2\x8C\xE2\x8C." ); // Maximal subparts replaced separately.
    '.��.' === wp_scrub_utf8( ".\xC1\xBF." );         // Overlong sequence.
    '.���.' === wp_scrub_utf8( ".\xED\xA0\x80." );    // Surrogate half.
    ```

Note! The Unicode Replacement Character is itself a Unicode character (U+FFFD).

Once a span of invalid bytes has been replaced by one, it will not be possible to
know whether the replacement character was originally intended to be there or if
it is the result of scrubbing bytes. It is ideal to leave replacement for display
only, but some contexts (e.g. generating XML or passing data into a large language
model) require valid input strings.

### 󠀁[See also](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#see-also)󠁿

 * [https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G40630](https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G40630/)

## 󠀁[Parameters](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#parameters)󠁿

 `$text`stringrequired

String which is assumed to be UTF-8 but may contain invalid sequences of bytes.

## 󠀁[Return](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#return)󠁿

 string Input text with invalid sequences of bytes replaced with the Unicode replacement
character.

## 󠀁[Source](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#source)󠁿

    ```php
    function wp_scrub_utf8( $text ) {
    	/*
    	 * While it looks like setting the substitute character could fail,
    	 * the internal PHP code will never fail when provided a valid
    	 * code point as a number. In this case, there’s no need to check
    	 * its return value to see if it succeeded.
    	 */
    	$prev_replacement_character = mb_substitute_character();
    	mb_substitute_character( 0xFFFD );
    	$scrubbed = mb_scrub( $text, 'UTF-8' );
    	mb_substitute_character( $prev_replacement_character );

    	return $scrubbed;
    }
    ```

[View all references](https://developer.wordpress.org/reference/files/wp-includes/utf8.php/)
[View on Trac](https://core.trac.wordpress.org/browser/tags/6.9.4/src/wp-includes/utf8.php#L109)
[View on GitHub](https://github.com/WordPress/wordpress-develop/blob/6.9.4/src/wp-includes/utf8.php#L109-L122)

## 󠀁[Related](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#related)󠁿

| Used by | Description | 
| [wp_check_invalid_utf8()](https://developer.wordpress.org/reference/functions/wp_check_invalid_utf8/)`wp-includes/formatting.php` |

Checks for invalid UTF8 in a string.

  |

## 󠀁[Changelog](https://developer.wordpress.org/reference/functions/wp_scrub_utf8/?output_format=md#changelog)󠁿

| Version | Description | 
| [6.9.0](https://developer.wordpress.org/reference/since/6.9.0/) | Introduced. |

## User Contributed Notes

You must [log in](https://login.wordpress.org/?redirect_to=https%3A%2F%2Fdeveloper.wordpress.org%2Freference%2Ffunctions%2Fwp_scrub_utf8%2F)
before being able to contribute a note or feedback.