PHP regular expressions and UTF-8

Perl-compatible regular expression functions in PHP can properly work with Unicode strings. Just add /u modifier to turn on UTF-8 support in preg_replace, preg_match, preg_match_all, preg_split and other PCRE (preg) functions. This way you can parse strings with national characters. For example:

$clean = preg_replace('/\s\s+/u', ' ', $dirty);

If used without /u modifier this code damages UTF-8 encoded strings by replacing national character bytes improperly interpreted as whitespace characters. This and many other problems are caused by improper interpretation of every byte as ASCII character which is not always true for UTF-8.

The modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
I found this tip as well as many other useful info on regular-expressions.info. It’s not easy to find it in the PHP documentation but it’s actually hidden here.

2 thoughts on “PHP regular expressions and UTF-8”

Leave a Comment