PHP regular expression functions fail on GoDaddy shared hosting

While testing some crawler script on GoDaddy shared hosting I noticed that the script is quitting w/o any notice at random points. Both web and CLI execution modes where affected. The script was previously tested on XAMPP server where it  worked fine.

Lately, I identified that script always quits after calling one of regular expression functions (PRCE) like preg_replace, preg_match and preg_match_all. The script called them hundreds of times and one of the calls became fatal.

UPDATE: Actually it appears to be some kind of general problem with long string operations. But switching to multi-byte string regular expression functions helped in most scenarios.

Read more

PHP regular expressions and UTF-8

Perl-compatible regular expression functions in PHP can properly work with Unicode strings. Just add /u modifier to turn on UTF-8 support in preg_replace, preg_match, preg_match_all, preg_split and other PCRE (preg) functions. This way you can parse strings with national characters. For example:

$clean = preg_replace('/\s\s+/u', ' ', $dirty);

If used without /u modifier this code damages UTF-8 encoded strings by replacing national character bytes improperly interpreted as whitespace characters. This and many other problems are caused by improper interpretation of every byte as ASCII character which is not always true for UTF-8.

The modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
I found this tip as well as many other useful info on regular-expressions.info. It’s not easy to find it in the PHP documentation but it’s actually hidden here.