While PHP itself doesn't know about different character sets and treats all characters as being one byte long, the PCRE engine understands UTF-8. There's also
Which outputs the correct length of 5 characters when you start your regular expresssion with
You can also use Unicode character properties to match only letters (in any language) for example:
You can see other Unicode character properties in the PHP Manual.
mb_ereg_match()
, but I prefer the PCRE functions (preg_...
). Here's a piece of code to see if your PHP was compiled with PCRE UTF-8 support.$str = 'ありがとう'; echo "strlen('$str') = " . strlen($str) . "\n"; echo "preg_match_all('/./', '$str', \$matches) = " . preg_match_all('/./', $str, $matches) . "\n"; echo "preg_match_all('/(*UTF8)./u', '$str', \$matches) = " . preg_match_all('/(*UTF8)./u', $str, $matches) . "\n";
Which outputs the correct length of 5 characters when you start your regular expresssion with
(*UTF8)
and use the /u
modifier.strlen('ありがとう') = 15 preg_match_all('/./', 'ありがとう', $matches) = 15 preg_match_all('/(*UTF8)./u', 'ありがとう', $matches) = 5
You can also use Unicode character properties to match only letters (in any language) for example:
// The WRONG way to do it, only works for ASCII: preg_match_all('/[a-zA-Z]/', $str, $matches); // This way it works with any language: preg_match_all('/(*UTF8)\p{L}/u', $str, $matches);
You can see other Unicode character properties in the PHP Manual.