How To Detect UTF-8 Multi Byte Characters

Some time ago, I started playing with different encodings and character sets. One day I was looking at different encoding character string. What I wanted to do is simply to loop over the string and print out the characters. I thought it would be enough to have a web page with default charset UTF-8 to output characters correctly. I discovered that it was not enough.

In this post I want to describe to you how to output characters from multi-byte character string to a web page. Throughout this post I will describe the pitfall and the solution to the latter while trying to output characters to a web page.

Now, I am not a PHP expert, and I will be more than happy to hear your comments about this post.

Consider following string. It contains characters from different character sets. At this stage, we still do not know how many bytes each character is. Why we need to know this information, I will will explain later in the post.

€Abبώиב¥£€¢₡₢₣₤₥₦§₧₨₩₪₫₭₮漢Ä©óíßä

The following is a snippet of PHP code, that loops over the same string described previously:

$str = "€Abبώиב¥£€¢₡₢₣₤₥₦§₧₨₩₪₫₭₮漢Ä©óíßä";

for ($index = 0; $index < strlen($str); $index ++)  {
	echo "char: ".$str[$index]." < br / > ";
}

Although encoding type of the page was set as UTF-8 (multi-byte character set), some of the characters were not properly printed out. The only characters that were printed properly are the “A” and “b”, letters of the Latin alphabet.

So what has happened?

The answer to that is, that in PHP, a single character has the size of one byte. Some of the characters in the above string are single-byte characters, and some are multi-byte. Letters of the Latin alphabet belong to ISO-8859-1 encoding, which is encoding for single-byte characters, while UTF-8 encoding can have more than two bytes per character.

When loop printed out the characters from the given string, it printed all the characters as a single-byte character, therefore only the actual single-byte characters were properly printed (in the case “A” and “b”).

What can be done to bypass this problem, is to check how many bytes each character is before printing it.

The following table (taken from http://en.wikipedia.org/wiki/UTF-8) shows byte size of the character according to its ASCII value.

UTF-8 byte sequence ascii values

Consider the following code. To get information about byte size of each character, I used ord() function to get the ASCII value.To print each character with the correct byte size, I am using substr() function, that takes length parameter, which is character size in bytes.

<?php
$str = "€Abبώиב¥£€¢₡₢₣₤₥₦§₧₨₩₪₫₭₮漢Ä©óíßä";

for ($index = 0; $index < strlen($str); $index ++)  {

	//get the ASCII value
	$byte = ord($str[$index]);

	if ($byte <= 127)  {
		$length = 1;
		echo "1-byte char: "
              .substr($str, $index, $length)." < br / > ";
}
	else if ($byte >= 194 && $byte <= 223)  {
		$length = 2;
		echo "2-byte char: "
               .substr($str, $index, $length)." < br / > ";
}
	else if ($byte >= 224 && $byte <= 239)  {
		$length = 3;
		echo "3-byte char: "
               .substr($str, $index, $length)." < br / > ";
}
	else if ($byte >= 240 && $byte <= 244)  {
		$length = 4;
		echo "4-byte char: "
               .substr($str, $index, $length)." < br / > ";
      }
}
?>

Output as follows:

3-byte char: €
1-byte char: A
1-byte char: b
2-byte char: ب
2-byte char: л
2-byte char: ώ
2-byte char: и
2-byte char: ב
2-byte char: ¥
2-byte char: £
3-byte char: €
2-byte char: ¢
3-byte char: ₣
3-byte char: ₤
2-byte char: §
3-byte char: ₧
3-byte char: ₪
3-byte char: ₫
3-byte char: 漢
2-byte char: Ä
2-byte char: ©
2-byte char: ó
2-byte char: í
2-byte char: ß
2-byte char: ä