25 8 / 2013

Perl UTF-8 crash course

I’ve been seeing the occurrences of perl programmers not understanding perl’s very simple (and sometimes buggy but easily fixable) handling of Unicode strings.

I confess I had the same misunderstanding until 6-7 years ago, and don’t want everyone to repeat the same mistake. Let’s forget what you know for 5 minutes, and take this simple course.

1. print($a, $b) and print($a . $b);

Let’s forget, for a moment, about utf-8 flags, operator overloading and all other crazy stuff.

If you see print($a, $b) and print($a . $b), you think they print the same output or different?

> perl -le 'print "foo", "bar"'
foobar

> perl -le 'print "foo". "bar"'
foobar

They should print the same thing and otherwise it would look a lot confusing. Perl agrees to that as well.

2. Latin-1 by default

However, run this:

a) > perl -Mutf8 -MEncode -le \
        'print "テスト", encode_utf8("テスト")'
Wide character in print at -e line 1.
テストテスト

b) > perl -Mutf8 -MEncode -le \
        'print "テスト". encode_utf8("テスト")'
Wide character in print at -e line 1.
テストãã¹ã

Woah, now you pass the same arguments to comma and dot versions, and get the different results. It’s confusing and either of them should be wrong. Which do you think is the correct behavior?

Some experts might skip the details and think “Oh yeah utf-8 flags and auto-upgrades happening in b. So that’s the bug”. If you’re one of them, you know perl very well, but maybe too well that think backwards. Put it other way, you’re wrong.

Because you see the garbage in the console, does that mean that a) is the “correct behavior” in perl?

No, b) is the correct behavior, because the latter encode_utf8("テスト") bit generates 9-letter byte strings to perl, and perl thinks it’s encoded in latin-1. So it displays them correctly in the console (although it looks garbage to you). Perl handles string literals as latin-1 by default, and that is what you asked for.

The behavior in a) represents perl’s bug (due to historical reasons) where it changes its default output encoding based on the character range of its argument.

use utf8;
use Encode;
print "テスト";               # <- Wide characters, print them in UTF-8
print encode_utf8("テスト")'; # <- Latin-1 characters, print them in Latin-1!

perl is printing two different strings in different encodings, which accidentally emit the exact same octet sequence.

This perl’s behavior can be fixed when you specify output encoding with binmode, or the command line flag -C (essentially binmoding utf8 on STDOUT):

c) > perl -C -Mutf8 -MEncode -le \
        'print "テスト", encode_utf8("テスト")'
テストãã¹ã

d) > perl -C -Mutf8 -MEncode -le \
        'print "テスト". encode_utf8("テスト")'
テストãã¹ã

Now, both the comma version and dot version “correctly” displays the characters.

3. Latin-1 and ASCII

Now let’s step back, and what if you print them without wide characters.

e) > perl -Mutf8 -MEncode -MData::Dump=dump -le \
        'print dump("テスト"). encode_utf8("テスト")' 
"\x{30C6}\x{30B9}\x{30C8}"テスト

f) > perl -C -Mutf8 -MEncode -MData::Dump=dump -le \
        'print dump("テスト"). encode_utf8("テスト")'
"\x{30C6}\x{30B9}\x{30C8}"ãã¹ã

Again, they are printing the same variables in same code (both using the dot), and you get mixed results in the output.

As we saw in previously, perl printing latin-1 strings in latin-1 encoding (e) is a bug, and you get correct results with -C flag in f).

Summary

If you print octets to a filehandle without any encoding specified (STDOUT in this case), and think the original encoding should be preserved, that’s relying on perl’s bug. You’re accidentally looking at the original octets because perl’s bug and terminal encoding mismatch, and see them in error.

Perl handles string literals as latin-1 strings, and however annoying it might be, you have to deal with it. Dealing with it means you tell perl which encoding the octets are in, with decode.

The bug is not latin-1 to utf-8 implicit upgrade. Implicit upgrade works like expected, and that’s what you asked for. The solution to avoid implicit upgrade is not to avoid concatenating (see a. where it uses comma instead of dot), nor to escape the other variable (see e). That just hides the issue, and not fix it.

The only situation where implicit upgrade causes “bug”, however, is where you really want to print byte streams to a raw file handle, e.g. when you create a JPEG binary data and print to a local file. When you build the whole JPEG data, with wide charactes in its EXIF area, perl suddenly thinks it’s a latin-1 strings and generates UTF-8 data out of it.

Because there’s no way to tell perl if a given string is in octet, not strings (i wish if there were), you have to make sure all the bits printed to the raw handle is encoded properly. Mixing in wide characters without properly encoding is a bug in your part, not perl’s.

You’ll see “Wide characters in print…” if you don’t do it correctly. There’s a reason for the warning.

See also

Unicode and Perl, maybe accidentally, Dave Cross saw similar discussions and post the simpler version of quiz on his blog.

Permalink 6 notes