Introducing utf8::all

Perl programmers are probably all aware of the utf8 pragma, which turns on UTF-8 in your source code. This is actually a stumbling block for new programmers, who might think that utf8 makes your filehandles use UTF-8 by default, or automagically turns incoming data into UTF-8, and ensures outgoing data is all UTF-8 as well. Sadly, that's not the case.

However, one of the great things about perl5i is that it turns on Unicode. All of it. UTF-8 source code. yes - but also sets the default IO layers (aka "disciplines" - just to be extra fancy) so your opens are automatically like:

open my $fh, '<:encoding (UTF-8)', 'file-to-read';

It also encodes @ARGV. These are really useful things to do by default so you don't have to ever think or worry about them, and Michael Schwern deserves credit for making it happen. I've split that code (which is a surprisingly small amount) into a new standalone pragma utf8::all.

The first release is already up on the CPAN. I'd welcome bug reports, and suggestions for anything else where we can turn on UTF-8 by default (if there is anything more).

I'm also excited about the Perl 5.14.0-RC1 that was released today. It includes, among other goodies, a feature called "unicode_strings" which will cause Unicode semantics to be used for all string operations for which it is in scope. Awesome! That'll be a huge headache removed for people who can require 5.14. I look forward to it!

Comments
Comment from Jakub Narębski - April 21, 2011 at 10:32 am

How it is different from

use open ':utf8';

pragma?

Comment from Walter - April 21, 2011 at 2:50 pm

Nice post and thanks for the pragma, should simplfy a bunch of code issues.

As an aside, I'm not sure if you are aware of it, but the blog template you are using renders the text really poorly on Win7 / Chrome 10. If you are interested I send you a screenshot.

Just trying to be helpful not critical, if you don't care no worries.

Comment from Mark Lawrence - April 24, 2011 at 9:22 pm

Thanks for this! I have been caught multiple times by this "stumbling block" (design bug!) and still didn't know about all the places UTF8 would not be coming in... such as via @ARGV. Installing and adding to multiple projects now.