Revision 3 - 2009-03-11 at 09:00:17

Using Unicode in Catalyst Applications

It is essential that you load the Catalyst::Plugin::Unicode plugin to ensure proper decoding/encoding of incoming request parameters and the outgoing body response respectively. This is done in your MyApp.pm:

use Catalyst qw/ -Debug ConfigLoader Unicode /;

Do NOT use Catalyst::View::TT::ForceUTF8! It's a hack that basically tells TT "if there's a string of characters in the stash, interpret them as a string of UTF-8 bytes" (see difference between bytes and characters). As a matter of fact, you should try to stay away from ALL plugins and modules that promise to help you with your Unicode problems except Catalyst::Plugin::Unicode! :-)

There are basically three main areas in which you should pay attention to Unicode. Other than that most things Unicode-related happen transparently and you shouldn't have to worry about it.

Templates

Either prefix your templates with the BOM or, much easier, tell TT the encoding of your templates in your Catalyst::View::TT module, for instance:

__PACKAGE__->config( {
    ENCODING     => 'utf-8',
} );

(This TT-specific config setting still seems to be undocumented but works just fine)

Stash (Controllers)

As perlunicode suggests, whenever you use (the preferred) UTF-8 Unicode encoding in string or regular expression literals, make sure you put:

use utf8;

at the top of your module (e.g. your Controllers).

Model

Make sure your database and tables were created with the UTF-8 encoding, and that the database connection is aware that you're using/expecting UTF-8. PostgreSQL behaves the right way, but for MySQL you'll have to specify this in the connect_info hash in your Model glue-class. Here's an example for DBIx::Class and Catalyst::Model::DBIC::Schema:

use base 'Catalyst::Model::DBIC::Schema';
__PACKAGE__->config(
    schema_class    => 'MyApp::Schema',
    connect_info    => [
        'dbi:mysql:database:host',
        'username',
        'password',
        {
            AutoCommit        => 1,
            RaiseError        => 1,
            mysql_enable_utf8 => 1,          # PostgreSQL: "pg_enable_utf8 => 1", but not necessary
        },
    ]
);

Note

mysql_enable_utf8 is experimental and may change in future versions.

SEE ALSO: DBD::mysql - DATABASE HANDLES

Tips for Troubleshooting Unicode in Perl Web Applications

Troubleshooting encoding/Unicode problems in a web/db app can be difficult, especially when different parts of the web system may be interpreting the same data in different ways, or data may be multiple encoded or decoded at different stages. Here is a short generic algorithm to help test/ensure that things are working and where things need fixing. Note that these details assuming we're using Perl 5.8+, and that we desire a clean Unicode system end-to-end, and we have relatively modern tools; even if not, they still mostly apply. These details are also generic to web applications in general, not specific to those written to use Catalyst; you can interpret "your program code" et al as being inclusive of the Catalyst framework itself where appropriate.

  1. Make sure all your text/code/template/non-binary/etc files are saved as UTF-8 text files (or they are 7-bit ASCII), and you have a Unicode-savvy text editor.

  2. Have a use utf8; at the top of every Perl file, so Perl treats your source files as being Unicode.

  3. Place a text string literal in your program code that you know isn't in ASCII ... for example I like to use the word '
    サンプル', which is what came out of Google's translation tool when I asked it to translate the word 'sample' to Japanese. Then setup your program to display that text directly in your web page text, without any escaping.

  4. Make sure the HTTP response headers for the webpage with that text have a content-type charset value of UTF-8, and make sure that Perl is encoding its output as actual UTF-8; if you were doing it directly using STDOUT for example such as in a CGI, it could be: binmode *main::STDOUT, ':encoding(UTF-8)'; or such. Make sure your web browser is Unicode savvy.

  5. At this point, if the web page displays correctly with the non-ASCII literal (and moreover, if you "view source" in the browser and the literal also displays literally), then you know your program can work/represent internally with Unicode correctly, and it can output Unicode correctly to the browser. It is very important to get this step working first, in isolation, so that you are in a position to judge or troubleshoot other issues such as receiving Unicode input from a browser or using it with a database.

  6. Next test that you can receive Unicode from the browser in the various ways, whether by query string / HTTP headers or in an HTTP post. E.g. try outputting a value and have the user submit it again, and compare for equality either in the Perl program or by displaying it again next to the original for visual inspection. If any differences come up, then you know any fixes you have to do concern either how you read and interpret the browser request, or perhaps on how you instruct the browser on how to submit a request. Once that's all cleared up, then you know your I/O with the web browser works fine.

  7. To test a database, I suggest first using a known-good and Unicode savvy alternate input method for putting some Unicode text in the database, such as using an admin/utility tool that came with the DBMS. Also make sure that the database is itself using UTF-8 character strings in its schema, eg that the schema is declared this way.

  8. With a database known to contain some valid Unicode etc text, you first test simply selecting that text from the database and displaying it. If anything doesn't match, it means you probably have to configure your DBMS client connection encoding so it is UTF-8 (often done with a few certain SQL commands), and then separately ensure that Perl is decoding the UTF-8 data into Perl text strings properly. Its important to make sure you can retrieve Unicode from the database properly so that you have a context for judging that you can insert such text in the database.

  9. Next try to insert some Unicode text in the database using your program, then select it back to check that it worked. If it didn't, then check DBMS client connection settings, or that Perl is encoding text as UTF-8 properly.

  10. Actually, when you have a known-good external tool to help you, you can alternately start the DBMS tests with step 9, where your program inserts text, then you use the known-good tool to ensure it actually was recorded properly.

That's it in a nutshell. Adjust as appropriate to account for any abstraction tools or frameworks you are using which means your tests may also involve testing those tools or configuring them.

More information

My tags:
 
Popular tags:
  unicode
Powered by Catalyst
Powered by MojoMojo Hosted by Shadowcat - Managed by Nordaaker