Using Unicode in Catalyst Applications

Prior to version 5.90040 it was required that you load the Catalyst::Plugin::Unicode::Encoding plugin to ensure proper decoding/encoding of incoming request parameters and the outgoing body response respectively. This is done in your MyApp.pm:

use Catalyst qw/ -Debug ConfigLoader Unicode::Encoding /;

Since that version the Unicode support was added to core by shipping the plugin with Catalyst and loading it by default.

Note that only some content-types are encoded. For version 0.8 those are /^text|xml$|javascript$/.

Do NOT use Catalyst::View::TT::ForceUTF8! It's a hack that basically tells TT "if there's a string of characters in the stash, interpret them as a string of UTF-8 bytes" (see difference between bytes and characters). As a matter of fact, you should try to stay away from ALL plugins and modules that promise to help you with your Unicode problems except Catalyst::Plugin::Unicode::Encoding! :-)

There are basically three main areas in which you should pay attention to Unicode. Other than that most things Unicode-related happen transparently and you shouldn't have to worry about it.

View: TT Templates

Either prefix your templates with the BOM or, much easier, tell TT the encoding of your templates in your Catalyst::View::TT module, for instance:

__PACKAGE__->config( {
    ENCODING     => 'utf-8',
} );

(This TT-specific config setting is documented here)

View: Mason Templates

Use Catalyst::View::HTML::Mason and define these in the configuration:

<View::Mason>
    <interp_args>
        comp_root           __path_to(root)__
        preamble            "use utf8; "
        data_dir = __HOME__/var/mason-data
    </interp_args>
    template_extension  .mas
    # Do not use encoding utf-8, it encodes/decodes the end result and that's too late.
    # Do not: encoding            utf-8
</View::Mason>

You need the data_dir defined for preamble to work.

If you are still using the older Catalyst::View::Mason you can use this to signal that all your templates are in utf8:

package MyApp::View::Mason;
use Moose;
extends 'Catalyst::View::Mason';
around 'new' => sub {
    my ($orig, $self, $app, @args) = @_;
    $self = $self->orig($app,@args);
    $self->template->compiler->{preamble} = 'use utf8';
    return $self;
};

Stash (Controllers)

As perlunicode suggests, whenever you use (the preferred) UTF-8 Unicode encoding in string or regular expression literals, make sure you put:

use utf8;

at the top of your module (e.g. your Controllers).

Model

Make sure your database and tables were created with the UTF-8 encoding, and that the database connection is aware that you're using/expecting UTF-8. PostgreSQL and Oracle behave the right way, but for MySQL or SQLite you'll have to specify this in the connect_info hash in your Model glue-class. Here's an example for DBIx::Class and Catalyst::Model::DBIC::Schema:

use base 'Catalyst::Model::DBIC::Schema';
__PACKAGE__->config(
    schema_class    => 'MyApp::Schema',
    connect_info    => [
        'dbi:mysql:database:host',
        'username',
        'password',
        {
            AutoCommit        => 1,
            RaiseError        => 1,
            mysql_enable_utf8 => 1,          # PostgreSQL: "pg_enable_utf8 => 1", but this *may* not be necessary.
                                             # SQLite: "unicode => 1".
        },
    ]
);

Note

mysql_enable_utf8 is experimental and may change in future versions.

SEE ALSO: DBD::mysql - DATABASE HANDLES

Schema

This isn't necessary for evolved databases like PostgreSQL and Oracle, or for MySQL with the mysql_enable_utf8 option, but for some others (which?), make sure your ResultSource classes load the ForceUTF8 component:

__PACKAGE__->load_components('ForceUTF8 ', 'PK::Auto', 'Core');

This will, in effect, apply UTF8Columns to all columns in the current result source class. Don't confuse ForceUTF8 with UTF8Columns. UTF8Columns requires you to explicitly specify which columns you want to treat as UTF-8, while ForceUTF8 automatically enables this for all columns.

WARNING!! It seems that ForceUTF8 is broken with recnt versions of DBIx::Class, so until it get's fixed, UTF8Columns should be used alone! For more details see bug #53520 for ForceUTF8.

For Oracle you have to make sure that its client library returns unicode by setting the environment variable NLS_LANG to something like AMERICAN_AMERICA.AL32UTF8. To avoid problems on new machines or when running under F(ast)CGI, which has its own environment, you might want to set it in your schema by overriding connection:

sub connection {
    my $self = shift;

    # make sure Oracle always returns unicode
    $ENV{NLS_LANG} = 'AMERICAN_AMERICA.AL32UTF8';

    $self->next::method(@_);
}

FormFu

See FormFu::Manual::Unicode

Tips for Troubleshooting Unicode in Perl Web Applications

Troubleshooting encoding/Unicode problems in a web/db app can be difficult, especially when different parts of the web system may be interpreting the same data in different ways, or data may be multiple encoded or decoded at different stages. Here is a short generic algorithm to help test/ensure that things are working and where things need fixing. Note that these details assuming we're using Perl 5.8+, and that we desire a clean Unicode system end-to-end, and we have relatively modern tools; even if not, they still mostly apply. These details are also generic to web applications in general, not specific to those written to use Catalyst; you can interpret "your program code" et al as being inclusive of the Catalyst framework itself where appropriate.

  1. Make sure all your text/code/template/non-binary/etc files are saved as UTF-8 text files (or they are 7-bit ASCII), and you have a Unicode-savvy text editor.

  2. Have a use utf8; at the top of every Perl file, so Perl treats your source files as being Unicode.

  3. Place a text string literal in your program code that you know isn't in ASCII ... for example I like to use the word '
    サンプル', which is what came out of Google's translation tool when I asked it to translate the word 'sample' to Japanese. Then setup your program to display that text directly in your web page text, without any escaping.

  4. Make sure the HTTP response headers for the webpage with that text have a content-type charset value of UTF-8, and make sure that Perl is encoding its output as actual UTF-8; if you were doing it directly using STDOUT for example such as in a CGI, it could be: binmode *main::STDOUT, ':encoding(UTF-8)'; or such. Make sure your web browser is Unicode savvy.

  5. At this point, if the web page displays correctly with the non-ASCII literal (and moreover, if you "view source" in the browser and the literal also displays literally), then you know your program can work/represent internally with Unicode correctly, and it can output Unicode correctly to the browser. It is very important to get this step working first, in isolation, so that you are in a position to judge or troubleshoot other issues such as receiving Unicode input from a browser or using it with a database.

  6. Next test that you can receive Unicode from the browser in the various ways, whether by query string / HTTP headers or in an HTTP post. E.g. try outputting a value and have the user submit it again, and compare for equality either in the Perl program or by displaying it again next to the original for visual inspection. If any differences come up, then you know any fixes you have to do concern either how you read and interpret the browser request, or perhaps on how you instruct the browser on how to submit a request. Once that's all cleared up, then you know your I/O with the web browser works fine.

  7. To test a database, I suggest first using a known-good and Unicode savvy alternate input method for putting some Unicode text in the database, such as using an admin/utility tool that came with the DBMS. Also make sure that the database is itself using UTF-8 character strings in its schema, eg that the schema is declared this way.

  8. With a database known to contain some valid Unicode etc text, you first test simply selecting that text from the database and displaying it. If anything doesn't match, it means you probably have to configure your DBMS client connection encoding so it is UTF-8 (often done with a few certain SQL commands), and then separately ensure that Perl is decoding the UTF-8 data into Perl text strings properly. Its important to make sure you can retrieve Unicode from the database properly so that you have a context for judging that you can insert such text in the database.

  9. Next try to insert some Unicode text in the database using your program, then select it back to check that it worked. If it didn't, then check DBMS client connection settings, or that Perl is encoding text as UTF-8 properly.

  10. Actually, when you have a known-good external tool to help you, you can alternately start the DBMS tests with step 9, where your program inserts text, then you use the known-good tool to ensure it actually was recorded properly.

That's it in a nutshell. Adjust as appropriate to account for any abstraction tools or frameworks you are using which means your tests may also involve testing those tools or configuring them

More information

My tags:
 
Popular tags:
  unicode
Powered by Catalyst
Powered by MojoMojo Hosted by Shadowcat - Managed by Nordaaker