Tuesday, August 15, 2006

Perl packages and modules

I've been working on Perl scripts that continue to grow. I decided that they were getting out of hand, and wanted to organize the scripts better by breaking out some functions into Perl modules. I am somewhat new to Perl, and although I do own the camel book, I found their discussion on modules to be less than helpful. After some internet searching, I found the correct way to do it. Here's the short-and-sweet version:

Let's say you have a module located here:
/home/myname/ver-8/Common/DBUtils.pm

Here's what DBUtils.pm looks like -

package Common::DBUtils;
use vars qw($VERSION @ISA @EXPORT @EXPORT_OK);
...
sub db_fun
{
...
}
1;
__END__


Here's a script that uses that function -

...
# points to root directory of your module
use lib "/home/myname/ver-8";
use Common::DBUtils;
...
my $thing = Common::DBUtils::db_fun();


Simple enough! But it's like learning how to work with package names and the classpath in Java. A bit of a pain, to start. I need to do some more reading on the subject because I saw some discussions that indicate there are other ways to go about this.

Tuesday, August 08, 2006

Perl/Mason documentation is poor - UTF8 example

Working with Perl/Mason is frustrating because the documentation is so poor. I've worked with several languages, and solving problems in Perl/Mason takes longer than anything I've ever experienced.

Most recent example: I have a Perl/Mason script that outputs a line of JSON in response to an AJAX request. I wanted the output to be Unicode (UTF-8), rather than Western (ISO-8859-1).

I'd hit this Mason script from a browser and see that the default encoding was set to Western. Output characters looked like junk because they were UTF-8 being displayed with Western encoding. Example:
{"name" : "crème brûlée"}

Ok, the solution should be something like setting http headers. I started searching for documentation with keywords "mason utf8", "mason charset" and so on. I couldn't find anything useful! It took me a few hours of searching and false trials before I finally came across a post describing the UTF-8 problem. Why isn't this described clearly at MasonHQ? If you search on "utf8" there, you get "no results found". Very frustrating.

Anyway, the "solution" seems to be simply to add the single line "use utf8;" at the top of the <%init> section of your Mason code. Once I did this, I'd hit the script and find that it was defaulting to UTF-8. The output looked right, too [but see my note at the end of this post]:
{"name" : "crème brûlée"}
I tried this with Chinese characters, which also worked.

In comparison, it was practically a piece of cake to fix the encoding in the MySQL database. Originally, the schema for the table containing the "name" field did not properly encode that field; it was using the default, which was latin1. I did just a little bit of hunting and found the solution pretty quickly - use alter table <table1> modify column <col1> varchar(32) character set latin1. I just had to modify this command a little, so that the column was set to utf8:

ALTER TABLE <table1> MODIFY COLUMN <col1> varchar(32) character set utf8

I also found the MySQL SHOW VARIABLES command helpful:

mysql> show variables like "cha%";
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_results | latin1 |
| character_set_server | latin1 |
...
+--------------------------+----------------------------+

and the SHOW FULL COLUMNS command:

show full columns from <table1>
+-------------+--------------+-------------------+...
| Field | Type | Collation |...
+-------------+--------------+-------------------+
...
| name | varchar(32) | latin1_swedish_ci |
...
+-------------+--------------+-------------------+


After running the "ALTER TABLE" command above, the output looks like this:

show full columns from <table1>
+-------------+--------------+-------------------+...
| Field | Type | Collation |...
+-------------+--------------+-------------------+
...
| name | varchar(32) | utf8_general_ci |
...
+-------------+--------------+-------------------+


I also found a useful MySQL page describing the SET NAMES utf8 command. You must run this SQL in your Mason script before getting UTF-8 data from the database, otherwise you'll see output with extraneous question marks, like this:
{"name" : "cr�me br�l�e"}

All of this MySQL documentation was unearthed very quickly in comparison to the Perl/Mason stuff.

Just to be clear, my Mason script looks like this (a stripped down version of my code):


<%args>
</%args>


<%init>
use strict;
use warnings;

# Output must be UTF8 because names may be UTF8
use utf8;

###########################################################
# Set output to be UTF8 in headers - this MUST be done,
# otherwise output is displayed incorrectly in display page
# with Western encoding.
###########################################################
$m->auto_send_headers(0); # do not send output automatically
$r->header_out('Content' => "text/html; charset=utf-8");

my $dbh = DBI->connect('DBI:mysql:blah1:blah2') or die;

###########################################################
# Default db connection is not set to utf8; set that here
# because results may contain utf8. If this is not done,
# there will be question marks in output.
###########################################################
my $unicode_sql = "set names utf8";
my $unicode_sth = $dbh->prepare($unicode_sql);
$unicode_sth->execute();

# Get UTF-8 info from database
my $sql_query = "SELECT table1.name FROM table1";
my $sth = $dbh->prepare($sql_query);
$sth->execute();
my @arr;
my $names = "";
my $sep = "";
while (@arr = $sth->fetchrow) {
$names = $names . $sep . $arr[0];
$sep = " ";
}
print $names; exit();

</%init>


I don't know whether this is actually the "proper" way to do things. If I hit the script directly, I still see Western encoding. At first, I was hitting the page and seeing the encoding set to UTF-8, and later it was defaulting to Western! I don't know what happened to cause the script output to display as Western again; if you force the encoding to be UTF-8 in the browser, the output looks right. Anyway, this output is just an intermediary response - it is never displayed directly to the user as is. It is displayed via Javascript in the browser, and that display correctly appears as UTF-8 (for now!). The Javascript does not use the escape function or any special decoding or encoding function in displaying.