[freeside-devel] [PATCH] Use of accentuated and special characters

Tue Sep 18 03:26:33 PDT 2007

Hi

having the whole database coded in UTF-8 isn't the only solution to 
manage to use special characters in freeside. In fact, I managed to have 
it working well with just a few source modifications.

1) First, we need to be able to record iso-8859-1 accentuated characters 
in the database from the infos received from the interface.

The whole problem is that the Perl Regexes, which use "\w" to check 
against alphanumeric characters, deny special characters i.e. sometimes 
"doubled characters", for exemple, an accentuated a (à) is not always a 
single character, but a "a" plus a "grave accent", which means 2 
characters [1]. So patching the checking functions in FS/FS/Record.pm, 
replacing "\w" with "(\pL\pM*)\pN_" works like a charm and allows us to 
record special characters in the database in iso-8859-1 format.

BTW, the Perl Regex above means : any letter + any optional special 
mark, or a numeral, or an underscore (yes, \w accepts underscore as well)

so, we only need to patch the functions below :
FS::Record::ut_text
FS::Record::ut_textn
FS::Record::ut_alpha
FS::Record::ut_alphan
FS::Record::ut_name

I read a lot about perl being able to have \w allow special characters 
based on the locale used but spent 3 hours trying to have it work and 
failed.

2) Secondly, as there's another way of passing variables from the 
interface to the perl modules (i.e. Ajax calls), we need to patch this 
one too.

the data is prepared in httemplate/elements/xmlhttp.html. Every single 
argument is encoded against the "escape" javascript function, which 
doesn't work well with special accents, so needs to be replaced by the 
better "encodeURIComponent" function [2] (javascript 1.5), and get rid 
of the awful hack replacing plus signs.

The receiving part of the ajax process is done in FS/UI/Web.pm. Ajax 
calls, having everything put in the XML format, automatically converts 
the iso-8859-1 characters in UTF-8, having it passed through the 
xmlrequest process. So, the only thing to do in FS::Web::start_job is to 
decode the arguments in utf-8 and then encode them in iso-8859-1 so it 
works exactly as if the data was passed the "standard" way.

a "use Encode;" is then a dependency, but I believe it already was since 
it's a pretty common class.

3) Having latex documents show all iso-8859-1 characters.

As long as the database contains special characters, the HTML Templates 
work very well and show correct characters in the browser, but the latex 
engine simply discards those special characters. Making it understand 
the tex file is encoded in iso-8859-1 is the only way to go. So we just 
need to put an "\usepackage[latin1]{inputenc}" in the tex template.

Now for the remarks :

* Perl Regex checks are usually done by the checking functions in 
FS::Record but there seems to be checks using "\w" all around, so we 
need to test thoroughly to make sure special characters are accepted 
everywhere.

* I haven't checked if the HTML emails sent show these special 
characters, but forcing the emails encoding to iso-8859-1 should do the job.

* There may be a more serious problem with all the procedures (if any) 
counting the characters in the strings to have their length. Since 
special characters may be in fact 2 separate characters, the global 
length of strings could be reported wrong. Does anyone know if there are 
such checks in the code and test it ?

* as some special characters are not letters+special mark but a special 
mark alone, the mentionned replacement for the "\w" isn't really 
correct. As an exemple, I tried to input a special french character 
(herited from latin : œ) and it was denied by the checking procedures so 
probably needs to be part of the "\w" replacement pattern. As this is 
just an exemple, in other special-characters-aware-languages, there may 
be a lot more of theses cases (I'm thinking about the german esset 
character and many others) so this pattern is just a temporary 
workaround. It could be a really good idea to have a global function 
returning this new pattern in the code somewhere and use it in all the 
"\w"-checking-regexes (of course, neither with url checks, neither 
email, etc.) . It could eventually be a configuration variable, but I 
think it doesn't really belong there. Is there somewhere such global 
variable can be defined ? Yes, I know, globals are usually not the way 
to go when object-oriented-thinking, but could be justified here.

* it's just an iso-8859-1 workaround ; the unicode power isn't there. So 
having japanese invoices isn't going to work as long as we stay with 
this encoding.

You'll find a cvs patch attached with the above described changes. Since 
I have several other changed in my development version, I manually 
edited the patch file, so it may be wrong.

Could non-us users test this patch with their own special characters and 
report about the whole behavior ?

Mathieu

PS : I tried to document my code modifications to include it in the 
patch but can't seem to find a Changes log, don't you use this kind of log ?

[1] : http://www.regular-expressions.info/unicode.html

[2] : 
http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Reference:Global_Functions:encodeURIComponent

-------------- next part --------------
A non-text attachment was scrubbed...
Name: freeside_special_characters.patch
Type: text/x-patch
Size: 7025 bytes
Desc: not available
Url : http://420.am/pipermail/freeside-devel/attachments/20070918/04f0b78d/attachment.bin