[freeside-devel] [PATCH] Use of accentuated and special characters
Mathieu Dimanche
mdimanche at free.fr
Tue Sep 18 03:26:33 PDT 2007
Hi
having the whole database coded in UTF-8 isn't the only solution to
manage to use special characters in freeside. In fact, I managed to have
it working well with just a few source modifications.
1) First, we need to be able to record iso-8859-1 accentuated characters
in the database from the infos received from the interface.
The whole problem is that the Perl Regexes, which use "\w" to check
against alphanumeric characters, deny special characters i.e. sometimes
"doubled characters", for exemple, an accentuated a (à) is not always a
single character, but a "a" plus a "grave accent", which means 2
characters [1]. So patching the checking functions in FS/FS/Record.pm,
replacing "\w" with "(\pL\pM*)\pN_" works like a charm and allows us to
record special characters in the database in iso-8859-1 format.
BTW, the Perl Regex above means : any letter + any optional special
mark, or a numeral, or an underscore (yes, \w accepts underscore as well)
so, we only need to patch the functions below :
FS::Record::ut_text
FS::Record::ut_textn
FS::Record::ut_alpha
FS::Record::ut_alphan
FS::Record::ut_name
I read a lot about perl being able to have \w allow special characters
based on the locale used but spent 3 hours trying to have it work and
failed.
2) Secondly, as there's another way of passing variables from the
interface to the perl modules (i.e. Ajax calls), we need to patch this
one too.
the data is prepared in httemplate/elements/xmlhttp.html. Every single
argument is encoded against the "escape" javascript function, which
doesn't work well with special accents, so needs to be replaced by the
better "encodeURIComponent" function [2] (javascript 1.5), and get rid
of the awful hack replacing plus signs.
The receiving part of the ajax process is done in FS/UI/Web.pm. Ajax
calls, having everything put in the XML format, automatically converts
the iso-8859-1 characters in UTF-8, having it passed through the
xmlrequest process. So, the only thing to do in FS::Web::start_job is to
decode the arguments in utf-8 and then encode them in iso-8859-1 so it
works exactly as if the data was passed the "standard" way.
a "use Encode;" is then a dependency, but I believe it already was since
it's a pretty common class.
3) Having latex documents show all iso-8859-1 characters.
As long as the database contains special characters, the HTML Templates
work very well and show correct characters in the browser, but the latex
engine simply discards those special characters. Making it understand
the tex file is encoded in iso-8859-1 is the only way to go. So we just
need to put an "\usepackage[latin1]{inputenc}" in the tex template.
Now for the remarks :
* Perl Regex checks are usually done by the checking functions in
FS::Record but there seems to be checks using "\w" all around, so we
need to test thoroughly to make sure special characters are accepted
everywhere.
* I haven't checked if the HTML emails sent show these special
characters, but forcing the emails encoding to iso-8859-1 should do the job.
* There may be a more serious problem with all the procedures (if any)
counting the characters in the strings to have their length. Since
special characters may be in fact 2 separate characters, the global
length of strings could be reported wrong. Does anyone know if there are
such checks in the code and test it ?
* as some special characters are not letters+special mark but a special
mark alone, the mentionned replacement for the "\w" isn't really
correct. As an exemple, I tried to input a special french character
(herited from latin : œ) and it was denied by the checking procedures so
probably needs to be part of the "\w" replacement pattern. As this is
just an exemple, in other special-characters-aware-languages, there may
be a lot more of theses cases (I'm thinking about the german esset
character and many others) so this pattern is just a temporary
workaround. It could be a really good idea to have a global function
returning this new pattern in the code somewhere and use it in all the
"\w"-checking-regexes (of course, neither with url checks, neither
email, etc.) . It could eventually be a configuration variable, but I
think it doesn't really belong there. Is there somewhere such global
variable can be defined ? Yes, I know, globals are usually not the way
to go when object-oriented-thinking, but could be justified here.
* it's just an iso-8859-1 workaround ; the unicode power isn't there. So
having japanese invoices isn't going to work as long as we stay with
this encoding.
You'll find a cvs patch attached with the above described changes. Since
I have several other changed in my development version, I manually
edited the patch file, so it may be wrong.
Could non-us users test this patch with their own special characters and
report about the whole behavior ?
Mathieu
PS : I tried to document my code modifications to include it in the
patch but can't seem to find a Changes log, don't you use this kind of log ?
[1] : http://www.regular-expressions.info/unicode.html
[2] :
http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Reference:Global_Functions:encodeURIComponent
-------------- next part --------------
A non-text attachment was scrubbed...
Name: freeside_special_characters.patch
Type: text/x-patch
Size: 7025 bytes
Desc: not available
Url : http://420.am/pipermail/freeside-devel/attachments/20070918/04f0b78d/attachment.bin
More information about the freeside-devel
mailing list