This is Interesting: Free Magazines for Graphics designers and webmasters  


Home > Archive > Mozilla XML > October 2005 > Re: Bug in Mozilla parser





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Re: Bug in Mozilla parser
Peter Flynn

2005-10-16, 6:28 pm

Vinu wrote:

> Hi,
>
> I am using the mozilla parser originally written by JamesClark.
> I set the encoding to latin_1.
>
> My XML buffer contains some characters greater than Ascii code 128. But
> these are in the CDATA sections.


Doesn't matter. A character in the document is a character in the
document, even if it's in a CDATA section or a comment.

> Now my XML buffer is clearly in latin_1 or ASCII encoding.


No, specifying <?xml version="1.0" encoding="iso-8859-1"?> just says
that you *claim* it's all Latin-1; it doesn't guarantee or check it.

> So when the
> parser encounters these characters(ASCII code > 128), it uses two bytes
> to represent it rather than the one byte used for latin1.


You mean the parser encounters two bytes where you were expecting your
document to use only one?

> This is obvioulsy a UTF-8 representation.


Possibly.

> But i have clearly set the encoding as latin1. Then why is this
> happening.


Whatever process generates your XML (hand-edit? program? download?)
is storing the character in a character encoding other than ISO 8859-1.

You need to find out where the data is coming from and fix it.

> Also isnt the parser supposed to ignore anything in the CDATA section.


No, a CDATA section only stops the parser scanning for markup characters.
The content still has to be in the document character set.

> Is this a well known bug.


No, but it is a common problem, usually due to the XML document being
stored in a different encoding to the one you are expecting.

> Could somebody please suggest something on
> this.


If you're using a Unix-based system (Mac, Linux, etc), and if you are
positive that the offending character has a representation in ISO 8859-1,
use the iconv program to force the document to be stored as such, eg

$ iconv -t iso-8859-1 myfile.xml >newfile.xml

If the character has no representation in ISO 8859-1 you'll get an
error message, and you'll have to use UTF-8 or something else instead.

I don't know how you do this under Windows, though.

///Peter

Vinu

2005-10-17, 3:21 am

Hi Peter,

Thank you for the response. I have a specific problem and will try to
be more clea this time.

What i can see in memory(i use C/C++ version of the parser in
VisualStudio) is that the XML buffer is represented as one byte per
character.
For e.g space char is represented as 32(dec) and so on.
Now this buffer also contains the character =A4 which has ASCII code
164(decimal) or A4(hex).
When the parser starts parsing this buffer to build a DOM tree, it is
replacing =A4(code 164 and represented as one byte) with two bytes(which
when represented in ASCII are the characters =C2=A4). I think this is an
UTF-8 representaion, where two bytes are used to represent =A4, whereas
in ASCII/latin1 it would be represented with just one byte.

So when i get back the XML from the DOM, and i am expecting a latin1
encoding, i get =C2=A4 wherever =A4 was expected.

So is the behaviour of any XML parser to store XML internally in UTF-8
and then probably its the responsibility of the application using this
parser to convert this XML in UTF-8 encoding to any encoding which it
wishes.

Looking forward to your comments

Thanks
Vinu

Peter Flynn

2005-10-17, 6:35 pm

Vinu wrote:
> What i can see in memory(i use C/C++ version of the parser in
> VisualStudio) is that the XML buffer is represented as one byte per
> character.
> For e.g space char is represented as 32(dec) and so on.
> Now this buffer also contains the character ¤ which has ASCII code
> 164(decimal) or A4(hex).
> When the parser starts parsing this buffer to build a DOM tree, it is
> replacing ¤(code 164 and represented as one byte) with two bytes(which
> when represented in ASCII are the characters ¤). I think this is an
> UTF-8 representaion, where two bytes are used to represent ¤, whereas
> in ASCII/latin1 it would be represented with just one byte.


OK, now I understand. Sorry for being so slow.

> So when i get back the XML from the DOM, and i am expecting a latin1
> encoding, i get ¤ wherever ¤ was expected.


Does your XML document start with an XML Declaration specifying
ISO-8859-1 as the character encoding? If not, then UTF-8 is assumed.

> So is the behaviour of any XML parser to store XML internally in UTF-8


Given that parsers are mandated to support UTF-8, probably yes.

> and then probably its the responsibility of the application using this
> parser to convert this XML in UTF-8 encoding to any encoding which it
> wishes.


Yes, that's probably true also.

///Peter
--
XML FAQ: http://xml.silmaril.ie/

Sponsored Links


Copyright 2003 - 2008 forum4designers.com  Software forum  Computer Hardware reviews