This is Interesting: Free Magazines for Graphics designers and webmasters  


Home > Archive > Mozilla XML > May 2007 > need help withXMLHTTPreq and parsing response





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author need help withXMLHTTPreq and parsing response
jason pollard

2007-05-12, 7:13 pm

Hi, it's my first post here.

I'm grabbing a byte stream with XMLHTTPRequest. Normally, the stream is
processed in a java applet as a byte stream. Most of the data comes in as
expected, but an occassional character seems to be garbled. For example,
looking at the stream contents in Venkmann, I can see something like this:
SE\x07\x03\u8220\x67
All the characters are recognized normally, but the Unicode one. In this
case the \u8220 should be (int)147. I guess what I'm looking for is some
insight as to how the data is parsed by the XMLHTTPRequest object and how I
may convert the misrecognized characters to what they're supposed to be.
I've tried overriding the MIME type to x-application/octet-stream and
everything else I can think of, but it's always the same.
Any ideas? THanks in advance,

--Jason


jason pollard

2007-05-12, 7:13 pm

I might clarify by saying I'm getting the raw text with request.responseText
and I'm not interested in parsed xml.

"jason pollard" <jasonpollard@___@yahoo.com> wrote in message
news:dpq84h$811@ripley.aoltw.net...
> Hi, it's my first post here.
>
> I'm grabbing a byte stream with XMLHTTPRequest. Normally, the stream is
> processed in a java applet as a byte stream. Most of the data comes in as
> expected, but an occassional character seems to be garbled. For example,
> looking at the stream contents in Venkmann, I can see something like this:
> SE\x07\x03\u8220\x67
> All the characters are recognized normally, but the Unicode one. In this
> case the \u8220 should be (int)147. I guess what I'm looking for is some
> insight as to how the data is parsed by the XMLHTTPRequest object and how

I
> may convert the misrecognized characters to what they're supposed to be.
> I've tried overriding the MIME type to x-application/octet-stream and
> everything else I can think of, but it's always the same.
> Any ideas? THanks in advance,
>
> --Jason
>
>



Martin Honnen

2007-05-12, 7:13 pm



jason pollard wrote:

> I might clarify by saying I'm getting the raw text with request.responseText


Its a responseText property meaning the byte stream that arrives is
decoded into text.
XMLHttpRequest in Mozilla does not have a responseStream or responseBody
property. MSXML has those,
<http://msdn.microsoft.com/library/d...c0c12cbddde.asp>
<http://msdn.microsoft.com/library/d...f3819f5b918.asp>
though even there it is difficult to do much with them using
J(ava)Script (at least not client-side).

In which context are you using XMLHttpRequest in Mozilla? Is that simply
script in a HTML or XML document loaded from a HTTP web server? I don't
think you can process any bytes that way.

--

Martin Honnen
http://JavaScript.FAQTs.com/
Christian Biesinger

2007-05-12, 7:13 pm

jason pollard wrote:
> All the characters are recognized normally, but the Unicode one. In this
> case the \u8220 should be (int)147.


Hm, that \u8220 should be \u201c I'd think... U+201C is "left double
quotation mark"; In windows-1252 that's the character at position 147
(0x93). The data you get is not bytes, but a text stream; the incoming
data was converted to unicode.
jason pollard

2007-05-12, 7:13 pm

Thanks for the replies...I'm using XMLHttpRequest w/ Mozilla.

>
> In which context are you using XMLHttpRequest in Mozilla? Is that simply
> script in a HTML or XML document loaded from a HTTP web server? I don't
> think you can process any bytes that way.
>
> --
>
> Martin Honnen
> http://JavaScript.FAQTs.com/



jason pollard

2007-05-12, 7:13 pm

Hi, Thanks for the response. I think you are correct that the unicode value
there is \u201c, I've just been outputting the charCodeAt() to screen which
gives the decimal value.

At any rate, the value is above 255, which is screwing things up. I read
that unicode strings (except for utf-8) signal their arrival by a xFFFE or
xFEFF, so I was thinking that whatever is parsing the incoming stream is
putting that there and I was trying to figure out a way to extract that from
the string byte, but no go.

You can access the actual stream here:
http://bvserver.inetats.com/SERVICE/SQUOTE?STOCK=DELL

The offending character in question is the "oe" one (/u339, decimal, should
be \x96 or 156 decimal), if you can see that in your browser (I'm using
Western-1252 encoding). I understand that the stream is being converted
automatically to Unicode, I just was looking for some rhyme or reason how
it's being converted, or why some bytes are converted to unicode, so that I
can convert back to bytes (acutally ints < 255). Could someone post a link
to the relevant source code? I took a look at the Mozilla repository, but
I'm not fluent in C so it didn't help at all.

Thanks,

--Jason

"Christian Biesinger" <cbiesinger@web.de> wrote in message
news:dpr6dh$e272@ripley.aoltw.net...
> jason pollard wrote:
this[color=darkred]
>
> Hm, that \u8220 should be \u201c I'd think... U+201C is "left double
> quotation mark"; In windows-1252 that's the character at position 147
> (0x93). The data you get is not bytes, but a text stream; the incoming
> data was converted to unicode.



Christian Biesinger

2007-05-12, 7:13 pm

jason pollard wrote:
> I just was looking for some rhyme or reason how
> it's being converted, or why some bytes are converted to unicode, so that I
> can convert back to bytes (acutally ints < 255).


Ah. Well, I mentioned that the data is apparently interpreted as
windows-1252, did I not? Just translate it back.

See, for example, the table at
http://www.microsoft.com/globaldev/.../sbcs/1252.mspx

> Could someone post a link
> to the relevant source code? I took a look at the Mozilla repository, but
> I'm not fluent in C so it didn't help at all.


http://lxr.mozilla.org/seamonkey/so...Request.cpp#520
is where XMLHttpRequest converts the server data to unicode. Looks like
all it does is get the charset from the HTTP response header, falling
back to UTF-8.

Although there's a harder case when there actually is an XML document
object; I'm not sure how the charset detection happens then.
jason pollard

2007-05-12, 7:13 pm

> > I just was looking for some rhyme or reason how
that I[color=darkred]
>
> Ah. Well, I mentioned that the data is apparently interpreted as
> windows-1252, did I not? Just translate it back.
>

I think that's where my question is going....I mean a \u201c character is a
\u201c, no matter what codepage is used to view it, right? So, then how do
I 'break out' the correct value? I'm assuming the correct value comes in
the bytestream (e.g. 147), but is mistakenly(?) converted to a bigger
unicode char (e.g. \u201c, 8220 dec). I've tried converting to Hex,
splitting, bitshifting, ANDing, ORing, etc. but can find no logic to it.

Here is some data I collected on the problem, maybe someone can see a
pattern:
charcodeat: (dec) should be*: (dec)
8225 135 (x87)
8220 (\u201C) 147 (x93)
8216 145 (x91)
339 156 (x96)

*Should be column is what the java version of the program, i.e.
InputStream.read() returns.

Also, in this case the server doesn't return a codepage suggestion in the
header, so I'm assuming that Mozilla xmlhttpreq is assuming everything is in
UTF-8. All the other characters in the stream are interpreted correctly,
it's usually just 1 or 2 out of 30 or so that get turned to unicode -- go
out of range in other words.

To clarify one more point, I don't need to display the characters of the
response. I have to get the charcodeat(), which should be ascii range
(<255), for further processing. I therefore don't really care about the
codepage, unless it affects the bytestream. Currently, no matter what I
specify in overrideMimeType, it has no effect on the processing of the
bytestream.

>

http://lxr.mozilla.org/seamonkey/so...Request.cpp#520
> is where XMLHttpRequest converts the server data to unicode. Looks like
> all it does is get the charset from the HTTP response header, falling
> back to UTF-8.


Thanks for that link. I will investigate this route further.

--Jason

P.S. I'm using readyState ==3 in my processReqChange function. Moz is the
only one to do it as documented. Good job.



jason pollard

2007-05-12, 7:13 pm


"Christian Biesinger" <cbiesinger@web.de> wrote in message
news:dpsbrq$k81@ripley.aoltw.net...
> jason pollard wrote:
that I[color=darkred]
>
> Ah. Well, I mentioned that the data is apparently interpreted as
> windows-1252, did I not? Just translate it back.
>
> See, for example, the table at
> http://www.microsoft.com/globaldev/.../sbcs/1252.mspx
>

Oooooh (smacks forehead)......I was taking another look at this table. I
thought it'd be useless because it doesn't even go above 255, but upon
looking at the x80 and x90 rows for one of my values, I noticed the win-1252
code to be in the x2000's, and indeed these values map to the correct int
values I get using java's InputStream.read(). So all I have to do is build
a table to translate the characters in question, and I'm good to go. A
little more work than I wanted, but isn't it always? So my problem is
solved, I believe. Thanks for the help everybody.

BTW, the overrideMimeType function is clearly ignored in this case, as we've
determined that the chars coming in are windows-1252. Why that is, I'm not
sure, but one thing I noticed, is that the other Western (ISO-8859-1 & 15 at
least) codepages don't have any characters defined for the x80s and 90s.
Why it doesn't just use UTF-8 I have no idea.

--Jason



Sponsored Links


Copyright 2003 - 2008 forum4designers.com  Software forum  Computer Hardware reviews