This is Interesting: Free Magazines for Graphics designers and webmasters  


Home > Archive > Webmaster forum > October 2007 > Strange encoding issue





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Strange encoding issue
Dylan Parry

2007-10-25, 6:22 pm

Hi folks,

I'm having a bit of a problem with character encoding. For some reason I
am getting things like "»" and "©" appearing on a new site I am
building. The pages are being served up as UTF-8, and were created/saved
as UTF-8 in MS Expression Web.

Strangely, the problem only manifests on /some/ pages but not others.
All were created and served in the same way.

The problem occurs in all of Firefox/IE7/Opera/Safari, so it's not a
browser issue. All browsers are detecting the documents as UTF-8 and
displaying them as such. Manually overriding the character encoding
doesn't fix the problem, and in some cases makes things worse - for
example, I thought it could have been ISO-8859-1 being incorrectly
served as UTF-8, but changing to ISO-8859-1 causes text such as "»"
and "©" to appear instead.

If I open up the pages in Notepad, the code appears exactly how it
should, ie. "»" or "©" with no other characters. If I then save the file
without actually making any changes, then it works fine and the browser
once again shows the document as intended.

Any ideas?

--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk

The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
Dylan Parry

2007-10-25, 6:22 pm

Dylan Parry wrote:

[...]
> The pages are being served up as UTF-8, and were created/saved
> as UTF-8 in MS Expression Web.

[...]

I've narrowed down the problem to where it occurs. I've noticed that I
only get this issue in files that have been affected by a global find
and replace operation, ie. find and replace in all files within a project.

So at least I now know what causes it, but it would be nice to be able
to use find and replace without it screwing up my site :(

--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk

The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
Christoph Schneegans

2007-10-25, 6:22 pm

Dylan Parry wrote:

> I'm having a bit of a problem with character encoding. For some reason I
> am getting things like "»" and "©" appearing on a new site I am
> building. The pages are being served up as UTF-8, and were created/saved
> as UTF-8 in MS Expression Web.


Post the URL, please. Do you use ASP.NET? I think it is possible to
misconfigure the web.config file so that ASP.NET reads your UTF-8 encoded
..aspx files as ISO-8859-1.

My next guess would be include files that use a different encoding than the
including page.

> If I open up the pages in Notepad, the code appears exactly how it
> should, ie. "»" or "©" with no other characters. If I then save the file
> without actually making any changes, then it works fine and the browser
> once again shows the document as intended.


Which encoding does Notepad assume? Just open the file, call "File > Save
as..." and check the value of the "Encoding" combobox.

Notepad and xWeb both store UTF-8 files with a byte-order mark. However, one
notable difference is the handling of invalid UTF-8 sequences when loading
a file: Notepad just throws these bytes away, while xWeb tries to preserve
them. Thus, I think it is possible that when you open a UTF-8 encoded file
with some invalid bytes in xWeb and then save it, the invalid bytes might
still be there. On the other hand, UTF-8 encoded files saved from within
Notepad should never contain invalid sequences.

Are you 100 percent sure that your files are perfectly valid UTF-8? If your
files are XHTML, you can temporarily reame them to .xml and then open them
in IE.

--
<http://schneegans.de/lv/> · rfc 4646 compliant language tag validator

Dylan Parry

2007-10-25, 6:22 pm

Christoph Schneegans wrote:

> Post the URL, please. Do you use ASP.NET? I think it is possible to
> misconfigure the web.config file so that ASP.NET reads your UTF-8 encoded
> .aspx files as ISO-8859-1.


I would normally upload the files, but I can't do so this time as I'm
using ASP.NET and don't currently have access to a suitable server other
than the dev one, which isn't publicly visible. I've checked the
web.config, and nothing in there should be causing this.

> My next guess would be include files that use a different encoding than the
> including page.


It's not that. I've got a couple of included files, but they're
definitely in UTF-8, and aren't the files that contain the affected
characters either.

>
> Which encoding does Notepad assume? Just open the file, call "File > Save
> as..." and check the value of the "Encoding" combobox.


It shows as UTF-8.

> Notepad and xWeb both store UTF-8 files with a byte-order mark. However, one
> notable difference is the handling of invalid UTF-8 sequences when loading
> a file: Notepad just throws these bytes away, while xWeb tries to preserve
> them. Thus, I think it is possible that when you open a UTF-8 encoded file
> with some invalid bytes in xWeb and then save it, the invalid bytes might
> still be there. On the other hand, UTF-8 encoded files saved from within
> Notepad should never contain invalid sequences.


Ah, that does begin to cast some light on the issue...

> Are you 100 percent sure that your files are perfectly valid UTF-8? If your
> files are XHTML, you can temporarily reame them to .xml and then open them
> in IE.


Now I am not that sure. As I've mentioned in a follow-up post, it's only
occurring in those files affected by a global find and replace - so it
would seem that xWeb is corrupting these files whenever I do one. FWIW,
it works perfectly fine if I manually edit these files in xWeb, so it
would seem that the method xWeb uses to open/edit/save files
automatically in the global find and replace is what it causing this to
happen.

--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk

The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
Andy Dingley

2007-10-25, 6:22 pm

On 25 Oct, 14:30, Dylan Parry <use...@dylanparry.com> wrote:
> Hi folks,
>
> I'm having a bit of a problem with character encoding. For some reason I
> am getting things like "=C2=BB" and "=C2=A9" appearing on a new site I am
> building. The pages are being served up as UTF-8, and were created/saved
> as UTF-8 in MS Expression Web.


Are you _sure_ they're served as UTF-8 ? Checked the HTTP header and
the browser's own metadata, not just a <meat> element in the header?

Errors of that sort (Accented-A "=C2" as a prefix character) are
indicative of UTF-8 content that has been handled as non-UTF-8. Most
likely this happens right at the last moment, when your browser
receives it by HTTP.

Alternatively, something earlier on in the editing process has loaded
them as non-UTF-8, mangled them, then saved them back again as
something that's clearly and obviously non-UTF-8. This is hard to do!
It's hard to actually label a saved files as "not UTF-8". Even if a
broken old 8-bit ANSI editor where to open up UTF-8 and save it again,
so long as it doesn't change these octets (it doesn't have to
understand them), then the file will still remain as valid UTF-8.
That's why, dollars to doughnuts, it's happening at the very last
moment rather than in the previous edit process.

Andy Dingley

2007-10-25, 6:22 pm

On 25 Oct, 15:01, Dylan Parry <use...@dylanparry.com> wrote:
> I've narrowed down the problem to where it occurs. I've noticed that I
> only get this issue in files that have been affected by a global find
> and replace operation, ie. find and replace in all files within a project.


Is the content preceding the obviously broken characters non-ASCII?
To get this error it's usually necessary to inject some non-ASCII
characters before the well-formed UTF-8, then have them saved with an
ISO-8859-* encoding during storage. On receipt, the final user agent
sees the ISO-8859-* characters first and thus treats the document as
not being well-formed UTF-8.

Dylan Parry

2007-10-25, 6:22 pm

Andy Dingley wrote:

> Are you _sure_ they're served as UTF-8 ? Checked the HTTP header and
> the browser's own metadata, not just a <meat> element in the header?


Yes. Firefox's "page info" shows the document to be UTF-8, and also in
the drop-down menu for character encoding it's shown as UTF-8. There
isn't a meta element in the head in the offending documents, so nowhere
to get confused. (Incidentally, I think the <meat> element is invalid <g> )

> Errors of that sort (Accented-A "Â" as a prefix character) are
> indicative of UTF-8 content that has been handled as non-UTF-8. Most
> likely this happens right at the last moment, when your browser
> receives it by HTTP.
>
> Alternatively, something earlier on in the editing process has loaded
> them as non-UTF-8, mangled them, then saved them back again as
> something that's clearly and obviously non-UTF-8. This is hard to do!


Heh - I'm pretty sure that's what is happening though. I think it's
likely a bug in the find and replace with xWeb. Having avoided using it
since noticing this problem, no further occurrences have been noted.

--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk

The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
Chaddy2222

2007-10-25, 6:22 pm


Dylan Parry wrote:
> Christoph Schneegans wrote:
>
ed[color=darkred]
>
> I would normally upload the files, but I can't do so this time as I'm
> using ASP.NET and don't currently have access to a suitable server other
> than the dev one, which isn't publicly visible. I've checked the
> web.config, and nothing in there should be causing this.
>
the[color=darkred]
>
> It's not that. I've got a couple of included files, but they're
> definitely in UTF-8, and aren't the files that contain the affected
> characters either.
>
e file[color=darkred]
ve[color=darkred]
>
> It shows as UTF-8.
>
, one[color=darkred]
ing[color=darkred]
rve[color=darkred]
ile[color=darkred]
ht[color=darkred]
>
> Ah, that does begin to cast some light on the issue...
>
your[color=darkred]
hem[color=darkred]
>
> Now I am not that sure. As I've mentioned in a follow-up post, it's only
> occurring in those files affected by a global find and replace - so it
> would seem that xWeb is corrupting these files whenever I do one. FWIW,
> it works perfectly fine if I manually edit these files in xWeb, so it
> would seem that the method xWeb uses to open/edit/save files
> automatically in the global find and replace is what it causing this to
> happen.
>

Yes, it sounds like an MSEW bug to me.
--
Regards Chad. http://freewebdesign.awardspace.biz

Dylan Parry

2007-10-25, 6:22 pm

Andy Dingley wrote:

> Is the content preceding the obviously broken characters non-ASCII?


No, it's all ASCII text (at least UTF-8 within the ASCII block) up to
the character that goes "wrong".

--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk

The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
Sponsored Links


Copyright 2003 - 2008 forum4designers.com  Software forum  Computer Hardware reviews