This is Interesting: Free Magazines for Graphics designers and webmasters
Home > Archive > Webmaster forum > October 2007 > Strange encoding issue
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Strange encoding issue
|
|
| Dylan Parry 2007-10-25, 6:22 pm |
| Hi folks,
I'm having a bit of a problem with character encoding. For some reason I
am getting things like "»" and "©" appearing on a new site I am
building. The pages are being served up as UTF-8, and were created/saved
as UTF-8 in MS Expression Web.
Strangely, the problem only manifests on /some/ pages but not others.
All were created and served in the same way.
The problem occurs in all of Firefox/IE7/Opera/Safari, so it's not a
browser issue. All browsers are detecting the documents as UTF-8 and
displaying them as such. Manually overriding the character encoding
doesn't fix the problem, and in some cases makes things worse - for
example, I thought it could have been ISO-8859-1 being incorrectly
served as UTF-8, but changing to ISO-8859-1 causes text such as "»"
and "©" to appear instead.
If I open up the pages in Notepad, the code appears exactly how it
should, ie. "»" or "©" with no other characters. If I then save the file
without actually making any changes, then it works fine and the browser
once again shows the document as intended.
Any ideas?
--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk
The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
| |
| Dylan Parry 2007-10-25, 6:22 pm |
| Dylan Parry wrote:
[...]
> The pages are being served up as UTF-8, and were created/saved
> as UTF-8 in MS Expression Web.
[...]
I've narrowed down the problem to where it occurs. I've noticed that I
only get this issue in files that have been affected by a global find
and replace operation, ie. find and replace in all files within a project.
So at least I now know what causes it, but it would be nice to be able
to use find and replace without it screwing up my site :(
--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk
The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
| |
| Christoph Schneegans 2007-10-25, 6:22 pm |
| Dylan Parry wrote:
> I'm having a bit of a problem with character encoding. For some reason I
> am getting things like "»" and "©" appearing on a new site I am
> building. The pages are being served up as UTF-8, and were created/saved
> as UTF-8 in MS Expression Web.
Post the URL, please. Do you use ASP.NET? I think it is possible to
misconfigure the web.config file so that ASP.NET reads your UTF-8 encoded
..aspx files as ISO-8859-1.
My next guess would be include files that use a different encoding than the
including page.
> If I open up the pages in Notepad, the code appears exactly how it
> should, ie. "»" or "©" with no other characters. If I then save the file
> without actually making any changes, then it works fine and the browser
> once again shows the document as intended.
Which encoding does Notepad assume? Just open the file, call "File > Save
as..." and check the value of the "Encoding" combobox.
Notepad and xWeb both store UTF-8 files with a byte-order mark. However, one
notable difference is the handling of invalid UTF-8 sequences when loading
a file: Notepad just throws these bytes away, while xWeb tries to preserve
them. Thus, I think it is possible that when you open a UTF-8 encoded file
with some invalid bytes in xWeb and then save it, the invalid bytes might
still be there. On the other hand, UTF-8 encoded files saved from within
Notepad should never contain invalid sequences.
Are you 100 percent sure that your files are perfectly valid UTF-8? If your
files are XHTML, you can temporarily reame them to .xml and then open them
in IE.
--
<http://schneegans.de/lv/> · rfc 4646 compliant language tag validator
| |
| Dylan Parry 2007-10-25, 6:22 pm |
| Christoph Schneegans wrote:
> Post the URL, please. Do you use ASP.NET? I think it is possible to
> misconfigure the web.config file so that ASP.NET reads your UTF-8 encoded
> .aspx files as ISO-8859-1.
I would normally upload the files, but I can't do so this time as I'm
using ASP.NET and don't currently have access to a suitable server other
than the dev one, which isn't publicly visible. I've checked the
web.config, and nothing in there should be causing this.
> My next guess would be include files that use a different encoding than the
> including page.
It's not that. I've got a couple of included files, but they're
definitely in UTF-8, and aren't the files that contain the affected
characters either.
>
> Which encoding does Notepad assume? Just open the file, call "File > Save
> as..." and check the value of the "Encoding" combobox.
It shows as UTF-8.
> Notepad and xWeb both store UTF-8 files with a byte-order mark. However, one
> notable difference is the handling of invalid UTF-8 sequences when loading
> a file: Notepad just throws these bytes away, while xWeb tries to preserve
> them. Thus, I think it is possible that when you open a UTF-8 encoded file
> with some invalid bytes in xWeb and then save it, the invalid bytes might
> still be there. On the other hand, UTF-8 encoded files saved from within
> Notepad should never contain invalid sequences.
Ah, that does begin to cast some light on the issue...
> Are you 100 percent sure that your files are perfectly valid UTF-8? If your
> files are XHTML, you can temporarily reame them to .xml and then open them
> in IE.
Now I am not that sure. As I've mentioned in a follow-up post, it's only
occurring in those files affected by a global find and replace - so it
would seem that xWeb is corrupting these files whenever I do one. FWIW,
it works perfectly fine if I manually edit these files in xWeb, so it
would seem that the method xWeb uses to open/edit/save files
automatically in the global find and replace is what it causing this to
happen.
--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk
The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
| |
| Andy Dingley 2007-10-25, 6:22 pm |
| On 25 Oct, 14:30, Dylan Parry <use...@dylanparry.com> wrote:
> Hi folks,
>
> I'm having a bit of a problem with character encoding. For some reason I
> am getting things like "=C2=BB" and "=C2=A9" appearing on a new site I am
> building. The pages are being served up as UTF-8, and were created/saved
> as UTF-8 in MS Expression Web.
Are you _sure_ they're served as UTF-8 ? Checked the HTTP header and
the browser's own metadata, not just a <meat> element in the header?
Errors of that sort (Accented-A "=C2" as a prefix character) are
indicative of UTF-8 content that has been handled as non-UTF-8. Most
likely this happens right at the last moment, when your browser
receives it by HTTP.
Alternatively, something earlier on in the editing process has loaded
them as non-UTF-8, mangled them, then saved them back again as
something that's clearly and obviously non-UTF-8. This is hard to do!
It's hard to actually label a saved files as "not UTF-8". Even if a
broken old 8-bit ANSI editor where to open up UTF-8 and save it again,
so long as it doesn't change these octets (it doesn't have to
understand them), then the file will still remain as valid UTF-8.
That's why, dollars to doughnuts, it's happening at the very last
moment rather than in the previous edit process.
| |
| Andy Dingley 2007-10-25, 6:22 pm |
| On 25 Oct, 15:01, Dylan Parry <use...@dylanparry.com> wrote:
> I've narrowed down the problem to where it occurs. I've noticed that I
> only get this issue in files that have been affected by a global find
> and replace operation, ie. find and replace in all files within a project.
Is the content preceding the obviously broken characters non-ASCII?
To get this error it's usually necessary to inject some non-ASCII
characters before the well-formed UTF-8, then have them saved with an
ISO-8859-* encoding during storage. On receipt, the final user agent
sees the ISO-8859-* characters first and thus treats the document as
not being well-formed UTF-8.
| |
| Dylan Parry 2007-10-25, 6:22 pm |
| Andy Dingley wrote:
> Are you _sure_ they're served as UTF-8 ? Checked the HTTP header and
> the browser's own metadata, not just a <meat> element in the header?
Yes. Firefox's "page info" shows the document to be UTF-8, and also in
the drop-down menu for character encoding it's shown as UTF-8. There
isn't a meta element in the head in the offending documents, so nowhere
to get confused. (Incidentally, I think the <meat> element is invalid <g> )
> Errors of that sort (Accented-A "Â" as a prefix character) are
> indicative of UTF-8 content that has been handled as non-UTF-8. Most
> likely this happens right at the last moment, when your browser
> receives it by HTTP.
>
> Alternatively, something earlier on in the editing process has loaded
> them as non-UTF-8, mangled them, then saved them back again as
> something that's clearly and obviously non-UTF-8. This is hard to do!
Heh - I'm pretty sure that's what is happening though. I think it's
likely a bug in the find and replace with xWeb. Having avoided using it
since noticing this problem, no further occurrences have been noted.
--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk
The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
| |
| Chaddy2222 2007-10-25, 6:22 pm |
|
Dylan Parry wrote:
> Christoph Schneegans wrote:
>
ed[color=darkred]
>
> I would normally upload the files, but I can't do so this time as I'm
> using ASP.NET and don't currently have access to a suitable server other
> than the dev one, which isn't publicly visible. I've checked the
> web.config, and nothing in there should be causing this.
>
the[color=darkred]
>
> It's not that. I've got a couple of included files, but they're
> definitely in UTF-8, and aren't the files that contain the affected
> characters either.
>
e file[color=darkred]
ve[color=darkred]
>
> It shows as UTF-8.
>
, one[color=darkred]
ing[color=darkred]
rve[color=darkred]
ile[color=darkred]
ht[color=darkred]
>
> Ah, that does begin to cast some light on the issue...
>
your[color=darkred]
hem[color=darkred]
>
> Now I am not that sure. As I've mentioned in a follow-up post, it's only
> occurring in those files affected by a global find and replace - so it
> would seem that xWeb is corrupting these files whenever I do one. FWIW,
> it works perfectly fine if I manually edit these files in xWeb, so it
> would seem that the method xWeb uses to open/edit/save files
> automatically in the global find and replace is what it causing this to
> happen.
>
Yes, it sounds like an MSEW bug to me.
--
Regards Chad. http://freewebdesign.awardspace.biz
| |
| Dylan Parry 2007-10-25, 6:22 pm |
| Andy Dingley wrote:
> Is the content preceding the obviously broken characters non-ASCII?
No, it's all ASCII text (at least UTF-8 within the ASCII block) up to
the character that goes "wrong".
--
Dylan Parry
http://electricfreedom.org | http://webpageworkshop.co.uk
The opinions stated above are not necessarily representative of
those of my cats. All opinions expressed are entirely your own.
|
|
|
| | Copyright 2003 - 2008 forum4designers.com Software forum Computer Hardware reviews |
|