This is Interesting: Free Magazines for Graphics designers and webmasters  


Home > Archive > Webmaster forum > January 2005 > character encoding?





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author character encoding?
T.J.

2005-01-21, 7:21 pm


I upgraded my hosting the other day, and now when validating
my pages, get the following warning,

"The character encoding specified in the HTTP header (iso-8859-1)
is different from the value in the <meta> element"

Is this something to do to do with my host, and should I change the
encoding to iso-8859-1?
(The pages still validates OK)
TIA.


Eric Jarvis

2005-01-21, 7:21 pm

T.J. no1@home.invalid wrote:
>
> I upgraded my hosting the other day, and now when validating
> my pages, get the following warning,
>
> "The character encoding specified in the HTTP header (iso-8859-1)
> is different from the value in the <meta> element"
>
> Is this something to do to do with my host, and should I change the
> encoding to iso-8859-1?
> (The pages still validates OK)
> TIA.
>


Do you have a good reason for using a different encoding? If do ask your
web hosts how you go about changing the HTTP header. In general for single
language sites in English (and most Latin languages) iso-8859-1 is the
standard encoding and the best one to use. However you'll need to check
that you aren't using any characters that don't fit with it.

--
eric
www.ericjarvis.co.uk
"live fast, die only if strictly necessary"
T.J.

2005-01-21, 11:17 pm


"Eric Jarvis" <web@ericjarvis.co.uk> wrote in message
news:MPG.1c5b9608e32b612b98dc0e@news.individual.net...
> T.J. no1@home.invalid wrote:
>
> Do you have a good reason for using a different encoding? If do ask your
> web hosts how you go about changing the HTTP header. In general for single
> language sites in English (and most Latin languages) iso-8859-1 is the
> standard encoding and the best one to use. However you'll need to check
> that you aren't using any characters that don't fit with it.
>


I have a page which is very roughly translated in to french, german,
spanish,
italian and portugese,
The Portugese page at
http://www.sim64.co.uk/pt.html
is giving an error when trying to validate it, but I just can't see where
the
error is. It might be something blindingly obvious, but I can't see it.
Before I upgraded my hosting it valdated fine.
I was using utf-8 then, but it looks like my host is over riding this,
Could this be what is causing the problem?


Eric Jarvis

2005-01-22, 4:15 am

T.J. no1@home.invalid wrote:
>
> "Eric Jarvis" <web@ericjarvis.co.uk> wrote in message
> news:MPG.1c5b9608e32b612b98dc0e@news.individual.net...
>
> I have a page which is very roughly translated in to french, german,
> spanish,
> italian and portugese,
> The Portugese page at
> http://www.sim64.co.uk/pt.html
> is giving an error when trying to validate it, but I just can't see where
> the
> error is. It might be something blindingly obvious, but I can't see it.
> Before I upgraded my hosting it valdated fine.
> I was using utf-8 then, but it looks like my host is over riding this,
> Could this be what is causing the problem?
>


Yes. Tell your hosts that you need the HTTP headers to define the encoding
as utf-8. Aske them for instructions on how you can change it, or if
that's not an option for them to change it themselves.

I can't see what the error is on the Portuguese page. I'd just cut and
paste in the dtd from one of the pages that validates and if that works
just assume it to be a subtle typo somewhere.

--
eric
www.ericjarvis.co.uk
"live fast, die only if strictly necessary"
Norman L. DeForest

2005-01-22, 4:15 am


On Sat, 22 Jan 2005, T.J. wrote:

> "Eric Jarvis" <web@ericjarvis.co.uk> wrote in message
> news:MPG.1c5b9608e32b612b98dc0e@news.individual.net...
>
> I have a page which is very roughly translated in to french, german,
> spanish,
> italian and portugese,
> The Portugese page at
> http://www.sim64.co.uk/pt.html
> is giving an error when trying to validate it, but I just can't see where
> the
> error is. It might be something blindingly obvious, but I can't see it.


If you use a Unicode/UTF-8-supporting text editor, the problem character
may be invisible to you. (I just tried looking at a copy of that file
with EditPad Lite and that character is invisible when it is the
first thing in the file.) The first three bytes in that file (immediately
ahead of the <!DOCTYPE...> declaration) are hexadecimal EF, BB, BF which
is the UTF-8 encoding for character 65279 (hexadecimal FEFF), a zero-width
joining space (which is used by some software to determine byte order but
interpreted in ISO-8859-1 as the three characters 'ï', '»', and '¿'). Try
deleting those three bytes with some other software that is *not*
UTF-8-aware and see if the file now validates.

> Before I upgraded my hosting it valdated fine.
> I was using utf-8 then, but it looks like my host is over riding this,
> Could this be what is causing the problem?


It could be. The first three bytes may now be messing up the
DOCTYPE declaration as the DOCTYPE is no longer the first thing in the
file when the initial reading of the file assumes that the character
encoding is ISO-8859-1:

<!DOCTYPE ....
^^^ <--- try getting rid of these.

--
Norman De Forest http://www.chebucto.ns.ca/~af380/Profile.html
af380@chebucto.ns.ca [=||=] (A Speech Friendly Site)
My Usenet 2005 calendar: http://www.chebucto.ns.ca/~af380/Year-2005.txt
For explanation: http://www.chebucto.ns.ca/~af380/Links.Books.html#TandD

Norman L. DeForest

2005-01-22, 4:15 am


[more info, see below]

On Sat, 22 Jan 2005, Norman L. DeForest wrote:

> On Sat, 22 Jan 2005, T.J. wrote:
>
>
> If you use a Unicode/UTF-8-supporting text editor, the problem character
> may be invisible to you. (I just tried looking at a copy of that file
> with EditPad Lite and that character is invisible when it is the
> first thing in the file.) The first three bytes in that file (immediately
> ahead of the <!DOCTYPE...> declaration) are hexadecimal EF, BB, BF which
> is the UTF-8 encoding for character 65279 (hexadecimal FEFF), a zero-width
> joining space (which is used by some software to determine byte order but
> interpreted in ISO-8859-1 as the three characters 'ï', '»', and '¿'). Try
> deleting those three bytes with some other software that is *not*
> UTF-8-aware and see if the file now validates.
>
>
> It could be. The first three bytes may now be messing up the
> DOCTYPE declaration as the DOCTYPE is no longer the first thing in the
> file when the initial reading of the file assumes that the character
> encoding is ISO-8859-1:
>
> <!DOCTYPE ....
> ^^^ <--- try getting rid of these.


Testing a copy of your page in my temp directory:

http://validator.w3.org/check?uri=h...2Ftemp%2Fpt.htm

[snip]
: This page is not Valid [15] HTML 4.01 Transitional!
:
: Below are the results of attempting to parse this document with an
: SGML parser.
: 1. Line 1, column 0: character "ï" not allowed in prolog
: »¿<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
: [16] ✉
[snip]

Testing a copy of your page with the first three bytes removed:

http://validator.w3.org/check?uri=h...Ftemp%2Fpt4.htm

[snip]
: This Page Is Valid [13] HTML 4.01 Transitional!
[snip]

--
Norman De Forest http://www.chebucto.ns.ca/~af380/Profile.html
af380@chebucto.ns.ca [=||=] (A Speech Friendly Site)
My Usenet 2005 calendar: http://www.chebucto.ns.ca/~af380/Year-2005.txt
For explanation: http://www.chebucto.ns.ca/~af380/Links.Books.html#TandD

T.J.

2005-01-22, 12:15 pm


"Norman L. DeForest" <af380@chebucto.ns.ca> wrote in message
news:Pine.GSO.3.95.iB1.0.1050122034557.908A-100000@halifax.chebucto.ns.ca...


<snip>

>
> Testing a copy of your page in my temp directory:
>
> http://validator.w3.org/check?uri=h...2Ftemp%2Fpt.htm
>
> [snip]
> : This page is not Valid [15] HTML 4.01 Transitional!
> :
> : Below are the results of attempting to parse this document with an
> : SGML parser.
> : 1. Line 1, column 0: character "ï" not allowed in prolog
> : »¿<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> : [16] ✉
> [snip]
>
> Testing a copy of your page with the first three bytes removed:
>
> http://validator.w3.org/check?uri=h...Ftemp%2Fpt4.htm
>
> [snip]
> : This Page Is Valid [13] HTML 4.01 Transitional!
> [snip]
>


Thank you,
I had seen that was the problem, but where did those first three
bytes come from and why can't I see them?
I use notepad and all I had to do was copy and paste the whole code,
without making any changes and it validates, re-load the existing page
and the problem comes back.


Norman L. DeForest

2005-01-22, 12:15 pm


On Sat, 22 Jan 2005, T.J. wrote:

> "Norman L. DeForest" <af380@chebucto.ns.ca> wrote in message
> news:Pine.GSO.3.95.iB1.0.1050122034557.908A-100000@halifax.chebucto.ns.ca...

[snip]
[snip][color=darkred]
[snip][color=darkred]
[snip][color=darkred]
[snip][color=darkred]
> Thank you,
> I had seen that was the problem, but where did those first three
> bytes come from and why can't I see them?
> I use notepad and all I had to do was copy and paste the whole code,
> without making any changes and it validates, re-load the existing page
> and the problem comes back.


The three bytes, hexadecimal EF (decimal 239, 'ï'),
hexadecimal BB (decimal 187, '»'),
hexadecimal BF (decimal 191, '¿'),
are the UTF-8 encoding for a single character.
In binary with the character bits underlined:
11101111 10111011 10111111
^^^^ ^^^^^^ ^^^^^^
encodes the character hexadecimal FEFF (decimal 65279):
1111111011111111

That character is listed in the Unicode chart at:
"Arabic Presentation Forms-B" (characters FE70 to FEFF):
http://www.unicode.org/charts/PDF/UFE70.pdf

It is a space character with *zero* width. From what I have read (if I
understood it correctly), it has two functions:

1. Some Arabic characters have more than one form depending on whether
they are at the beginning of a word or in the middle. If you need to
join two words visually without changing their form, you need to insert
a space between them. Since you don't want a visible space, you use
one that is defined as having zero width. Character U+FEFF is the
zero-width no-break space and would be the one used.

2. It is also used in software that processes the UTF-16 subset of Unicode
(storing each character 0000 to FFFF (0 to 65535) as a two-byte word)
to indicate the byte order in each word. (Intel CPUs store the least
significant byte of a word first and Macs store the most significant
byte first). Since hex. FFFE (65534) is guaranteed to be invalid,
FEFF can be used to flag the byte order used by the software. This way
documents can be shared between Macs and Windows machines and each can
detect when the byte-order needs reversing for that processor.

I can only guess but I suspect that some Unicode-aware software of yours
is adding the leading byte-order indicator to any document it fetches and
stores as Unicode or UTF-8 (function 2 above). When you view the text
with Unicode-aware software, the character is displayed *with zero width*.
(function 1 above) so it is effectively invisible.

What software are you using to fetch the page? Would I be correct in
assuming it is Internet Explorer?[1]

If you use something like wget instead, would that fetch the correct
page? If so, IE may be the problem. Try the Windows version at:
"GNU wget - GNU Project - Free Software Foundation (FSF)"
http://www.gnu.org/software/wget/wget.html

The problem could also be a cache somewhere that has stored the original
buggy page. Try clearing your browser cache to see if that's the problem.
Your ISP may also have cached the old page (but fetching it once again
with Lynx shows that it is now correct). I also thought that if you had
your preferred encoding set to UTF-8 in your browser, the server may have
been adding the extra bytes but setting Lynx to specify UTF-8 as the
preferred document encoding didn't get the extra bytes so the problem is
more likely to be with your software. That's where wget can be useful.

To view the file without the zero-width space being invisible, you could
try something that is *not* Unicode-aware. (Sometime dumber software is
better for a job than alleged "smart" software.) The LIST command in 4DOS
(formerly shareware, now freeware) can view the file as text or as
hexadecimal bytes. 4DOS has been Tip Number One on my computer tips
page long before it became freeware:
http://www.chebucto.ns.ca/~af380/Tips.html#Tip001
It's a replacement for COMMAND.COM that rivals Unix shells for
scripting and command-line capabilities.

An editor I use when I want to be able to see *everything* in a file
is NTED, my patched version of Tiny EDitor from PC Magazine. Pressing
Alt-V allows me to see carriage-returns, line-feeds, and tabs. Alt-B
toggles the appearance of characters 0 and 255 so they too can be
distinguished from spaces (when I'm using the default PC character set):
http://www.chebucto.ns.ca/~af380/Tips.html#TinyEd
It's only about 3000 bytes in size. It has a limit of 64KB for file size
but larger files can always be cut into pieces. (Look for "They Slice,
They Dice..." on my Computer Tips page.)

To view text in iso-8859-1 or CP1252, I load a CP1252 font into my
VGA card. The font is available on my Tips page in two forms, one with
the control characters displaying as in the standard PC character set and
one that uses miniature diagonal "^A", "^B", "^C", etc. figures instead:
http://www.chebucto.ns.ca/~af380/Tips.html#Tip019
(The control-code version also displays the ISO-8859-1 non-breaking space
character (A0 hexadecimal, 160 decimal) as a miniature diagonal "BL" for
"blank" so it can be distinguished from an ASCII space, character 32
(20 hex.).

For more on Unicode and UTF-8 encoding, see:

rfc2279 -- UTF-8
http://www.cis.ohio-state.edu/htbin/rfc/rfc2279.html

"Unicode Code Charts (PDF Version)"
http://www.unicode.org/charts/


Norman "who prefers DOS utilities[2] over Windows ones" De Forest

[1] My ISP's newsletter has had an advisory against both Internet Explorer
and Outlook/Outlook Express for some time and recommend using
alternatives to both for security reasons:
http://beacon.chebucto.info/news.shtml
[2] or Unix utilities ported to DOS:
ftp://garbo.uwasa.fi/pc/unix
--
Norman De Forest http://www.chebucto.ns.ca/~af380/Profile.html
af380@chebucto.ns.ca [=||=] (A Speech Friendly Site)
My Usenet 2005 calendar: http://www.chebucto.ns.ca/~af380/Year-2005.txt
For explanation: http://www.chebucto.ns.ca/~af380/Links.Books.html#TandD

T.J.

2005-01-23, 12:25 pm


"Norman L. DeForest" <af380@chebucto.ns.ca> wrote in message
news:Pine.GSO.3.95.iB1.0.1050122074856.15459A-100000@halifax.chebucto.ns.ca...
>
> On Sat, 22 Jan 2005, T.J. wrote:
>
> [snip]
> [snip]
> [snip]
> [snip]
> [snip]
>
> The three bytes, hexadecimal EF (decimal 239, 'ï'),
> hexadecimal BB (decimal 187, '»'),
> hexadecimal BF (decimal 191, '¿'),
> are the UTF-8 encoding for a single character.
> In binary with the character bits underlined:
> 11101111 10111011 10111111
> ^^^^ ^^^^^^ ^^^^^^
> encodes the character hexadecimal FEFF (decimal 65279):
> 1111111011111111


<snipped>

Thank you for all the info,
Unfortunately it's a bit wasted on me as I don't understand
90% of it.
I appreciate your trying to help though, and the problem is solved
now.
Thanks again.


Eric Jarvis

2005-01-23, 7:20 pm

T.J. no1@home.invalid wrote:
>
> "Norman L. DeForest" <af380@chebucto.ns.ca> wrote in message
> news:Pine.GSO.3.95.iB1.0.1050122074856.15459A-100000@halifax.chebucto.ns.ca...
>
>
> Thank you for all the info,
> Unfortunately it's a bit wasted on me as I don't understand
> 90% of it.
>


I got most of it and learned a lot, so thanks from me too Norman. I
learned a few things that will kumin nandhi.

--
eric
www.ericjarvis.co.uk
"live fast, die only if strictly necessary"
Sponsored Links


Copyright 2003 - 2008 forum4designers.com  Software forum  Computer Hardware reviews