This is Interesting: Free Magazines for Graphics designers and webmasters  


Home > Archive > Web Authoring Tools > April 2007 > Any tool to check against missing semicolons in entity and character references?





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Any tool to check against missing semicolons in entity and character references?
Jukka K. Korpela

2007-04-01, 6:21 pm

According to classic HTML (nominally, SGML-based) rules, a semicolon
(reference close) is optional in entity and character references, when the
reference is not immediately followed by a name character. Browsers haven't
had problems with this.

However, IE 7 is absurdly picky: it refuses to recognize
a) an entity reference referring to a character outside ISO Latin 1
b) a hexadecimal character reference
whenever it does not contain the trailing semicolon. Thus,
&rarr
and

are displayed literally (whereas é and → are OK).

Is there any practically useful checker that issues a warning about such
constructs, or about any entity or character reference not terminated by a
semicolon? As far as I can see, no.

Henri Sivonen's checker http://hsivonen.iki.fi/validator/ is closest to what
I mean, but not very close: it detects any reference without semicolon but
it only reports the first problem and then terminates, which isn't very
practical.

Of course, switching to XHTML and using a validator would solve this
problem - and create many others.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Chris Morris

2007-04-01, 6:21 pm

"Jukka K. Korpela" <jkorpela@cs.tut.fi> writes:
> Is there any practically useful checker that issues a warning about
> such constructs, or about any entity or character reference not
> terminated by a semicolon? As far as I can see, no.


Simple enough to write one, though.
http://compsoc.dur.ac.uk/~cim/EntityChecker.cgi

Currently it's check by file upload only, with a 50kb limit. It
wouldn't be difficult to add a "check by URL" feature, or to convert
it to a command-line tool, if either would be more useful.

It gave the expected results on a couple of test documents that I fed
it, though it currently doesn't support input encodings other than
UTF-8 (non-multibyte encodings aren't likely to do anything worse than
a wrong column number, though)

--
Chris
Jukka K. Korpela

2007-04-01, 6:21 pm

Scripsit Chris Morris:

> "Jukka K. Korpela" <jkorpela@cs.tut.fi> writes:
>
> Simple enough to write one, though.


I also realized, after posting my message, that the check is fairly easy.
The problem is how to make it included into popular checkers, I guess. And
actually it's easier to _fix_ the problematic constructs rather than just
check for them: it can be done basically with the following PERL one-liner:

while(<> ) { s/(\&[#]?[0-9a-zA-Z]+);?/$1;/g; print; }

(This modifies some malformed constructs as well, but the above should cover
all character references and all entity references defined in HTML 4.01.)

> http://compsoc.dur.ac.uk/~cim/EntityChecker.cgi


It doesn't seem to check for hex character references like – being
terminated by a semicolon.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Sponsored Links


Copyright 2003 - 2008 forum4designers.com  Software forum  Computer Hardware reviews