This issue (parsing HTML with regexps) has been a heated topic of debate
for a long time amongst HTML authors and programmers. I'll try to
summarise what I know about it here.
Many programmers who try to parse HTML do so on the assumption that the
grammar for HTML tags are regular (and hence can be expressed through
regular expressions). The unfortunate fact is that they're not, so most
programs which attempt to parse HTML (particularly browsers) do a very
bad job of it.
To show you the result, here are some examples of tags which are totally
valid according to the specs, but because so many browsers parse them
incorrectly, their use is discouraged (this is a classic case of
majority rules even when the majority is wrong):
Example 1.
<IMG -- this is an ALT="Hello there" comment -- SRC="/images/blah.gif">
* according to the SGML definition, everything between the IMG
and the SRC is a comment, including the ALT= part.
* most browsers however will either get confused on the first --
and skip the rest of the tag, or try to parse the comment as
attributes (and succeed with the ALT "element").
Example 2.
<IMG ALT=">Hello there!<" SRC="/images/blah.gif">
* according to the SGML definition, this is a single tag with an
ALT attribute with value ">Hello there!<".
* most browsers will parse this as a tag followed by some free
text followed by another tag, ie.
1. <IMG ALT=">
2. Hello there!
3. <" SRC="/images/blah.gif">
Of course, neither of the tags are valid HTML, so only the
text will appear.
Example 3.
4 < 5 and 6 > 5.
* Its a little-known fact (except in SGML circles) that in SGML
a < must be immediately followed by a tag name (or the "!"
mark) to be considered a tag, otherwise its treated as an
ordinary character. So according to the HTML specs, this
text has no tags in it.
* Of course, most browsers will try to parse < 5 and 6 > as a
tag (which isn't valid HTML).
Example 4.
<!-- <This is the first comment!> -- --> Here is the second! <-->
* I'll let you work out for yourself what this means in SGML,
and the various ways browsers could try to interpret it.
I doubt very much that a browser could parse this correctly
without using an SGML parser.
Worse than that, it is not even possible to write a single piece of
parser code that will accept both all standard-compliant HTML and all
"accepted practice" HTML unambiguously.
|
| > Well?
|
Well indeed. It seems you can either:
* use an SGML parser which can do far more than you really expect
from something to parse HTML, and reject lots of pages because
they use common HTML techniques that aren't valid HTML (eg. <B>
inside preformatted text, or <P> inside lists and tables)
* write your own "HTML parser" and make the same mistakes all the
other "HTML parsers" make (but that's OK because you're in the
majority, right?)
If after reading all this, you've decided you couldn't give a rat's
backside about this SGML and specs crap and just want a regex that
"works the way you expect it to", then the only thing that I can
recommend is Perl 5's extended regular expressions. It supports
minimal matching which avoids the particular problem you've been
having. This might work in Perl 5 (no guarantees, try it and see):
($alt) = /<img .*?alt="(.*?)".*?>/i;
which is similar to the (possibly) more familiar
($alt) = /<img .*alt="(.*)".*>/i;
except that the .*? wildcards will match the shortest strings which still
allow the text to be accepted.
| > jimmy
|
| > ps Anyone know the location of an html parser? C++ preferred, but C
| > accepted. There's probably a Java one around, no?
The most well-known SGML parser author is James Clark, who wrote the
free SGMLS parser in C. His newer parser is called SP, and is a
re-write in C++ with lots more features added to it.
The SP home page:
http://www.jclark.com/sp/
|
| how about getting the html 2/3 bnf specs and translating it into lex?
| then you could produce a C/C++ lexer/parser. have a look around im
| sure someone else has done it... a starting point would be the actual
| html specs themselves.
The HTML grammar is specified as an SGML DTD, not as a BNF. So what you
need to do is load an SGML parser with the DTD of the version of the
HTML specs you want to accept, and you'll be able to parse
HTML-compliant documents. In fact you can have the same parser accept
documents from multiple DTDs if the documents have a line like this at
the start:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
Unfortunately I've seen quite a few HTML editors which add lines like
this at the start, yet still generate non-SGML documents *sigh*.
PS. If I've told you far more than what you've ever wanted to know
about SGML, too bad :) If you want to know more about SGML and how
it relates to HTML, here are a couple of really good starting
points:
SGML/HTML Resource Centre:
http://www.geocities.com/Athens/2694/sgml.html
An intro which argues why people should write SGML/HTML
compliant documents instead of ones that "just look in
Netscape", and follows it up with a good list of resources for
SGML and HTML, particularly parsers.
Content Models in the HTML 2.0, HTML 3.0 and HTML 3.2 DTDs
http://www.ozemail.com.au/~dkgsoft/html/
Explains SGML DTD basics and shows how they apply to HTML by
walking through the various HTML DTD trees in very readable
terms (considering how complex SGML can get).
This ends the lesson :)
Cheers,
--------------------------------+-----------------------------------
Dennis Clark | Email: dennis@nospam.ilanet.slnsw.gov.au
Programmer, ILANET | Tel: +61 2 230 1649
State Library of NSW | Fax: +61 2 232 8701
"What a tangled WWW we weave!"