The goal of this project was to implement
a Text-Only converter for Web pages. The post-processor takes as input an HTML document
with possibly rich formatting, using tables for layout and small IMG for bullets etc. It
converts this to an output document in proper HTML format (not ASCII text) which contains
only text. This makes them more accessible and loading faster.
this led to a C++ class library (framework) representing HTML documents and tags as
various objects. A lexical analyser and parser builds a document's structure in-memory as
a linked C++ object tree. The hierarchical object representation allows a
"smarter" text-only conversion than what current text-only scripts, mostly being
written in Perl, can achieve. For example, graphical IMG bullets are recognised and output
as UL/LI, and IMG rulers are translated into HR.
For more related References, see the ALTifier
project. Some ideas expressed here have been reused there.