Page d'accueil du laboratoire > Équipe de Probabilités > Page principale > HTML4TeX

HTML4TeX,
a stupid project

Ever dream of interpreting HTML with TeX? It's theorically possible, and I try to give a practical proof of that. Well, this is not original since some people wrote scripts that roughly convert HTML into LaTeX, or some others XML-parsers (xmltex).

Downloads

The following files change almost every day (it's no more true). One can download them for "beta testing" or typesetting simple documents. Anyway, please read this entire page before making use of those files. I do not garantee anything about that. Those files are supposed to be safe (I'm using them!).

Download	State	Description
html4tex.tex		general input file for plainTeX and LaTeX (fundamental but unstable).
htmlgraf.tex		paragraph processor (fundamental but unstable).
htmlinux.tex		system dependent commands, checked on a linux based system (only) (optional but rather stable).
TeXscape		poor TeXnician's html browser. It is still limited to local files and I still love it. The Gnu `wget` command is experimented... Put it in some bin directory and type `texscape -h` for help...

One should notice that I don't support frames, that many tags are inactive, that tables are not fully supported, that colors are very poorly supported (but they are indeed, so don't try white text over white background). Errors can happen even with nice HTML code. However, one can check his HTML syntax with those programs: the test is quite severe.

Nethertheless, I've recently included a little debugging facility for common errors (wrong comments and such). Also, colors in htmlinux are supported in a better way: every ghostscript color name is now supported (write ``Black'' and not ``black'', ...).

General

There are 3 TeX files (should work with both plain and LaTeX), two for general HTML typesetting (html4tex.tex, htmlgraf.tex), the other for special effects (htmlinux.tex: colors, image inclusion, sounds, animations) this last one is system dependent, and written for a decent Linux distribution. It may work on other platforms. Adapt its code to your own needs.

Typesetting

Typessing of HTML fragments is done with the help of the input file html4tex.tex. It contains a lot of quite complex control sequences for processing tags and attributes (kernel), a paragraph processor htmlgraf.tex which is input while reading html4tex.tex, definitions of tags, attributes, and also common entities (iso-8859-1 oriented).

Special effects

The material here is related to the optional input file htmlinux.tex. It is tested on a RedHat 5.2's Linux' distribution, and also on AIX. The platform must understand shell scripts (sh, bash) and programs listed below within quotes should be available.

To see the effect of this input on the dvi output, typical use is « xdvi -allowshell foo.dvi » or « dvips foo.dvi -o; gv foo.ps ». Shell's scripts embedded in the dvi output will then be processed. Be aware of the risk of such processing! If one wants to check the safeness of the output, one has to edit the dvi file and search for the string « psfile="`..." » which anounces shell's scripts.

Static images: dimensions must be explicitely given, no scaling is done, it relies on the « Netpbm package ».
Moving images: gif's animations are treated as static images, but mpeg animations are processed with the famous « xanim » program as external objects (displayed outside the xdvi window).
Sounds: the I.E. tag <bgsound> is fully supported (with loop) via the « play » and « mpeg123 » command as internal event.
Colors: colors are displayed in the postscript file.
True linking: it's a quite unstable thing. Xdvi understands some hyper-references' tags and may process them internally and externally (!)...
http-URL: « wget » is experimented for downloading distant material.

Common errors

Attributes must be written in lowercase to be interpreted, and values must be surrounded by quotes:

<foo attribute="value">content</foo>

It's better to write tags in lowercase too. When attributes are multiply defined, the first only is taken into account. Beware, in some situations, bad forming of attributes may lead to catastrophic results. For instance, consider <td width=100 align="left">. Here the value of width will be "left" which will cause an error.

Oddities

<tex>'s tag

Many HTML tags have not been implemented, but a new one has been introduced: <tex>...</tex> which allows TeX text within HTML documents (please ask W3C to do the same!). The former tag is equivalent to <script language="TeX">...</script>. For now TeX' scripts are the only ones which are supported. Well, the syntax of in-TeX html-text is now a little bit different. One must surround the html text with opening and closing HTML tag: \HTML<html>...</html>, or \HTML<html>...</html>\endHTML, or even \beginHTML<html>...</html>\endHTML.

<!ENTITY>'s experiment

Extensibility via

<!ENTITY entity
"replacement text">

is experimented. XML compliance is a very strange question because XML rules for that element are not clear because they are incompatible with SGML ones. What is done is the following: the replacement text may contain anything (elements or entities) but "'s are absolutely forbidden. If one has to give attributes to embedded elements, one has to do it this way:

<!ENTITY author "<div align=&quot;right&quot;>
                 <b>Anthony Phan &amp;amp; Co</b><br>
                 D&amp;eacute;partement de Math&amp;eacute;matiques, SP2MI,<br>
                 Boulevard Marie et Pierre Curie, T&amp;eacute;l&amp;eacute;port 2,<br>
		 BP 30179, F-86962 Futuroscope-Chasseneuil cedex</div>">

Ampersand are also of complex nature as one can see. In fact, what's happening is that litteral & and " are replaced by & and " while processing the entity's definition...

To be improved

Things that may be improved: <table>...</table>, <font>...</font>, <img>, paragraph processing... We've rewritten the font selection scheme and the paragraf processor. It's now much more powerfull and easier to change. However, general font selection scheme has to be improved in order to be more versatile.

Typical use

Here is a typical plainTeX file that uses this translator.

%
% Just define some settings (page format)
%

\magnification=\magstep1
\voffset=0.12 true cm
\hoffset=0.31 true cm
\vsize=24.2 true cm
\hsize=15.3 true cm
\parindent=1.5em

%
% Optional. If \HTMLproofmode is setted,
% one gets an enormous list of messages
% (with errors report and debugging infos)
%

\let\HTMLproofmode=!

%
% Input files
%

\input html4tex

% \input htmlinux
%
% \noHTMLcolors (or) \HTMLcolors
%
% \noHTMLmultimedia (or) \HTMLmultimedia
%
% for htmlinux, I hope you know what you're doing... 
%
% include a HTML file from the current directory,
% else write something like: 
%
% \HTMLbase{/home/user/public_html/}
%

\includeHTML{main.html}

%
% One can write HTML directly
%

\HTML
<html>

(...HTML stuff...)

</html>

\bye

Main difficulties

Typographical rendering

(very unstable:) Dealing with TeX, one knows what \par, \par\noindent, \medbreak and so on, mean. With HTML, typographical conventions are not well rendered. What is the exact meaning of <br>? is it similar to \par\noindent? and is <p> similar to some break? I've chosen the following rules:

plainTeX                    HTML
\par\noindent        <--->  <br>
\par\noindent        <--->  <div>
\par                 <--->  no equivalent
\smallbreak\noindent <--->  <p>
                            and common breaks
                            between vertical lists

By vertical list, I mean of course lists and etc. Those typographical conventions and many others are, up to now, built in the packages.

Tables

Well, tables are hard stuff both in TeX and HTML. Thus translation is difficult. The worse is that vertical material can be embedded in table cells with HTML...

Unsupported

FRAMES! and MUCH MORE! (look for the word <unsupported> in html4tex.tex and also </unsupported>, <unsupp@rted>, <inactive>, </inactive>, <in@ctive>)

About true linking, it is under development since I've realized the existence of

\special{html:<a name="foo">}foo\special{html:</a>}

and other such special control sequences...

Copyright: HTML's texts or graphics are free of any copyright, they are copyleft. TeX programs are also copyleft but one can send a postcard. MetaPost programs have just a fell-free-to-send-me-a-postcard licence. MetaFont programs have the standard LaTeX licence.

Anthony Phan,
Département de Mathématiques, SP2MI,
Boulevard Marie et Pierre Curie, Téléport 2,
BP 30179, F-86962 Futuroscope-Chasseneuil cedex

HTML4TeX, a stupid project