Page d'accueil du laboratoire
>
Équipe de Probabilités
>
Page principale
>
HTML4TeX
HTML4TeX,
a stupid project
Ever dream of interpreting HTML with TeX? It's theorically
possible, and I try to give a practical proof of that.
Well, this is not original since some people wrote scripts
that roughly convert HTML into LaTeX, or some others
XML-parsers (xmltex).
Downloads
The following files change almost every day (it's no more true). One can
download them for "beta testing" or typesetting
simple documents. Anyway, please read this entire page
before making use of those files. I do not garantee
anything about that. Those files are supposed to be safe
(I'm using them!).
Download
|
State
|
Description
|
html4tex.tex
|
 |
general input file for plainTeX and LaTeX (fundamental but unstable).
|
htmlgraf.tex
|
|
paragraph processor (fundamental but unstable).
|
htmlinux.tex
|
 |
system dependent commands, checked on a linux based system
(only) (optional but rather stable).
|
TeXscape
|
|
poor TeXnician's html browser. It is still limited to local
files and I still love it. The Gnu wget command
is experimented... Put it in some bin directory and type
texscape -h for help...
|
One should notice that I don't support frames, that many
tags are inactive, that tables are not fully supported,
that colors are very poorly supported (but they are indeed,
so don't try white text over white background). Errors can
happen even with nice HTML code. However, one can check his
HTML syntax with those programs: the test is quite severe.
Nethertheless, I've recently included a little debugging
facility for common errors (wrong comments and such). Also,
colors in htmlinux are supported in a better way: every
ghostscript color name is now supported (write ``Black''
and not ``black'', ...).
General
There are 3 TeX files (should work with both plain and LaTeX),
two for general HTML typesetting (html4tex.tex, htmlgraf.tex),
the other for special effects (htmlinux.tex: colors, image inclusion,
sounds, animations) this last one is system dependent, and written
for a decent Linux distribution. It may work on other platforms.
Adapt its code to your own needs.
Typesetting
Typessing of HTML fragments is done with the help of
the input file html4tex.tex. It contains a lot
of quite complex control sequences for processing tags
and attributes (kernel), a paragraph processor
htmlgraf.tex which is input while reading
html4tex.tex, definitions of tags, attributes,
and also common entities (iso-8859-1 oriented).
Special effects
The material here is related to the optional input file
htmlinux.tex. It is tested on a RedHat 5.2's Linux'
distribution, and also on AIX. The platform must understand
shell scripts (sh, bash) and programs listed below within
quotes should be available.
To see the effect of this input on the dvi output,
typical use is
« xdvi -allowshell foo.dvi
»
or
« dvips foo.dvi -o; gv foo.ps
».
Shell's scripts embedded in the dvi output will then be processed.
Be aware of the risk of such processing! If one wants to check
the safeness of the output, one has to edit the dvi file and search
for the string
« psfile="`..."
»
which anounces shell's scripts.
- Static images:
dimensions must be explicitely
given, no scaling is done, it relies on the
«
Netpbm package
».
- Moving images:
gif's animations are treated as
static images, but mpeg animations are processed with the
famous «
xanim
» program as
external objects (displayed outside the xdvi window).
- Sounds:
the I.E. tag
<bgsound>
is fully supported (with loop) via the
« play
»
and « mpeg123
»
command as internal event.
- Colors: colors are displayed in the postscript file.
- True linking:
it's a quite unstable thing. Xdvi
understands some hyper-references' tags and may process them
internally and externally (!)...
- http-URL:
«
wget
» is experimented
for downloading distant material.
Common errors
Attributes must be written in lowercase to be
interpreted, and values must be surrounded by quotes:
<foo
attribute="value">content</foo>
It's better to write tags in lowercase too. When attributes are
multiply defined, the first only is taken into account.
Beware, in some situations, bad forming of attributes may
lead to catastrophic results. For instance, consider
<td width=100 align="left">
.
Here the value of width
will be
"left"
which will cause an error.
Oddities
<tex>'s tag
Many HTML tags have not been implemented, but a new one
has been introduced:
<tex>...</tex>
which allows TeX
text within HTML documents (please ask W3C to do the same!).
The former tag is equivalent to
<script language="TeX">...</script>
.
For now TeX' scripts are the only ones which are supported.
Well, the syntax of in-TeX html-text is now a little bit different.
One must surround the html text with opening and closing HTML tag:
\HTML<html>...</html>
,
or \HTML<html>...</html>\endHTML
,
or even \beginHTML<html>...</html>\endHTML
.
<!ENTITY>'s experiment
Extensibility via <!ENTITY entity
"replacement text">
is experimented.
XML compliance is a very strange question because XML
rules for that element are not clear because they are
incompatible with SGML ones. What is done is the following:
the replacement text may contain anything (elements or entities)
but "'s are absolutely forbidden.
If one has to give attributes to embedded elements, one has
to do it this way:
<!ENTITY author "<div align="right">
<b>Anthony Phan &amp; Co</b><br>
D&eacute;partement de Math&eacute;matiques, SP2MI,<br>
Boulevard Marie et Pierre Curie, T&eacute;l&eacute;port 2,<br>
BP 30179, F-86962 Futuroscope-Chasseneuil cedex</div>">
Ampersand are also of complex nature as one can see. In fact,
what's happening is that litteral &
and "
are replaced by &
and "
while processing the entity's definition...
To be improved
Things that may be improved:
<table>...</table>
,
<font>...</font>
,
<img>
, paragraph processing... We've
rewritten the font selection scheme and the paragraf processor.
It's now much more powerfull and easier to change. However,
general font selection scheme has to be improved in order
to be more versatile.
Typical use
Here is a typical plainTeX file that uses this translator.
%
% Just define some settings (page format)
%
\magnification=\magstep1
\voffset=0.12 true cm
\hoffset=0.31 true cm
\vsize=24.2 true cm
\hsize=15.3 true cm
\parindent=1.5em
%
% Optional. If \HTMLproofmode is setted,
% one gets an enormous list of messages
% (with errors report and debugging infos)
%
\let\HTMLproofmode=!
%
% Input files
%
\input html4tex
% \input htmlinux
%
% \noHTMLcolors (or) \HTMLcolors
%
% \noHTMLmultimedia (or) \HTMLmultimedia
%
% for htmlinux, I hope you know what you're doing...
%
% include a HTML file from the current directory,
% else write something like:
%
% \HTMLbase{/home/user/public_html/}
%
\includeHTML{main.html}
%
% One can write HTML directly
%
\HTML
<html>
(...HTML stuff...)
</html>
\bye
Main difficulties
Typographical rendering
(very unstable:) Dealing with TeX, one knows what
\par
, \par\noindent
,
\medbreak
and so on, mean. With HTML,
typographical conventions are not well rendered. What is
the exact meaning of <br>
? is it similar
to \par\noindent
? and is
<p>
similar to some break? I've chosen
the following rules:
plainTeX HTML
\par\noindent <---> <br>
\par\noindent <---> <div>
\par <---> no equivalent
\smallbreak\noindent <---> <p>
and common breaks
between vertical lists
By vertical list, I mean of course lists and etc. Those
typographical conventions and many others are, up to now,
built in the packages.
Tables
Well, tables are hard stuff both in TeX and HTML. Thus
translation is difficult. The worse is that vertical
material can be embedded in table cells with HTML...
Unsupported
FRAMES! and MUCH MORE! (look for the word <unsupported>
in html4tex.tex
and also </unsupported>
,
<unsupp@rted>
, <inactive>
,
</inactive>
, <in@ctive>
)
About true linking, it is under development since I've realized
the existence of
\special{html:<a name="foo">}foo\special{html:</a>}
and other such special control sequences...
Copyright: HTML's texts or graphics are free of any
copyright, they are copyleft. TeX programs are also
copyleft but one can send a postcard. MetaPost
programs have just a fell-free-to-send-me-a-postcard licence.
MetaFont programs have the standard LaTeX licence.
Anthony Phan,
Département de Mathématiques, SP2MI,
Boulevard Marie et Pierre Curie, Téléport 2,
BP 30179, F-86962 Futuroscope-Chasseneuil cedex