This code is highly experimental and still under construction. It
comes with no guarantees, no warranties and is not future-proof.
Note that version 0.8b has some file name changes. The most
important one being the renaming of the former aux
directory
(see issue #1).
My Java application
MakeJmlrBookGUI
needs to parse LaTeX files so that it can fix common problems. This
mainly involves replacing obsolete/problematic code and removing
.eps
from included graphics, so that the book correctly compiles
with pdflatex
and my jmlrbook
class.
Unfortunately TeX syntax can be too complex for a regular
expression. The texparserlib.jar
library is not intended as a TeX
engine, but as a way of parsing TeX code that's somewhat better than
a simple pattern match.
Since TeX4HT no longer works with the jmlrbook
class, I started to
extend the TeX parsing code so that it could convert the article
abstracts to HTML without requiring TeX4HT. This aspect is no
longer required as JMLR W&CP now generate the HTML files from the
.bib
file associated with the proceedings, but the html
part of the
TeX parser library allows the translation of code fragments, such as
author name or article title, so it can be rendered in Java's
HTMLDocument
, which makes the GUI look a bit tidier.
Since I have other Java applications (for example,
datatooltk
and
bib2gls
) that also need to
parse LaTeX files or their associated .aux
or .bib
files, I
decided to split away the TeX parsing code from MakeJmlBookGui into
a separate library, namely texparserlib.jar
. This also makes it
easier to test the library without the additional overhead of the
main program.
There are only a few LaTeX packages implemented and some of them
aren't a full implementation. These are provided as some aspects of
them are required for the applications using the texparserlib.jar
library. For example, MakeJmlrBookGUI needs to convert articles that
use the old jmlr2e
package so that they instead use the new jmlr
class, the datatooltk
application can import probsoln
data sets, and
bib2gls
needs to know symbols that commonly occur in glossary
entries. So, for example, texparserlib.jar
recognises siunitx
's
\si{}
command as it's feasible that a user might want to define units used
in the document, but it's less likely that a specific measurement
might occur in the name
field (although measurements may well
occur in the description
when defining constants, but that's less
important to bib2gls
since entries aren't usually sorted by their
description).
The accompanying texparsertest.jar
is a command line application
provided to test the texparserlib.jar
library. It's not intended for
general use. (For this reason, I've renamed it from
texparserapp.jar
to texparsertest.jar
. There is still a script
called texparserapp
in the bin
directory which does the same
thing as texparsertest
.)
Syntax:
texparsertest
[--html
] --in
<tex file> --output
<out dir>
This parses <tex file> and saves the new file in <out dir> and
copies over any included images. It will run epstopdf
on any eps
files and wmf2eps
on any wps files. Both epstopdf
and wmf2eps
must
be on your system path. The --html
switch indicates conversion to
HTML. If this switch is omitted, LaTeX to LaTeX conversion is
assumed.
The output directory <out dir> must not exist. This is a precautionary measure to ensure you don't accidentally overwrite the original files.
I experimented with including a GUI to provide a way of testing the library with a graphical interface but I've now removed it as I don't have time to develop it, and it requires additional libraries.
TeX Java Help uses the TeX Parser Library.
The command line texjavahelpmk.jar
application is similar to texparsertest --html
but is customized to work with the texjavahelplib.jar
library and has added support
for texjavahelp.sty
. This is used with datatooltk
to provide the in-application
manual created from the LaTeX source and will also be used with future versions of
flowframtk
and makeglossariesgui
.
Test files are in src/tests/
The test file src/tests/test-obsolete/test-obs.tex
contains obsolete
commands such as \bf
, \centerline
and \epsfig
.
cd src/tests
texparsertest --in test-obsolete/test-obs.tex --output output/test-obsolete
This will create the directory output/test-obsolete
and create a
file in it called test-obs.tex
which is the original file with the
obsolete commands replaced. The eps file is also copied over to the
new directory and epstopdf
is used to create a corresponding pdf
file. The \epsfig
command is converted to \includegraphics
with the
file extension removed.
Font changing commands, such as \bf
, are changed to the
corresponding LaTeX2e font declarations (such as \bfseries
) in text mode
and the corresponding LaTeX2e math font commands (such as \mathbf
)
in math mode. The commands are unchanged if they occur in the
argument of \verb
or in a command definition.
The test file src/tests/test-sw/test-sw.tex
simulates output from
Scientific Word. (I don't have SW so I can't test this. The code is
based on the type of code I've had to work with as a production
editor.) In general I don't want commands such as \bigskip
in the
articles, as the inter-paragraph spacing should be dependent on the
book style. Also, I don't have tcilatex.tex
used by SW, so
texparserapp
removes \input{tcilatex}
and substitutes \FRAME
and
\Qcb
. I don't know what other commands tcilatex
defines. Those two
are the only ones I've encountered so far. texparserapp
also
replaces
\special{language "Scientific Word";...;tempfilename '
imgname.wmf'}
with \includegraphics{
imgname}
and runs wmf2eps
on
the wmf image.
cd src/tests
texparsertest --in test-sw/test-sw.tex --output output/test-sw
This creates the directory output/test-sw
and writes a copy of
test-sw.tex
with the relevant substitutions. The image file
X0001.wmf
is converted to eps and the eps file is then converted to
pdf.
Conversion to HTML just creates a single HTML file and copies over image files. It's very limited as I initially only needed to convert abstracts to HTML. MathJax is used to render math mode.
cd src/tests
texparsertest --in test-article/test-article.tex --output output/test-article --html
The bib2gls
application uses the HTML conversion without MathJax
when trying to interpret the sort value when the sort
field is
missing, so this test file now includes some packages that have been
added to help bib2gls
. These are mostly packages that provide
symbols that might appear in a glossary. Some support for datatool
has also
been added to assist datatooltk
.