Renard::API::MuPDF::mutool - Retrieve PDF image and text data via MuPDF's mutool
version 0.006
_call_mutool( @args )
Helper function which calls mutool
with the contents of the @args
array.
Returns the captured STDOUT
of the call.
This function dies if mutool
unsuccessfully exits.
get_mutool_pdf_page_as_png($pdf_filename, $pdf_page_no)
This function returns a PNG stream that renders page number $pdf_page_no
of the PDF file $pdf_filename
.
get_mutool_text_stext_raw($pdf_filename, $pdf_page_no)
This function returns an XML string that contains structured text from page number $pdf_page_no
of the PDF file $pdf_filename
.
The XML format is defined by the output of mutool
looks like this (for page 23 of the pdf_reference_1-7.pdf
file):
<?xml version="1.0"?>
<document name="(null)">
<page height="666" width="531">
<block bbox="261.18 616.16397 269.77766 625.2532">
<line bbox="261.18 616.16397 269.77766 625.2532" dir="1 0" wmode="0">
<font name="MyriadPro-Semibold" size="7.98">
<char bbox="261.18 616.16397 265.45729 625.2532" c="2" x="261.18" y="623.2582"/>
<char bbox="265.50038 616.16397 269.77766 625.2532" c="3" x="265.50038" y="623.2582"/>
</font>
</line>
</block>
<block bbox="225.78 88.20229 305.18159 117.93829">
<line bbox="225.78 88.20229 305.18159 117.93829" dir="1 0" wmode="0">
<font name="MyriadPro-Bold" size="24">
<char bbox="225.78 88.20229 239.724 117.93829" c="P" x="225.78" y="111.93829"/>
<char bbox="239.5176 88.20229 248.63759 117.93829" c="r" x="239.5176" y="111.93829"/>
<char bbox="248.4552 88.20229 261.1272 117.93829" c="e" x="248.4552" y="111.93829"/>
<char bbox="261.1128 88.20229 269.29679 117.93829" c="f" x="261.1128" y="111.93829"/>
</font>
</line>
</block>
</page>
</document>
Simplified, the high-level structure looks like:
<page> -> [list of blocks]
<block> -> [list of blocks]
a block is either:
- stext
<line> -> [list of lines] (all have same baseline)
<font> -> [list of fonts] (horizontal spaces over a line)
<char> -> [list of chars]
- image
# TODO document the image data from mutool
get_mutool_text_stext_xml($pdf_filename, $pdf_page_no)
Returns a HashRef of the structured text from from page number $pdf_page_no
of the PDF file $pdf_filename
.
See the function get_mutool_text_stext_raw for details on the structure of this data.
get_mutool_page_info_raw($pdf_filename)
Returns an XML string of the page bounding boxes of PDF file $pdf_filename
.
The data is in the form:
<document>
<page pagenum="1">
<MediaBox l="0" b="0" r="531" t="666" />
<CropBox l="0" b="0" r="531" t="666" />
<Rotate v="0" />
</page>
<page pagenum="2">
...
</page>
</document>
get_mutool_page_info_xml($pdf_filename)
Returns a HashRef containing the page bounding boxes of PDF file $pdf_filename
.
See function get_mutool_page_info_raw for information on the structure of the data.
fun get_mutool_outline_simple($pdf_filename)
Returns an array of the outline of the PDF file $pdf_filename
as an ArrayRef[HashRef]
which corresponds to the items
attribute of Renard::Incunabula::Outline.
fun get_mutool_get_trailer_raw($pdf_filename)
Returns the trailer of the PDF file $pdf_filename
as a string.
fun get_mutool_get_object_raw($pdf_filename, $object_id)
Returns the object given by the ID $object_id
for PDF file $pdf_filename
as a string.
fun get_mutool_get_info_object_parsed( $pdf_filename )
Returns the document information dictionary as a Renard::API::MuPDF::mutool::ObjectParser object.
See Table 10.2 on pg. 844 of the PDF Reference, version 1.7 to see the entries that usually used (e.g., Title, Author).
Project Renard
This software is copyright (c) 2017 by Project Renard.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.