How to include image in `Page`'s `to_html` or `to_xhtml` method? #69

LazyGeniusMan · 2023-05-19T01:40:07Z

When I try coverting a page that have image to html or xhtml, the image is not included. With this code:

fn main() {
    use mupdf::{Document, Page};
    use std::fs;

    let doc: Document = Document::open("C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\test.epub").unwrap();
    let page: Page = doc.load_page(341).unwrap();
    let html: String = page.to_html().unwrap();

    fs::write("C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\rs-test.html", html);
}

I got this result:

there should be an image above Figure 10.3 text.

I tried to do the same thing in PyMuPDF with this code:

import fitz

doc = fitz.Document('C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\test.epub')
page = doc[331] # the page index is somehow different for the same page I want
html = page.get_text("html")

with open("C:\\Users\\LazyGeniusMan\\Downloads\\mupdf\\py-test.html", "w") as file:
    file.write(html)

I got this result:

the image is included in base64 format.

I also tried doing the same thing via mutool convert cli, and can get the same result but there's an option that need to be enabled, I dont find anyway to set this thing in to_html method of this crate. The option in mutool look like this:

Text output options:
        inhibit-spaces: don't add spaces between gaps in the text
        preserve-images: keep images in output
        preserve-ligatures: do not expand ligatures into constituent characters
        preserve-whitespace: do not convert all whitespace into space characters
        preserve-spans: do not merge spans on the same line
        dehyphenate: attempt to join up hyphenated words
        mediabox-clip=no: include characters outside mediabox

The text was updated successfully, but these errors were encountered:

messense · 2023-05-19T02:12:16Z

Sorry, this project is not actively maintained at the moment, but I'm happy to accept pull requests to fix this if anyone is up for it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to include image in `Page`'s `to_html` or `to_xhtml` method? #69

How to include image in `Page`'s `to_html` or `to_xhtml` method? #69

LazyGeniusMan commented May 19, 2023 •

edited

Loading

messense commented May 19, 2023

How to include image in Page's to_html or to_xhtml method? #69

How to include image in Page's to_html or to_xhtml method? #69

Comments

LazyGeniusMan commented May 19, 2023 • edited Loading

messense commented May 19, 2023

How to include image in `Page`'s `to_html` or `to_xhtml` method? #69

How to include image in `Page`'s `to_html` or `to_xhtml` method? #69

LazyGeniusMan commented May 19, 2023 •

edited

Loading