Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html to fb2 omits <h2> elements #8123

Closed
phil294 opened this issue Jun 13, 2022 · 9 comments
Closed

html to fb2 omits <h2> elements #8123

phil294 opened this issue Jun 13, 2022 · 9 comments

Comments

@phil294
Copy link

phil294 commented Jun 13, 2022

Explain the problem.
Try some Wikipedia article:

pandoc -o rock.fb2 https://en.wikipedia.org/wiki/Rock_castle

and view the fb2 file: The subheaders (e.g. "Rock-hewn castles") are missing.

Pandoc version?
2.17.1.1 on Manjaro (Arch) Linux

Side notes:

Another problem: Image captions are missing but since they don't follow semantics standards and are plain <div> text elements, this is Wikipedia to be blamed.

Finally, it would be nice if <img> width and height attributes were respected, or maybe via css somehow?

Locally, I solved 1. and 2. via regex hacking and 3. with imagemagick cmd line tools before converting.

@phil294 phil294 added the bug label Jun 13, 2022
@jgm
Copy link
Owner

jgm commented Jun 14, 2022

@astanin - can you see what is happening here?

@astanin
Copy link
Contributor

astanin commented Jun 14, 2022

@jgm I'm not using this feature anymore but I suppose that the problem is how Wikipedia HTML is parsed.

FB2 does not have an equivalent of the h2 tag. It provides only <title> within a <section>, but it allows to have nested sections. I think it becomes more clear if we look at the body of an FB2 document (source):

  <body>
    <section>
      <p>Frontispiece, with the caption: "He examined with his glass the word
        upon the wall, going over every letter of it with the most minute
        exactness." (<emphasis>Page</emphasis> 23.)</p>
    </section>
    <section>
      <title><p>PART I.</p></title>
      <section>
        <p>(<emphasis>Being a reprint from the reminiscences of</emphasis> JOHN
          H. WATSON, M.D.,<emphasis> late of the Army Medical
            Department.</emphasis>) <a xlink:href="#N2" type="note">2</a></p>
      </section>
      <section>
        <title><p>CHAPTER I. MR. SHERLOCK HOLMES.</p></title>
        <p>IN the year 1878 I took my degree of Doctor of Medicine of the
          University of London, and proceeded to Netley to go through the
          course prescribed for surgeons in the army. Having completed my
          studies there, I was duly attached to the Fifth Northumberland
          Fusiliers as Assistant Surgeon. The regiment was stationed in India
          at the time, and before I could join it, the second Afghan war had
          broken out. On landing at Bombay, I learned that my corps had
          advanced through the passes, and was already deep in the enemy's
          country. I followed, however, with many other officers who were in
          the same situation as myself, and succeeded in reaching Candahar in
          safety, where I found my regiment, and at once entered upon my new
          duties.</p>
        <p>The campaign brought honours and promotion to many, but for me it
          had nothing but misfortune and disaster. I was removed from my
          brigade and attached to the Berkshires, with whom I served at the
          fatal battle of Maiwand. There I was struck on the shoulder by a
          Jezail bullet, which shattered the bone and grazed the subclavian
          artery. I should have fallen into the hands of the murderous Ghazis
          had it not been for the devotion and courage shown by Murray, my
          orderly, who threw me across a pack-horse, and succeeded in bringing
          me safely to the British lines.</p>
      </section>
    </section>
  </body>

FB2 Writer apparently assumes that a section is represented by a Div Block with a section attribute and Header as the first nested Block. As long as HTML can be parsed to this structure, FB2 converter should work fine.

From what I can see from the native format output, Wikipedia HTML is parsed differently. Looking at the beginning of one of the sections:

    ,Div ("",["thumb","tright"],[])
     [Div ("",["thumbinner"],[("style","width:222px;")])
      [Plain [Link ("",["image"],[]) [Image ("",["thumbimage"],[("width","220"),("height","165"),("srcset","//upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Burg_Rotenhan.jpg/330px-Burg_Rotenhan.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Burg_Rotenhan.jpg/440px-Burg_Rotenhan.jpg 2x")]) [] ("//upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Burg_Rotenhan.jpg/220px-Burg_Rotenhan.jpg","")] ("/wiki/File:Burg_Rotenhan.jpg","")]
      ,Div ("",["thumbcaption"],[])
       [Div ("",["magnify"],[])
        [Plain [Link ("",["internal"],[]) [] ("/wiki/File:Burg_Rotenhan.jpg","Enlarge")]]
       ,Plain [Str "The",Space,Str "gateway",Space,Str "to",Space,Link ("",[],[]) [Str "Rotenhan",Space,Str "Castle"] ("/wiki/Rotenhan_Castle","Rotenhan Castle"),Str ",",Space,Str "which",Space,Str "was",Space,Str "entirely",Space,Str "hewn",Space,Str "out",Space,Str "of",Space,Str "the",Space,Str "sandstone"]]]]
    ,Header 2 ("rock-hewn-castlesedit",[],[]) [Span ("Rock-hewn_castles",["mw-headline"],[]) [Str "Rock-hewn",Space,Str "castles"],Span ("",["mw-editsection"],[]) [Span ("",["mw-editsection-bracket"],[]) [Str "["],Link ("",[],[]) [Str "edit"] ("/w/index.php?title=Rock_castle&action=edit&section=2","Edit section: Rock-hewn castles"),Span ("",["mw-editsection-bracket"],[]) [Str "]"]]]
    ,Para [Str "Castle",Space,Str "researcher",Space,Link ("",[],[]) [Str "Otto",Space,Str "Piper"] ("/wiki/Otto_Piper","Otto Piper"),Space,Str "used",Space,Str "the",Space,Str "German",Space,Str "phrase",Space,Emph [Str "ausgehauene",Space,Str "Burg"],Space,Str "(literally:",Space,Str "\"carved-out",Space,Str "castle\")",Space,Str "for",Space,Str "castles",Space,Str "that",Space,Str "had",Space,Str "rooms",Space,Str "artificially",Space,Str "hewn",Space,Str "out",Space,Str "of",Space,Str "the",Space,Str "rock",Space,Str "on",Space,Str "which",Space,Str "the",Space,Str "castle",Space,Str "stood.",Superscript [Link ("",[],[]) [Str "[1]"] ("#cite_note-1","")],Space,Str "His",Space,Str "examples",Space,Str "of",Space,Str "such",Space,Str "rock-hewn",Space,Str "castles",Space,Str "include",Space,Link ("",["mw-redirect"],[]) [Str "Fleckenstein"] ("/wiki/Fleckenstein_Castle","Fleckenstein Castle"),Str ",",Space,Link ("",[],[]) [Str "Trifels"] ("/wiki/Trifels_Castle","Trifels Castle"),Space,Str "and",Space,Link ("",["mw-redirect"],[]) [Str "Altwindstein"] ("/wiki/Altwindstein_Castle","Altwindstein Castle"),Str ".",Space,Str "From",Space,Str "a",Space,Str "constructional",Space,Str "point",Space,Str "of",Space,Str "view",Space,Str "there",Space,Str "is",Space,Str "a",Space,Str "close",Space,Str "relationship",Space,Str "with",Space,Link ("",[],[]) [Str "cave",Space,Str "castles"] ("/wiki/Cave_castle","Cave castle"),Str ",",Space,Str "which",Space,Str "are",Space,Str "also",Space,Str "often",Space,Str "enhanced",Space,Str "with",Space,Str "rooms",Space,Str "artificially",Space,Str "cut",Space,Str "out",Space,Str "of",Space,Str "the",Space,Str "rock."]

So it appears that Header is not even inside a Div with the "section" attribute.

As a workaround I would suggest to click on the page Edit link in Wikipedia, copy the mediawiki markup to a file, and try to convert that file instead of the rendered HTML. As a more permanent solution, the assumption about how a section is represented may have to be revised.

@jgm
Copy link
Owner

jgm commented Jun 14, 2022

@astanin I don't think that's the heart of it. The FB2 writer has never expected the AST to be structured into sections. It starts by applying a function renderSections that converts a regular AST into this section Div structure.

@tarleb
Copy link
Collaborator

tarleb commented Aug 3, 2022

It seems that this happens whenever the content is wrapped in a div and doesn't start with a header. Example:

::: wrapper
hello

# MISSING

section one
:::

Output of pandoc -t fb2 for this Markdown, tidyed up for readability. Note that the first level heading is missing.

<?xml version="1.0" encoding="utf-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0"
xmlns:l="http://www.w3.org/1999/xlink">
  <description>
    <title-info>
      <genre>unrecognised</genre>
    </title-info>
    <document-info>
      <program-used>pandoc</program-used>
    </document-info>
  </description>
  <body>
    <title>
      <p />
    </title>
    <section>
      <p>hello</p>
      <p>section one</p>
    </section>
  </body>
</FictionBook>

@jgm
Copy link
Owner

jgm commented Aug 3, 2022

I think what's going on is that makeSections just leaves this Div alone, so the expectation of the fb2 writer (which is that after makeSections every Div will have the structure described above) is incorrect?

Here's the result of putting a trace on the block structure produced by makeSections in the FB2 writer:

[ Div
    ( "" , [ "section" ] , [] )
    [ Header 1 ( "" , [] , [] ) []
    , Div
        ( "" , [ "wrapper" ] , [] )
        [ Para [ Str "hello" ]
        , Div
            ( "missing" , [ "section" ] , [] )
            [ Header 1 ( "" , [] , [] ) [ Str "MISSING" ]
            , Para [ Str "section" , Space , Str "one" ]
            ]
        ]
    ]
]

jgm added a commit that referenced this issue Aug 3, 2022
This allows the writer to recurse into those Divs and
find new sections inside them. See #8123.
@jgm
Copy link
Owner

jgm commented Aug 3, 2022

Pushed a potential fix, but I don't know enough about FB2 to know if this is right.
Output for @tarleb's snippet would be

<?xml version="1.0" encoding="UTF-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:l="http://www.w3.org/1999/xlink"><description><title-info><genre>unrecognised</genre></title-info><document-info><program-used>pandoc</program-used></document-info></description><body><title><p /></title><section><p>hello</p><section id="missing"><title><p>MISSING</p></title><p>section one</p></section></section></body></FictionBook>

@jgm
Copy link
Owner

jgm commented Aug 3, 2022

Is FB2 okay with a section element without a title element?

@tarleb
Copy link
Collaborator

tarleb commented Aug 4, 2022

The FB2 schema that I found says minOccurs="0" for titles in sections, so it seems that title-less sections are ok.

@jgm
Copy link
Owner

jgm commented Aug 4, 2022

Closing this, then.

@jgm jgm closed this as completed Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants