Broken links between multiple input Markdown files #6384

bebuch · 2020-05-20T11:11:24Z

We convert our documentation from Markdown to PDF using Pandoc. Usually several Markdown files are converted to one PDF file.

Links between Markdown files that are included in the same PDF are broken in the PDF.

A very similar bug report already existed in 2016 with #2719.

Minimal example

File test-1.md:

# Headline 1

some text

## Headline 2

more text

File test-2.md:

# Headline 1

some other text

## Another headline

more other text

[link to #headline-1](#headline-1) **wrong link**
refers to Headĺine 1 in test-1, shoud refer to Headĺine 1 in test-2

[link to #another-headline](#another-headline) **works**
as expected because anchor is unique

[link to test-1.md](test-1.md) **broken link**

[link to test-1.md#headline-1](test-1.md#headline-1) **broken link**

[link to test-1.md#headline-2](test-1.md#headline-2) **broken link**

[link to test-2.md](test-2.md) **broken link**

[link to test-2.md#headline-1](test-2.md#headline-1) **broken link**

[link to test-2.md#another-headline](test-2.md#another-headline) **broken link**

Compile it to PDF:

docker run --rm --volume "$(pwd):/data" --user $(id -u):$(id -g) pandoc/latex:2.9.2.1 test-1.md test-2.md -o test.pdf

You get the same behavior with HTML which is simpler to debug:

docker run --rm --volume "$(pwd):/data" --user $(id -u):$(id -g) pandoc/latex:2.9.2.1 test-1.md test-2.md -o test.html

Here is the HTML output:

<h1 id="headline-1">Headline 1</h1>
<p>some text</p>
<h2 id="headline-2">Headline 2</h2>
<p>more text</p>
<h1 id="headline-1-1">Headline 1</h1>
<p>some other text</p>
<h2 id="another-headline">Another headline</h2>
<p>more other text</p>
<p><a href="#headline-1">link to #headline-1</a> <strong>wrong link</strong> refers to Headĺine 1 in test-1, shoud refer to Headĺine 1 in test-2</p>
<p><a href="#another-headline">link to #another-headline</a> works as expected because anchor is unique</p>
<p><a href="test-1.md">link to test-1.md</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-1">link to test-1.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-2">link to test-1.md#headline-2</a> <strong>broken link</strong></p>
<p><a href="test-2.md">link to test-2.md</a> <strong>broken link</strong></p>
<p><a href="test-2.md#headline-1">link to test-2.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-2.md#another-headline">link to test-2.md#another-headline</a> <strong>broken link</strong></p>

Expected behavior

links to anchors (without file name) should also refer in the converted file to the correct headline from the original Markdown file and not to the first identical headline from all Markdown files
links to files with anchors that belong to the list of markdown files passed to Pandoc should link to the corresponding anchor of the converted file and not to the original file itself
links to files without anchors that belong to the list of markdown files passed to Pandoc are a bit more difficult to resolve
1. If the markdown file referred to starts with a headline (of any order), it should be linked to this headline
2. Otherwise, an additional anchor must be inserted at this point in the target document to which the link is then made (alternatively, such links could be removed pragmatically, since such cases are likely to be very rare, but the first proposed solution would be preferable if it can be implemented, since very rare is not a never)

The text was updated successfully, but these errors were encountered:

jgm · 2020-05-20T14:44:35Z

I think you have a misconception about how pandoc treats its input files.

If you include multiple files on the command line, pandoc concatenates their contents and parses the result, paying no attention to what file a particular bit of markdown is found in. (In this respect it works like a lot of other unix tools.)

So the fact that the link is found in the second file makes no difference.

You might try experimenting with the --file-scope option, depending on your needs. (Note that this has certain limitations, though: e.g. with that option you can't define a link reference in one file and use it in another.)

bebuch · 2020-05-20T17:15:18Z

I understand that this is how it works currently, but this is not useful behavior.

Thanks for your comments, that helped me to better understand the current behavior. The current behavior is in accordance with the documentation, so this is not a bug report, but an enhancement request.

The point is that PDF is a file format that always consists of one file. All contents are embedded in it.

In Markdown (and also HTML for example), however, the same information is handled in multiple files within a directory structure. This is absolutely necessary, because images, for example, cannot be embedded in these formats.

A document converter should be able to convert such multi-file document formats into single-file formats without breaking the 'in document' links and image inclusions.

If I understand it correctly, the conversion process is currently divided into a read and a write process. To implement the proposed behavior, the reading process for markdown files would have to be adapted.

Load input files individually
Parse input files individually
Adjust relative links between all included files (including images)
Merge ASTs

Technically, I see no reason why this could not be implemented. It would make the tool much more useful and easier to use. For example, in many cases, the use of the parameter --resource-path would become obsolete, since the inclusion with the intelligent behavior simply works.

I understand this is a major change, but I think it would save a lot of people a lot of time and energy. I would therefore be very pleased about a second review.

If you agree, please reopen the issue.

jgm · 2020-05-20T18:01:08Z

--file-scope does 1, 2, and 4 -- did you try it?

The one thing it won't do is rewrite heading IDs or links, so in your example the two identical headings would not receive unique anchors. I'd suggest dealing with this by using explicit identifiers (see link attribute syntax) when you have duplicated headings.

jgm · 2020-05-20T18:14:28Z

If you'd like, you can create a more general issue that requests that the markdown parser be made sensitive to the file containing each particular bit of content. This would require a fairly big architectural change: the readers would have to be changed to take a [(FilePath, Text)] argument instead of just a Text. Currently they simply don't have access to information about the containing file.

bebuch · 2020-05-20T18:29:42Z

answer first comment

Step 3 is the most important one in this process. ;-)

When I compile to HTML with --file-scope, all links described above as broken are still broken. In addition, the HTML is also invalid, because now two elements have the same ID.

<h1 id="headline-1">Headline 1</h1>
<p>some text</p>
<h2 id="headline-2">Headline 2</h2>
<p>more text</p>
<h1 id="headline-1">Headline 1</h1>
<p>some other text</p>
<h2 id="another-headline">Another headline</h2>
<p>more other text</p>
<p><a href="#headline-1">link to #headline-1</a> <strong>wrong link</strong> refers to Headĺine 1 in test-1, shoud refer to Headĺine 1 in test-2</p>
<p><a href="#another-headline">link to #another-headline</a> <strong>works</strong> as expected because anchor is unique</p>
<p><a href="test-1.md">link to test-1.md</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-1">link to test-1.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-2">link to test-1.md#headline-2</a> <strong>broken link</strong></p>
<p><a href="test-2.md">link to test-2.md</a> <strong>broken link</strong></p>
<p><a href="test-2.md#headline-1">link to test-2.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-2.md#another-headline">link to test-2.md#another-headline</a> <strong>broken link</strong></p>

The important point is that if you open the markdown files with OOO for example, the links will all work before conversion. After the conversion to HTML or PDF the same links are broken. The links are not included in the conversion, so the conversion is incomplete. (Always assumed, the Markdown files are considered as one document that is spread over several files).

answer second comment

If I interpret your description and the behavior of --file-scope correctly, then step 3 can be done before merging the ASTs.

I haven't checked the source code for this yet, but I would suspect that the merging of the ASTs takes place at a location where the original filenames (including path) are available. If this is the case, the addition of this step would be less complex.

I don't know if I can get around to checking it today, but probably sometime next week. Thanks for the feedback!

jgm · 2020-05-20T18:42:14Z

step 3 can be done before merging the ASTs.

Yes, that's true, when --file-scope is used we have

        mconcat <$> mapM (readSource >=> r readerOpts) sources'

It would be possible, for example, to insert an identifier prefix derived from the file name before each internal identifer. Internal links to rewritten ids could also be rewritten.

I'm not so sure about links to other parts of the document (in other files). You are expecting it to work with a link to other-markdown-file.md#ident, so I guess your idea is that if other-markdown-file.md is one of the files on the command line, this wolud get rewritten to something like #other-markdown-file-md-ident.

That would mean that, for example, you couldn't link to the document's source file from the document -- perhaps an undesirable consequence for some users.

Paging @jkr who added the file-scope option originally and may have some thoughts on whether it should be changed in this way.

bebuch · 2020-05-20T19:01:09Z

so I guess your idea is that if other-markdown-file.md is one of the files on the command line, this wolud get rewritten to something like #other-markdown-file-md-ident.

Exactly!

I'm not so sure about links to other parts of the document (in other files). You are expecting it to work with a link to other-markdown-file.md#ident

These links can remain unchanged. The linked data is not part of the converted document, so it is okay if it breaks if the linked data is not copied separately.

Ideally, you could convert them to absolute HTTP paths using an additional Pandoc command line option like --rebase-relative-links-on 'https://github.com/jgm/pandoc/tree/master/doc', but that might need to be addressed in a separate issue afterwards.

Previously, when multiple file arguments were provided, pandoc simply concatenated them and passed the contents to the readers, which took a Text argument. As a result, the readers had no way of knowing which file was the source of any particular bit of text. This meant that we couldn't report accurate source positions on errors or include accurate source positions as attributes in the AST. More seriously, it meant that we couldn't resolve resource paths relative to the files containing them (see e.g. #5501, #6632, #6384, #3752). Add Text.Pandoc.Sources (exported module), with a `Sources` type and a `ToSources` class. A `Sources` wraps a list of `(SourcePos, Text)` pairs. [API change] A parsec `Stream` instance is provided for `Sources`. The module also exports versions of parsec's `satisfy` and other Char parsers that track source positions accurately from a `Sources` stream (or any instance of the new `UpdateSourcePos` class). Text.Pandoc.Parsing now exports these modified Char parsers instead of the ones parsec provides. Modified parsers to use a `Sources` as stream [API change]. The readers that previously took a `Text` argument have been modified to take any instance of `ToSources`. So, they may still be used with a `Text`, but they can also be used with a `Sources` object. In Text.Pandoc.Error, modified the constructor PandocParsecError to take a `Sources` rather than a `Text` as first argument, so parse error locations can be accurately reported. T.P.Error: showPos, do not print "-" as source name.

mbrucher · 2022-06-18T11:36:36Z

Is this solved now? I agree that multi file handling without this option is not very useful at this time :(

bebuch · 2022-09-06T17:17:29Z

@mbrucher Looks like 6e45607 does a lot of preparation to solve it, but the actual link fixing still isn't done.

jgm · 2022-09-06T17:31:59Z

I think the best fix is not a change in the reader itself, but rather modifying the behavior of --file-scope:
Instead of

        mconcat <$> mapM (readSource >=> r readerOpts) sources'

we could have something like

        mconcat <$> mapM (\s -> readSource s >>= r readerOpts >>= rewriteLinksAndIdentifiers s sources') sources'

Here rewriteLinksAndIdentifiers would change all the identifiers in source sby adding a prefix derived from s. It would also change all links of form FILE(#anchor)? where FILE is in sources' accordingly.

This would produce the behavior you're going for. One drawback, though, is that even explicitly provided anchors would change, and this might cause problems for some people.

bebuch · 2022-09-06T19:01:05Z

Sounds like a reasonable idea.

As for explicitly defined anchor IDs, unfortunately I can't think of a backwards compatible solution either. As long as the generated prefix ID depends exclusively on the file name in which the user-defined anchor ID was defined, a new composed anchor ID results. The problem is that this is not possible. To be able to compile files from different directories with Pandoc, the prefix ID must necessarily include the file path. This of course depends on the computer. The problem can be reduced if a common root directory is determined for all source files. This would also shorten the length of the (prefix-)IDs, which makes sense anyway.

External links to such anchors in the generated document must be updated accordingly, which is a breaking change of Pandoc.

If the user changes his/her directory structure so that a new common base directory results, links to anchors must be updated again. It is then the user's responsibility to avoid such a change if they wish to do so.

However, I believe that this use case affects very few users. The problem that the change would solve probably affects many more users.

jgm · 2022-09-06T19:26:33Z

However, I believe that this use case affects very few users. The problem that the change would solve probably affects many more users.

It's always very hard to know. Usually when I make a change like this, I find that all sorts of people have been relying on the old behavior!

mbrucher · 2022-09-06T19:34:57Z

I think the fact that this doesn't work limits pandoc to small products. You can't create anything big that is still maintainable.
Now, I don't think there is an issue with the links. For one file, the links are internal, no difference. Once you have more than one file, the link derives from the common folder structure (with the root removed). This behavior should make it consistent when you only have one file as well, so not sure where the breakage would occur, as this doesn't work properly for multi files at the moment.

bebuch · 2022-09-06T19:48:26Z

True, you can't know. The only option to keep backward comparability would be to introduce a new command line option for the changed behavior. Indeed that might be the best solution. Maybe something like '--file-prefixed-anchors'. John MacFarlane ***@***.***> schrieb am Di., 6. Sept. 2022, 21:26:

…

However, I believe that this use case affects very few users. The problem that the change would solve probably affects many more users. It's always very hard to know. Usually when I make a change like this, I find that all sorts of people have been relying on the old behavior! — Reply to this email directly, view it on GitHub <#6384 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3UCSLWQBLI2GROO37GF23V46LHHANCNFSM4NF2X6BQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

This change only affects the case where `--file-scope` is used and more than one file is specified on the command line. In this case, identifiers will be prefixed with a string derived from the file path, to disambiguate them. For example, an identifier `foo` in `contents/file1.txt` will become `contents__file1.txt__foo`. Links will be adjusted accordingly: if `file2.txt` links to `file1.txt#foo`, then the link will be changed to point to `#file1.txt__foo`. Similarly, a link to `file1.txt` will point to `#file1.txt`. A Div with an identifier derived from the file path will be added around each file's content, so that links to files will still work. Closes #6384. [API change]: Text.Pandoc.Shared exports `textToIdentifier`.

bobro99 · 2024-09-18T10:34:17Z

Hello everyone!
I also encountered a similar problem. I have a large volume of markdown documentation in gitlab. There is a need to convert the documentation to pdf. Everything is fine, but there is a problem - anchors in a link to another pdf file do not work. When the file is in markdown format, anchors of links to another file work perfectly. But when the files are already converted to pdf, then anchors in links do not work. The concept of document formation does not allow making a single pdf. Tell me, will there be any development of this task? Maybe there are new issues?

jgm · 2024-09-18T16:23:45Z

You can use a Lua filter to modify the links so that they point to the pdfs. See docs on lua filters on the website.

bobro99 · 2024-09-18T16:57:41Z

You can use a Lua filter to modify the links so that they point to the pdfs. See docs on lua filters on the website.

Thank you for such a quick response. I will try to follow your advice.

bobro99 · 2024-09-19T12:53:51Z

Unfortunately, I didn't have enough time to figure out how to create a lua filter. I would like to ask if anyone solves this problem, please let me know.

Enivex · 2024-09-22T13:15:02Z

I've been trying to convert some ebooks to LaTeX or typst, and this still seems to be a problem when converting (single) EPUB files, since they consist of several HTML files internally.

Was a bit surprised to see this issue closed. Is there a more appropriate one for my particular use case?

bpj · 2024-09-22T15:19:36Z

@bobro99 This filter should replace .md extensions in local link targets with .pdf solving your problem.

It requires Lpeg, so pandoc 2.16.2 or newer. Recommendation is to download and install latest version at https://github.com/jgm/pandoc/releases/latest

local md_url_pattern = re.compile([===[
  url <- {~ 'file:'? ( !ext [^:] )+ ext -> '.pdf' ( '#' .* )? !. ~}
  ext <- ( '.md' &( %p / !. ) )
]===])

function Link (link)
  local url = link.target
  link.target = md_url_pattern:match(url) or url
  return link
end

If you use some other file extension(s) than .md change the second line of the pattern to

ext <- ( ( '.md' / '.mkd' / '.markdown' ) &( %p / !. ) )

adding/removing /-separated extensions as needed. (Of course there must be at least one!)

If you still have problems, such as too many or too few URLs being changed/you only want to change some links with the .md extension add a class .pdf to the links you want to change, replacing something like

[link text](my-file.md)

with

[link text](my-file.md){.pdf}

and use this filter instead:

local md_ext_pattern = re.compile([===[
  ext <- ( '.md' &( %p / !. ) )
]===])

function Link (link)
  if link.classes:includes('pdf') then
    local url = link.target
    link.target = re.gsub(url, md_ext_pattern, '.pdf')
    return link
  end
  return nil
end

(Of course again adding extensions in the pattern as needed.)

This uses a somewhat laxer matching method, but works with any urls as long as the .pdf class is present on the link. You may want to use both the stricter pattern and the class check:

local md_url_pattern = re.compile([===[
  url <- {~ 'file:'? ( !ext [^:] )+ ext -> '.pdf' ( '#' .* )? !. ~}
  ext <- ( '.md' &( %p / !. ) )
]===])

function Link (link)
  if link.classes:includes('pdf') then
    local url = link.target
    link.target = md_url_pattern:match(url) or url
    return link
  end
  return nil
end

bobro99 · 2024-09-23T13:08:34Z

@bpj thank you very much for your help and participation in the development of the filter-lua!
But, unfortunately, anchors in pdf did not work.
As I understand, the conversion of links from md to pdf is wonderful, but links like file.md#myanchor do not work in pdf either. That is, I do not get to the anchor as it happens in md. Perhaps in pdf you need to use some other symbol, not # . I do not know.
I used the suggested lua filter for testing:

local md_url_pattern = re.compile([===[
  url <- {~ 'file:'? ( !ext [^:] )+ ext -> '.pdf' ( '#' .* )? !. ~}
  ext <- ( '.md' &( %p / !. ) )
]===])

function Link (link)
  local url = link.target
  link.target = md_url_pattern:match(url) or url
  return link
end

bpj · 2024-09-24T17:23:24Z

@bobro99 yes fragments don't work with PDF targets. I forgot that. I don't recall if the limitation lies with the PDF format, with hyperref, or if pandoc doesn't do something it could do, which perhaps could be fixed with a filter, or if my filter breaks something pandoc would do, although the last is unlikely.

Ultimately a filter could even bypass what the LaTeX writer does with Link elements by replacing them with the link text and raw LaTeX code, but hyperref is rather finicky and its documentation is a maze. I'm sure pandoc handles some otherwise easily overlooked edge cases.

For now you can move the ( '#' .* )? !. bit outside the capture — to the right of the ~} . That will give you a link to the file but not the section.

One thing a filter could easily do would be to insert a piece of text from an attribute giving the section title or some other suitable piece of text, or the link title, in human readable form next to the link, added to the link text or instead of the link text, e.g.

[here](foo.md#my-section){append-pdf="section My section"}

or

[here](foo.md#my-section "section My section")

being transformed as if you had written

[here (section My Section)](foo.pdf)

If you are only targeting PDF you can simply add such hints in the link text manually of course.

IIRC hyperref/PDF doesn't let you use pop-up titles meaningfully either :-(

jgm closed this as completed May 20, 2020

jgm reopened this May 20, 2020

tarleb added API enhancement reader labels May 21, 2020

uppalabharath mentioned this issue Nov 8, 2021

epub rejected by Google Play Books, EPUBCheck quii/learn-go-with-tests#420

Closed

jgm mentioned this issue Sep 7, 2022

Add prefixes to identifiers with --file-scope. #8282

Merged

jgm closed this as completed in #8282 Sep 19, 2022

atc0005 mentioned this issue Sep 29, 2022

v13.0.x, v14.0.x, v15.0.x fails to import via Google Play Books quii/learn-go-with-tests#598

Closed

NicklasFranzen mentioned this issue Dec 1, 2022

Multiple input files using file-scope does not handle spaces in filenames #8467

Closed

jgm mentioned this issue Aug 21, 2023

Broken TOC links when converting multiple Markdown files to epub3 #9009

Open

Enivex mentioned this issue Sep 22, 2024

Inline links seem broken when converting de-drm epub to org file #8470

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken links between multiple input Markdown files #6384

Broken links between multiple input Markdown files #6384

bebuch commented May 20, 2020

jgm commented May 20, 2020

bebuch commented May 20, 2020

jgm commented May 20, 2020

jgm commented May 20, 2020

bebuch commented May 20, 2020 •

edited

Loading

jgm commented May 20, 2020

bebuch commented May 20, 2020

mbrucher commented Jun 18, 2022

bebuch commented Sep 6, 2022

jgm commented Sep 6, 2022

bebuch commented Sep 6, 2022

jgm commented Sep 6, 2022

mbrucher commented Sep 6, 2022

bebuch commented Sep 6, 2022 via email

bobro99 commented Sep 18, 2024

jgm commented Sep 18, 2024

bobro99 commented Sep 18, 2024 •

edited

Loading

bobro99 commented Sep 19, 2024

Enivex commented Sep 22, 2024 •

edited

Loading

bpj commented Sep 22, 2024

bobro99 commented Sep 23, 2024 •

edited

Loading

bpj commented Sep 24, 2024

Broken links between multiple input Markdown files #6384

Broken links between multiple input Markdown files #6384

Comments

bebuch commented May 20, 2020

Minimal example

Expected behavior

jgm commented May 20, 2020

bebuch commented May 20, 2020

jgm commented May 20, 2020

jgm commented May 20, 2020

bebuch commented May 20, 2020 • edited Loading

jgm commented May 20, 2020

bebuch commented May 20, 2020

mbrucher commented Jun 18, 2022

bebuch commented Sep 6, 2022

jgm commented Sep 6, 2022

bebuch commented Sep 6, 2022

jgm commented Sep 6, 2022

mbrucher commented Sep 6, 2022

bebuch commented Sep 6, 2022 via email

bobro99 commented Sep 18, 2024

jgm commented Sep 18, 2024

bobro99 commented Sep 18, 2024 • edited Loading

bobro99 commented Sep 19, 2024

Enivex commented Sep 22, 2024 • edited Loading

bpj commented Sep 22, 2024

bobro99 commented Sep 23, 2024 • edited Loading

bpj commented Sep 24, 2024

bebuch commented May 20, 2020 •

edited

Loading

bobro99 commented Sep 18, 2024 •

edited

Loading

Enivex commented Sep 22, 2024 •

edited

Loading

bobro99 commented Sep 23, 2024 •

edited

Loading