Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken links between multiple input Markdown files #6384

Closed
bebuch opened this issue May 20, 2020 · 22 comments · Fixed by #8282
Closed

Broken links between multiple input Markdown files #6384

bebuch opened this issue May 20, 2020 · 22 comments · Fixed by #8282

Comments

@bebuch
Copy link

bebuch commented May 20, 2020

We convert our documentation from Markdown to PDF using Pandoc. Usually several Markdown files are converted to one PDF file.

Links between Markdown files that are included in the same PDF are broken in the PDF.

A very similar bug report already existed in 2016 with #2719.

Minimal example

File test-1.md:

# Headline 1

some text

## Headline 2

more text

File test-2.md:

# Headline 1

some other text

## Another headline

more other text

[link to #headline-1](#headline-1) **wrong link**
refers to Headĺine 1 in test-1, shoud refer to Headĺine 1 in test-2

[link to #another-headline](#another-headline) **works**
as expected because anchor is unique

[link to test-1.md](test-1.md) **broken link**

[link to test-1.md#headline-1](test-1.md#headline-1) **broken link**

[link to test-1.md#headline-2](test-1.md#headline-2) **broken link**

[link to test-2.md](test-2.md) **broken link**

[link to test-2.md#headline-1](test-2.md#headline-1) **broken link**

[link to test-2.md#another-headline](test-2.md#another-headline) **broken link**

Compile it to PDF:

docker run --rm --volume "$(pwd):/data" --user $(id -u):$(id -g) pandoc/latex:2.9.2.1 test-1.md test-2.md -o test.pdf

You get the same behavior with HTML which is simpler to debug:

docker run --rm --volume "$(pwd):/data" --user $(id -u):$(id -g) pandoc/latex:2.9.2.1 test-1.md test-2.md -o test.html

Here is the HTML output:

<h1 id="headline-1">Headline 1</h1>
<p>some text</p>
<h2 id="headline-2">Headline 2</h2>
<p>more text</p>
<h1 id="headline-1-1">Headline 1</h1>
<p>some other text</p>
<h2 id="another-headline">Another headline</h2>
<p>more other text</p>
<p><a href="#headline-1">link to #headline-1</a> <strong>wrong link</strong> refers to Headĺine 1 in test-1, shoud refer to Headĺine 1 in test-2</p>
<p><a href="#another-headline">link to #another-headline</a> works as expected because anchor is unique</p>
<p><a href="test-1.md">link to test-1.md</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-1">link to test-1.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-2">link to test-1.md#headline-2</a> <strong>broken link</strong></p>
<p><a href="test-2.md">link to test-2.md</a> <strong>broken link</strong></p>
<p><a href="test-2.md#headline-1">link to test-2.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-2.md#another-headline">link to test-2.md#another-headline</a> <strong>broken link</strong></p>

Expected behavior

  1. links to anchors (without file name) should also refer in the converted file to the correct headline from the original Markdown file and not to the first identical headline from all Markdown files
  2. links to files with anchors that belong to the list of markdown files passed to Pandoc should link to the corresponding anchor of the converted file and not to the original file itself
  3. links to files without anchors that belong to the list of markdown files passed to Pandoc are a bit more difficult to resolve
    1. If the markdown file referred to starts with a headline (of any order), it should be linked to this headline
    2. Otherwise, an additional anchor must be inserted at this point in the target document to which the link is then made (alternatively, such links could be removed pragmatically, since such cases are likely to be very rare, but the first proposed solution would be preferable if it can be implemented, since very rare is not a never)
@jgm
Copy link
Owner

jgm commented May 20, 2020

I think you have a misconception about how pandoc treats its input files.

If you include multiple files on the command line, pandoc concatenates their contents and parses the result, paying no attention to what file a particular bit of markdown is found in. (In this respect it works like a lot of other unix tools.)

So the fact that the link is found in the second file makes no difference.

You might try experimenting with the --file-scope option, depending on your needs. (Note that this has certain limitations, though: e.g. with that option you can't define a link reference in one file and use it in another.)

@jgm jgm closed this as completed May 20, 2020
@bebuch
Copy link
Author

bebuch commented May 20, 2020

I understand that this is how it works currently, but this is not useful behavior.

Thanks for your comments, that helped me to better understand the current behavior. The current behavior is in accordance with the documentation, so this is not a bug report, but an enhancement request.


The point is that PDF is a file format that always consists of one file. All contents are embedded in it.

In Markdown (and also HTML for example), however, the same information is handled in multiple files within a directory structure. This is absolutely necessary, because images, for example, cannot be embedded in these formats.

A document converter should be able to convert such multi-file document formats into single-file formats without breaking the 'in document' links and image inclusions.

If I understand it correctly, the conversion process is currently divided into a read and a write process. To implement the proposed behavior, the reading process for markdown files would have to be adapted.

  1. Load input files individually
  2. Parse input files individually
  3. Adjust relative links between all included files (including images)
  4. Merge ASTs

Technically, I see no reason why this could not be implemented. It would make the tool much more useful and easier to use. For example, in many cases, the use of the parameter --resource-path would become obsolete, since the inclusion with the intelligent behavior simply works.

I understand this is a major change, but I think it would save a lot of people a lot of time and energy. I would therefore be very pleased about a second review.

If you agree, please reopen the issue.

@jgm
Copy link
Owner

jgm commented May 20, 2020

--file-scope does 1, 2, and 4 -- did you try it?

The one thing it won't do is rewrite heading IDs or links, so in your example the two identical headings would not receive unique anchors. I'd suggest dealing with this by using explicit identifiers (see link attribute syntax) when you have duplicated headings.

@jgm
Copy link
Owner

jgm commented May 20, 2020

If you'd like, you can create a more general issue that requests that the markdown parser be made sensitive to the file containing each particular bit of content. This would require a fairly big architectural change: the readers would have to be changed to take a [(FilePath, Text)] argument instead of just a Text. Currently they simply don't have access to information about the containing file.

@bebuch
Copy link
Author

bebuch commented May 20, 2020

answer first comment

Step 3 is the most important one in this process. ;-)

When I compile to HTML with --file-scope, all links described above as broken are still broken. In addition, the HTML is also invalid, because now two elements have the same ID.

<h1 id="headline-1">Headline 1</h1>
<p>some text</p>
<h2 id="headline-2">Headline 2</h2>
<p>more text</p>
<h1 id="headline-1">Headline 1</h1>
<p>some other text</p>
<h2 id="another-headline">Another headline</h2>
<p>more other text</p>
<p><a href="#headline-1">link to #headline-1</a> <strong>wrong link</strong> refers to Headĺine 1 in test-1, shoud refer to Headĺine 1 in test-2</p>
<p><a href="#another-headline">link to #another-headline</a> <strong>works</strong> as expected because anchor is unique</p>
<p><a href="test-1.md">link to test-1.md</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-1">link to test-1.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-2">link to test-1.md#headline-2</a> <strong>broken link</strong></p>
<p><a href="test-2.md">link to test-2.md</a> <strong>broken link</strong></p>
<p><a href="test-2.md#headline-1">link to test-2.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-2.md#another-headline">link to test-2.md#another-headline</a> <strong>broken link</strong></p>

The important point is that if you open the markdown files with OOO for example, the links will all work before conversion. After the conversion to HTML or PDF the same links are broken. The links are not included in the conversion, so the conversion is incomplete. (Always assumed, the Markdown files are considered as one document that is spread over several files).

answer second comment

If I interpret your description and the behavior of --file-scope correctly, then step 3 can be done before merging the ASTs.

I haven't checked the source code for this yet, but I would suspect that the merging of the ASTs takes place at a location where the original filenames (including path) are available. If this is the case, the addition of this step would be less complex.

I don't know if I can get around to checking it today, but probably sometime next week. Thanks for the feedback!

@jgm
Copy link
Owner

jgm commented May 20, 2020

step 3 can be done before merging the ASTs.

Yes, that's true, when --file-scope is used we have

        mconcat <$> mapM (readSource >=> r readerOpts) sources'

It would be possible, for example, to insert an identifier prefix derived from the file name before each internal identifer. Internal links to rewritten ids could also be rewritten.

I'm not so sure about links to other parts of the document (in other files). You are expecting it to work with a link to other-markdown-file.md#ident, so I guess your idea is that if other-markdown-file.md is one of the files on the command line, this wolud get rewritten to something like #other-markdown-file-md-ident.

That would mean that, for example, you couldn't link to the document's source file from the document -- perhaps an undesirable consequence for some users.

Paging @jkr who added the file-scope option originally and may have some thoughts on whether it should be changed in this way.

@jgm jgm reopened this May 20, 2020
@bebuch
Copy link
Author

bebuch commented May 20, 2020

so I guess your idea is that if other-markdown-file.md is one of the files on the command line, this wolud get rewritten to something like #other-markdown-file-md-ident.

Exactly!

I'm not so sure about links to other parts of the document (in other files). You are expecting it to work with a link to other-markdown-file.md#ident

These links can remain unchanged. The linked data is not part of the converted document, so it is okay if it breaks if the linked data is not copied separately.

Ideally, you could convert them to absolute HTTP paths using an additional Pandoc command line option like --rebase-relative-links-on 'https://github.com/jgm/pandoc/tree/master/doc', but that might need to be addressed in a separate issue afterwards.

jgm added a commit that referenced this issue May 9, 2021
Previously, when multiple file arguments were provided, pandoc
simply concatenated them and passed the contents to the readers,
which took a Text argument.

As a result, the readers had no way of knowing which file
was the source of any particular bit of text.  This meant that
we couldn't report accurate source positions on errors or
include accurate source positions as attributes in the AST.
More seriously, it meant that we couldn't resolve resource
paths relative to the files containing them
(see e.g. #5501, #6632, #6384, #3752).

Add Text.Pandoc.Sources (exported module), with a `Sources` type
and a `ToSources` class.  A `Sources` wraps a list of `(SourcePos,
Text)` pairs. [API change] A parsec `Stream` instance is provided for
`Sources`.  The module also exports versions of parsec's `satisfy` and
other Char parsers that track source positions accurately from a
`Sources` stream (or any instance of the new `UpdateSourcePos` class).

Text.Pandoc.Parsing now exports these modified Char parsers instead of
the ones parsec provides.  Modified parsers to use a `Sources` as stream
[API change].

The readers that previously took a `Text` argument have been
modified to take any instance of `ToSources`. So, they may still
be used with a `Text`, but they can also be used with a `Sources`
object.

In Text.Pandoc.Error, modified the constructor PandocParsecError
to take a `Sources` rather than a `Text` as first argument,
so parse error locations can be accurately reported.

T.P.Error: showPos, do not print "-" as source name.
jgm added a commit that referenced this issue May 10, 2021
Previously, when multiple file arguments were provided, pandoc
simply concatenated them and passed the contents to the readers,
which took a Text argument.

As a result, the readers had no way of knowing which file
was the source of any particular bit of text.  This meant that
we couldn't report accurate source positions on errors or
include accurate source positions as attributes in the AST.
More seriously, it meant that we couldn't resolve resource
paths relative to the files containing them
(see e.g. #5501, #6632, #6384, #3752).

Add Text.Pandoc.Sources (exported module), with a `Sources` type
and a `ToSources` class.  A `Sources` wraps a list of `(SourcePos,
Text)` pairs. [API change] A parsec `Stream` instance is provided for
`Sources`.  The module also exports versions of parsec's `satisfy` and
other Char parsers that track source positions accurately from a
`Sources` stream (or any instance of the new `UpdateSourcePos` class).

Text.Pandoc.Parsing now exports these modified Char parsers instead of
the ones parsec provides.  Modified parsers to use a `Sources` as stream
[API change].

The readers that previously took a `Text` argument have been
modified to take any instance of `ToSources`. So, they may still
be used with a `Text`, but they can also be used with a `Sources`
object.

In Text.Pandoc.Error, modified the constructor PandocParsecError
to take a `Sources` rather than a `Text` as first argument,
so parse error locations can be accurately reported.

T.P.Error: showPos, do not print "-" as source name.
@mbrucher
Copy link

Is this solved now? I agree that multi file handling without this option is not very useful at this time :(

@bebuch
Copy link
Author

bebuch commented Sep 6, 2022

@mbrucher Looks like 6e45607 does a lot of preparation to solve it, but the actual link fixing still isn't done.

@jgm
Copy link
Owner

jgm commented Sep 6, 2022

I think the best fix is not a change in the reader itself, but rather modifying the behavior of --file-scope:
Instead of

        mconcat <$> mapM (readSource >=> r readerOpts) sources'

we could have something like

        mconcat <$> mapM (\s -> readSource s >>= r readerOpts >>= rewriteLinksAndIdentifiers s sources') sources'

Here rewriteLinksAndIdentifiers would change all the identifiers in source sby adding a prefix derived from s. It would also change all links of form FILE(#anchor)? where FILE is in sources' accordingly.

This would produce the behavior you're going for. One drawback, though, is that even explicitly provided anchors would change, and this might cause problems for some people.

@bebuch
Copy link
Author

bebuch commented Sep 6, 2022

Sounds like a reasonable idea.

As for explicitly defined anchor IDs, unfortunately I can't think of a backwards compatible solution either. As long as the generated prefix ID depends exclusively on the file name in which the user-defined anchor ID was defined, a new composed anchor ID results. The problem is that this is not possible. To be able to compile files from different directories with Pandoc, the prefix ID must necessarily include the file path. This of course depends on the computer. The problem can be reduced if a common root directory is determined for all source files. This would also shorten the length of the (prefix-)IDs, which makes sense anyway.

External links to such anchors in the generated document must be updated accordingly, which is a breaking change of Pandoc.

If the user changes his/her directory structure so that a new common base directory results, links to anchors must be updated again. It is then the user's responsibility to avoid such a change if they wish to do so.

However, I believe that this use case affects very few users. The problem that the change would solve probably affects many more users.

@jgm
Copy link
Owner

jgm commented Sep 6, 2022

However, I believe that this use case affects very few users. The problem that the change would solve probably affects many more users.

It's always very hard to know. Usually when I make a change like this, I find that all sorts of people have been relying on the old behavior!

@mbrucher
Copy link

mbrucher commented Sep 6, 2022

I think the fact that this doesn't work limits pandoc to small products. You can't create anything big that is still maintainable.
Now, I don't think there is an issue with the links. For one file, the links are internal, no difference. Once you have more than one file, the link derives from the common folder structure (with the root removed). This behavior should make it consistent when you only have one file as well, so not sure where the breakage would occur, as this doesn't work properly for multi files at the moment.

@bebuch
Copy link
Author

bebuch commented Sep 6, 2022 via email

jgm added a commit that referenced this issue Sep 7, 2022
This change only affects the case where `--file-scope` is used
and more than one file is specified on the command line.

In this case, identifiers will be prefixed with a string
derived from the file path, to disambiguate them. For example,
an identifier `foo` in `contents/file1.txt` will become
`contents__file1.txt__foo`.  Links will be adjusted accordingly:
if `file2.txt` links to `file1.txt#foo`, then the link will
be changed to point to `#file1.txt__foo`.  Similarly, a link
to `file1.txt` will point to `#file1.txt`.  A Div with an
identifier derived from the file path will be added around
each file's content, so that links to files will still work.

Closes #6384.

[API change]: Text.Pandoc.Shared exports `textToIdentifier`.
@jgm jgm closed this as completed in #8282 Sep 19, 2022
jgm added a commit that referenced this issue Sep 19, 2022
This change only affects the case where `--file-scope` is used
and more than one file is specified on the command line.

In this case, identifiers will be prefixed with a string
derived from the file path, to disambiguate them. For example,
an identifier `foo` in `contents/file1.txt` will become
`contents__file1.txt__foo`.  Links will be adjusted accordingly:
if `file2.txt` links to `file1.txt#foo`, then the link will
be changed to point to `#file1.txt__foo`.  Similarly, a link
to `file1.txt` will point to `#file1.txt`.  A Div with an
identifier derived from the file path will be added around
each file's content, so that links to files will still work.

Closes #6384.

[API change]: Text.Pandoc.Shared exports `textToIdentifier`.
@bobro99
Copy link

bobro99 commented Sep 18, 2024

Hello everyone!
I also encountered a similar problem. I have a large volume of markdown documentation in gitlab. There is a need to convert the documentation to pdf. Everything is fine, but there is a problem - anchors in a link to another pdf file do not work. When the file is in markdown format, anchors of links to another file work perfectly. But when the files are already converted to pdf, then anchors in links do not work. The concept of document formation does not allow making a single pdf. Tell me, will there be any development of this task? Maybe there are new issues?

@jgm
Copy link
Owner

jgm commented Sep 18, 2024

You can use a Lua filter to modify the links so that they point to the pdfs. See docs on lua filters on the website.

@bobro99
Copy link

bobro99 commented Sep 18, 2024

You can use a Lua filter to modify the links so that they point to the pdfs. See docs on lua filters on the website.

Thank you for such a quick response. I will try to follow your advice.

@bobro99
Copy link

bobro99 commented Sep 19, 2024

Unfortunately, I didn't have enough time to figure out how to create a lua filter. I would like to ask if anyone solves this problem, please let me know.

@Enivex
Copy link

Enivex commented Sep 22, 2024

I've been trying to convert some ebooks to LaTeX or typst, and this still seems to be a problem when converting (single) EPUB files, since they consist of several HTML files internally.

Was a bit surprised to see this issue closed. Is there a more appropriate one for my particular use case?

@bpj
Copy link

bpj commented Sep 22, 2024

@bobro99 This filter should replace .md extensions in local link targets with .pdf solving your problem.

It requires Lpeg, so pandoc 2.16.2 or newer. Recommendation is to download and install latest version at https://github.com/jgm/pandoc/releases/latest

local md_url_pattern = re.compile([===[
  url <- {~ 'file:'? ( !ext [^:] )+ ext -> '.pdf' ( '#' .* )? !. ~}
  ext <- ( '.md' &( %p / !. ) )
]===])

function Link (link)
  local url = link.target
  link.target = md_url_pattern:match(url) or url
  return link
end

If you use some other file extension(s) than .md change the second line of the pattern to

ext <- ( ( '.md' / '.mkd' / '.markdown' ) &( %p / !. ) )

adding/removing /-separated extensions as needed. (Of course there must be at least one!)

If you still have problems, such as too many or too few URLs being changed/you only want to change some links with the .md extension add a class .pdf to the links you want to change, replacing something like

[link text](my-file.md)

with

[link text](my-file.md){.pdf}

and use this filter instead:

local md_ext_pattern = re.compile([===[
  ext <- ( '.md' &( %p / !. ) )
]===])

function Link (link)
  if link.classes:includes('pdf') then
    local url = link.target
    link.target = re.gsub(url, md_ext_pattern, '.pdf')
    return link
  end
  return nil
end

(Of course again adding extensions in the pattern as needed.)

This uses a somewhat laxer matching method, but works with any urls as long as the .pdf class is present on the link. You may want to use both the stricter pattern and the class check:

local md_url_pattern = re.compile([===[
  url <- {~ 'file:'? ( !ext [^:] )+ ext -> '.pdf' ( '#' .* )? !. ~}
  ext <- ( '.md' &( %p / !. ) )
]===])

function Link (link)
  if link.classes:includes('pdf') then
    local url = link.target
    link.target = md_url_pattern:match(url) or url
    return link
  end
  return nil
end

@bobro99
Copy link

bobro99 commented Sep 23, 2024

@bpj thank you very much for your help and participation in the development of the filter-lua!
But, unfortunately, anchors in pdf did not work.
As I understand, the conversion of links from md to pdf is wonderful, but links like file.md#myanchor do not work in pdf either. That is, I do not get to the anchor as it happens in md. Perhaps in pdf you need to use some other symbol, not # . I do not know.
I used the suggested lua filter for testing:

local md_url_pattern = re.compile([===[
  url <- {~ 'file:'? ( !ext [^:] )+ ext -> '.pdf' ( '#' .* )? !. ~}
  ext <- ( '.md' &( %p / !. ) )
]===])

function Link (link)
  local url = link.target
  link.target = md_url_pattern:match(url) or url
  return link
end

@bpj
Copy link

bpj commented Sep 24, 2024

@bobro99 yes fragments don't work with PDF targets. I forgot that. I don't recall if the limitation lies with the PDF format, with hyperref, or if pandoc doesn't do something it could do, which perhaps could be fixed with a filter, or if my filter breaks something pandoc would do, although the last is unlikely.

Ultimately a filter could even bypass what the LaTeX writer does with Link elements by replacing them with the link text and raw LaTeX code, but hyperref is rather finicky and its documentation is a maze. I'm sure pandoc handles some otherwise easily overlooked edge cases.

For now you can move the ( '#' .* )? !. bit outside the capture — to the right of the ~} . That will give you a link to the file but not the section.

One thing a filter could easily do would be to insert a piece of text from an attribute giving the section title or some other suitable piece of text, or the link title, in human readable form next to the link, added to the link text or instead of the link text, e.g.

[here](foo.md#my-section){append-pdf="section My section"}

or

[here](foo.md#my-section "section My section")

being transformed as if you had written

[here (section My Section)](foo.pdf)

If you are only targeting PDF you can simply add such hints in the link text manually of course.

IIRC hyperref/PDF doesn't let you use pop-up titles meaningfully either :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants