-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken links between multiple input Markdown files #6384
Comments
I think you have a misconception about how pandoc treats its input files. If you include multiple files on the command line, pandoc concatenates their contents and parses the result, paying no attention to what file a particular bit of markdown is found in. (In this respect it works like a lot of other unix tools.) So the fact that the link is found in the second file makes no difference. You might try experimenting with the |
I understand that this is how it works currently, but this is not useful behavior. Thanks for your comments, that helped me to better understand the current behavior. The current behavior is in accordance with the documentation, so this is not a bug report, but an enhancement request. The point is that PDF is a file format that always consists of one file. All contents are embedded in it. In Markdown (and also HTML for example), however, the same information is handled in multiple files within a directory structure. This is absolutely necessary, because images, for example, cannot be embedded in these formats. A document converter should be able to convert such multi-file document formats into single-file formats without breaking the 'in document' links and image inclusions. If I understand it correctly, the conversion process is currently divided into a read and a write process. To implement the proposed behavior, the reading process for markdown files would have to be adapted.
Technically, I see no reason why this could not be implemented. It would make the tool much more useful and easier to use. For example, in many cases, the use of the parameter I understand this is a major change, but I think it would save a lot of people a lot of time and energy. I would therefore be very pleased about a second review. If you agree, please reopen the issue. |
The one thing it won't do is rewrite heading IDs or links, so in your example the two identical headings would not receive unique anchors. I'd suggest dealing with this by using explicit identifiers (see link attribute syntax) when you have duplicated headings. |
If you'd like, you can create a more general issue that requests that the markdown parser be made sensitive to the file containing each particular bit of content. This would require a fairly big architectural change: the readers would have to be changed to take a |
answer first comment Step 3 is the most important one in this process. ;-) When I compile to HTML with <h1 id="headline-1">Headline 1</h1>
<p>some text</p>
<h2 id="headline-2">Headline 2</h2>
<p>more text</p>
<h1 id="headline-1">Headline 1</h1>
<p>some other text</p>
<h2 id="another-headline">Another headline</h2>
<p>more other text</p>
<p><a href="#headline-1">link to #headline-1</a> <strong>wrong link</strong> refers to Headĺine 1 in test-1, shoud refer to Headĺine 1 in test-2</p>
<p><a href="#another-headline">link to #another-headline</a> <strong>works</strong> as expected because anchor is unique</p>
<p><a href="test-1.md">link to test-1.md</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-1">link to test-1.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-1.md#headline-2">link to test-1.md#headline-2</a> <strong>broken link</strong></p>
<p><a href="test-2.md">link to test-2.md</a> <strong>broken link</strong></p>
<p><a href="test-2.md#headline-1">link to test-2.md#headline-1</a> <strong>broken link</strong></p>
<p><a href="test-2.md#another-headline">link to test-2.md#another-headline</a> <strong>broken link</strong></p> The important point is that if you open the markdown files with OOO for example, the links will all work before conversion. After the conversion to HTML or PDF the same links are broken. The links are not included in the conversion, so the conversion is incomplete. (Always assumed, the Markdown files are considered as one document that is spread over several files). answer second comment If I interpret your description and the behavior of I haven't checked the source code for this yet, but I would suspect that the merging of the ASTs takes place at a location where the original filenames (including path) are available. If this is the case, the addition of this step would be less complex. I don't know if I can get around to checking it today, but probably sometime next week. Thanks for the feedback! |
Yes, that's true, when mconcat <$> mapM (readSource >=> r readerOpts) sources' It would be possible, for example, to insert an identifier prefix derived from the file name before each internal identifer. Internal links to rewritten ids could also be rewritten. I'm not so sure about links to other parts of the document (in other files). You are expecting it to work with a link to That would mean that, for example, you couldn't link to the document's source file from the document -- perhaps an undesirable consequence for some users. Paging @jkr who added the file-scope option originally and may have some thoughts on whether it should be changed in this way. |
Exactly!
These links can remain unchanged. The linked data is not part of the converted document, so it is okay if it breaks if the linked data is not copied separately. Ideally, you could convert them to absolute HTTP paths using an additional Pandoc command line option like |
Previously, when multiple file arguments were provided, pandoc simply concatenated them and passed the contents to the readers, which took a Text argument. As a result, the readers had no way of knowing which file was the source of any particular bit of text. This meant that we couldn't report accurate source positions on errors or include accurate source positions as attributes in the AST. More seriously, it meant that we couldn't resolve resource paths relative to the files containing them (see e.g. #5501, #6632, #6384, #3752). Add Text.Pandoc.Sources (exported module), with a `Sources` type and a `ToSources` class. A `Sources` wraps a list of `(SourcePos, Text)` pairs. [API change] A parsec `Stream` instance is provided for `Sources`. The module also exports versions of parsec's `satisfy` and other Char parsers that track source positions accurately from a `Sources` stream (or any instance of the new `UpdateSourcePos` class). Text.Pandoc.Parsing now exports these modified Char parsers instead of the ones parsec provides. Modified parsers to use a `Sources` as stream [API change]. The readers that previously took a `Text` argument have been modified to take any instance of `ToSources`. So, they may still be used with a `Text`, but they can also be used with a `Sources` object. In Text.Pandoc.Error, modified the constructor PandocParsecError to take a `Sources` rather than a `Text` as first argument, so parse error locations can be accurately reported. T.P.Error: showPos, do not print "-" as source name.
Previously, when multiple file arguments were provided, pandoc simply concatenated them and passed the contents to the readers, which took a Text argument. As a result, the readers had no way of knowing which file was the source of any particular bit of text. This meant that we couldn't report accurate source positions on errors or include accurate source positions as attributes in the AST. More seriously, it meant that we couldn't resolve resource paths relative to the files containing them (see e.g. #5501, #6632, #6384, #3752). Add Text.Pandoc.Sources (exported module), with a `Sources` type and a `ToSources` class. A `Sources` wraps a list of `(SourcePos, Text)` pairs. [API change] A parsec `Stream` instance is provided for `Sources`. The module also exports versions of parsec's `satisfy` and other Char parsers that track source positions accurately from a `Sources` stream (or any instance of the new `UpdateSourcePos` class). Text.Pandoc.Parsing now exports these modified Char parsers instead of the ones parsec provides. Modified parsers to use a `Sources` as stream [API change]. The readers that previously took a `Text` argument have been modified to take any instance of `ToSources`. So, they may still be used with a `Text`, but they can also be used with a `Sources` object. In Text.Pandoc.Error, modified the constructor PandocParsecError to take a `Sources` rather than a `Text` as first argument, so parse error locations can be accurately reported. T.P.Error: showPos, do not print "-" as source name.
Is this solved now? I agree that multi file handling without this option is not very useful at this time :( |
I think the best fix is not a change in the reader itself, but rather modifying the behavior of mconcat <$> mapM (readSource >=> r readerOpts) sources' we could have something like mconcat <$> mapM (\s -> readSource s >>= r readerOpts >>= rewriteLinksAndIdentifiers s sources') sources' Here This would produce the behavior you're going for. One drawback, though, is that even explicitly provided anchors would change, and this might cause problems for some people. |
Sounds like a reasonable idea. As for explicitly defined anchor IDs, unfortunately I can't think of a backwards compatible solution either. As long as the generated prefix ID depends exclusively on the file name in which the user-defined anchor ID was defined, a new composed anchor ID results. The problem is that this is not possible. To be able to compile files from different directories with Pandoc, the prefix ID must necessarily include the file path. This of course depends on the computer. The problem can be reduced if a common root directory is determined for all source files. This would also shorten the length of the (prefix-)IDs, which makes sense anyway. External links to such anchors in the generated document must be updated accordingly, which is a breaking change of Pandoc. If the user changes his/her directory structure so that a new common base directory results, links to anchors must be updated again. It is then the user's responsibility to avoid such a change if they wish to do so. However, I believe that this use case affects very few users. The problem that the change would solve probably affects many more users. |
It's always very hard to know. Usually when I make a change like this, I find that all sorts of people have been relying on the old behavior! |
I think the fact that this doesn't work limits pandoc to small products. You can't create anything big that is still maintainable. |
True, you can't know. The only option to keep backward comparability would
be to introduce a new command line option for the changed behavior. Indeed
that might be the best solution. Maybe something like
'--file-prefixed-anchors'.
John MacFarlane ***@***.***> schrieb am Di., 6. Sept. 2022,
21:26:
… However, I believe that this use case affects very few users. The problem
that the change would solve probably affects many more users.
It's always very hard to know. Usually when I make a change like this, I
find that all sorts of people have been relying on the old behavior!
—
Reply to this email directly, view it on GitHub
<#6384 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3UCSLWQBLI2GROO37GF23V46LHHANCNFSM4NF2X6BQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
This change only affects the case where `--file-scope` is used and more than one file is specified on the command line. In this case, identifiers will be prefixed with a string derived from the file path, to disambiguate them. For example, an identifier `foo` in `contents/file1.txt` will become `contents__file1.txt__foo`. Links will be adjusted accordingly: if `file2.txt` links to `file1.txt#foo`, then the link will be changed to point to `#file1.txt__foo`. Similarly, a link to `file1.txt` will point to `#file1.txt`. A Div with an identifier derived from the file path will be added around each file's content, so that links to files will still work. Closes #6384. [API change]: Text.Pandoc.Shared exports `textToIdentifier`.
This change only affects the case where `--file-scope` is used and more than one file is specified on the command line. In this case, identifiers will be prefixed with a string derived from the file path, to disambiguate them. For example, an identifier `foo` in `contents/file1.txt` will become `contents__file1.txt__foo`. Links will be adjusted accordingly: if `file2.txt` links to `file1.txt#foo`, then the link will be changed to point to `#file1.txt__foo`. Similarly, a link to `file1.txt` will point to `#file1.txt`. A Div with an identifier derived from the file path will be added around each file's content, so that links to files will still work. Closes #6384. [API change]: Text.Pandoc.Shared exports `textToIdentifier`.
Hello everyone! |
You can use a Lua filter to modify the links so that they point to the pdfs. See docs on lua filters on the website. |
Thank you for such a quick response. I will try to follow your advice. |
Unfortunately, I didn't have enough time to figure out how to create a lua filter. I would like to ask if anyone solves this problem, please let me know. |
I've been trying to convert some ebooks to LaTeX or typst, and this still seems to be a problem when converting (single) EPUB files, since they consist of several HTML files internally. Was a bit surprised to see this issue closed. Is there a more appropriate one for my particular use case? |
@bobro99 This filter should replace It requires Lpeg, so pandoc 2.16.2 or newer. Recommendation is to download and install latest version at https://github.com/jgm/pandoc/releases/latest local md_url_pattern = re.compile([===[
url <- {~ 'file:'? ( !ext [^:] )+ ext -> '.pdf' ( '#' .* )? !. ~}
ext <- ( '.md' &( %p / !. ) )
]===])
function Link (link)
local url = link.target
link.target = md_url_pattern:match(url) or url
return link
end If you use some other file extension(s) than
adding/removing If you still have problems, such as too many or too few URLs being changed/you only want to change some links with the [link text](my-file.md) with [link text](my-file.md){.pdf} and use this filter instead: local md_ext_pattern = re.compile([===[
ext <- ( '.md' &( %p / !. ) )
]===])
function Link (link)
if link.classes:includes('pdf') then
local url = link.target
link.target = re.gsub(url, md_ext_pattern, '.pdf')
return link
end
return nil
end (Of course again adding extensions in the pattern as needed.) This uses a somewhat laxer matching method, but works with any urls as long as the local md_url_pattern = re.compile([===[
url <- {~ 'file:'? ( !ext [^:] )+ ext -> '.pdf' ( '#' .* )? !. ~}
ext <- ( '.md' &( %p / !. ) )
]===])
function Link (link)
if link.classes:includes('pdf') then
local url = link.target
link.target = md_url_pattern:match(url) or url
return link
end
return nil
end |
@bpj thank you very much for your help and participation in the development of the filter-lua!
|
@bobro99 yes fragments don't work with PDF targets. I forgot that. I don't recall if the limitation lies with the PDF format, with hyperref, or if pandoc doesn't do something it could do, which perhaps could be fixed with a filter, or if my filter breaks something pandoc would do, although the last is unlikely. Ultimately a filter could even bypass what the LaTeX writer does with Link elements by replacing them with the link text and raw LaTeX code, but hyperref is rather finicky and its documentation is a maze. I'm sure pandoc handles some otherwise easily overlooked edge cases. For now you can move the One thing a filter could easily do would be to insert a piece of text from an attribute giving the section title or some other suitable piece of text, or the link title, in human readable form next to the link, added to the link text or instead of the link text, e.g. [here](foo.md#my-section){append-pdf="section My section"} or [here](foo.md#my-section "section My section") being transformed as if you had written [here (section My Section)](foo.pdf) If you are only targeting PDF you can simply add such hints in the link text manually of course. IIRC hyperref/PDF doesn't let you use pop-up titles meaningfully either :-( |
We convert our documentation from Markdown to PDF using Pandoc. Usually several Markdown files are converted to one PDF file.
Links between Markdown files that are included in the same PDF are broken in the PDF.
A very similar bug report already existed in 2016 with #2719.
Minimal example
File
test-1.md
:File
test-2.md
:Compile it to PDF:
You get the same behavior with HTML which is simpler to debug:
Here is the HTML output:
Expected behavior
The text was updated successfully, but these errors were encountered: