Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange results for quote marks in pdf metadata #5812

Closed
brainchild0 opened this issue Oct 11, 2019 · 14 comments
Closed

strange results for quote marks in pdf metadata #5812

brainchild0 opened this issue Oct 11, 2019 · 14 comments

Comments

@brainchild0
Copy link

brainchild0 commented Oct 11, 2019

Consider the following command:

pandoc -o x.pdf <<< '---
title: |+
 "One" -- "Two" --- "Three"
---'

The result is a simple document with a title formatted with curved quote marks, and an en- and em-dash:

md-title

However, the effect in the PDF metadata is less pleasant:

$ pdfinfo x.pdf 
Title:          ``One'' – ``Two'' — ``Three''
Subject:        
Keywords:       
Author:         
Creator:        LaTeX via pandoc
Producer:       pdfTeX-1.40.18
CreationDate:   Fri Oct 11 09:38:00 2019 EDT
ModDate:        Fri Oct 11 09:38:00 2019 EDT
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      612 x 792 pts (letter)
Page rot:       0
File size:      39795 bytes
Optimized:      no
PDF version:    1.5

The dashes were translated nicely, but the quotation marks are handled strangely. What are the possibilities for creating a plain string that resembles the printed title as cleanly as possible?

@agusmba
Copy link
Contributor

agusmba commented Oct 11, 2019

I wouldn't say strangely since that is the standard notation for left and right quotation marks in latex.
Whether in this particular case the quotes should be removed is a different matter.

@jgm
Copy link
Owner

jgm commented Oct 11, 2019

We use \texorpdfstring to ensure that regular tex commands don't go into the PDF bookmarks.
It seems that the usual quotation ligatures also don't work in this context.
You may find that if you use -t latex-smart --pdf-engine=xelatex, it works properly. In this case pandoc won't use ligatures (because -smart) and the unicode quotes should be passed through unchanged.
I don't know if a change to the defaults is called for, because without xelatex using unicode quotes may not work.

@brainchild0
Copy link
Author

Why would use of Unicode be dependent on a particular LaTeX engine? Are other engines unable to support characters outside of ASCII? Do non-English languages lack support in all engines but one?

Assuming the Unicode characters are not presenting a particular issue, would it not be more likely to produce desired results if normal translation by the smart extension, in contrast to the special LaTeX behavior, were applied to the metadata fields so as to generate the correct plain text string without LaTeX ligatures?

In other words, from the manual:

In LaTeX, smart means to use the standard TeX ligatures for quotation marks

It simply seems that metadata might be a special case for this rule.

Would this create any problems other than the possibility that the engine cannot properly handle a Unicode string? And in any case, could basic ASCII quotation marks be used?

@jgm
Copy link
Owner

jgm commented Oct 11, 2019

Are other engines unable to support characters outside of ASCII?

Correct. pdflatex doesn't support non-ASCII well. xelatex and lualatex do.

Did you try the fix I suggested?

@brainchild0
Copy link
Author

Yes, with smart disabled, the document appearance seems the same, and the metadata looks correct. Both pdflatex and xelatex seem to work equally well.

But I am unsure of the penalties of disabling smart. It seems like the correct choice given that I write MarkDown using these conventions.

But more to the point of the issue, would it not be an improvement if handling occurred correctly even with the extension enabled, even if in some cases it would mean using only basic ASCII quotation marks?

@jgm
Copy link
Owner

jgm commented Oct 11, 2019

No penalties disabling smart on latex output if you're just producing pdf with xelatex or lualatex.

We can leave this open with the suggestion of using ASCII quotation marks, but I'm not sure it's worth the additional code complexity.

@brainchild0
Copy link
Author

Then maybe smart should be disabled for LaTeX, if it has no benefit and some liability.

By the way, is there an error case for using the Unicode string in pdflatex? It worked fine for me just now.

@TomBener
Copy link
Contributor

TomBener commented Aug 21, 2023

Disabling the smart option for LaTeX may be not a good option. For straight quotes in headings, it would be great to wrap them with \texorpdfstring.

For example, converting:

\section{Pandoc's Features}\label{pandocs-features.md__pandocs-features}

to:

\section{\texorpdfstring{Pandoc's Features}{Pandoc’s Features}}\label{pandocs-features.md__pandocs-features}

#5909 is related to the issue.

@jgm
Copy link
Owner

jgm commented Aug 21, 2023

I think the original issue has long ago been solved. Here's the result with current pandoc:

% pdfinfo x.pdf 
Title:           “One” – “Two” — “Three”

Thus, closing...

@jgm jgm closed this as completed Aug 21, 2023
@TomBener
Copy link
Contributor

TomBener commented Aug 21, 2023

@jgm Wait, quotes in the headings are not processed correctly. If writing the heading in Markdown:

# "One"

Then converting to PDF via LaTeX, the PDF bookmark is still ``One'' instead of the desired “One”.

@jgm
Copy link
Owner

jgm commented Aug 21, 2023

@TomBener I'm not seeing this. You may be using an old version of pandoc? (Or older tex packages?)

@TomBener
Copy link
Contributor

@jgm You're correct. But I found a weird result. Let me clarify.

The content of the markdown file named test.md are as follows:

# "One" Heading

Some texts here.

# Pandoc's Features

Then if I run the command:

pandoc --pdf-engine=xelatex test.md -o test.pdf

The generated PDF test.pdf had the correct bookmark.

CleanShot 2023-08-22 at 10 38 43@2x

However, if I cut them to two steps, e.g. firstly generate LaTeX via Pandoc:

pandoc -s test.md -o test.tex

Then compile test.tex to PDF manually:

xelatex test.tex

Then the generated PDF bookmark was not desired.

CleanShot 2023-08-22 at 10 38 20@2x

The Pandoc version:

$ pandoc --version
pandoc 3.1.6.1
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /Users/username/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

For some reason, I need to generate LaTeX and then compile it to PDF, so the difference is important for the workflow. Could you help me with the issue? Thanks a lot.

@jgm
Copy link
Owner

jgm commented Aug 22, 2023

I think this is because in generating PDF via latex, we disable the smart extension in writing the LaTeX. You could try with -t latex-smart.

@TomBener
Copy link
Contributor

Disabling the smart extension could be an option. However, when writing Chinese, the side effects emerged. Like the screenshot shows below, the English quotes were also treated as Chinese, which looked quite wide.

CleanShot 2023-08-22 at 16 06 58@2x

To generate the PDF above, the command below was executed:

pandoc --pdf-engine=xelatex -V CJKmainfont=NotoSerifCJKsc-Regular test.md -o test.pdf

Even if I loaded \usepackage[punct=plain]{ctex}, the issue remained.

All problems lie in that Chinese and English use the same quotes in the Unicode table. In the Chinese LaTeX forum, it is recommended to write quotes as follows:

``English Quotes''

“中文引号”

Indeed, this is an annoying problem. I don’t expect pandoc can make changes for it, but just propose the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants