Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XeTeX 下使用思源黑体 / 思源宋体后复制出错 #286

Closed
stone-zeng opened this issue Jun 10, 2017 · 14 comments
Closed

XeTeX 下使用思源黑体 / 思源宋体后复制出错 #286

stone-zeng opened this issue Jun 10, 2017 · 14 comments

Comments

@stone-zeng
Copy link
Member

代码如下:

\documentclass{ctexart}
\begin{document}
    {\CJKfontspec{Source Han Sans SC} 孤立子 ABC} \par
    {\CJKfontspec{Source Han Serif SC} 孤立子 ABC} \par
    {\CJKfontspec{Microsoft YaHei} 孤立子 ABC} \par
    孤立子 ABC
\end{document}

使用 XeLaTeX 编译后,前两行复制出来三个汉字是 U+5B64U+2F74U+2F26,后两行则是 U+5B64U+7ACBU+5B50。在思源黑体 / 宋体之下“立子”二字变成了康熙部首(U+2F74U+2F26)。在只用 fontspec 的情况下,也同样有该问题。

使用 LuaLaTeX 编译,则正常。

PDF 阅读器方面,Adobe Reader 和 SumatraPDF 都会出现错误。

平台:Windows 10,TeX Live 2017,XeTeX 3.14159265-2.6-0.99998,LuaTeX 1.0.4。
字体:Source Han Sans 1.004,Source Han Serif 1.000。

@leo-liu
Copy link
Member

leo-liu commented Jun 11, 2017

应该是 XeTeX 的 Bug。裸 XeTeX 就能重现。

% XeTeX
\font\1="Source Han Sans SC"
\font\2="Source Han Serif SC"
\font\3="Microsoft YaHei"

{\1 孤立子 ABC} \par
{\2 孤立子 ABC} \par
{\3 孤立子 ABC} \par

\bye

无论如何,ctex-kit 项目无法解决这类问题。

另外,似乎不是 dvipdfmx 的 Bug,因为用 uptex + dvipdfmx 可以测试通过如下代码无误:

% upTeX + dvipdfmx
\font\rm=upzhserif-h
\font\it=upzhserifit-h
\font\bf=upzhserifb-h

\special{pdf:mapline upserif-h   unicode SourceHanSansSC-Regular.otf}
\special{pdf:mapline upserifit-h unicode SourceHanSerifSC-Regular.otf}
\special{pdf:mapline upserifb-h  unicode msyh.ttc}

{\rm 孤立子 ABC} \par
{\it 孤立子 ABC} \par
{\bf 孤立子 ABC} \par

\bye

或者如下基于 zhmCJK 的代码用 latex+dvipdfmx 编译也无误:

% latex + dvipdfmx
\documentclass{article}
\usepackage{zhmCJK}
\setCJKmainfont{SourceHanSansSC-Regular.otf}
\setCJKsansfont{SourceHanSerifSC-Regular.otf}
\setCJKmonofont{msyh.ttc}
\begin{document}

{\rmfamily 孤立子 ABC} \par
{\sffamily 孤立子 ABC} \par
{\ttfamily 孤立子 ABC} \par

\end{document}

@leo-liu
Copy link
Member

leo-liu commented Jun 11, 2017

经马起园确认,是 XeTeX 在 CMap 方面的 bug。

@jjgod
Copy link
Member

jjgod commented Jun 11, 2017

As a workaround, does \XeTeXgenerateactualtext=1 help in this case?

I should add that although the ToUnicode map is incorrect, Preview from macOS seems to handle the copying perfectly fine.

A quick fix would be simply blacklist the KANGXI RADICALs (U+2F00 to U+2FD5) region as that’s more rare. A smarter fix would be trying to leverage the actualtext unicode string to better generate the ToUnicode map.

@stone-zeng
Copy link
Member Author

\XeTeXgenerateactualtext=1 not work on my computer. I use SourceHanSans-Regular.ttc and SourceHanSerif-Regular.ttc.

By the way, I have noticed that you have modified TeX Live. When this change can take effect?

@jjgod
Copy link
Member

jjgod commented Jun 12, 2017

I modified my branch of texlive as an attempt, I’m not convinced that this is the best approach yet.

In general, TeXLive has an annual release schedule so any changes done now will be released next year.

@jjgod jjgod reopened this Jun 12, 2017
@leo-liu
Copy link
Member

leo-liu commented Jun 12, 2017

这个问题的解决似乎从 XeTeX 那边着手更合理一些,毕竟 (up)latex+dvipdfmx 也不会出错,是 XeTeX 编译时丢失了一部分信息。在 dvipdfm-x 打补丁有点怪。

@jjgod
Copy link
Member

jjgod commented Jun 12, 2017

For the record, I’m not able to reproduce the issue even with Adobe Reader on macOS. So at this point I cannot confirm ToUnicode map is the problem here.

\XeTeXgenerateactualtext=1 is supposed to be the solution from XeTeX side. So it should work, if it doesn’t, someone who can reproduce it should investigate.

I sent an email to http://tug.org/pipermail/xetex/2017-June/027142.html in case anyone is willing to help.

@jjgod
Copy link
Member

jjgod commented Jun 12, 2017

http://tug.org/pipermail/xetex/2017-June/027143.html claimed that \XeTeXgenerateactualtext=1 works fine. Can either of you try?

@stone-zeng
Copy link
Member Author

I have tried some PDF viewers. With \XeTeXgenerateactualtext=1,
Adobe Reader DC, Adobe Acrobat DC will give the correct result, while SumatraPDF v3.1.2, Windows Reader App (阅读器), Microsoft Edge and Microsoft Word 2016 will not.

My OS is Windows 10 1607 (Build 14393.1198).

@jjgod
Copy link
Member

jjgod commented Jun 12, 2017

Thanks for testing again. I will see if Akira-san can build a new w32tex binary for xdvipdfmx for you to test.

@jjgod
Copy link
Member

jjgod commented Jun 12, 2017

Akira-san kindly confirms my patch indeed helped the \XeTeXgenerateactualtext=0 situation and provided a new dvipdfmx.dll to use: http://tug.org/pipermail/xetex/2017-June/027150.html

Feel free to try.

Since it's not a ctex-kit bug I will close this and move further discussion elsewhere. If you have more to comment please reply to the email thread.

@jjgod jjgod closed this as completed Jun 12, 2017
@leo-liu
Copy link
Member

leo-liu commented Jun 13, 2017

Windows 下测试了一下,\XeTeXgenerateactualtext=1 在 Debian 下有效。(Acroread 9.5, Evince 3.22.1)

aminophen added a commit to texjporg/tex-jp-build that referenced this issue Sep 8, 2017
@stone-zeng
Copy link
Member Author

stone-zeng commented Apr 11, 2018

I have tested on TeX Live 2018 (revision 47303 2018-04-05 19:52:22 +0200) and XeTeX (3.14159265-2.6-0.99999).

Test file:

% Compiled with XeTeX
% \XeTeXgenerateactualtext=0 or 1

\font\1="Source Han Sans SC"  % 1.004
\font\2="Source Han Serif SC" % 1.001
\font\3="Microsoft YaHei"

\def\KANGXI{%% U+2F74% U+2F26
}

\def\HAN{%% U+7ACB% U+5B50
}

{\1 \KANGXI ~ \HAN} \par
{\2 \KANGXI ~ \HAN} \par
{\3 \KANGXI ~ \HAN} \par

\bye

Copy the string in PDF files to https://r12a.github.io/app-conversion/, the results are the following:

  • Adobe Acrobat Pro DC (2018.011.20038):
% \XeTeXgenerateactualtext=1
U+2F74 U+2F26 U+7ACB U+5B50
U+2F74 U+2F26 U+7ACB U+5B50
U+2F74 U+2F26 U+7ACB U+5B50

% \XeTeXgenerateactualtext=0
U+7ACB U+5B50 U+7ACB U+5B50
U+7ACB U+5B50 U+7ACB U+5B50
U+107419 U+107316 U+7ACB U+5B50
  • SumatraPDF (3.1.2 64-bit)
% \XeTeXgenerateactualtext=1
U+7ACB U+5B50 U+7ACB U+5B50
U+7ACB U+5B50 U+7ACB U+5B50
?? U+7ACB U+5B50

% \XeTeXgenerateactualtext=0
U+7ACB U+5B50 U+7ACB U+5B50
U+7ACB U+5B50 U+7ACB U+5B50
?? U+7ACB U+5B50
  • Google Chrome (65.0.3325.181)
% \XeTeXgenerateactualtext=1
U+2F74 U+2F26 U+7ACB U+5B50
U+2F74 U+2F26 U+7ACB U+5B50
U+2F74 U+2F26 U+7ACB U+5B50

% \XeTeXgenerateactualtext=0
U+7ACB U+5B50 U+7ACB U+5B50
U+7ACB U+5B50 U+7ACB U+5B50
U+7419 U+7316 U+7ACB U+5B50
  • Microsoft Edge (41.16299.334.0)
% \XeTeXgenerateactualtext=1
U+7ACB U+5B50 U+7ACB U+5B50
U+7ACB U+5B50 U+7ACB U+5B50
  U+7ACB U+5B50

% \XeTeXgenerateactualtext=0
U+7ACB U+5B50 U+7ACB U+5B50
U+7ACB U+5B50 U+7ACB U+5B50
  U+7ACB U+5B50

Notice that for normal Han characters, all programs give the correct results; for Kangxi Radicals, however, it seems that the mapping is still in a mess.

OS: Windows 10 1709 (build 16299.334)

PS: @jjgod I think ctex-kit is not so appropriate for discussing this issue, could you show me a better place (the email list seems to be outdated)?

@muzimuzhi
Copy link
Contributor

muzimuzhi commented Dec 6, 2019

\XeTeXgenerateactualtext=1 的工作原理是向 pdf 写入 /ActualText 项(pdf ref v1.7, sec. 10.6.2, table 10.10; sec. 10.8.3),从而可以逐字符地控制从 PDF 复制文本的结果。写入 /ActualText 项后,「复制出错」的问题能否修复,还需要阅读器支持。

PS:从对 SumatraPDF 源码的简单搜索看,它仍不支持 /ActualText


关于 \XeTeXgenerateactualtext 的部分资料(至少还有一个印象中的邮件没有找到)

  • 最初的 feature request 邮件
    • 提到该功能需要阅读器配合
  • primitive 添加时的通知邮件
    • 提到开启该功能后,在不同阅读器中对「选中文本、标记文本」等操作的影响
  • xetex news 中的相关描述
  • xetexref 2019-12-09 起,\XeTeXgenerateactualtext 被文档化

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants