-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docx reader: metadata recognition is blocked if other elements come before title #8986
Comments
.docx
reader's handling of top styles in international documents.docx
reader's handling of top styles in international documents
Example added. |
You are right: we just look at the style. metaStyles :: M.Map ParaStyleName T.Text
metaStyles = M.fromList [ ("Title", "title")
, ("Subtitle", "subtitle")
, ("Author", "author")
, ("Date", "date")
, ("Abstract", "abstract")] Paragraphs with these styles turn into metadata values. |
I would have assumed that the style ID would stay the same in localizations, while the style name changes, but you are reporting the reverse. It would be good to have more information here from others using localized versions of Word. EDIT: Also, the style names above are compared against style names, not ids, so it should work if your style name is really "Title". |
OK, this doesn't have anything to do with the style ID or with the language. Pandoc looks for "metadata" paragraphs only at the beginning of the document. Since your document begins with another element (an image of a cat), the paragraph is not treated as metadata. Removing the cat picture causes the text to be treated as a title. |
@jgm Is this an issue that ought to be resolved in Pandoc itself, or is it better to so some preprocessing on our side? This anonymized example is based on a real-life document we got. So I assume it might be best to fix it in Pandoc? Or is the setup like this invalid per standards? |
The pandoc behavior is intentional, but it could be changed. The current approach is conservative: we don't want to pick up a style "Date" that occurs deep in the body of the document as the metadata data... Changing it might produce some unexpected effects. |
Shouldn't specific styles always be considered 'metadata', such as "Title" or "Subtitle"? |
Who knows? Word has a Date style. Is it intended for the document date only? Or is it something one might use for other dates in the body of the document? In fact, a user could use it either way. If they did the latter we'd be picking up dates from the body of the document and treating them as metadata. I'm tempted to change things so that these styles are always considered metadata, even if they don't come at the beginning, but I'm also resisting the temptation, because it might have bad results -- and these would only become evident after the change was made. I think it was probably done this way for a reason. |
Could you give me examples of metadata that are not metadata in different contexts? |
See my previous comment on Date. Do I have a real-world example? No. I try to deal with Word documents as little as possible. But as I said: anyone can apply the Date style anywhere in the document they wish! So, maybe there are lots of documents where the Date style is not used for metadata. That would be my guess, anyway. |
I will try to compose some examples and strategies for extracting certain parts of metadata. |
.docx
reader's handling of top styles in international documents
Explain the problem.
How does the
.docx
reader in Pandoc determine the top style, such as Title, and what implications does this approach have for international documents? Specifically, in Dutch (NL) documents, the top style for Title is often named Title but has a Style ID of Titel (the Dutch translation for Title).I believe this might result in the Title being converted to a regular paragraph.
Pandoc version?
Example
This example has been anonymized and therefore contains gibberish.
Visual representation of document in Microsoft Word
Expected
Expected
"Znzxar txfnfdcestx turpfmdrhpff"
to be marked as the Title of this document as seen in the screenshot above.Actual
Text
"Znzxar txfnfdcestx turpfmdrhpff"
is marked as regular text paragraph, not the Title of the document.Sources:
Input: input.docx
Output: none/expected.html
Our internal ID: NLDOC-837
The text was updated successfully, but these errors were encountered: