Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ipynb reader] smart extension does not parse ASCII en-dash correctly #7928

Closed
ickc opened this issue Feb 18, 2022 · 18 comments
Closed

[ipynb reader] smart extension does not parse ASCII en-dash correctly #7928

ickc opened this issue Feb 18, 2022 · 18 comments
Labels

Comments

@ickc
Copy link
Contributor

ickc commented Feb 18, 2022

MWE with pandoc 2.17.1.1 on macOS 12.2.1.

In pandoc-bug.ipynb,

{
 "cells": [
 ],
 "metadata": {
  "title": "Number range 5–6"
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

Running < pandoc-bug.ipynb | pandoc -s -f ipynb+smart -t ipynb+smart,

{
 "cells": [],
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "title": "Number range 5--6"
 }
}

Running it twice: < pandoc-bug.ipynb | pandoc -s -f ipynb+smart -t ipynb+smart | pandoc -s -f ipynb+smart -t ipynb+smart

{
 "cells": [],
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "title": "Number range 5\\--6"
 }
}

And you can see each round trip will add a \\.

A similar construct in markdown doesn't result in this behavior, e.g. echo '5–6' | pandoc -f markdown -t markdown | pandoc -f markdown -t markdown.

@ickc ickc added the bug label Feb 18, 2022
@jgm
Copy link
Owner

jgm commented Feb 18, 2022

A little easier to see the issue with:

% pandoc pandoc-bug.ipynb -s -f ipynb+smart -t native
Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "jupyter"
            , MetaMap
                (fromList
                   [ ( "nbformat" , MetaString "4" )
                   , ( "nbformat_minor" , MetaString "5" )
                   , ( "title" , MetaString "Number range 5--6" )
                   ])
            )
          ]
    }
  []

We're getting 5--6 in the native instead of 5[UNICODE NDASH]6.
Since this is a MetaString, the markdown writer properly escapes it, taking it literally.

@jgm
Copy link
Owner

jgm commented Feb 18, 2022

Notes:

  • the ipynb reader correctly parses this as MetaString "Number range 5\8211\&6" (whether or not smart is enabled)
  • the ipynb writer outputs a unicode en-dash when smart is not enabled, but -- when smart is enabled
  • the markdown writer has the same behavior

Conclusion: the root problem is with the markdown writer. The markdown writer should render MetaString "Number range 5\8211\&6" as a literal string with a unicode en-dash, rather than applying the "smart" transformation.

However, it's hard to see how we can fix this with current architecture. The function that builds our yaml metadata block gets passed something like [Str "5\8211\&6"] and renders it using the markdown writer, which does the smart transformation. We'd have to change the infrastructure around yaml metadata so that a separate function is used for MetaString. Low priority for now.

@ickc
Copy link
Contributor Author

ickc commented Feb 18, 2022

  • the markdown writer has the same behavior

Conclusion: the root problem is with the markdown writer.

Sorry don't quite follow here, echo '5–6' | pandoc -f markdown -t markdown | pandoc -f markdown -t markdown is ok?

(Context for others, markdown has the smart extension enabled by default but ipynb doesn't.)

@jgm
Copy link
Owner

jgm commented Feb 18, 2022

The problem arises because of the MetaString. With -f markdown -t markdown -s we parse the metadata field as inlines.

@jgm
Copy link
Owner

jgm commented Feb 18, 2022

Well, it's a complex problem. Writing the MetaString gives us 5--6 but then the ipynb reader treats this as a literal string rather than parsing it as Markdown. So maybe another solution would be to have the ipynb reader parse the metadata fields as Markdown, rather than taking them as literal strings. I think this would be dangerous for most fields, but probably okay for title?

@ickc
Copy link
Contributor Author

ickc commented Feb 18, 2022

This example might be slightly clearer:

Input: pandoc-bug.ipynb

{
 "cells": [],
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "title": "Number range 5--6"
 }
}

Each pipe below is broken down into one single conversion to/from AST,
which variants such as ipynb vs ipynb+smart vs markdown vs markdown-smart.

The only problematic one involves -t ipynb+smart. Edit: actually the markdown writing Number range 5\--6 is also problematic.

< pandoc-bug.ipynb | pandoc -s -f ipynb -t native
Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "jupyter"
            , MetaMap
                (fromList
                   [ ( "nbformat" , MetaString "4" )
                   , ( "nbformat_minor" , MetaString "5" )
                   , ( "title" , MetaString "Number range 5--6" )
                   ])
            )
          ]
    }
  []
❯ < pandoc-bug.ipynb | pandoc -s -f ipynb -t native | pandoc -s -f native -t ipynb
{
 "cells": [],
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "title": "Number range 5--6"
 }
}
❯ < pandoc-bug.ipynb | pandoc -s -f ipynb -t native | pandoc -s -f native -t ipynb+smart
{
 "cells": [],
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "title": "Number range 5\\--6"
 }
}
❯ < pandoc-bug.ipynb | pandoc -s -f ipynb -t native | pandoc -s -f native -t markdown
---
jupyter:
  nbformat: 4
  nbformat_minor: 5
  title: Number range 5\--6
---
❯ < pandoc-bug.ipynb | pandoc -s -f ipynb -t native | pandoc -s -f native -t markdown-smart
---
jupyter:
  nbformat: 4
  nbformat_minor: 5
  title: Number range 5--6
---
❯ < pandoc-bug.ipynb | pandoc -s -f ipynb -t native | pandoc -s -f native -t markdown | pandoc -f markdown -t native -s
Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "jupyter"
            , MetaMap
                (fromList
                   [ ( "nbformat" , MetaInlines [ Str "4" ] )
                   , ( "nbformat_minor" , MetaInlines [ Str "5" ] )
                   , ( "title"
                     , MetaInlines
                         [ Str "Number"
                         , Space
                         , Str "range"
                         , Space
                         , Str "5--6"
                         ]
                     )
                   ])
            )
          ]
    }
  []
❯ < pandoc-bug.ipynb | pandoc -s -f ipynb -t native | pandoc -s -f native -t markdown-smart | pandoc -f markdown-smart -t native -s
Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "jupyter"
            , MetaMap
                (fromList
                   [ ( "nbformat" , MetaInlines [ Str "4" ] )
                   , ( "nbformat_minor" , MetaInlines [ Str "5" ] )
                   , ( "title"
                     , MetaInlines
                         [ Str "Number"
                         , Space
                         , Str "range"
                         , Space
                         , Str "5--6"
                         ]
                     )
                   ])
            )
          ]
    }
  []

@ickc
Copy link
Contributor Author

ickc commented Feb 18, 2022

I think it is the ipynb reader didn't convert this to en-dash?

Input: pandoc-bug.ipynb

{
 "cells": [],
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "title": "Number range 5--6"
 }
}
< pandoc-bug.ipynb | pandoc -s -f ipynb+smart -t native 
Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "jupyter"
            , MetaMap
                (fromList
                   [ ( "nbformat" , MetaString "4" )
                   , ( "nbformat_minor" , MetaString "5" )
                   , ( "title" , MetaString "Number range 5--6" )
                   ])
            )
          ]
    }
  []

Edit: the direct comparison to markdown would be

Input: pandoc-bug.md

---
jupyter:
  nbformat: 4
  nbformat_minor: 5
  title: Number range 5--6
---
< pandoc-bug.md | pandoc -s -f markdown -t native 
Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "jupyter"
            , MetaMap
                (fromList
                   [ ( "nbformat" , MetaInlines [ Str "4" ] )
                   , ( "nbformat_minor" , MetaInlines [ Str "5" ] )
                   , ( "title"
                     , MetaInlines
                         [ Str "Number"
                         , Space
                         , Str "range"
                         , Space
                         , Str "5\8211\&6"
                         ]
                     )
                   ])
            )
          ]
    }
  []
❯ < pandoc-bug.md | pandoc -s -f markdown-smart -t native 
Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "jupyter"
            , MetaMap
                (fromList
                   [ ( "nbformat" , MetaInlines [ Str "4" ] )
                   , ( "nbformat_minor" , MetaInlines [ Str "5" ] )
                   , ( "title"
                     , MetaInlines
                         [ Str "Number"
                         , Space
                         , Str "range"
                         , Space
                         , Str "5--6"
                         ]
                     )
                   ])
            )
          ]
    }
  []

which correctly convert to en-dash with smart.

@ickc ickc changed the title ipynb+smart round trip not idempotent [ipynb reader] smart extension does not parse ASCII en-dash correctly Feb 18, 2022
@jgm
Copy link
Owner

jgm commented Feb 18, 2022

I think it is the ipynb reader didn't convert this to en-dash?

Yes - as I was trying to explain, this is because it's in the context of a MetaString, so the content is treated as plain text and not parsed as markdown.

@jgm
Copy link
Owner

jgm commented Feb 18, 2022

Two possible fixes were mentioned above:

  • modifying the writer to handle MetaString differently (difficult without big changes)
  • modifying the reader to parse some metadata fields as markdown (but which?)

@ickc
Copy link
Contributor Author

ickc commented Feb 18, 2022

@jgm
Copy link
Owner

jgm commented Feb 18, 2022

This doesn't say that the title field is supposed to be parsed as markdown, does it?

@ickc
Copy link
Contributor Author

ickc commented Feb 19, 2022

I tested and the result is that markdown in title/authors are not interpreted in nbconvert. (i.e. nbconvert does not parse title or authors in metadata as markdown.)

@jgm
Copy link
Owner

jgm commented Feb 19, 2022

Then that argues against the second solution above. Unfortunately, the first one would be quite difficult given how things are currently arranged...

@jgm
Copy link
Owner

jgm commented Feb 19, 2022

Actually, maybe this is simpler than I thought.

jgm added a commit that referenced this issue Feb 19, 2022
Previously we used the markdown writer to render metadata.
This had some undesirable consequences (e.g. en dash expanded
to `--` when `smart` enabled), so now we use the plain writer.

This addresses #7928, but I think a more elegant fix is possible.
@jgm
Copy link
Owner

jgm commented Feb 19, 2022

I'm going to call this closed by my last commit.

@jgm jgm closed this as completed Feb 19, 2022
@ickc
Copy link
Contributor Author

ickc commented Feb 25, 2022

Thanks.

I used the latest nightly from
https://github.com/jgm/pandoc/actions/runs/1891833666
to test this but it seems the problem from the MWE still persists?

I used

$ < pandoc-bug.ipynb | pandoc-nightly-macos-2022-02-24/pandoc -s -f ipynb+smart -t ipynb+smart | pandoc -s -f ipynb+smart -t ipynb+smart
{
 "cells": [],
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "title": "Number range 5\\--6"
 }
}

@jgm
Copy link
Owner

jgm commented Feb 25, 2022

The problem is that your second pandoc in the pipe is the system one, not the nightly.

@ickc
Copy link
Contributor Author

ickc commented Feb 26, 2022

Oh, right. How silly was me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants