Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API change] Modify Attr, add it to relevant types #79

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

despresc
Copy link
Contributor

@despresc despresc commented Sep 8, 2020

Resolves jgm/pandoc#684 in this library. The new Attr type is

data Attr = Attr Text [Text] (Map Text Text)

as I mentioned in that issue. Everything that could reasonably have an Attr has one now.

I haven't added

data ListItem = ListItem Attr [Block]
data ListLabel = ListLabel Attr [Inline]

for the lists yet, but I can do that if it's reasonable.

@despresc
Copy link
Contributor Author

despresc commented Sep 8, 2020

I think I would adapt pandoc to this request first, then the Format one (if both are good), since this one will involve more extensive modifications to the writer tests: the native input in the writer tests will have to be adapted to the IR type changes. To allow normal development in both packages, I can rebase the other request (once it's done) on top of this one, then modify pandoc in two requests, keeping the branches up to date with master.

I'm also interested in the Figure and Ref changes (since these four changes are essentially all the things in the list of projects that require a change to the IR), but I'm not sure how much appetite there is for so many changes.

@srid
Copy link

srid commented Sep 8, 2020

Is there a reason why this cannot just be:

type Attr = Map Text Text

i.e., why not put both id and class in the attribute Map?

@despresc
Copy link
Contributor Author

despresc commented Sep 9, 2020

Both are used pretty frequently in pandoc, internally, so one reason is that it's more convenient to have them separated so that a simple pattern-match can get to them. The id attribute could still go in the map, in principle, but class is better kept separate and as [Text], rather than keeping it encoded in a single Text value.

@jgm
Copy link
Owner

jgm commented Sep 10, 2020

If we're making a large scale change of this type, it might be worth considering a different alternative (though this would be an even more radical change).

Something like:

data Block = Block Attr BlockT
data BlockT = Header Int [Inline] | Para [Inline] | ...
data Inline = Inline Attr InlineT
data InlineT = Str Text | Emph [Inline] | ...

This gives attributes to everything in a uniform way. Granted, it would be a much more painful change to make to the code base, though maybe PatternSynonyms could make it less painful. Hence, probably not worth considering at the time. But I thought it should be mentioned.

@despresc
Copy link
Contributor Author

That was my thinking as well.

I should mention that in the Semigroup instance for Inlines, two adjacent Emph elements are only merged if they have null attributes (and similarly for the other relevant inlines). That could be too strict of a requirement.

@despresc
Copy link
Contributor Author

On that subject, the new Attr doesn't have a Semigroup instance defined. Would this:

(Attr i1 c1 k1) <> (Attr i2 c2 k2) = Attr i' (c1 <> c2) (k1 <> k2)
  where
    i' = if Text.null i1 then i2 else i1

be appropriate? The left/right priorities for the components could be different.

@despresc
Copy link
Contributor Author

I should also say that factoring out the Attr like you mentioned is a little more generous than what's done in this pull request. A few of the Inline and Block branches don't have an Attr component: all of Plain, Null, Str, Space, SoftBreak, and LineBreak, to be specific.

@jgm
Copy link
Owner

jgm commented Sep 16, 2020

General note: Thanks for all of these concrete proposals for pandoc-types. Because of other things lined up, it may be a few weeks before I can really think seriously about them.

On Attr: were you saying that you'd prefer the approach I sketched of separating out the Attr from the base block types?

On Semigroup: the instance you propose seems as good as any.

On Semigroup for Inlines: I would think that a good general principle would be that adjacent elements of the same type can merge if they have identical Attr. That would be more generous than "if they have null Attr." Is there something to be said against it?

On the Attr type: there is a drawback to using Map. (I don't know if this was discussed before.) The drawback is that you don't get deterministic output with Show (or with ToJSON, etc.). This could perhaps be dealt with by defining Show explicitly (similarly ToJSON, etc.). But my reason for using a list originally was to allow order information to be preserved deterministically. I'm not sure what to think about this. In many ways Map is more correct (disallowing multiple attributes with the same key) as well as more performant.

@despresc
Copy link
Contributor Author

You're welcome. I'll still be free for a while to adapt pandoc to any changes, so there isn't much rush (though I'd prefer working sooner rather than later, of course).

I think I like separating out attributes. That's a common approach for compilers that need to annotate every node of their AST with additional information, and a uniform treatment of attributes would be nice. Downsides (maybe small) that I can think of, other than it being a larger change: pattern matches will get a little more complex, at least in the short term, everything becoming Block attr (Constructor ...) instead of Constructor attr ...; the AST will be larger in native/json/lua output, since Str et al. get attributes; and things will get attributes that don't have them in any output format (though the attributes can simply be ignored in those cases).

Another design is this:

data WithAttr a = WithAttr Attr a
type Block = WithAttr BlockT
type Inline = WithAttr InlineT

for uniformity, but I'm not sure if it makes too much difference. Having separate types for other attribute-carrying elements (like RowT or CellT, or however they're named) is also possible; that would be more necessary in the WithAttr representation than yours, to keep the representation of attribute-carrying things more uniform.

I can't think of any objections to that method of combining Inlines.

I think the output of Show for Map is deterministic, since it converts the map to a list with toAscList before showing it. You're right that the ToJSON instance isn't, so that would need to be defined separately to preserve determinism. Actually, one possibility is to give everything an explicit toEncoding implementation. That way the Attr instance can use the toEncoding for Map, which folds over the map directly and so outputs an object with sorted keys, from what I can tell. I think that would also have the effect of making encode a bit faster, and could be used to make the output for Meta deterministic as well.

I can't really comment on the need to preserve key ordering or duplicate keys. I don't think that pandoc ever relies on that property, at least. There might be users or people in the wider pandoc ecosystem that do, but I don't have any examples (mostly from not being familiar with a lot of filters).

@despresc
Copy link
Contributor Author

Maybe the downsides for factoring out Attr are bigger than I think? There will be a certain amount of annoyance having to deal with an extra layer on top of the existing Block and Inline. That's harder to avoid in Haskell, but the json (and maybe the lua) serialization could be adapted. The data Block = Block Attr BlockT type could be serialized as

{"t":"BlockTConstructor","a":"attributes","c":"content"}

and the lua output might be similarly adjustable. That might be an improvement.

@tarleb
Copy link
Contributor

tarleb commented Sep 17, 2020

Different general note: I've started work on ast-migrator. The goal is to convert from one AST version to another. That should give us more room to do drastic changes without impacting library (and filter) users too much. Maybe it's enough to prevent a Python 2/3 kind of situation.

I'll move the project to the pandoc organization if it is deemed useful.

@jgm
Copy link
Owner

jgm commented Sep 17, 2020

I'm convinced that we can handle the switch to a Map rather than a list.

It's definitely true that going general with the Attr would be much more of a pain in the short term, since the code modifications required are much more intrusive. That may argue for a more gradual approach, even if it's less elegant.

The new Attr type, in addition to being a data type, changes the
key-value pairs representation to a Map.

An Attr component has been added to every relevant branch of Inline
and Block. It has also been added to Caption.

New Attr conversion functions have been added to Definition, and new
builders for the types that received an Attr component have been added
to Builder.

The representation of Attr in JSON is different: the key-value pairs
are now an object.

Additionally, the t_table test was a bit too large to inspect for
correctness directly. It has been simplified, and new tests for the
top-level components of Table have been added.
The semigroup operation is left-biased for identifiers and attributes,
and concatenates the classes.
The semigroup instance for Inlines will now fuse together appropriate
inline elements when they have equal attributes, not just null
attributes.
@tarleb
Copy link
Contributor

tarleb commented Oct 8, 2020

An alternative to using a Map could be to have something like Swift's Ordered Dictionary. The datatype combines an array with a map. My understanding is that JavaScript's Map does something similar.

@despresc
Copy link
Contributor Author

despresc commented Oct 9, 2020

Something like Vector (Text, Text)? Though it could be [(Text, Text)] still (maybe under a newtype), and just be sorted and have its duplicate keys removed.

@tarleb
Copy link
Contributor

tarleb commented Oct 9, 2020

I checked the Swift implementation, it is basically

data OrderedDictionary k v = OrderedDictionary
  { orderedKeys :: Vector k
  , keysToValues :: Map k v
  }

Which seems like overkill here.
Given our use-case, and the size of a typical attribute collection, a simple newtype might be a good idea.

@srid
Copy link

srid commented Oct 9, 2020

Data.Map.Ordered from ordered-containers?

An OMap behaves much like a Map, with mostly the same asymptotics, but also remembers the order that keys were inserted. All operations whose asymptotics are worse than Map have documentation saying so.

@jgm
Copy link
Owner

jgm commented Oct 9, 2020

That does fit, but it's always a liability to take on an additional library dependency. One more thing to break if upstream maintainers don't keep it up to date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Permit adding attributes to all Markdown elements
5 participants