add more sorting options in longturtle serializer #2880

VladimirAlexiev · 2024-08-11T06:35:51Z

@nicholascar, @fititnt, maybe @gwhigs
The main purpose of longturtle is to facilitate better diffs.
Sorting has a lot to do with stability (diff minimization), so I'll share more experience.

Call me weird, but I prefer to read reasonably small ontologies (up to 50 classes, 100 props) in turtle rather than in an ontology editor.
So each time I get some ontology, I convert it to turtle (using riot --formatted=ttl) and then :

add prefixes for all used props.
- I don't use nsN: prefixes
- I first consult https://prefix.cc
- (*) If not, I make a reasonable prefix myself.
sort prefixes alphabetically
- (*) but the own ontology prefix first
- I also align namespaces on <
sort turtle blocks (subjects) alphabetically by prefixed name, not full URL
- It's better to first sort by kind (see sections below)
- but the ontology node first
- (*) also author records and any other subsidiary nodes describing the ontology
- (*) any external terms last
I also add comments to delineate the sections. They come in this order:
- ### Ontology
- ### Classes
- ### Properties
- ### Individuals
- ### External Terms
  - You could also sort them by kind, but then do you make 3 sections per external ontology?
  - Since there are usually few external terms, better to keep them all in one section.

Note: the (*) items are harder to implement, so they are optional.

Related links:

@fititnt in MVP of RDF/Turtle canonization/file formatting for generated dictionaries EticaAI/lexicographi-sine-finibus#46 describes similar enhancements they have done.
- Another nice idea: emit multiple values as separate lines
- Sort multiple langStrings by lang rather than value
Graham Higgins in https://groups.google.com/g/rdflib-dev/c/EUW2fawv4mw mentions that some tools preserve order of triples, but rdflib is not mentioned as one of them, and this cannot be relied upon

AFAIK, rdflib and TQ are the only software that care about the aesthetics of turtle output.
If this and #2881 are implemented, I'll switch from jena riot to rdflib.

The text was updated successfully, but these errors were encountered:

fititnt · 2024-08-11T07:15:40Z

Just saw the mention, will comment quickly (I have no strong opinion to remember at the moment).

While I don't know (maybe not even exist single person) who come with the nice formatting, its really, really fantastic both the formatting (at least compared with how was done 15~20 years ago) and... have some way to force programs (even if optional) to try hard keep a a consistent output, because it helps when doing diffs.

One of my use cases for this was run same scripts that get data (for example, Wikidata, also converting tables from open data sources) and generating some formats (and RDF is one of then) and if anything minimal changes, is possible also explore the diffs.

About the specific suggestions of @VladimirAlexiev, I have no opinion on very specific details (but I do agree with the general idea). Also, its not merely aesthetics , but helps with diffs.

(And I also would be okay if any of the programs I'm using at some point would change the defaults)

VladimirAlexiev · 2024-09-09T15:20:47Z

atextor/turtle-formatter is a Jena/Java tool specifically for this purpose

nicholascar · 2024-09-24T05:52:48Z

@VladimirAlexiev @fititnt: I wrote the small changes into RDFLib to make the longturtle format. I can think of further enhancements that could be made to make it even better, and which are likely entirely inline with the suggestions above. Also, I would also like to make longturtle the default turtle format for RDFLib 8.x, which may come out later this year.

Yes, I too read turtle all day and care about how it looks!

So consider longturtle as being under active development and I'll take this Issue as input into improvements for it.

I do want to follow up with the work done in the recent (current?) W3C canonical serialization WG to see if there are better things developed there that we could so in serialization here. Again, suggestions/pointers welcome.

VladimirAlexiev · 2024-09-24T18:42:00Z

I currently use the atextor tools (owl-cli and turtle-format) but it would be nice to have competition from python.

Afaik, RDF Canonicalization deals with numbering of blanks nodes, not any other layout issues.

Lincoln-GR · 2024-11-11T08:06:28Z

My suggestion for sorting options in the turtle output (long or regular) would be whatever produces the least git diff between 2 slightly different graphs.

Currently the subjects are sorted by (is_bnode, num_of_references, subject) here:
https://github.com/RDFLib/rdflib/blob/main/rdflib/plugins/serializers/turtle.py#L76-L83

The "number of references" part is frustrating since it means adding a tuple with a node as the object in the new version of the graph, means that node can move around in the turtle output, even when it and all its properties haven't changed.

If was sorted just by (is_bnode, subject) then the turtle output of our graphs would be a lot more stable between changes.

VladimirAlexiev mentioned this issue Sep 10, 2024

section sorting atextor/turtle-formatter#22

Open

VladimirAlexiev mentioned this issue Sep 17, 2024

Use a build tool qudt/qudt-public-repo#959

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add more sorting options in longturtle serializer #2880

add more sorting options in longturtle serializer #2880

VladimirAlexiev commented Aug 11, 2024 •

edited

Loading

fititnt commented Aug 11, 2024

VladimirAlexiev commented Sep 9, 2024

nicholascar commented Sep 24, 2024 •

edited

Loading

VladimirAlexiev commented Sep 24, 2024

Lincoln-GR commented Nov 11, 2024

add more sorting options in longturtle serializer #2880

add more sorting options in longturtle serializer #2880

Comments

VladimirAlexiev commented Aug 11, 2024 • edited Loading

fititnt commented Aug 11, 2024

VladimirAlexiev commented Sep 9, 2024

nicholascar commented Sep 24, 2024 • edited Loading

VladimirAlexiev commented Sep 24, 2024

Lincoln-GR commented Nov 11, 2024

VladimirAlexiev commented Aug 11, 2024 •

edited

Loading

nicholascar commented Sep 24, 2024 •

edited

Loading