format-xml: Add a script for pretty-printing our XML source #432

wking · 2017-08-03T21:18:58Z

Using Python, because I'm familiar with it, and I've floated a Python script for spdx-spec in spdx/spdx-spec#10. I'm not sure which tools/languages other license-list maintainers are familiar with.

Also using LXML, since it's popular for XML editing within Python.

If you run this script on our current master, you get a lot of changes:

$ ./bin/format-xml.py src
$ git --no-pager diff --shortstat
 270 files changed, 16593 insertions(+), 16945 deletions(-)

I've left those changes off this PR for now while we discuss the feasability of this approach, but I can push them up publically somewhere else if folks want to browse them without running the script locally.

This is laying the ground-work for automatically manipulating our XML, e.g. converting to <bullet> (#414).

zvr · 2017-08-03T21:39:18Z

:-)
I've written a similar one myself.

But:

let's separate the "inline" from the "block" tags
you should get rid of all whitespace formatting on input
maximum line length is nice

wking · 2017-08-03T21:54:33Z

On Thu, Aug 03, 2017 at 02:39:18PM -0700, Alexios Zavras (zvr) wrote: - let's separate the "inline" from the "block" tags

That requires a mapping from the tags to inline/block. And I'm not sure if LXML provides a way to override their inline classification (which may be stored in [1], I'm not sure). If not, you'd need something more sophisticated than the current ‘pretty_print=True’. I think this would be nice, but not strongly enough to want to go through that work ;). Feel free to PR support for it, or to just PR your own script.

- you should get rid of all whitespace formatting on input

Do you mean my current tab-space conversion [2]? I don't think we want tabs in our output XML, and there are only a few licenses where our current XML contains tabs: $ git grep -c "$(echo -en '\t')" origin/master origin/master:src/APL-1.0.xml:11 origin/master:src/GPL-2.0-with-GCC-exception.xml:4 origin/master:src/GPL-2.0-with-autoconf-exception.xml:6 origin/master:src/GPL-2.0-with-bison-exception.xml:4 origin/master:src/GPL-2.0-with-classpath-exception.xml:5 origin/master:src/GPL-2.0-with-font-exception.xml:4 origin/master:src/GPL-3.0-with-autoconf-exception.xml:26 origin/master:src/OSET-PL-2.1.xml:193 origin/master:src/eCos-2.0.xml:8 I see my tab→space conversion as part of a guard against future accidental tab use, with something like Travis CI (or other CI tooling) making sure that whatever pretty-printer we settle on finds nothing to change in proposed PRs.

- maximum line length is nice

I guess, although then you have to decide what it should be ;). If we do have an opinion there, I'd be happy to accept tooling to automate it (note that we don't want to inject bugs like #430 and #431). I'm also fine handling this by hand, since that seems like less work to me then writing a reliable implementation. In any case, I think the output from this script is a lot better than what we have now ;). I'm in favor of landing it or something similar, and then iterating on those tools as we have time to improve things. [1]: http://lxml.de/api/lxml.html.defs-module.html#special_inline_tags [2]: https://github.com/spdx/license-list-XML/pull/432/files#diff-da7eafb36baacd32384e657599594a11R41

zvr · 2017-08-03T22:12:32Z

very quickly, because I have to get ready for a flight in a few hours.

yes, manual mapping of tags to the inline/block set; the built-in pretty_print was not enough, so I re-wrote it
you actually want something like re.sub(r'\s+', ' ', txt)
once the indent is calculated, the text is formatted to maxlength and lines filled, like fmt(1)

of course, as you say, none of this should be a blocking point, and I'd be happy to do another iteration later, when I get back

wking · 2017-08-03T22:30:20Z

On Thu, Aug 03, 2017 at 10:12:32PM +0000, Alexios Zavras (zvr) wrote: - yes, manual mapping of tags to the inline/block set; the built-in pretty_print was not enough, so I re-wrote it

Sounds good to me, as long as it's short and/or you're willing to maintain it ;).

- you actually want something like `re.sub(r'\s+', ' ', txt)`

I'm fine leaving things at this scale up to manual edits. But if we're comfortable not allowing doubled spaces, etc., then it's good to have tooling that enforces it. Are we comfortable forbidding doubled spaces?

- once the indent is calculated, the text is formatted to maxlength and lines filled, like fmt(1)

Do you shell out to fmt? It cites Knuth for detecting line breaks [1], so I doubt it is worth rolling our own version of that. Python also has a library for this sort of thing [2], although you'd want to set break_long_words [3] and break_on_hyphens [4] false to avoid issues like #430 and #431.

… of course, as you say, none of this should be a blocking point, and I'd be happy to do another iteration later, when I get back

And there's no rush either, so I'm happy to wait. [1]: https://www.gnu.org/software/coreutils/manual/html_node/fmt-invocation.html [2]: https://docs.python.org/3/library/textwrap.html [3]: https://docs.python.org/3/library/textwrap.html#textwrap.TextWrapper.break_long_words [4]: https://docs.python.org/3/library/textwrap.html#textwrap.TextWrapper.break_on_hyphens

jlovejoy · 2017-09-14T17:37:36Z

we'll address this in a later release, leaving issue open for discussion then.

wking · 2017-12-14T19:04:09Z

@zvr, have you had time to look over this or push forward with your alternative? The current trailing whitespace issues (#410) are leading to conflicts between in-flight fixups like #496 (which removed trailing whitespace here) and #514 (which makes a semantic change to that line). Basically, folks are contributing who seem to have editors which cleanup whitespace issues, and unless they go through the extra trouble to preserve those existing issues, there are likely to be more of these unnecessary merge conflicts.

zvr · 2017-12-15T09:59:16Z

Yes, the tool is ok and I've been using it for the licenses I am adding.
For typical output, take a loot at #472 or #470 ...

wking · 2017-12-15T15:56:58Z

Yes, the tool is ok and I've been using it...

Can you PR it so we can fix all the other licenses? I'd rather get that churn out of the way, and automated formatting would mean one less thing to manually review.

goneall · 2017-12-15T17:59:21Z

Can you PR it so we can fix all the other licenses? I'd rather get that churn out of the way, and automated formatting would mean one less thing to manually review.

Are we going to require all contributors to use the same pretty printing tool? Seems like there is going to be a lot of churn if we use this for all licenses. Not sure this is something we should be doing right before the release.

wking · 2017-12-15T18:34:47Z

On Fri, Dec 15, 2017 at 05:59:21PM +0000, goneall wrote: Are we going to require all contributors to use the same pretty printing tool?

I'd like to, with a Makefile rule to make it easy to apply and a Travis rule to enforce it.

Seems like there is going to be a lot of churn if we use this for all licenses.

Yes, there will be a single-commit bump for all the existing licenses. But the bump should be whitespace-only, so it should be easy to review. It will require rebasing for any open PRs that edit an existing license, but I'm ok with that because we're seeing similar conflicts already, even without shared format tooling [1].

Not sure this is something we should be doing right before the release.

There are only 14 open pull requests, and with 5 each by @mlinksva and me, that leaves 4 by other people which might need rebasing. I don't think that's too high a cost. And if we want to hustle the 9 that aren't mine in before running a reformat, I'm fine with that. I would like to get it done before the release. As I said above, I expect it to be easy to audit. And churning before the release means that diffing between this release and the next will be much easier. [1]: #432 (comment)

goneall · 2017-12-15T20:02:53Z

And churning before the release means that diffing between this release and the next will be much easier.

Good point. If we're going to pretty print all the XML, this would be the time to do it. Suggest after we merge most of the other PR's to avoid re-basing.

With all of the recent PR's, I'm expecting some changes once we test the XML files using the license generator tool. I'll want to coordinate my fixes either before or after pretty printing.

wking · 2017-12-15T20:27:28Z

On Fri, Dec 15, 2017 at 08:02:54PM +0000, goneall wrote: I'll want to coordinate my fixes either before or after pretty printing.

If we have the tool complete (and it sounds like we do), I'd recommend folks who want to avoid rebasing hold off until we get the tool landed. So the order would be: 1. Any acceptable current PRs. 2. The formatting tool, with a mass reformat. 3. New PRs. I'm fine with all of my existing PRs landing in phase 3 after I rebase them, so don't hold up 2 on my account.

wking · 2017-12-29T20:33:08Z

On Fri, Dec 15, 2017 at 08:02:54PM +0000, goneall wrote: > And churning before the release means that diffing between this > release and the next will be much easier. Good point. If we're going to pretty print all the XML, this would be the time to do it.

With 3.0 cut, this particular ship has sailed. But at the moment, the *only* open PR (besides mine, which I'm happy rebasing) that is touching existing license content is @mlinksva's #489. The other non-me PRs are adding new from-scratch content. So if we can agree on a pretty-printer (@zvr, still no PR for yours?), now seems like a great time to start using it.

wking · 2018-03-26T17:30:09Z

@zvr, still no PR for your pretty-printer? In the absence of that PR, can we land this on so we can get mostly auto-formatted XML? There are still very few PRs in flight that touch existing licenses, and if we are really concerned about merge conflicts I'm fine rolling out the auto-formatting in stages to work around any existing PRs.

zvr · 2018-03-26T19:32:28Z

I have published the pretty printer in https://github.com/zvr/xmlindent -- last touched when the schema changed 5 months ago.
I admit I don't remember the current state, but I used it on the massive update we did back then.

goneall · 2018-12-13T17:55:05Z

Discussed on the legal call 13 Dec 2016. The online tools will pretty print edited XML files prior to the pull request, so this should not be needed in the normal workflow. For other workflows, we would encourage submitter to pretty print prior to submit. Pretty printing after submitting may cause issues identifying change history in git. Decision made to close / reject the PR.

format-xml: Add a script for pretty-printing our XML source

b7183bb

wking mentioned this pull request Aug 3, 2017

Automated PR for RSCPL #197

Merged

wking mentioned this pull request Aug 9, 2017

Automated PR for MPL-2.0-no-copyleft-exception #128

Merged

jlovejoy added this to the Later Release milestone Sep 14, 2017

wking mentioned this pull request Oct 18, 2017

Fix parse errors #446

Merged

This was referenced Dec 17, 2017

fixed CDLA-Sharing-1.0 #521

Merged

Add EUPL-1.2 #511

Merged

jlovejoy added the technical issue label Dec 19, 2017

wking mentioned this pull request Jun 4, 2018

Add alt tag to handle BSD 3rd clause alternative language #650

Merged

goneall closed this Dec 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

format-xml: Add a script for pretty-printing our XML source #432

format-xml: Add a script for pretty-printing our XML source #432

wking commented Aug 3, 2017

zvr commented Aug 3, 2017

wking commented Aug 3, 2017 via email

zvr commented Aug 3, 2017

wking commented Aug 3, 2017 via email

jlovejoy commented Sep 14, 2017

wking commented Dec 14, 2017

zvr commented Dec 15, 2017

wking commented Dec 15, 2017

goneall commented Dec 15, 2017

wking commented Dec 15, 2017 via email

goneall commented Dec 15, 2017

wking commented Dec 15, 2017 via email

wking commented Dec 29, 2017 via email

wking commented Mar 26, 2018

zvr commented Mar 26, 2018

goneall commented Dec 13, 2018

format-xml: Add a script for pretty-printing our XML source #432

format-xml: Add a script for pretty-printing our XML source #432

Conversation

wking commented Aug 3, 2017

zvr commented Aug 3, 2017

wking commented Aug 3, 2017 via email

zvr commented Aug 3, 2017

wking commented Aug 3, 2017 via email

jlovejoy commented Sep 14, 2017

wking commented Dec 14, 2017

zvr commented Dec 15, 2017

wking commented Dec 15, 2017

goneall commented Dec 15, 2017

wking commented Dec 15, 2017 via email

goneall commented Dec 15, 2017

wking commented Dec 15, 2017 via email

wking commented Dec 29, 2017 via email

wking commented Mar 26, 2018

zvr commented Mar 26, 2018

goneall commented Dec 13, 2018