-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
format-xml: Add a script for pretty-printing our XML source #432
Conversation
:-) But:
|
On Thu, Aug 03, 2017 at 02:39:18PM -0700, Alexios Zavras (zvr) wrote:
- let's separate the "inline" from the "block" tags
That requires a mapping from the tags to inline/block. And I'm not
sure if LXML provides a way to override their inline classification
(which may be stored in [1], I'm not sure). If not, you'd need
something more sophisticated than the current ‘pretty_print=True’.
I think this would be nice, but not strongly enough to want to go
through that work ;). Feel free to PR support for it, or to just PR
your own script.
- you should get rid of all whitespace formatting on input
Do you mean my current tab-space conversion [2]? I don't think we
want tabs in our output XML, and there are only a few licenses where
our current XML contains tabs:
$ git grep -c "$(echo -en '\t')" origin/master
origin/master:src/APL-1.0.xml:11
origin/master:src/GPL-2.0-with-GCC-exception.xml:4
origin/master:src/GPL-2.0-with-autoconf-exception.xml:6
origin/master:src/GPL-2.0-with-bison-exception.xml:4
origin/master:src/GPL-2.0-with-classpath-exception.xml:5
origin/master:src/GPL-2.0-with-font-exception.xml:4
origin/master:src/GPL-3.0-with-autoconf-exception.xml:26
origin/master:src/OSET-PL-2.1.xml:193
origin/master:src/eCos-2.0.xml:8
I see my tab→space conversion as part of a guard against future
accidental tab use, with something like Travis CI (or other CI
tooling) making sure that whatever pretty-printer we settle on finds
nothing to change in proposed PRs.
- maximum line length is nice
I guess, although then you have to decide what it should be ;). If we
do have an opinion there, I'd be happy to accept tooling to automate
it (note that we don't want to inject bugs like #430 and #431).
I'm also fine handling this by hand, since that seems like less work
to me then writing a reliable implementation.
In any case, I think the output from this script is a lot better than
what we have now ;). I'm in favor of landing it or something similar,
and then iterating on those tools as we have time to improve things.
[1]: http://lxml.de/api/lxml.html.defs-module.html#special_inline_tags
[2]: https://github.com/spdx/license-list-XML/pull/432/files#diff-da7eafb36baacd32384e657599594a11R41
|
very quickly, because I have to get ready for a flight in a few hours.
of course, as you say, none of this should be a blocking point, and I'd be happy to do another iteration later, when I get back |
On Thu, Aug 03, 2017 at 10:12:32PM +0000, Alexios Zavras (zvr) wrote:
- yes, manual mapping of tags to the inline/block set; the built-in
pretty_print was not enough, so I re-wrote it
Sounds good to me, as long as it's short and/or you're willing to
maintain it ;).
- you actually want something like `re.sub(r'\s+', ' ', txt)`
I'm fine leaving things at this scale up to manual edits. But if
we're comfortable not allowing doubled spaces, etc., then it's good to
have tooling that enforces it. Are we comfortable forbidding doubled
spaces?
- once the indent is calculated, the text is formatted to maxlength
and lines filled, like fmt(1)
Do you shell out to fmt? It cites Knuth for detecting line breaks
[1], so I doubt it is worth rolling our own version of that. Python
also has a library for this sort of thing [2], although you'd want to
set break_long_words [3] and break_on_hyphens [4] false to avoid
issues like #430 and #431.
… of course, as you say, none of this should be a blocking point,
and I'd be happy to do another iteration later, when I get back
And there's no rush either, so I'm happy to wait.
[1]: https://www.gnu.org/software/coreutils/manual/html_node/fmt-invocation.html
[2]: https://docs.python.org/3/library/textwrap.html
[3]: https://docs.python.org/3/library/textwrap.html#textwrap.TextWrapper.break_long_words
[4]: https://docs.python.org/3/library/textwrap.html#textwrap.TextWrapper.break_on_hyphens
|
we'll address this in a later release, leaving issue open for discussion then. |
@zvr, have you had time to look over this or push forward with your alternative? The current trailing whitespace issues (#410) are leading to conflicts between in-flight fixups like #496 (which removed trailing whitespace here) and #514 (which makes a semantic change to that line). Basically, folks are contributing who seem to have editors which cleanup whitespace issues, and unless they go through the extra trouble to preserve those existing issues, there are likely to be more of these unnecessary merge conflicts. |
Can you PR it so we can fix all the other licenses? I'd rather get that churn out of the way, and automated formatting would mean one less thing to manually review. |
Are we going to require all contributors to use the same pretty printing tool? Seems like there is going to be a lot of churn if we use this for all licenses. Not sure this is something we should be doing right before the release. |
On Fri, Dec 15, 2017 at 05:59:21PM +0000, goneall wrote:
Are we going to require all contributors to use the same pretty printing tool?
I'd like to, with a Makefile rule to make it easy to apply and a
Travis rule to enforce it.
Seems like there is going to be a lot of churn if we use this for
all licenses.
Yes, there will be a single-commit bump for all the existing licenses.
But the bump should be whitespace-only, so it should be easy to
review. It will require rebasing for any open PRs that edit an
existing license, but I'm ok with that because we're seeing similar
conflicts already, even without shared format tooling [1].
Not sure this is something we should be doing right before the
release.
There are only 14 open pull requests, and with 5 each by @mlinksva and
me, that leaves 4 by other people which might need rebasing. I don't
think that's too high a cost. And if we want to hustle the 9 that
aren't mine in before running a reformat, I'm fine with that.
I would like to get it done before the release. As I said above, I
expect it to be easy to audit. And churning before the release means
that diffing between this release and the next will be much easier.
[1]: #432 (comment)
|
Good point. If we're going to pretty print all the XML, this would be the time to do it. Suggest after we merge most of the other PR's to avoid re-basing. With all of the recent PR's, I'm expecting some changes once we test the XML files using the license generator tool. I'll want to coordinate my fixes either before or after pretty printing. |
On Fri, Dec 15, 2017 at 08:02:54PM +0000, goneall wrote:
I'll want to coordinate my fixes either before or after pretty
printing.
If we have the tool complete (and it sounds like we do), I'd recommend
folks who want to avoid rebasing hold off until we get the tool
landed. So the order would be:
1. Any acceptable current PRs.
2. The formatting tool, with a mass reformat.
3. New PRs.
I'm fine with all of my existing PRs landing in phase 3 after I rebase
them, so don't hold up 2 on my account.
|
On Fri, Dec 15, 2017 at 08:02:54PM +0000, goneall wrote:
> And churning before the release means that diffing between this
> release and the next will be much easier.
Good point. If we're going to pretty print all the XML, this would
be the time to do it.
With 3.0 cut, this particular ship has sailed. But at the moment, the
*only* open PR (besides mine, which I'm happy rebasing) that is
touching existing license content is @mlinksva's #489. The other
non-me PRs are adding new from-scratch content. So if we can agree on
a pretty-printer (@zvr, still no PR for yours?), now seems like a
great time to start using it.
|
@zvr, still no PR for your pretty-printer? In the absence of that PR, can we land this on so we can get mostly auto-formatted XML? There are still very few PRs in flight that touch existing licenses, and if we are really concerned about merge conflicts I'm fine rolling out the auto-formatting in stages to work around any existing PRs. |
I have published the pretty printer in https://github.com/zvr/xmlindent -- last touched when the schema changed 5 months ago. |
Discussed on the legal call 13 Dec 2016. The online tools will pretty print edited XML files prior to the pull request, so this should not be needed in the normal workflow. For other workflows, we would encourage submitter to pretty print prior to submit. Pretty printing after submitting may cause issues identifying change history in git. Decision made to close / reject the PR. |
Using Python, because I'm familiar with it, and I've floated a Python script for spdx-spec in spdx/spdx-spec#10. I'm not sure which tools/languages other license-list maintainers are familiar with.
Also using LXML, since it's popular for XML editing within Python.
If you run this script on our current master, you get a lot of changes:
I've left those changes off this PR for now while we discuss the feasability of this approach, but I can push them up publically somewhere else if folks want to browse them without running the script locally.
This is laying the ground-work for automatically manipulating our XML, e.g. converting to
<bullet>
(#414).