Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Candidate clang-format config file #603

Closed
wants to merge 2 commits into from

Conversation

kinkie
Copy link
Contributor

@kinkie kinkie commented Apr 14, 2020

This is a candidate to have a discussion about whether
to adopt clang-format. This configuration is compatible
with clang-format-9 and later

@kinkie
Copy link
Contributor Author

kinkie commented Apr 14, 2020

To run it, in the root of the source tree:
$ clang-format --style=file -i --verbose $(find . -name '*.cc' -or -name *.h -or -name *.c)

@rousskov rousskov marked this pull request as draft April 14, 2020 20:07
Copy link
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To approve painful changes, it would be nice to see a well-articulated proposal covering the obvious concerns and any known caveats: Why do you think that clang-format advantages are worth the large amount of conflicts the switch will create across branches? What are those advantages, and why are they worth the cost? Can clang-format do much better than mimicking our current (very limited) Code Style? Etc.

As you know from my squid-dev email, I only looked at a small portion of the previous incarnation of this branch but, FWIW, I saw:

  • many small improvements (e.g., [1]),
  • some small regressions (e.g., [2,3]),
  • more complex #include ordering rules (e.g., [4]),
  • numerous bad comment movements (e.g., [5]),
  • numerous pointless(?) last empty line removals (e.g., [6]).

I suspect some of the problems mentioned above can be reduced via clang-format configuration adjustments, but I did not study this issue.


The old references below were posted on squid-dev earlier but no longer work. I did not invest the time necessary to refresh them for this PR. Their English descriptions can be used to navigate the diff.

  1. Better tertiary operator spacing near master...kinkie:clang-format#diff-94b81725fceba313922066c3ce6071a7R19

  2. Broken xgetnameinfo() formatting near master...kinkie:clang-format#diff-31086c4678ad7442b4024e422f6e4351R318

  3. Extra ip6_parsenumeric() function name indent near master...kinkie:clang-format#diff-31086c4678ad7442b4024e422f6e4351R337

  4. This was discussed in a few comments of another PR, starting with Maintenance: Move sort-includes.pl to scripts/maintenance/ #599 (comment)

  5. Manual comment alignment for related code lines is a Bad Idea, but even if we do automate all such alignments, Squid code is not written to help clang identify related lines correctly as illustrated by the
    jobs_ member declaration changes near master...kinkie:clang-format#diff-57b6a8db44fb379395f9be80a4c44c5aR82

  6. Virtually any modified source file can be used as an example, but here is one (compat/inet_pton.h) where nothing else has changed: master...kinkie:clang-format#diff-563683e32b22291b9d5622652ec544d9L33

@rousskov rousskov self-requested a review April 14, 2020 20:28
@kinkie
Copy link
Contributor Author

kinkie commented Apr 14, 2020

Moving the conversation over from PR599:

Alex: I would not object to switching to this new sorting order if there are good reasons to switch.

Kinkie: Squid is currently working fine, so it's not possible to come with any hard evidence.

I do not see the relevance of the above assertion. We strive to make Squid headers independent, and most of them probably are. Thus, virtually any order (with squid.h being first) should "work fine" from Squid building point of view (or can be easily fixed if needed).

I think a rule makes sense to avoid circular dependencies. Sure, we could chase them but what's the benefit?

My main observation is that it may be more intuitive to keep interface and implementation more closely coupled than we do now.

Yes, I am sure that many would feel that the interface-first order is more intuitive. My primary concern is that it is very difficult to define "interface" precisely in many real cases, especially in Squid environment, for humans and programs alike. The more rules require human interpretation, and the more complex that interpretation is, the more wasteful conflicts we have. From conflict-avoidance and automation point of view, "[case-insensitive] alphabetical order" is a much better rule than "interface file(s) first and then [case-sensitive] alphabetical order".\

In terms of translation units, it's not too difficult. In the end it's a convention and no more.
But your point makes sense. I'll try it.

I am not a fan of being locked to a particular version of a tool, like we are with astyle right now.

There is a significant difference between the formatting rule being based on version X of the tool and being forced to use version X. When everything is fully automated, a developer can use any tool version as long as the result complies with the rule. For stable, well-written, and properly configured tools, upgrades (or reasonable version differences) are not a problem at all. Astyle was not such a tool when we started using it. Based on your clang-format experiments, clang-format is not quite there yet either (or our configuration does not cover some formatting aspects that it should cover)

There is for sure space for iterations; the configuration is a first attempt.

clang-format covers a much bigger spectrum of code aspects, and as such it may be interesting for us.

Yes, it is an argument for switching to clang-format if we are sure that modern astyle cannot control those formatting aspects. To avoid misunderstanding: There is no doubt in my mind that clang-format has the potential to be much better than astyle (because of what kind of information a clang-based tool has access to). I just do not know how much of that potentially available information clang-format is using now and what the current clang-format implementation quality is. The potential is certainly there!

Some high visibility projects are using it (among others, chromium, webkit, qt, linux). Of course this is not to say that we need to, just that it has been tested in large codebases and found adequate.

If we adopt it, it is likely there will be aspects of our coding style we'll need to change if we want to adopt the tool. The question is if it's worth it. I am quite convinced it is, but my level of acceptance of large changes is notoriously higher than average

Amos: The issue with tool differences is not whether version X+1 changes things version X missed. But whether any version X undoes changes version Y makes.

I would not formulate it like this. IMO, the primary "issue" here is any surprise/uncontrolled change of official Squid sources. Once we fully automate formatting, we do not want Squid sources to suddenly change because somebody ran "sudo apt full-upgrade" on the CI box. And before that full automation goal is reached, we want to reduce surprises and reformatting during PR review/merging process.

From my experiment, this risk is very low. It is also possible to use git commit hooks to ensure that new PRs are already conformant when they are submitted. Our current ad hoc solution makes this less feasible.

Again, this does not mean that others cannot use other tool versions. It means that if they do, they may face formatting changes down the road. We want to minimize that.

Sure

@rousskov
Copy link
Contributor

The question is if it's worth it. I am quite convinced it is

FYI: I do not recall seeing a coherent set of arguments convincing others that this change is worth it. Perhaps the reasons are obvious to you. They are not to me, and I cannot get the answers myself without investing a lot of time. The current lack of a well-articulated proposal does not mean we should not switch, of course. It means that we are operating in the dark, discussing without structure, and that is clearly a bad way to make decisions.

It is also possible to use git commit hooks to ensure that new PRs are already conformant when they are submitted. Our current ad hoc solution makes this less feasible.

In what way does an astyle-based solution make anything less feasible? I see no serious feasibility differences between astyle and clang-format as far as formatting automation is concerned. What am I missing?

@kinkie
Copy link
Contributor Author

kinkie commented Apr 14, 2020

To approve painful changes, it would be nice to see a well-articulated proposal covering the obvious concerns and any known caveats: Why do you think that clang-format advantages are worth the large amount of conflicts the switch will create across branches? What are those advantages, and why are they worth the cost? Can clang-format do much better than mimicking our current (very limited) Code Style? Etc.

As you know from my squid-dev email, I only looked at a small portion of the previous incarnation of this branch but, FWIW, I saw:

  • many small improvements (e.g., [1]),
  • some small regressions (e.g., [2,3]),

[2] doesn't seem like a regression but it's not what we want either.

[3] We can probably get rid of SQUIDCEXTERN by now, I suppose that is what's confusing the parser; there is not a lot of C calling into c++ anymore, is there?

  • more complex #include ordering rules (e.g., [4]),

I tested your idea of using negative weights to enforce our rules to override the "main include" concept and it works.
It also reformats c files which we are currently ignoring. Whether it is a good or a bad idea is up for discussion

  • numerous bad comment movements (e.g., [5]),

We can try to switch the AlignTrailingComments to false

  • numerous pointless(?) last empty line removals (e.g., [6]).

This cannot be configured that I could see. Is it a dealbreaker?

I suspect some of the problems mentioned above can be reduced via clang-format configuration adjustments, but I did not study this issue.

The old references below were posted on squid-dev earlier but no longer work. I did not invest the time necessary to refresh them for this PR. Their English descriptions can be used to navigate the diff.

I've added them as code blocks below.

  1. Better tertiary operator spacing near master...kinkie:clang-formatdiff-94b81725fceba313922066c3ce6071a7R19

I think you're referring to compat/GnuRegex.c:

                     dend = ((d >= string1 && d <= end1)
-                            ? end_match_1 : end_match_2);
+                                ? end_match_1
+                                : end_match_2);

  1. Broken xgetnameinfo() formatting near master...kinkie:clang-formatdiff-31086c4678ad7442b4024e422f6e4351R318
@@ -19,9 +19,8 @@ SQUIDCEXTERN int xgetnameinfo(const struct sockaddr *sa,
                               size_t hostlen,
                               char *serv,
                               size_t servlen,
-                              int flags );
+                              int flags);
  1. Extra ip6_parsenumeric() function name indent near master...kinkie:clang-formatdiff-31086c4678ad7442b4024e422f6e4351R337
 #if INET6
 static int
-ip6_parsenumeric(sa, addr, host, hostlen, flags)
-const struct sockaddr *sa;
+    ip6_parsenumeric(sa, addr, host, hostlen, flags)
+        const struct sockaddr *sa;
 const char *addr;
 char *host;
 size_t hostlen;
  1. This was discussed in a few comments of another PR, starting with #599 (comment)
  1. Manual comment alignment for related code lines is a Bad Idea, but even if we do automate all such alignments, Squid code is not written to help clang identify related lines correctly as illustrated by the
    jobs_ member declaration changes near master...kinkie:clang-formatdiff-57b6a8db44fb379395f9be80a4c44c5aR82
  2. Virtually any modified source file can be used as an example, but here is one (compat/inet_pton.h) where nothing else has changed: master...kinkie:clang-formatdiff-563683e32b22291b9d5622652ec544d9L33

@rousskov
Copy link
Contributor

Unfortunately, your guesses for old references 1 and 2 were wrong. I cannot quickly supply correct snippets because the PR branch code they were based on is no longer easily available on GitHub (one more reason to avoid force-pushes in publicly shared branches and to provide stable diffs folks can annotate). I doubt it is critically important to restore all of those references though -- I am sure you can find a lot more improvements and regressions than I did if you carefully inspect the current diff.

You have correctly restored the old reference 3, thank you! I doubt that regression is related to SQUIDCEXTERN because ip6_parsenumeric() is not declared SQUIDCEXTERN AFAICT, but it is probably easy for you (and currently difficult for me) to check.

Please do not misinterpret my six items as a comprehensive list. I just wanted to provide a glimpse at what issues a proper "formatter replacement" proposal should investigate/cover/address IMO. I am not working on such a proposal.

@yadij
Copy link
Contributor

yadij commented Apr 17, 2020

Good point. I will build a PR for SQUIDCEXTERN removal tonight. All the src/ ones can simply erase, the include and compat ones need converting to C++ compatible sub-scope form.

@kinkie
Copy link
Contributor Author

kinkie commented Apr 18, 2020

The question is if it's worth it. I am quite convinced it is

FYI: I do not recall seeing a coherent set of arguments convincing others that this change is worth it. Perhaps the reasons are obvious to you. They are not to me, and I cannot get the answers myself without investing a lot of time. The current lack of a well-articulated proposal does not mean we should not switch, of course. It means that we are operating in the dark, discussing without structure, and that is clearly a bad way to make decisions.

It is also possible to use git commit hooks to ensure that new PRs are already conformant when they are submitted. Our current ad hoc solution makes this less feasible.

In what way does an astyle-based solution make anything less feasible? I see no serious feasibility differences between astyle and clang-format as far as formatting automation is concerned. What am I missing?

I'll try to focus on providing a reasonable set of arguments to try and argue why clang-format is better for us than the existing astyle + formatter.pl + other custom solutions as we are using now

One thing that is I think not under contention is that using some tool to enforce some coding style is beneficial; this reduces friction and formatting changes due to individual contributors' own styles and tooling configuration differences. To get this benefit, it doesn't really matter what the style is as long as it's consistent.

  • as shown by my experiments, clang-format is doing a way better job as being consistent across releases. With astyle we are effectively locked at one specific, quite old, release. Upgrading it is painful, and that release is not included in any current distribution which means we need to rely on the current automatic reformatting commits
  • astyle is a very simplistic parser; clang-format uses the clang tokenizer and parser and this gives a high degree of confidence that any valid c++ is going to be correctly interpreted
  • (this is anecdotal, but it feels a bit overkill to do an in-depth analysis) astyle supports and can enforce a more limited set of coding rules than clang-format, ensuring a lesser amount of consistency
  • plenty of high-profile projects are using clang-format. This includes linux, chromium, chrome, webkit, qt, mozilla, blender. This doesn't mean that we need to follow the herd, but we have learnt for instance in switching revision control systems, or when we decided to lean more on stdlib than our own container, that there are benefits in doing so
  • there are integrations of clang-format in all major IDEs and editors (https://clang.llvm.org/docs/ClangFormat.html#vim-integration). This is of course not mandatory, but it can help reduce friction
  • if we decide to go with it, we can implement different migration strategies: flag day (which I favour, if we can bring in feature branches with low friction), apply on changed files, possibly helping developers with git pre-commit hooks, apply on changed lines (via https://clang.llvm.org/docs/ClangFormat.html#script-for-patch-reformatting). The latter strategy is not available in astyle, as far as I can tell

For a post explaining why and how mongodb switched from astyle to clang-format , please see https://engineering.mongodb.com/post/succeeding-with-clangformat-part-1-pitfalls-and-planning/

I hope these arguments will be enough to convince that there is benefit in going ahead; if so, we can discuss about what style to adopt, where the tool supports (or mandates) something which is not covered in our existing guidelines

@rousskov
Copy link
Contributor

I'll try to focus on providing a reasonable set of arguments to try and argue why clang-format is better for us than the existing astyle + formatter.pl + other custom solutions as we are using now

Thank you. FTR, I have been asking for analysis not because I have serious specific/substantiated doubts about clang-format advantages, but because I think we should make well-informed decisions, especially when it comes to large-cost changes. Please keep this overall goal in mind when reading my responses below. We need good information to make a decision. This is not an exersize of convincing me (or anybody else) that you are right. This is an exercise in presenting a high-quality analysis.

One thing that is I think not under contention is that using some tool to enforce some coding style is beneficial

Yes, of course. I would even go one step further -- the more [reasonable] styling we can automatically apply/enforce, the better. Ideally, all styling should be automatically applied/enforced.

  • as shown by my experiments, clang-format is doing a way better job as being consistent across releases.

FYI: AFAICT, the posted experiment does not demonstrate what you are describing because it does not compare clang-format changes across recent releases (those are shown) with astyle changes across recent releases (not shown AFAICT).

With astyle we are effectively locked at one specific, quite old, release. Upgrading it is painful

The last part, which is crucial to your argument, has not been fairly demonstrated AFAICT. To be fair, we have to show lots of changes beyond our control when upgrading from Astyle v3.0 to v3.1. The current ancient Astyle version we use is a red herring here IMO -- clearly, we could have upgraded it much earlier. This is less important, but I can speculate that clang-format releases led to significant formatting changes in the earlier days of that tool as well.

  • astyle is a very simplistic parser; clang-format uses the clang tokenizer and parser and this gives a high degree of confidence that any valid c++ is going to be correctly interpreted

Yes, although I would rephrase this assertion to emphasize the long-term potential of a clang-based tool rather than our confidence in the absence of clang-format bugs: Having access to higher quality input does not necessarily imply fewer output bugs. It only means that clang-format can (potentially) do more. Other factors decrease the probability of clang-format bugs a lot more than seeing a detailed AST IMO.

  • (this is anecdotal, but it feels a bit overkill to do an in-depth analysis) astyle supports and can enforce a more limited set of coding rules than clang-format, ensuring a lesser amount of consistency

Can you quickly estimate the number of configuration options that astyle has? That clang-format has? I do not think we need an in-depth analysis here. I hope that an estimate or two would either support this point well or remove it from the consideration.

  • plenty of high-profile projects are using clang-format.

Can you make the opposite argument about astyle? If not, this is not a relevant claim (or, to be more precise, this is a necessary precondition for using a tool).

  • there are integrations of clang-format in all major IDEs and editors

Yes, this is valuable, assuming there are no (or fewer) astyle integrations.

  • if we decide to go with it, we can implement different migration strategies: flag day (which I favour, if we can bring in feature branches with low friction), apply on changed files, possibly helping developers with git pre-commit hooks,

Astyle supports all that as well AFAICT so this first part is unimportant for this analysis IMO.

apply on changed lines (via https://clang.llvm.org/docs/ClangFormat.html#script-for-patch-reformatting). The latter strategy is not available in astyle, as far as I can tell

This may be important. If we switch, we will go for a flag day, but the ability to reformat patches may be very valuable anyway.

I hope these arguments will be enough to convince that there is benefit in going ahead

If I ignore invalid and incomplete arguments, then I am left with "higher long-term potential", "better integration with editors", and "patch reformatting support". Out of those three, only the "better integration with editors" argument can be honestly used as a partial justification for switching "today". I do not think that is enough to warrant such a costly switch.

However, I hope you can demonstrate that the clang-format offers a lot more control over the sources, which could (with all the other factors being equal or better for clang-format) be enough to warrant a switch.

Also, if possible, please see if you can find other examples of largish projects switching from astyle to clang-format (and confirm that you cannot find examples of the opposite switches!). This information can boost our confidence that we are on the right path (even if those projects switch for wrong or unrelated reasons, like the mongodb example you linked to).

@kinkie
Copy link
Contributor Author

kinkie commented Apr 24, 2020

I'll try to focus on providing a reasonable set of arguments to try and argue why clang-format is better for us than the existing astyle + formatter.pl + other custom solutions as we are using now

Thank you. FTR, I have been asking for analysis not because I have serious specific/substantiated doubts about clang-format advantages, but because I think we should make well-informed decisions, especially when it comes to large-cost changes. Please keep this overall goal in mind when reading my responses below. We need good information to make a decision. This is not an exersize of convincing me (or anybody else) that you are right. This is an exercise in presenting a high-quality analysis.

One thing that is I think not under contention is that using some tool to enforce some coding style is beneficial

Yes, of course. I would even go one step further -- the more [reasonable] styling we can automatically apply/enforce, the better. Ideally, all styling should be automatically applied/enforced.

I agree.

  • as shown by my experiments, clang-format is doing a way better job as being consistent across releases.

FYI: AFAICT, the posted experiment does not demonstrate what you are describing because it does not compare clang-format changes across recent releases (those are shown) with astyle changes across recent releases (not shown AFAICT).

With astyle we are effectively locked at one specific, quite old, release. Upgrading it is painful

The last part, which is crucial to your argument, has not been fairly demonstrated AFAICT. To be fair, we have to show lots of changes beyond our control when upgrading from Astyle v3.0 to v3.1. The current ancient Astyle version we use is a red herring here IMO -- clearly, we could have upgraded it much earlier. This is less important, but I can speculate that clang-format releases led to significant formatting changes in the earlier days of that tool as well.

It is indeed the case that astyle version 3.1 has the same results as 2.04; we can probably remove the version check. @yadij , what do you think?

  • astyle is a very simplistic parser; clang-format uses the clang tokenizer and parser and this gives a high degree of confidence that any valid c++ is going to be correctly interpreted

Yes, although I would rephrase this assertion to emphasize the long-term potential of a clang-based tool rather than our confidence in the absence of clang-format bugs: Having access to higher quality input does not necessarily imply fewer output bugs. It only means that clang-format can (potentially) do more. Other factors decrease the probability of clang-format bugs a lot more than seeing a detailed AST IMO.

  • (this is anecdotal, but it feels a bit overkill to do an in-depth analysis) astyle supports and can enforce a more limited set of coding rules than clang-format, ensuring a lesser amount of consistency

Can you quickly estimate the number of configuration options that astyle has? That clang-format has? I do not think we need an in-depth analysis here. I hope that an estimate or two would either support this point well or remove it from the consideration.

I have counted 54 options in astyle and 107 in clang-format. I have skimmed through the manuals, and tried to group options influencing the same behavior in different ways.

  • plenty of high-profile projects are using clang-format.

Can you make the opposite argument about astyle? If not, this is not a relevant claim (or, to be more precise, this is a necessary precondition for using a tool).

A quick search on google has not found any major project mentioning astyle in their check-in instructions.

  • there are integrations of clang-format in all major IDEs and editors

Yes, this is valuable, assuming there are no (or fewer) astyle integrations.

  • if we decide to go with it, we can implement different migration strategies: flag day (which I favour, if we can bring in feature branches with low friction), apply on changed files, possibly helping developers with git pre-commit hooks,

Astyle supports all that as well AFAICT so this first part is unimportant for this analysis IMO.

apply on changed lines (via https://clang.llvm.org/docs/ClangFormat.html#script-for-patch-reformatting). The latter strategy is not available in astyle, as far as I can tell

This may be important. If we switch, we will go for a flag day, but the ability to reformat patches may be very valuable anyway.

I agree.

I hope these arguments will be enough to convince that there is benefit in going ahead

If I ignore invalid and incomplete arguments, then I am left with "higher long-term potential", "better integration with editors", and "patch reformatting support". Out of those three, only the "better integration with editors" argument can be honestly used as a partial justification for switching "today". I do not think that is enough to warrant such a costly switch.

However, I hope you can demonstrate that the clang-format offers a lot more control over the sources, which could (with all the other factors being equal or better for clang-format) be enough to warrant a switch.

Is the comparison in number of option sufficient to convince you of that? In order to go more in depth I would have to manually compare the semantic of each of these 100 different options, you can imagine how I am not really keen to do it

Also, if possible, please see if you can find other examples of largish projects switching from astyle to clang-format (and confirm that you cannot find examples of the opposite switches!). This information can boost our confidence that we are on the right path (even if those projects switch for wrong or unrelated reasons, like the mongodb example you linked to).

astyle -> clang-format: I've found open-quantum-safe (open-quantum-safe/liboqs#56), openthread (openthread/openthread#2533) .
No evidence of anybody going from clang-format to astyle.

Interestigly, I've stumbled upon whatastyle (https://pypi.org/project/whatstyle/), which claims to try and match a style format best matching the style a source is written in.

@rousskov
Copy link
Contributor

I have counted 54 options in astyle and 107 in clang-format.

Thank you for reporting this. These stats do not raise any red flags, but the difference alone is not substantial enough to claim a proof of a significant advantage IMO.

Alex: I hope you can demonstrate that the clang-format offers a lot more control over the sources

Kinkie: Is the comparison in number of option sufficient to convince you of that?

No, that would be rather irrational of me. It is just a good stepping stone towards a convincing demonstration. After all, it is not about the raw number of options but what they control. I was hoping for a more drastic difference in raw numbers (e.g., an order of magnitude) or a completely different approach to specifying the desirable format.

In order to go more in depth I would have to manually compare the semantic of each of these 100 different options, you can imagine how I am not really keen to do it

FWIW, I do not think that comparing the semantics of each option is necessary or even helpful for those reviewing the "let's switch to clang-format" proposal. It would simply overwhelm the reviewers with low-level information. If we are going to switch, then I would expect somebody to go through those 107 options and set them the best we can (including leaving some at their defaults), but that activity can probably wait and is rather different from the comparative option-by-option analysis you rightfully want to avoid.

The proposal needs to demonstrate that clang-format is substantially better and, hence, worth the switching costs. If I were to make that proposal, I would try to show several specific areas of significant improvements over what modern astyle can do and demonstrate that there are no serious regressions. With 100 configuration options, I suspect both demonstrations are possible, but past demonstrations did neither.

With your updates in mind, I will re-state the current pros and cons in the next comment.

@rousskov
Copy link
Contributor

AFAICT, here are the current switching arguments, with a few invalid ones thrown away and a few adjustments based on Francesco's research and the discussion:

Pros:

  1. Clang-format can control X and Y and Z formatting aspects which even the latest astyle cannot handle. These formatting aspects are important to us. See examples at ... and ... and ...

  2. Clang-format supports integration with major IDEs and editors.

  3. Clang-format may be able to format patches.

  4. Clang-format is popular. We know of large projects using or switching to clang-format. We cannot find anybody switching in the opposite direction. Astyle has lost the popularity contest and, hence, is less likely to attract a lot of development/support energy going forward. This is not a valid argument for switching now, but it demonstrates that we will likely want to switch in the foreseeable future so at least some switching costs are likely inevitable.

Cons:

  1. Reformatting master (and v5) will cause a lot of conflicts in pending development branches.

  2. Not reformatting v4 (and v5) will increase backporting costs.

  3. We will need to adjust our formatting scripts.

  4. Applying clang-format causes formatting regressions A, B, and C. See examples at ... and ... and ....

Conclusion: Clang-format has several important advantages (see pros1 and pros2) and causes no important formatting regressions (see cons4 for a list of minor ones). Switching is likely so we should start enjoying a better tool sooner rather than later. Let's work on the best format/configuration file that we can agree on and switch.

All that is left is filling out X, Y, Z as well as A, B, C...

@kinkie
Copy link
Contributor Author

kinkie commented Apr 27, 2020

AFAICT, here are the current switching arguments, with a few invalid ones thrown away and a few adjustments based on Francesco's research and the discussion:

Pros:

  1. Clang-format can control X and Y and Z formatting aspects which even the latest astyle cannot handle. These formatting aspects are important to us. See examples at ... and ... and ...

From our code style guidelines:

  • "no sets of multiple empty lines"
    clang-format has directive "MaxEmptyLinesToKeep". astyle has no equivalent support

  • "one space between if and its parameter () brackets" - clang-format supports it via the SpaceBeforeParens: ControlStatements dirctive, I couldn't find anything similar in astyle

  • clang-format has a more flexible support for brace control, allowing more flexibility about newlines around them (via BreakBeforeBraces: Custom / BraceWrapping directive)

  • clang-format supports handy readability-enhancing features: SpaceBeforeAssignmentOperators , SpaceBeforeCpp11BracedList, SpaceBeforeInheritanceColon, SpaceBeforeRangeBasedForLoopColon

  1. Clang-format supports integration with major IDEs and editors.
  2. Clang-format may be able to format patches.
  3. Clang-format is popular. We know of large projects using or switching to clang-format. We cannot find anybody switching in the opposite direction. Astyle has lost the popularity contest and, hence, is less likely to attract a lot of development/support energy going forward. This is not a valid argument for switching now, but it demonstrates that we will likely want to switch in the foreseeable future so at least some switching costs are likely inevitable.

Cons:

  1. Reformatting master (and v5) will cause a lot of conflicts in pending development branches.
  2. Not reformatting v4 (and v5) will increase backporting costs.
  3. We will need to adjust our formatting scripts.
    I can do this
  4. Applying clang-format causes formatting regressions A, B, and C. See examples at ... and ... and ....
  • compat/GnuRegex.c is a bit of a basket case. It is actually made somewhat less readable by the transformation. We probably want to only apply the transformation to .cc files
  • with the default setting of ColumnLimit to 0, some conditionals that are right now on multiple lines get merged onto a single longer line. OTOH setting any arbitrary column limit will explode the size of the change and it's likely to hit some corner case anyway
  • we are most likely tightening the code style rules. This is arguably a good thing, but it's still a chagne

Conclusion: Clang-format has several important advantages (see pros1 and pros2) and causes no important formatting regressions (see cons4 for a list of minor ones). Switching is likely so we should start enjoying a better tool sooner rather than later. Let's work on the best format/configuration file that we can agree on and switch.

All that is left is filling out X, Y, Z as well as A, B, C...

@yadij
Copy link
Contributor

yadij commented Apr 28, 2020

compat/GnuRegex.c is a bit of a basket case. It is actually made somewhat less readable by the transformation. We probably want to only apply the transformation to .cc files

IIRC the delay in moving to std::regex was due to CentOS compiler being outdated. With CentOS 7 hitting EOL end of this year I hope we can resolve all these problems in v6 anyway.

@kinkie
Copy link
Contributor Author

kinkie commented Apr 28, 2020

compat/GnuRegex.c is a bit of a basket case. It is actually made somewhat less readable by the transformation. We probably want to only apply the transformation to .cc files

IIRC the delay in moving to std::regex was due to CentOS compiler being outdated. With CentOS 7 hitting EOL end of this year I hope we can resolve all these problems in v6 anyway.

Centos 6 is hitting EOL on 2020-11-30; Centos 7 is EOL'ed on 2024-06-30. But yes, I agree.

@rousskov
Copy link
Contributor

@kinkie, thank you for providing verbal descriptions of anticipated improvement examples. To be honest, I expected improvements that are a lot more valuable than the ones currently listed, but perhaps I was overly optimistic about current clang-format abilities. Or perhaps clang-format primary advantages are not in the individual formatting options.

I have a related question: Is there a way to configure clang-format so that a given large piece of non-trivial C++ code will always be formatted exactly the same way, no matter how the developer formats that code? I am hoping for a "yes" answer. A "no" answer to that question would imply that current clang-format just cannot control some C++ code formatting aspects; in other words, there exist two different inputs (of the same C++ code) such that their formatted versions will remain different.

The formatting regressions list feels inadequate to me because it does not contain previously discussed formatting regressions. There absence makes the reader wonder whether there are a lot more problems that we are currently unaware of. An honest comprehensive disclosure of formatting regressions is an important prerequisite for going forward IMO.

Please note that unlike the X, Y, Z examples in my template, the list of known regressions (A, B, C, ...) has to be as comprehensive as we can make it. We do not need a link to every regression of the same kind, but we do need a list item for every regression kind that we know about.

The current list also seems to focus on the amount of changes, which is already covered in another "cons" bullet. When building the formatting regressions list, the amount of changes is irrelevant. In this context, we should welcome any change that is improving formatting or, at the very least, makes it more consistent. This list should focus on formatting that became worse after applying clang-format.

@kinkie
Copy link
Contributor Author

kinkie commented Apr 28, 2020

@kinkie, thank you for providing verbal descriptions of anticipated improvement examples. To be honest, I expected improvements that are a lot more valuable than the ones currently listed, but perhaps I was overly optimistic about current clang-format abilities. Or perhaps clang-format primary advantages are not in the individual formatting options.

One advantage I see is we could move away from our current descriptive code guidelines and onto "format with clang-format version X or later and this configuration file".

I have a related question: Is there a way to configure clang-format so that a given large piece of non-trivial C++ code will always be formatted exactly the same way, no matter how the developer formats that code? I am hoping for a "yes" answer. A "no" answer to that question would imply that current clang-format just cannot control some C++ code formatting aspects; in other words, there exist two different inputs (of the same C++ code) such that their formatted versions will remain different.

The answer is "no". I've ran an experiment: take client_side.cc, arguably our most complicated source file, format it with clang-format-10, then run it through a random astyle style, and then clang-format-10 again and compare the differences.
There are 10 differences:

-    virtual bool canDial(AsyncCall &) const { return true; }
-    virtual void dial(AsyncCall &) { (handler)(portCfg, portTypeNote, sub); }
+    virtual bool canDial(AsyncCall &) const
+    {
+        return true;
+    }
+    virtual void dial(AsyncCall &)
+    {
+        (handler)(portCfg, portTypeNote, sub);
+    }

or

-    if (!sslServerBump || sslServerBump->act.step1 == Ssl::bumpClientFirst) {  // Either means client-first.
+    if (!sslServerBump || sslServerBump->act.step1 == Ssl::bumpClientFirst)  // Either means client-first.
+    {

Maybe there is an opportunity to make the configuration more prescritptive, or we can accept this

The formatting regressions list feels inadequate to me because it does not contain previously discussed formatting regressions. There absence makes the reader wonder whether there are a lot more problems that we are currently unaware of. An honest comprehensive disclosure of formatting regressions is an important prerequisite for going forward IMO.

getnameinfo.cc is quite messy, PR#619 is meant to clean it up, and it takes care of the noted regressions there (xgetnameinfo and ip6_parsenumeric).

header sorting: negative priorities work in keeping headers sorted according to our specifications, although with some oddities as we have noted when subdirectories are involved. The ordering is however consistent, so I'd leave this be

trailing comments alignment: I'm afraid there's only two options: either align them or force them to be a fixed number of characters before the last statement. I don't want to downplay your concerns, but I fail to see this as a showstopper. If you believe it is a showstopper (no need to explain why), then say so, and I'll just drop the PR.

Please note that unlike the X, Y, Z examples in my template, the list of known regressions (A, B, C, ...) has to be as comprehensive as we can make it. We do not need a link to every regression of the same kind, but we do need a list item for every regression kind that we know about.

The patch is 121k lines long . Going over it is going to be a lot of work.
Also, "regressions" is unfortunately a subjective matter. I don't consider aligning trailing comments to be a regression, while you do.

For instance, is

-Trie::Trie(TrieCharTransform *aTransform) : head(0), transform(aTransform)
-{}
+Trie::Trie(TrieCharTransform *aTransform) :
+    head(0), transform(aTransform)
+{
+}

a regression? Sure, it uses more lines, but to me it is more readable.

A lot of positive changes I see are about enforcing the rule of having a function's return type be on its own line - we have been inconsistent in enforcing it
I see some cases of function opening bracket being moved to a new line, but we've been inconsistent there as well.
I see pointer alignments (int* foo vs. int * foo vs. int *foo) being enforced. Astyle can do this, we just never turned it on

A pretty important regression can be

-            debugs(77, DBG_IMPORTANT, "WARNING: client_delay_pool #" << (i+1) <<
-                   " has no client_delay_access configured. " <<
-                   "No client will ever use it.");
+            debugs(77, DBG_IMPORTANT, "WARNING: client_delay_pool #" << (i + 1) << " has no client_delay_access configured. "
+                                                                     << "No client will ever use it.");

debugs() is hard to undertand, and as such the formatter gets confused and either creates very long lines or aligns to the wrong point

The current list also seems to focus on the amount of changes, which is already covered in another "cons" bullet. When building the formatting regressions list, the amount of changes is irrelevant. In this context, we should welcome any change that is improving formatting or, at the very least, makes it more consistent. This list should focus on formatting that became worse after applying clang-format.

I'm at 2% of the difference, and it's rather late in the day. Please let me know if there is any showstopper for you. IMO the regressions are worth the extra rigour this tool enables us, but I understand it's subjective so it carries limited weight.

@rousskov
Copy link
Contributor

One advantage I see is we could move away from our current descriptive code guidelines and onto "format with clang-format version X or later and this configuration file".

Such a move, as we discussed before, is a good approach (under certain conditions), but I do not see a unique clang-format advantage here. The same move can, in principle, be done while still using astyle AFAICT. Yes, clang-format should be able to control more aspects, but that nearly orthogonal advantage is already counted elsewhere (pros item 1).

I've ran an experiment: take client_side.cc, arguably our most complicated source file, format it with clang-format-10, then run it through a random astyle style, and then clang-format-10 again and compare the differences.
There are 10 differences: ...

AFAICT, before we make any conclusions from this experiment, we have to make sure that no clang-format configuration options can control the currently-changing output. For example, perhaps clang-format can split all one-liners longer than N characters and merge all shorter multi-liners? And control the position of the opening if statement curly brace, even in the presence of trailing comments? That would eliminate the changes you have observed.

Maybe there is an opportunity to make the configuration more prescritptive, or we can accept this

I believe that the more uniform code formatting is, the better, but that is a somewhat secondary concern. A "yes" answer to my question would allow us to leave all C++ formatting to the tool (with the exception of the text inside comments), which would be a significant improvement because it will squash review arguments over C++ code formatting.

header sorting: negative priorities work in keeping headers sorted according to our specifications, although with some oddities as we have noted when subdirectories are involved. The ordering is however consistent, so I'd leave this be

Please note that our current primary goal is to catalogue regressions/problems, not decide what to do about them. We can move to the decision/mitigation phases while all the regressions are known. I will try to resist discussing what problems we should leave unaddressed until those phases.

The patch is 121k lines long . Going over it is going to be a lot of work.

Yes, but it is the work that has to be done to make an informed decision: We should not switch until we understand what happens when we do (and until we know how to increase the benefits and/or reduce the pain from the switch).

Needless to say, nobody is forcing you to do this tedious work! However, if you want us to switch, the onus is on you to prepare a quality proposal, which includes, IMO, itemizing regression kinds. IIRC, after spending less than an hour looking at the initial diff, I found regressions that you have not warned us about, and that scares me -- how many more different regressions are out there?

Also, please note that, AFAICT, you are looking at a diff produced with a handful of clang-format options. If we switch, we will probably want to use a lot more options, which will probably change the diff. I do not know what the most efficient way to navigate this process is, but I bet it will require spending many hours working with large diffs.

Also, "regressions" is unfortunately a subjective matter. I don't consider aligning trailing comments to be a regression, while you do.

I agree that assessing some formatting changes require judgement calls. Use your best judgement and err on the side of inclusion. It does not hurt to warn about a few change categories that everybody is going to be OK with.

For instance, is

-Trie::Trie(TrieCharTransform *aTransform) : head(0), transform(aTransform)
-{}
+Trie::Trie(TrieCharTransform *aTransform) :
+    head(0), transform(aTransform)
+{
+}

a regression? Sure, it uses more lines, but to me it is more readable.

FWIW, I consider this an improvement. I would also prefer clang-format to initialize each member on its own line (except for legitimate one-liners where everything fits on one line).

BTW, "more lines" is not a very good criteria for judging formats IMO.

A pretty important regression can be

-            debugs(77, DBG_IMPORTANT, "WARNING: client_delay_pool #" << (i+1) <<
-                   " has no client_delay_access configured. " <<
-                   "No client will ever use it.");
+            debugs(77, DBG_IMPORTANT, "WARNING: client_delay_pool #" << (i + 1) << " has no client_delay_access configured. "
+                                                                     << "No client will ever use it.");

debugs() is hard to undertand, and as such the formatter gets confused and either creates very long lines or aligns to the wrong point

Yes, this change makes formatting worse. Hopefully, we can find a way to address this problem. For example, I would argue that alignment should use four spaces for each nesting level change, not opening parens or other alignment markers. It would also be great to leave unimportant << at the end of the lines, but I doubt clang-format is that powerful. However, again, the current phase is not really about problem mitigation (or, to be more precise, any mitigation should be done by the person making the switching proposal if they want to make the switch more appealing).

I'm at 2% of the difference, and it's rather late in the day. Please let me know if there is any showstopper for you.

I am not sure I understand the question. I hope we can find solutions (or decide to ignore) all the problems seen so far. Right now, the unknowns are the only known showstopper.

IMO the regressions are worth the extra rigour this tool enables us, but I understand it's subjective so it carries limited weight.

Right now, this is not so much about subjective decisions as it is about decisions in the dark. Here is how it feels:

  • Clang-format is better than astyle. Should we switch?
  • Yes, clang-format is probably better, but what problems will we need to address if we switch?
  • Clang-format is better than astyle! We should switch!
  • I see problems A, B, and C. What other problems will we need to address if we switch?
  • The problems you see are either minor or not problems at all. I did not find anything important after looking at 2% of a huge diff produced with a small subset of the formatting options (compared to what we are likely to actually use during the switch). You are stalling progress. Can we just decide to switch already?!

If your claim is that there are no regressions worse discussing (for whatever reason), you can, of course, make that claim (and explain your reasoning). You are in the driver seat.

@squid-anubis squid-anubis added M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels and removed M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels labels Jun 2, 2020
@rousskov rousskov added the S-waiting-for-author author action is expected (and usually required) label Dec 9, 2020
@kinkie
Copy link
Contributor Author

kinkie commented May 20, 2021

abandoning, we're staying on astyle for the time being

@kinkie kinkie closed this May 20, 2021
@rousskov
Copy link
Contributor

we're staying on astyle for the time being

Just to avoid misunderstanding: The above fact does not imply that future attempts to improve formatting, including attempts using clang tools, are not welcomed per se. When somebody works on formatting improvements, they should learn the lessons this PR offers, of course, but "do not do it" is not one of those lessons.

@rousskov rousskov removed their request for review May 20, 2021 13:39
@rousskov rousskov removed the S-waiting-for-author author action is expected (and usually required) label May 20, 2021
@kinkie
Copy link
Contributor Author

kinkie commented May 20, 2021 via email

@rousskov
Copy link
Contributor

Right. I actually left that comment for future developers. I think you and I are more-or-less on the same page here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants