Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make trailing whitespace at the end of multiline strings an error #19299

Open
wooster0 opened this issue Mar 14, 2024 · 13 comments · May be fixed by #20575
Open

Make trailing whitespace at the end of multiline strings an error #19299

wooster0 opened this issue Mar 14, 2024 · 13 comments · May be fixed by #20575
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@wooster0
Copy link
Contributor

wooster0 commented Mar 14, 2024

Status quo

Status quo of trailing whitespace:

  • Non-ASCII Unicode trailing whitespace outside of a comment or multiline string is an error:
test {}  
x.zig:1:9: error: expected type expression, found 'invalid bytes'
test {}  
        ^
x.zig:1:9: note: invalid byte: '\xe3'

There is a normal space followed by a full-width space.

  • ASCII trailing whitespace outside of a comment or multiline string is formatted away by zig fmt.
  • Non-ASCII Unicode trailing whitespace at the end of a comment is not formatted away:
//                                                   

There are a bunch of trailing full-width spaces.

Technically it'd be useful if zig fmt formatted away trailing Unicode whitespace as well but the Zig compiler currently doesn't really have much Unicode awareness.
With that in mind, this is what happens for multiline strings:

test {
    const ascii_spaces = \\               
    ;
    _ = ascii_spaces;
    const unicode_full_width_spaces = \\              
    ;
    _ = unicode_full_width_spaces;
}

zig fmt does not format anything away and the compiler doesn't error.

Trailing whitespace as it is handled right now is in my opinion perfectly fine except for multiline strings.

Proposal

I propose zig fmt to not change and the compiler to complain, for a start, when there is ASCII whitespace at the end of a multiline string.

The reason is that trailing whitespace at the end of a multiline string is almost always a mistake and the intent is rather unclear (Communicate intent precisely).
Was the intent to have this trailing whitespace or not?

Example:

\\ -fno-ms-extensions Disable support for Microsoft extensions
\\ -fdollars-in-identifiers
\\ Allow '$' in identifiers
\\ -fno-dollars-in-identifiers
\\ Disallow '$' in identifiers
\\ -fmacro-backtrace-limit=<limit>

test {
    const help_menu =
        \\  -fno-ms-extensions      Disable support for Microsoft extensions
        \\  -fdollars-in-identifiers        
        \\                          Allow '$' in identifiers
        \\  -fno-dollars-in-identifiers     
        \\                          Disallow '$' in identifiers
        \\  -fmacro-backtrace-limit=<limit>
    ;
    _ = help_menu;
}

The -fdollars-in-identifiers and -fno-dollars-in-identifiers lines have trailing ASCII spaces. Select the text to see them.
Mistakes like this are common in multiline strings and even your editor won't format them away for you if it knows about Zig multiline strings (and so knows that the spaces are part of a string, which are part of the program and have semantic implications).

Was this really a mistake or did iddev5 (https://github.com/Vexu/arocc/pull/152/files) mean to put these ASCII spaces there?

The intent is more or less unclear and so I propose the compiler to behave like this:

test {
    const help_menu =
        \\  -fno-ms-extensions      Disable support for Microsoft extensions
        \\  -fdollars-in-identifiers        
        \\                          Allow '$' in identifiers
        \\  -fno-dollars-in-identifiers     
        \\                          Disallow '$' in identifiers
        \\  -fmacro-backtrace-limit=<limit>
    ;
    _ = help_menu;
}
$ zig test x.zig
x.zig:4:44: error: trailing whitespace in multiline string
        \\  -fdollars-in-identifiers        
                                           ^
x.zig:4:10: note: use a singleline string if you meant to put this trailing whitespace

Meaning, the fix would be:

test {
    const help_menu =
        \\  -fno-ms-extensions      Disable support for Microsoft extensions
        \\  -fdollars-in-identifiers
    ++ "        " ++
        \\                          Allow '$' in identifiers
        \\  -fno-dollars-in-identifiers
    ++ "     " ++
        \\                          Disallow '$' in identifiers
        \\  -fmacro-backtrace-limit=<limit>
    ;
    _ = help_menu;
}

Now intent is very clear. You specifically concatenated spaces in between (for whatever reason) so clearly you intended to do this.
Because "" strings are singleline, you can inspect whatever is in between those two double quotes
and you don't have to worry about trailing whitespace in multiline strings that you didn't see.

Here, again, as above it would technically be useful if it errors for Unicode whitespace as well.
But because (as above) zig fmt doesn't format away trailing Unicode whitespace, I think it's fine for the compiler to only complain for ASCII spaces for now,
that is until zig fmt starts formatting away trailing Unicode whitespace as well.

Reasoning

The rules "Communicate intent precisely" and "Favor reading code over writing code".

When I say non-ASCII Unicode trailing whitespace, I am also talking about Related Unicode characters with property White_Space=no (second table), such as zero-width spaces. They fit this category as well.

@nektro
Copy link
Contributor

nektro commented Mar 14, 2024

as long as there can be a way such that zig build-* rejects it but zig fmt can still accept and fix it, then I support this.

@Vexu Vexu added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Jun 2, 2024
@Vexu Vexu added this to the 0.14.0 milestone Jun 2, 2024
@matklad
Copy link
Contributor

matklad commented Jul 9, 2024

+1, we regularly step into this in TigerBeetle, where there's an un-noticed trailing whitespace in \\hello world (which is annoying in and of itself), which then subsequently leaks leaks into whatever we are using this string literal for.

I wanted to quickly fix this today, but, after about an hour hacking, I think I'll give up for now :-)

I did come up with some tests here:

matklad/zig@854e86c...6d2506b

Note for whoever wants to fix this next (perhaps future me?):

  • My first idea was to check for this error in the parser: when we parse a multiline string literal, check if it ends in " " | "\t" and warn. The problem with this approach is that the parser doesn't know where the token ends, only where it starts. So, this'll require scanning through the literal's text the second time around, which feels redundant.
  • My second idea was to add .multiline_string_literal_invalid to the set of tokens returned by the lexer, but I don't like it because that means duplciating all the call sites to handle both .multiline_string_literal and .multiline_string_literal_invalid
  • My third idea was to emit a new .invalid_trailing_whitespace token from the lexer, so that, onece it sees a multiline literal with trailing whitespace, it'd split it in two tokens. This doesn't work because in the AST multiline string literal holds a contiguous slice of tokens for individual lines, and, if you intersperse them with invalid tokens, everything that handles that AST node also have to changed.

Sadly, I was smart enough to come up with all those ideas, but not smart enough to realize that they don't work before implementing them, so I'll leave it at that for now 🤣

Some further ideas:

  • we can just eat up the second pass over the data in the parser, especially given that we'll do an extra pass in the AstGet anyway.
  • or we could push this error all the way up to the AstGen, as we are computing token end there anyway.

@mlugg
Copy link
Member

mlugg commented Jul 9, 2024

Yeah, I'd expect this to be done as an AstGen error.

@nektro
Copy link
Contributor

nektro commented Jul 9, 2024

another benefit of making it an AstGen error is that zig fmt could autofix it

@mlugg
Copy link
Member

mlugg commented Jul 9, 2024

@nektro per the literal first sentence of the proposal, we will not have zig fmt modify such code (emphasis my own):

I propose zig fmt to not change and the compiler to complain, for a start, when there is ASCII whitespace at the end of a multiline string.

Having zig fmt silently perform a hard-to-spot, potentially functional, change to code is not a good idea. The whole point of this error is that the intent is unclear. Thus, it would be completely irrational for zig fmt to assume that intent.

@nektro
Copy link
Contributor

nektro commented Jul 9, 2024

if the user wants trailing space they should not be using a multiline string. the whole point of this proposal is that it'd be completely invisible to both the writer and reader of the code. barely any software warns about it and zig fmt asserting/protecting you from it would not be a bad thing

edit: oh i see i already left this opinion when this was first opened so ill stop my comments here

matklad added a commit to matklad/zig that referenced this issue Jul 10, 2024
Trailing whitespace in multiline strings is annoying for all the usual
reasons trailing whitespace is annoying (unrelated line changes).

However, in this case it even can leak into whatever the string literal
is used for. For example, in the previous commit there is trailing
whitespace leaking into the help messages.

So it makes sense to forbid this outright.

`zig fmt` intentionally doesn't try to clean up here --- if there's a
trailing whitespace in a multiline literal, it is _ambiguous_ whether
the user added that by accident, or whether they _intended_ to have a
space there.

If you need to have trailing whitespace, you can contatenate `\\` and
`"` strings with `++`:

    const S =
        \\hello
    ++ "  \n" ++
        \\world
        \\
    ;

Closes: ziglang#19299
matklad added a commit to matklad/zig that referenced this issue Jul 10, 2024
Trailing whitespace in multiline strings is annoying for all the usual
reasons trailing whitespace is annoying (unrelated line changes).

However, in this case it even can leak into whatever the string literal
is used for. For example, in the previous commit there is trailing
whitespace leaking into the help messages.

So it makes sense to forbid this outright.

`zig fmt` intentionally doesn't try to clean up here --- if there's a
trailing whitespace in a multiline literal, it is _ambiguous_ whether
the user added that by accident, or whether they _intended_ to have a
space there.

If you need to have trailing whitespace, you can contatenate `\\` and
`"` strings with `++`:

    const S =
        \\hello
    ++ "  \n" ++
        \\world
        \\
    ;

Closes: ziglang#19299
matklad added a commit to matklad/zig that referenced this issue Jul 10, 2024
Trailing whitespace in multiline strings is annoying for all the usual
reasons trailing whitespace is annoying (unrelated line changes).

However, in this case it even can leak into whatever the string literal
is used for. For example, in the previous commit there is trailing
whitespace leaking into the help messages.

So it makes sense to forbid this outright.

`zig fmt` intentionally doesn't try to clean up here --- if there's a
trailing whitespace in a multiline literal, it is _ambiguous_ whether
the user added that by accident, or whether they _intended_ to have a
space there.

If you need to have trailing whitespace, you can contatenate `\\` and
`"` strings with `++`:

    const S =
        \\hello
    ++ "  \n" ++
        \\world
        \\
    ;

Closes: ziglang#19299
matklad added a commit to matklad/zig that referenced this issue Jul 10, 2024
Trailing whitespace in multiline strings is annoying for all the usual
reasons trailing whitespace is annoying (unrelated line changes).

However, in this case it even can leak into whatever the string literal
is used for. For example, in the previous commit there is trailing
whitespace leaking into the help messages.

So it makes sense to forbid this outright.

`zig fmt` intentionally doesn't try to clean up here --- if there's a
trailing whitespace in a multiline literal, it is _ambiguous_ whether
the user added that by accident, or whether they _intended_ to have a
space there.

If you need to have trailing whitespace, you can contatenate `\\` and
`"` strings with `++`:

    const S =
        \\hello
    ++ "  \n" ++
        \\world
        \\
    ;

Closes: ziglang#19299
matklad added a commit to matklad/zig that referenced this issue Jul 10, 2024
Trailing whitespace in multiline strings is annoying for all the usual
reasons trailing whitespace is annoying (unrelated line changes).

However, in this case it even can leak into whatever the string literal
is used for. For example, in the previous commit there is trailing
whitespace leaking into the help messages.

So it makes sense to forbid this outright.

`zig fmt` intentionally doesn't try to clean up here --- if there's a
trailing whitespace in a multiline literal, it is _ambiguous_ whether
the user added that by accident, or whether they _intended_ to have a
space there.

If you need to have trailing whitespace, you can contatenate `\\` and
`"` strings with `++`:

    const S =
        \\hello
    ++ "  \n" ++
        \\world
        \\
    ;

Closes: ziglang#19299
@thomab
Copy link

thomab commented Jul 19, 2024

For what it's worth, I don't like it. If I declare a string literal, then most likely that is literally the string that I want.

An example where this would be unfortunate is if you're generating markdown (or similar) files from templates, and you want to include explicit line-breaks (two trailing spaces) in the template.

Couldn't you lint this on your own machines or ci?

bash-5.2$ cat temp.zig
const _ =
    \\hello
    \\world
    \\oops
    \\foo
    \\bar
;
bash-5.2$ grep -EHn '^\s*\\\\.*\s+$' temp.zig
temp.zig:4:    \\oops

@matklad
Copy link
Contributor

matklad commented Jul 19, 2024

An example where this would be unfortunate is if you're generating markdown (or similar) files from templates, and you want to include explicit line-breaks (two trailing spaces) in the template.

This is precisely the case where it is beneficial to see trailing white-space marked explicitly in the source code. It's hard to spot a bug in

    writeMarkdown(
        \\Roses are red  
        \\  Violets are blue,  
        \\Sugar is sweet
        \\  And so are you.
    );

@thomab
Copy link

thomab commented Jul 19, 2024

see trailing white-space marked explicitly

But the proposal is to forbid it, not to mark it.

@matklad
Copy link
Contributor

matklad commented Jul 19, 2024

We don't need a new feature to mark trailing whitespace because there is a number of existing language features you can use for this niche purpose sufficiently effectively:

// Use normal string, the perfect solution for this specific example
    writeMarkdown("" ++
        "Roses are red  \n" ++
        "  Violets are blue,  \n" ++
        "Sugar is sweet  \n" ++
        "  And so are you.  \n");

// For one off line break, the following works
    writeMarkdown(
        \\Roses are red
        \\  Violets are blue,
    ++ "  \n" ++
        \\Sugar is sweet
        \\  And so are you.
    );

// If you need this a lot for a particular use-case, you can make domain-specific helper:
    writeMarkdown(pre(
        \\Roses are red
        \\  Violets are blue,
        \\Sugar is sweet
        \\  And so are you.
    ));
    writeMarkdown(
        \\Roses are red {br}
        \\  Violets are blue,  {br}
        \\Sugar is sweet  {br}
        \\  And so are you.  {br}
    );

// Finally, it might be worth to stop and think whether you actually _must_ 
// use a trailing whitespace, or whether there's a better solution:
    writeMarkdown(
        \\Roses are red\
        \\  Violets are blue,\
        \\Sugar is sweet\
        \\  And so are you.\
    );

@thomab
Copy link

thomab commented Jul 19, 2024

niche purpose

That is probably true, but there aren't very many people arguing either way
here.

\\Roses are red\

I suppose could live with this.

Edited:
Grepping a large code base such as zig or tigerbeetle for this takes milliseconds on my machine, though it's only a few years old, I understand it can take some amount longer on older hardware or laptops. Probably not that much longer, though.

@thomab
Copy link

thomab commented Jul 19, 2024

I wanted to see if looking for this at build time is feasible and came up with this proof of concept. It's rough and not portable but, it seems like it could be done.

diff --git a/build.zig b/build.zig
index 054b67b..b1ea75c 100644
--- a/build.zig
+++ b/build.zig
@@ -46,6 +46,7 @@ pub fn build(b: *std.Build) !void {
 
     // Top-level steps you can invoke on the command line.
     const build_steps = .{
+        .trail = b.step("trail", "Look for trailing whitespace"),
         .aof = b.step("aof", "Run TigerBeetle AOF Utility"),
         .check = b.step("check", "Check if TigerBeetle compiles"),
         .clients_c = b.step("clients:c", "Build C client library"),
@@ -257,6 +258,28 @@ pub fn build(b: *std.Build) !void {
         break :blk install.getDest();
     };
 
+    { // zig build trail
+        var code: u8 = undefined;
+        const grep = b.runAllowFail(&.{
+            "grep",
+            "-EHnr",
+            \\^\s.*\s+$
+            ,
+            "--include=*.zig",
+        }, &code, .Inherit) catch |err| switch (err) {
+            error.ExitCodeFailure => if (code == 1) "ok" else blk: {
+                std.debug.print("{}\n", .{code});
+                break :blk "what";
+            },
+            else => return err,
+        };
+        std.debug.print("{}\n", .{code});
+        if (code > 2) {
+            std.debug.print("{s}\n", .{grep});
+            return error.LintError;
+        }
+    }
+
     { // zig build aof
         const aof = b.addExecutable(.{
             .name = "aof",
diff --git a/src/vopr.zig b/src/vopr.zig
index da11e4c..414b56f 100644
--- a/src/vopr.zig
+++ b/src/vopr.zig
@@ -234,7 +234,7 @@ pub fn main() !void {
         \\          SEED={}
         \\
         \\          replicas={}
-        \\          standbys={}
+        \\          standbys={} 
         \\          clients={}
         \\          request_probability={}%
         \\          idle_on_probability={}%

And the output:

$ zig build trail
170
src/vopr.zig:237:        \\          standbys={} 

error: LintError
/home/thomas/Tools/tigerbeetle/build.zig:279:13: 0x11212c9 in build (build)
            return error.LintError;
            ^
/home/thomas/zig/0.13.0/files/lib/std/Build.zig:2117:24: 0x1103d37 in runBuild__anon_8835 (build)
        .ErrorUnion => try build_zig.build(b),
                       ^
/home/thomas/zig/0.13.0/files/lib/compiler/build_runner.zig:301:9: 0x10ff060 in main (build)
        try builder.runBuild(root);
        ^
error: the following build command failed with exit code 1:
/home/thomas/Tools/tigerbeetle/.zig-cache/o/5ba6ffbcb74aeb9315c6bd2f8977a108/build /home/thomas/zig/0.13.0/files/zig /home/thomas/Tools/tigerbeetle /home/thomas/Tools/tigerbeetle/.zig-cache /home/thomas/.cache/zig --seed 0x9d32edc7 -Zbd117d5a66a8de31 trail

@thomab
Copy link

thomab commented Jul 19, 2024

For what it's worth, I know I don't have much of a leg to stand on but I didn't want this to go through without mentioning it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants