Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add string interpolation from nimboost #6507

Closed

Conversation

bluenote10
Copy link
Contributor

As discussed in #5600, this PR adds @vegansk nice string interpolation implementation from nimboost to the standard library.

To discuss: Should this live in a separate module or in strutils? If separate, what's a good name?

In nimboost, the implementation relies on 4 modules: richstring, parsers, formatters, and limits. I minified the implementation by using parseInt instead of nimboost's strToInt. This gets rid of the requirements on limits and parsers, but I'm not sure if using strToInt would have any benefits. The remaining things fit reasonably into one module. Should we consider to include it from strutils so that users don't need an extra import for string interpolation while keeping the implementation clean and separated?

CC @Yardanico

@dom96
Copy link
Contributor

dom96 commented Oct 14, 2017

The module isn't that large, I think it's fine to put its code into strutils.

@ghost
Copy link

ghost commented Oct 14, 2017

@bluenote10
Well, it's fine, but I think you need to add Nimboost author's copyright. Something like https://github.com/nim-lang/Nim/blob/devel/lib/pure/poly.nim#L4
About include - yeah, I think it can be included to strutils.

else:
result = $chr(ord(if lowerCase: 'a' else: 'A') + v - 10)

proc intToStr(n: SomeNumber, radix = 10, len = 0, fill = ' ', lowerCase = false): string =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this function be implemented in terms of the existing functions for converting an int to a string? It seems to me that this reinvents a lot of the code that we already have causing duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a closer look, there is not as much duplication as it seems. The part that can be rewritten is line 43 to 49, but it would be slightly less efficient, because it would create several temporary strings and requires two passes for the case conversion (toHex only allows to produce upper case). Note that there is probably not much room for simplification for the sign placement logic at the bottom because it depends on whether "zero filling" is desired. I have a few smaller things though which makes the code slightly more concise, which I'll push.

prefix[0] = '-'
result = prefix & result

proc alignStr(s: string, len: int, fill = ' ', trunc = false): string =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this function seems to do too much, but maybe it makes the implementation easier. Might be worth considering splitting this up into a truncate and indent function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we do have an indent in the strutils module already, could it be used?

https://nim-lang.org/docs/strutils.html#indent,string,Natural,string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be possible, but from the perspective of the macro it is nicer to have just one function to call which does all expected behavior at once, instead of having to build a more complex AST with chaining function calls. That's why I though it helps to rename these functions.


proc floatToStr(v: SomeNumber, len = 0, prec = 0, sep = '.', fill = ' ', scientific = false): string =
## Converts ``v`` to string with precision == ``prec``. If result's length
## is lesser then ``len``, it aligns result to the right with ``fill`` char.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/then/than/

# nimboost's richstring
# -----------------------------------------------------------------------------

proc parseIntFmt(fmtp: string): tuple[maxLen: int, fillChar: char] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to see some comments with examples of what this parses.

Same for the functions below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some docstring, but note that this is not the best place to document the behavior for users, because these are only the internal helper functions. From a user perspective this is just about getting the traditional printf like behavior: the pad symbol of normal formatters like %5d are a plain space, and %05d is a special case, where the pad symbol becomes a zero.

Copy link
Contributor

@dom96 dom96 Oct 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bluenote10 This isn't for users, it's for the programmer. It's nice to see an example of what this procedure parses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you happy with the comments I have added? Or anything else to review? (I didn't move it into strutils now, as wished by Araq)


macro fmt*(fmt: static[string]): untyped =
## String interpolation macro with scala-like format specifiers.
## Knows about:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: An empty line is required for a newline to be present in the generated docs, currently Knows about will be on the same line as the above sentence.

@dom96
Copy link
Contributor

dom96 commented Oct 14, 2017

Argh, that's a bit annoying. Please see the "outdated" comments too.

@bluenote10
Copy link
Contributor Author

@Yardanico Good point, I have added Anatoly now. Using align directly is not possible, because the version here requires handling of negative values as well. A possibility would be to merge this extended behavior into align but I'm not sure if this is desired. I see these formatters more as special case helper function used by the macro. I have renamed them now (formatInt, formatFloat, formatInt) to make this a bit more clear.

@dom96 Yes in terms of the length it is not an issue. My only concern was that it comes with a large number of helper functions, which are only relevant to the macro and it would maybe be nice to have them in their own scope.

@ghost
Copy link

ghost commented Oct 14, 2017

@bluenote10 maybe also change 2012 -> 2017 ?

@dom96
Copy link
Contributor

dom96 commented Oct 14, 2017

My only concern was that it comes with a large number of helper functions, which are only relevant to the macro and it would maybe be nice to have them in their own scope.

It would be nice, but I don't think it's necessary. I still vote to put this in the strutils module.

@Araq
Copy link
Member

Araq commented Oct 15, 2017

That's not how string interpolation should work in Nim. :-/ Nim is not about cryptic 1 letter mini languages. For example, to format toHex this should be used

format"as Hex ${x.toHex}"

We have the power of parseExpr to keep the code really simple, format should turn a string into a string concat. Another problem with d and s specifiers is that Nim already knows the type in the context so copying C's design (and Scala's ultimately comes from C) makes no sense.

@bluenote10
Copy link
Contributor Author

bluenote10 commented Oct 15, 2017

It does make sense from the perspective:

  • printf-like formatting is by far the most commonly used 1-letter mini language for this purpose.
  • it can hardly be beaten in terms of conciseness.

Making the implementation purely concat-based misses the point of string interpolation in my opinion. Doesn't this mean that all formatting has to go into the interpolated expression? I rarely use an interpolator without a formatter. I don't care about things like the x formatters, but the crucial features to me are quick/easy solutions to (1) fixing widths, (2) left/right alignment, and (3) controlling float precision, which is fully covered by the printf mini language. Personally, I'm extremely picky when it comes to writing log messages, because my eyes are very bad at parsing text output when it is non-aligned / variable width, especially because my domain is heavy in numerical output. Therefore, one of the first things I learn in every language is to do string formatting/interpolation. I have worked with many systems now and so far Scala's approach was definitely the most convenient solution, especially because it is part of the language and IDE's even understand the code in the ${...} expression and provide proper syntax highlighting and auto-completion.

A few common use cases and how I would currently have to approach them without a formatting mini language:

# Output goal:
# Iteration:     1    MSE:    39194.898
# Iteration:     2    MSE:    26129.932
# Iteration:     3    MSE:    17419.955
# Iteration:     4    MSE:    11613.303
# Iteration:     5    MSE:     7742.202

var mse = 58792.347
for iter in 0 ..< 10:
  mse = mse / 1.5
  echo fmt"Iteration: ${iter+1}%5d    MSE: ${mse}%12.3f"
  # vs
  echo fmt"Iteration: ${align($(iter+1), 5)}    MSE: ${formatFloat(mse, ffDecimal, 3).align(12)}"

# Output goal:
# README.txt                  0.002 MB
# LICENSE                     0.005 MB
# movie.mp4                  11.952 MB

let files = @[
  ("README.txt", 1925),
  ("LICENSE", 5203),
  ("movie.mp4", 12532921)
]
for file, fileSize in files.items():
  echo fmt"${file}%-20s ${filesize.float / (1 shl 20)}%12.3f MB"
  # vs
  echo fmt"""${file & repeat(" ", 20 - len(file))} ${formatFloat(filesize.float / (1 shl 20), ffDecimal, 3).align(12)} MB"""

These are minimal examples, real world use cases contain a whole bunch of variables, making conciseness a major factor. Having to do the formatting explicitly is about as much pain that even a formatting freak like me would avoid it. And the barrier to applying nice formatting has an immediate impact on the quality of all text output produced in a language. For instance, if you look at the visual quality of log messages of various projects they reflect this formatting barrier, i.e.,
the average log message quality I have experienced goes something like Scala > Python > C++.

For some time I was also experimenting with the ideas Jehan proposed here. They are nice from a design point of view, but pulling the formatting into the expression will always have a trade-off problem between conciseness and name space. I.e., in order to make the formatting syntax concise, a lot of common symbol names have to be used and even than it is less concise than the printf formatters. Conversely, using less conflicting symbols for the formatting makes the formatting cumbersome. My conclusion was that it is much easier to go for the printf-like syntax, because it covers such a broad range of use cases in a super concise way. If you absolutely don't want that in Nim we should discuss that what we can offer in terms of formatting instead, but I think the above examples show that purely concat-based string interpolation is an almost useless feature.

@Araq
Copy link
Member

Araq commented Oct 16, 2017

  • You should have used formatSize in your example and I would argue that the format string mislead you to use the inferior solution since it only knows "floating point vs integer". I think times and "sizes" should be supported too in the mini language.
  • The alignments are wrong for Unicode (a general stdlib issue)
  • The alignments are conceptually wrong, it's just a magic number you guessed and things misalign when the strings are too long. Something like flexibleTabs("${x}\t${y}", maxLineLen=80) seems more useful.

That said, this is fine with me if you move it into its own stdlib module. I understand other programmers also expect to have this toolbox available.

@bluenote10 bluenote10 force-pushed the feature/string_interpolation branch from 46073b2 to b8e1e02 Compare October 21, 2017 09:54
@bluenote10 bluenote10 force-pushed the feature/string_interpolation branch from b8e1e02 to 1369d2b Compare October 21, 2017 10:03
@bluenote10
Copy link
Contributor Author

Is there anything left here you want me to adjust? I hope everyone is happy as it is now.

@Araq
Copy link
Member

Araq commented Oct 23, 2017

As I said, IMO sizes and times should be supported too and your alignments ignore unicode.

@bluenote10
Copy link
Contributor Author

Oh sorry, I thought your comments were meant for future iterations. I have implemented support for unicode formatting now, and I came up with a solution for "sizes" formatters. A few design decisions:

  • I decided it makes the most sense to combine them with the existing float formatters (e.g. fmt"${1024}%.3Kif kB" would give "1.000 kB"), because the general width and precision modifiers are the same and we don't waste another terminal character.
  • I did not include automatic suffix output, because (1) sometimes people may want to use the "wrong" suffix, and (2) it would be necessary to specify e.g. include_space as well. Since we are doing string interpolation it is much easier for users to type whatever textual suffix they prefer in the string.

Are you happy with it? Would you mind moving the time formatters into a second PR? This would require a bit more thought.

Oh, and I think there is still a Windows specific issue I'll have to fix...

@Araq
Copy link
Member

Araq commented Oct 24, 2017

I don't agree with these design decisions. :-) The problem is that single letter format specifiers do not scale, so the first step is to allow multiple characters here:

${x}s  # short for
${x}s$

With this optional terminating $ (it is only optional for the existing single letter shortcut) we have an extensible syntax that can keep things sane:

${x}size kb$ # with space, always in kb
${x}size_kb$ # without space, always in kb
${x}size_$ # choose the right suffix, no space
${x}size $  # choose the right suffx, with space
${x}time 00:00:00$ # oh look, times work too

@bluenote10
Copy link
Contributor Author

Okay I've been thinking about his for a while, but I can't come up with something that is not ambiguous. My basic problems are:

What is "${x}size $"? Is it "<x_parsed_as_s>ize $" or "<x_parsed_as_size>"?
What is "${x}size ${y}"? Is it "<x>size <y>" or "<x_parsed_as_size>{y}"?
How to produce the output string "<x_without_a_formatter>size $"?

I've been thinking about doubling/escaping the characters following the closing bracket, and doubling/escaping of the terminal $, but the resulting languages look pretty user-unfriendly to me and hard to parse. As far as I can see, this would also mean a complete rewrite of the parser, because relying on interpolatedFragments and regularity of the suffix is now longer possible :(. I'm a bit at a loss now.

@Araq
Copy link
Member

Araq commented Oct 25, 2017

What is "${x}size $"? Is it "&lt;x_parsed_as_s&gt;ize $" or "<x_parsed_as_size>"?

Parsed as "<x_parsed_as_size>"

What is "${x}size ${y}"? Is it "size " or "<x_parsed_as_size>{y}"?

"<x_parsed_as_size>{y}"

How to produce the output string "<x_without_a_formatter>size $"?

"${x}$ size $$"

As far as I can see, this would also mean a complete rewrite of the parser, because relying on interpolatedFragments and regularity of the suffix is now longer possible :(. I'm a bit at a loss now.

We should patch interpolatedFragments for consistency.

I've been thinking about doubling/escaping the characters following the closing bracket, and doubling/escaping of the terminal $, but the resulting languages look pretty user-unfriendly to me and hard to parse.

Well I don't know. We could accept for now what we already have, but that potential "$ marks the end" rule cannot be reintroduced without breaking things so we better get it right this time.

@andreaferretti
Copy link
Collaborator

I think that the perfect is the enemy of the good. This seems to me a nice macro that supports a variety of common use cases

@bluenote10
Copy link
Contributor Author

A few more thoughts:

  • If we want to go for this modified mini language, I don't think we should make the terminal $ optional. Since the language does not have a clear formatting starting character % as before, it is just too easy to run into something ambiguous and get surprising results. For instance ${filename}.txt is <filename_parsed_as_s>.txt, but if you are dealing with a different extension you can get a surprise: ${filename}.flt may be parsed using the f formatter with an empty precision followed by lt. Or maybe in a situation like renaming ${filename}-not-special.dat into ${filename}-special.dat the -s suddenly becomes swallowed away as formatter, and the user ends up with a pecial.dat. Basically a user would have to be aware of all possible formatter prefixes, and be very careful with the strings following an interpolation expression. To avoid this and keep things sane, it is probably required to live with a mandatory terminal $. This is actually okay for cases which do use a formatter: The length should be the same because the leading % is replaced by a trailing $. The only case where it would suck is the plain ${x}$ or x = ${x}$ y = ${y}$. There is a temptation to come up with special rules where the trailing $ might be omitted like at the and of string, at whitespace, or at typical separating characters. But not sure if we want to go down that rabbit hole. It probably leads to exactly the problems mentions above.
  • How to handle "pure variable identifiers" (i.e., without parentheses like x = $x, y = $y)? They would probably not only need the same mandatory terminal $, but for them we would have to bring back the leading formatter sign, probably % again. Thus, they don't fit into the language as well as before, and they suffer in their brevity slightly, because they now require at least one more character as before. However, omitting all the parentheses is still very convenient for debugging, so it is better to allow x = $x$, y = $y$, z = $z$ then not having them at all.

Overall I'm not yet sure if the language based on a unique terminal symbol is really an improvement in those 99% standard use cases. Also I don't think string interpolation is so much about being able to format any type in any imaginable way, but more about the basic problem of assembling a nicely formatted string from a large number of basic types like strings+ints+floats.

My current proposal would be to keep the current % based approach, but introduce maybe just one more terminal symbol c for "custom". When using the custom formatter, the interpolation calls a proc named e.g. customFmt(x: <TYPE_OF_EXPRESSION>, fmtp: string): string, where ftmp is the format parameter string, i.e., everything between the % and the c. This would allow users to quickly bring their own formatters into scope if need be. And the strinterp module could export a set of customFmt for certain special use cases like times/dates. Currently the restrictions for the ftmp string are defined by the set of terminal symbols (then {'d', 'f', 's', 'x', 'X', 'e', 'c'}) and the rules of interpolatedFragments when to break a fragment. I think a lot is already possible with that (more than I need). And if we later realize that the format strings need more flexibility, we could maybe just allow escaping of these characters within fmtp. This kind of escaping would in my opinion be easier to handle for the user, because he knows that after the % symbol I'm in a special mode where I have to be careful and escape e.g. s characters if I need them. That feels different to the $ terminal mode where users think they write a normal string continuation but accidentally match a formatter prefix.

@Araq
Copy link
Member

Araq commented Oct 29, 2017

Also I don't think string interpolation is so much about being able to format any type in any imaginable way, but more about the basic problem of assembling a nicely formatted string from a large number of basic types like strings+ints+floats.

That's biased. Times and sizes are often more common than plain ints and floats which have little meaning of their own.

@FedericoCeratto
Copy link
Member

I would suggest a compromise: a simple formatter that handles named variables, left padding, and knows only strings, ints and floats. This covers 90% of common use cases and can be used without having to look up the minilanguage syntax.
More complex or type-specific formatting could be performed by calling the right converters in advance.

of "Z": result.divisor = -7
of "Y": result.divisor = -8
else:
quit "Illegal float format suffix: " & remainder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raise a ValueError here instead.

@dom96
Copy link
Contributor

dom96 commented Nov 16, 2017

👍 to what @FedericoCeratto said.

I actually just got an idea after looking at the Python f-string implementation. Something that would be extensible:

type
  MyObject = object
    x: int

proc format(f: MyObject, pattern: string): string =
  ## Custom implementation for formatting of MyObject
  return repeatChart(" ", pattern.parseInt()) & f.x

let t = getGmTime(getTime())
let c = MyObject(x: 42)

echo(f"Hello. It is currently {t:m past H}. My cool custom object's value\n{c:4}")

# Hello. It is currently 38 past 9. My cool custom object's value
#     42

This would allow for some very powerful extensibility and significantly simplify the formatter.

The rule is that everything after the first : is passed to the specified type's format function.

@Araq
Copy link
Member

Araq commented Nov 16, 2017

@dom96 I like it very much. It saves typing and yet is "obvious" enough. Really nice.

@bluenote10
Copy link
Contributor Author

I'll see what I can do. But since this requires a complete rewrite it's probably best to have another PR once it's ready. I remember having a closer look at Python f-strings before and concluding that they have some drawbacks which made the Scala formatters superior. Fortunately, I can't remember exactly what was the issue :).

@dom96
Copy link
Contributor

dom96 commented Nov 16, 2017

If we do decide to choose my idea, and I'd love to hear other people's opinions, then I think it'd be best to move your implementation into a Nimble package (sorry!).

@bluenote10
Copy link
Contributor Author

bluenote10 commented Nov 16, 2017

There are already three string interpolation libraries on Nimble, what's the point of having a forth one? The whole point was to have one in the stdlib, everyone agreed to that in the RFC.

@dom96
Copy link
Contributor

dom96 commented Nov 17, 2017

If we do decide to choose my idea

By that I mean: if we decide to add the formatter I describe in my previous comment into the stdlib then your current implementation should go into Nimble.

@bluenote10
Copy link
Contributor Author

By that I mean: if we decide to add the formatter I describe in my previous comment into the stdlib then your current implementation should go into Nimble.

Yeah, that's fine. In fact this already comes from "nimboost" and I'll probably just push back the changes/improvements I made.

I think an issue I had with a Python f-string approach is that I'm not sure if the formatter string validation can happen at compile time or at runtime only. For Python this doesn't matter, but it is just super annoying if your string interpolation fails at runtime just because of a stupid formatter mistake. I think for built-in types it is mandatory to do the formatter validation at compile time to avoid running into runtime errors all the time. The current implementation achieves this. I think my problem with implementing it was that I couldn't figure out how to build the AST depending on the type of the parseExpr expression. If this was possible the formatter can switch to static validation for int/float/string and only use dynamic runtime formatting for custom types. I'll have to give it another try to see if there is a way to make it work.

Note that your format(f: MyObject, pattern: string) is exactly what I have proposed above with the customFmt formatters. I.e., exactly the same could be done with this current approach, with the only exception that pattern must not use any of the 7 terminal symbols. The benefit of relying on terminal symbols is that

  • implementing static + dynamic validation is easy.
  • we could also make it clear that only the c formatter relies on runtime validation.
  • users can easily make explicit "formatting type conversion" via the terminal symbol, e.g., formatting an int in float notation. If we dispatch depending on the type of the parsed expression this is probably harder to do.

There was another syntactical issue with f-strings as well, but I still can't remember what it was.

@yglukhov
Copy link
Member

Formatters could choose between compile time and runtime by accepting static strings.

@bluenote10
Copy link
Contributor Author

Hm, I might be confused, but the o in format(o: MyObject, pattern: static[string]) only comes into play at runtime. So if the generated AST is e.g. "prefix string" & format(myObj, "invalid format string") & "suffix string", the validation would only happen at runtime. Do you mean something like a factory formatFactory(pattern: static[string]): (MyObject -> string) so that we (somehow) could call the factory at compile (doing the pattern validation) and leaving only the actual "object to string" conversion for runtime?

var v = n.int64
let s = v.int64 < 0
if s:
v = v * -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overflows for low(int64)

@Araq
Copy link
Member

Araq commented Dec 17, 2017

strformat is now part of the stdlib, feel free to create PRs to improve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants