Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw strings + multi-line string discussion #1004

Closed
andrevidela opened this issue Jan 30, 2021 · 10 comments
Closed

Raw strings + multi-line string discussion #1004

andrevidela opened this issue Jan 30, 2021 · 10 comments

Comments

@andrevidela
Copy link
Collaborator

andrevidela commented Jan 30, 2021

This proposal aims to motivate and explain the design behind two new string features in Idris: Raw and multi-line strings. Each proposal has a motivation, from which follow design goals. The proposal answers the design goals at the end, a discussion section talks about the different alternatives and how they do, or do not answer the design goals stated. This 3000 words long expose ends with a note on implementation and an invitation to discuss the design goals.

Multi-line strings

Mutli-line strings are a feature that allow to write string literals across multiple lines.

Motivation

Strings are ubiquitous in programming and allow programmers to communicate through text. Another use for strings is to carry data that will be used by other systems that read strings. Typical examples of this are:

  • Generate some HTML headers
  • Print out some ascii-art.
  • Write out manually an HTTP request.
  • Output files given a known template.

What every example has in common is to write down a string literal across multiple lines. Additionally, some of them would benefit from string interpolation #555 while others would benefit from avoiding character escapes. The first two examples would benefit from avoid escaping characters, while the last two would benefit from string interpolation.

Here are each example with existing string syntax

HTML header

let header = "<head>\n    <script type=\"text/javascript\" src=\"./tracking.js\"></script>\n</head>"

or

let header = "<head>\n"
          ++ "    <script type=\"text/javascript\" src=\"./tracking.js\"></script>\n" 
          ++ "</head>"

in order to show the layout of the file using vertical space.

ascii-art

This example is taken from the idris compiler itself

banner : String
banner = "     ____    __     _         ___                                           \n" ++
         "    /  _/___/ /____(_)____   |__ \\                                          \n" ++
         "    / // __  / ___/ / ___/   __/ /     Version " ++ showVersion True version ++ "\n" ++
         "  _/ // /_/ / /  / (__  )   / __/      https://www.idris-lang.org           \n" ++
         " /___/\\__,_/_/  /_/____/   /____/      Type :? for help                     \n" ++
         "\n" ++
         "Welcome to Idris 2.  Enjoy yourself!"

As you can see this suffers from two problems: the \ needs to be escaped, and the version number needs to be inserted. In this proposal I will not suggest to fix both of them at the same time but will provide a solution for each of them individually, leading to two different compromises that have different properties.

Write out manually an HTTP request.

testRequest : String
testRequest = 
"POST /cgi-bin/process.cgi HTTP/1.1\n" ++ 
"User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)\n" ++
"Host: www.tutorialspoint.com\n" ++
"Content-Type: application/x-www-form-urlencoded\n" ++
"Content-Length: length\n" ++
"Accept-Language: en-us\n" ++
"Accept-Encoding: gzip, deflate\n" ++
"Connection: Keep-Alive\n" ++
"\n" ++
"<?xml version=\"1.0\" encoding=\"utf-8\"?>\n"++
"<string xmlns=\"http://clearforest.com/\">string</string>"

This program describes an HTTP request that could be used for testing purposes (for example testing either the parsing, or the execution of the request). As you can see we have to add additional \n in order and escape the inner "

Output files given a known template.

packageDesc : (name, source, execName : String) -> String
packageDesc name source exec =
    "package = " ++ name ++ "\n"
    "sourcedir = " ++ source ++ "\n"
    "executable = " ++ exec

This program allows to describe idris packages containing the information necessary to create a binary out of an idris project. Typically a package manager written in idris would use such programs to generate project files interactively. We will see that this program does not actually benefit from multi-line strings if they do not support string interpolation.

The design goals are therefore as follows:

• Allow strings to be written across multiple lines.
• Allow indentation to be respected in nested programs.
• Remain compatible with string interpolation.

The proposal

Support multi-line strings using the same syntax rules as Swift that is:

"""hello"""

is an error, because the string is not multi-line

"""
hello\"""
this is an escaped multiline
string
\"""
"""

is allowed because the """ token is escaped

let verylongfunctionname = """
    the indentation here is 
    the one that
    everyone expects
    """

trims the leading whitespace using the last """ such that the string has value

the indentation here is 
the one that
everyone expects

This allows to control both the indentation and the trimming. In turn

hello = """
this is an error
        """

This error because the content of the multi-line string has an indentation level that is lower than what we ask to trim from the left.

The following are errors:

hello = """"""
hello = """""""
hello = """
this is not escaped """
"""
hello = """
the closing has to be on the next line """

In more technical terms, """ is both the leading and closing tokens of multi-line strings. The leading """ has to be followed by a line return that will be trimmed out of the string. The end """ has to be only preceded by whitespace. This whitespace allows to pick how many leading white-spaces are trimmed out of every line of the multi-line block.

The previous examples can be rewritten as

let header = """
    <head>
        <script type="text/javascript" src="./tracking.js"></script>
    </head>
    """

As you can see, the trimming whitespace allows to have a natural indentation in the let block.

However multi-line string is not a miracle solution to all our problems, in particular in the following:

banner : String
banner = """
         ____    __     _         ___                                           
        /  _/___/ /____(_)____   |__ \\                                          
        / // __  / ___/ / ___/   __/ /     Version 
    """ ++ showVersion True version ++ """
      _/ // /_/ / /  / (__  )   / __/      https://www.idris-lang.org           
     /___/\\__,_/_/  /_/____/   /____/      Type :? for help                     
    
    Welcome to Idris 2.  Enjoy yourself!
    """

This example suffers from being mixed with concatenation, a problem that is fixed with string interpolation.

testRequest : String
testRequest = """
POST /cgi-bin/process.cgi HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
Host: www.tutorialspoint.com
Content-Type: application/x-www-form-urlencoded
Content-Length: length
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: Keep-Alive

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://clearforest.com/">string</string>
"""

Allowing to pick the trimming white-space allows to copy-paste code easily from example, specially when they come from external sources (like a request-creating tool like postman), and inserted in code (for tests typically).

The last example does not benefit from multi-line strings despite looking like it would. This is because the feature would need to be compatible with interpolated strings (#555) in order to behave as one would expect:

packageDesc : (name, source, execName : String) -> String
packageDesc name source exec = """
    package = {name}
    sourcedir = {source}
    executable = {exec}
    """

Interaction with string interpolation

String interpolation #555 has not been implemented, however, the last example clearly shows that this feature is incomplete without it. We could salvage the same design from #555 and use s""" as starting token of interpolated multi-line strings. One issue we notice is that the braces{ symbol might not be the correct symbol to use anymore since one might be tempted to write json templates using the multi-line syntax:

let jsonUser = """
\{ "name" = {name}
, "date" = \{ "day" : {day}, 
           "month" : {month}, 
           "year" : {year} \}
\}
"""

This is because { and } are defined in #555 as delimiters for interpolated expressions. Which means they need to be escaped for defining json objects.

Since JSON is a very common format to use and expose as API. I suggest we amend #555 in order to replace the delimiters for interpolated strings to \{ and }.

With the amendement the previous example would look like this:

let jsonUser = """
{ "name" = {name}
, "date" = { "day" : \{day}, 
           "month" : \{month}, 
           "year" : \{year} }
}
"""

Which allows the JSON structure to be more visible.

Raw-strings proposal

This proposal might seem unrelated at first glance but is here to address its overlap wrt multi-line strings. Raw strings allow to write string literal without escaping any characters.

Motivation

Raw strings is a feature that allow programmers to write strings between special delimiters such that they do not have to escape anything and the string is interpreted as-is without any transformation from the compiler. This allows for the following use cases:

• Writing regex expression while avoiding escape characters
• Writing strings which are heavy in " which need escape characters
• Writing strings which contains special character that need to be rendered as is, specially characters like\n

Without raw strings, this is how the given example would be expressed in Idris:

let regex = "(\\w\\d)+"
let documentation = "the token for multiline string is \"\"\""
let multilines = "in order to go to the next line, use \\n \n like so"

Those example get particularly bad when mixed together:

let regex = "\"\"\"\\n(\\s*\\w+\\n)*\\s*\"\"\""

Additionally, raw strings are useful when pasting code from other sources, typically one might want to insert a bash script as a literal string as part of a CI jobs:

ci_job : String 
ci_job = "rm -rf build/ttc # clean the ttc files to avoid reusing old build\n" ++
"make all # Compile everything \n"++
"make test | tee results.back # save the results as a backup file `results.back` is always the current run\n"++
"cp results.back ../previous\\ runs/test\\ results:\\ $(DATE).back # Save the current run for reference later\n"++
"                                                          # useful to compare a previous run with results.back\n"++
"make clean"

The design goal for this proposal is thefore as follow:

• Allow any string to be represented within raw string, including line breaks
• Forbid any kind of escape sequences

The proposal

Reuse the syntax and semantics of Rust's raw strings r#"…"#. It comes with the following benefits:

• Already familiar for Rust programmers
• Handles any string without escape characters using dynamic delimiters for the string
• The changes from normal strings are minimal

This means the previous examples can be rewritten like this:

let regex = r#"(\w\d)+"#
let documentation = r#"the token for multiline string is """"#
let multilines = r#"in order to go to the next line, use \n
like so"#
let regex = r#""""\n(\s*\w+\n)*\\s*""""#

You will notice the second line is wrapping to the next line, that is beacause raw strings also interpret the carriage return from the text editor. This allows programmers to copy paste code safely without having character be accidentally escaped or white space trimmed like happens with string interpolation.

Additionally, the Rust implementation allows to declare exactly what the closing delimiter should be by using the opening delimiter. This allows to express every string witout restriction on its content. Take this HTML example

let link = r#"<a href="#C4">Jump to Chapter 4</a>"#

This would not parse because the sequence "# appears in the content of the string. However we can change the delimiters for raw strings to be "## instead in order to insert "# while avoiding escapes:

let link = r##"<a href="#C4">Jump to Chapter 4</a>"## -- valid

FInally, the bash example can be pasted without risk of errors:

ci_job : String 
ci_job = r##"rm -rf build/ttc # clean the ttc files to avoid reusing old build
make all # Compile everything
make test | tee results.back # save the results as a backup file `results.back` is always the current run
cp results.back ../previous\ runs/test\ results:\ $(DATE).back # Save the current run for reference later
                                                          # useful to compare a previous run with results.back
make clean"##

Overlap and difference with multi-line strings

One might notice that we now have two ways to writing a line across multiple lines, we can use either multi-line strings:

let hello = """
This is a multi-line string
"""

or we can use raw strings:

let hello = r#"
This is a raw string
"#

The difference is that the first string allows characters to escape, while the second interpret the content between quotes literally. For reference this is how each of those strings will be interpreted:

This is a multi-line string

and

\nThis is a raw string\n

Again, this is because raw strings do not trim or interpret the content between quotes, it parses everything in quote literally, including line breaks.

Additionally, by definition, raw strings cannot be interpolated, since they interpret their content witout any modification whatsoever. Supporting interpolation would go against the design goal nr2 because it would include escape sequence for the interpolation.

If we take the banner example we see that there is no mix of feature that allow to both bypass the escaping of character and allow for interpollation:

banner : String
banner = """
         ____    __     _         ___                                           
        /  _/___/ /____(_)____   |__ \\                                          
        / // __  / ___/ / ___/   __/ /     Version \{showVersion True version}
      _/ // /_/ / /  / (__  )   / __/      https://www.idris-lang.org           
     /___/\\__,_/_/  /_/____/   /____/      Type :? for help                     
    
    Welcome to Idris 2.  Enjoy yourself!
    """

where \ has to be escaped because it itself is the escape character

banner : String
banner = r#"
         ____    __     _         ___                                           
        /  _/___/ /____(_)____   |__ \                                          
        / // __  / ___/ / ___/   __/ /     Version \{showVersion True version}
      _/ // /_/ / /  / (__  )   / __/      https://www.idris-lang.org           
     /___/\__,_/_/  /_/____/   /____/      Type :? for help                     
    
    Welcome to Idris 2.  Enjoy yourself!"

Here we do not need to escape \ anymore but \{ is now interpreted as raw text rather than an instruction to interpolate the string.

Discussion and alternatives

The reason why those two proposals are brought up together is that both of those proposals is that they both have the ability to write strings over multiple lines. The great thing about those two proposals is that they borrow from other proposals which already have gone through their stages of discussion and design.

Here I will summarise some examples of already implemented proposals and why I personally think are unsuitable for Idris:

Swift's raw strings

Swift uses #"…"# as delimiters for raw strings, and the same delimiters for multi-line strings with """. The proposal is here:

https://github.com/apple/swift-evolution/blob/master/proposals/0200-raw-string-escaping.md

The issue with this design is that is allows for # to be an escape character for interpolated strings. Which blurs the line between what is actually raw and what is still interpreted.

This aspect is the one I am the least confident about since it can fix the problem we see on the banner example. But clashes with the second design goal of raw strings.

Python's multi-line and raw strings

Python uses """ as multi-line strings but does not allow to trim prefix whitespace on each line, which make it incompatible with Idris' culture of indentation:

longFunctionName = let indentedBit = """
                                        this won't ever work
                                        """ in ...

Raw strings are delcared by prepending r in front of a string, they work with multi-line strings as well. However they have the problem that they do not allow every single character to appear in raw strings since the closing delimiter could appear within the string, and since the delimiter is not customisable they cannot be represented. This clashes with the first design goal of raw strings

Scala's raw and multi-line strings

Scala has a very interesting way to go about this because it does all the string transformations at runtime with the pattern id"…" where id is a method call which will use the string literal following it as an argument.

Scala also supports multiline strings with """ which does not perform any trimming either in term of line breaks or leading whitespace, however it supports a method called stripMargin which remove leading whitespace using | as a delimiter:

"""Four score and
  |seven years ago
  |our fathers ...""".stripMargin

This works but is not very ideal since it incurs runtime execution for something that could be done at compile-time. This is specially important for proofs on multi-line strings.

C++ Raw multi-line strings

As far as I know, c++ does not have interpolated string. It has multiline strings through two mechanisms:

string hello = "this is a multiline"
               "string that spans two lines";

or

string hello = R"][(this is a multiline
string that spans two lines)][";

The second one is a raw string, which means leading whitespace won't be trimmed away. Additionally, this syntax has the benefit of allowing arbitrary delimiters for multiline strings. However, allowing arbitrary delimiters might lead to completely unreadable code. Additionally, the semantics of the double quote " has been changed to mean "start selecting the delimiter" since the delimiter is between the double quote and the left parenthesis. This does meet the design goals of raw strings, but might be more complicated than necessary as the discussion on the rust proposal already pointed out.

Implementation considerations

This document is not a way to encourage implementing both those features at the same time. As of today, we have a Pull Request (#999) which almost implements multi-line strings. I suggest we perform the following changes to it so that it conforms to the design outlined here:

• Use the second delimiter to pick the whitespace trimming
• Forbid anyother character than a line break after the oppening token
• Forbid any other character others than whitespace before the closing token

As for raw strings, if someone needs them very quickly we can start by implementing r#"…"# as delimiters for raw strings without being able to adjust the delimiters. And implement the dynamic delimiters at a later date.

Finally, since technically speaking multi-line strings and string interpolation are orthogonal, we can implement interpolation at a later date on multi-line strings.

Issues tracking the progress of each of those features should be openned so that we can track the state of their implementation. Issue #555 should be amended to mention support for multi-line strings.

Conclusion

It seems both those proposals capture something we want in the language, I propose to talk here about them. I am particularly interested in the design goal of raw strings, that is, supporting mult-line raw strings and forbid any kind of escape characters. Maybe we do not actually want this in the language and raw string should be single line and have escape characters!

Additionally, I would suggest we keep with existing solutions without modifications, such that users comming from either rust or swift don't get tricked into thinking the syntax works the way the same way as their host language, when in reality the behaviour has been altered.

Finally I propose to close this issue once the discussion is over and split those issues in two in order to keep track of them separately. Thank you for your time!

@andylokandy
Copy link
Contributor

In the example of the interpolated multi-line banner, why the "{" is escaped?

@andrevidela
Copy link
Collaborator Author

@andylokandy in the section about how Multi-line strings interact with string interpolation I suggest we change #555 to \{ in order to allow JSON strings without escapes.

Since JSON is a very common format to use and expose as API. I suggest we amend #555 in order to replace the delimiters for interpolated strings to { and }.

@andylokandy
Copy link
Contributor

Why not adopt the common #{ or ${?

@ohad
Copy link
Collaborator

ohad commented Jan 31, 2021

Looks like #999 now implements the raw string proposal thanks to @andylokandy

@andrevidela
Copy link
Collaborator Author

@andylokandy Either of them could work in principle. However, I'd rather not overload the meaning of $ any more than necessary since there was already a suggestion to use ${ for positional implicit argument binding (The proposal hasn't been written yet). #{ can work but it would make it awkward to use with multi-line raw strings with string interpolation (a-la swift) if we want to implement it down the line (Maybe that is what we want right now!). Because r#"This is a string : "hello"#{world}"# is unparsable, of course we could add the dynamic delimiters to fix it but it somewhat unfortunate if two language features clash in such a predictable way.

@andylokandy
Copy link
Contributor

andylokandy commented Jan 31, 2021

Since raw string and interpolation string are two non-orthogonal features, I think it's ok to share the same special escape character. Further, with # we can interpolate string without {}:

main : IO ()
main =
  do
    let name = generateName
    let body = generateBody complexArg0 complexArg1 complexArg2
    let content = f"""
      module Test

      #name : IO ()
      #name = #body
      """
    writeToFile content "Test.idr" 

@andrevidela
Copy link
Collaborator Author

That looks okay but it adds # as a new character that needs to be escaped in interpolated strings. Whereas with \{ only \ need to be escaped, which was already the case anyways. Worth noting is that if we want to display \{ in interpolated strings we would need to escape the escape which would lead to \\{ but that is already the expected behaviour of \ since the same problem occurs when printing \n.

@andylokandy
Copy link
Contributor

I feel good with adding # as a new to-be-escaped character (that's why I was saying that in the interpolated string we want to escape more).

@andylokandy
Copy link
Contributor

andylokandy commented Jan 31, 2021

I am wondering if we can completely get rid of the leading s or f of the interpolation string after adopting \{?

This was referenced Feb 4, 2021
@andrevidela
Copy link
Collaborator Author

I've openned #1027 and #1029 to keep track of the two features separately, we also need to ammend #555 since we don't need the s prefix now that interpolated sections are escaped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants