-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add syntax files to simplify porting re2c to new languages. #450
Comments
FWIW, this would be an amazingly cool feature. |
Just wanted to say, again, this would be an amazingly cool feature. |
I started some experimental work on this. I'm constrained on time at the moment, so it's not moving fast, but it's my next most important goal for re2c. |
An update. I've been doing some experimental work on |
Next thing for me is to write a syntax config for D (very close to C/C++) and resurrect test cases from #431. Feedback on the current DSL I used in syntax file is welcome. Although it's not documented yet, but it shouldn't be very hard to understand. There are different groups of configurations:
The first two groups are simple, and the last group is basically templates for language constructs that are used in different parts of codegen. The DSL allows conditionals |
This is all experimental work, all configurations are subject to change while they are on |
Handling a couple of more diverse languages (OCaml, Python) might be an interesting test here. I'm not sure I entirely understand how the init file works btw. |
This comment was marked as outdated.
This comment was marked as outdated.
Dlang support was added in d492026. OCaml support was added in c1ccefa (see discussion in #449). Now, as suggested by @pmetzger I started looking at python. Basic example (with a custom syntax file, not shared here): /*!re2c
re2c:define:YYFN = ["lex;", "str;", "cur;"];
re2c:define:YYPEEK = "str[cur]";
re2c:define:YYSKIP = "cur += 1";
re2c:yyfill:enable = 0;
number = [1-9][0-9]*;
number { return True }
* { return False }
*/
def main():
str = "1234\x00"
if not lex(str, 0):
raise "error"
if __name__ == "__main__":
main() The generated code looks like this: # Generated by re2c
def yy0(str, cur):
yych = str[cur]
cur += 1
if yych <= '0':
return yy1(str, cur)
elif yych <= '9':
return yy2(str, cur)
else:
return yy1(str, cur)
def yy1(str, cur):
return False
def yy2(str, cur):
yych = str[cur]
if yych <= '/':
return yy3(str, cur)
elif yych <= '9':
cur += 1
return yy2(str, cur)
else:
return yy3(str, cur)
def yy3(str, cur):
return True
def lex(str, cur):
return yy0(str, cur)
def main():
str = "1234\x00"
if not lex(str, 0):
raise "error"
if __name__ == "__main__":
main() Does it look reasonable? I plan to use recursive functions code model by default, but loop/switch model should work as well. Which one is preferable? Do function calls add much overhead in python? I'll do some benchmarks myself later, but I'm curios to hear what others think. |
Python is interpreted, and doesn't have much of an optimizer. I suspect loop switch will be faster, but I don't know for sure. Benchmarks will be needed. Another thing about python: it has a // operator, which potentially might be mistaken for a comment. Generally, I think that it might be good if the comment character for a particular language could be defined rather than using the default. Oh, and lastly: python has optional type annotations. Those might be helpful in the generated code for those using mypy. |
Huh, I got |
@skvadrik Oh! Python does not have tail recursion. I had not noticed how you were doing it, if you want to use recursion for this in Python you need a trampoline function so that you don't infinitely recurse. I guess using |
For reference, https://github.com/0x65/trampoline describes how to do trampolines with python. |
Just looked at the python example, it seems pretty reasonable. |
Vlang support was added in 73853c5. |
So I find myself wanting to use the Python support. I'm an adult and understand that all the syntax etc. for syntax files may change in the future. Could a suitable version of re2c get tagged (perhaps not officially released) for people who want to experiment with real code? |
Use this: https://github.com/skvadrik/re2c/releases/tag/python-experimental. I previously rebased git history so that all python-specific work goes before it, and I shouldn't break git history up to this commit with my future changes. |
@pmetzger It will be very helpful if you try it out and report any issues. :) |
Haskell support was added in 4e78ef8. The configurations have to be a bit more verbose, as even simple operations have to update lexer state and propagate it further down the program (see https://github.com/skvadrik/re2c/tree/syntax-files/examples/haskell). I'm thinking that this can benefit from language-specific default API (so far it only exists for the C/C++ backend, but the definitions are now all in syntax files, so each syntax file may provide its own default API). There are monadic and pure styles for Haskell. |
🔥 |
JS support was added in 74ace08. |
Zig support was added in 5cd48a8. |
My further plan is to focus on polishing syntax file API (and who knows - maybe eventually even releasing it :D). If you have other interesting languages in mind, please mention them in this thread - the API is not frozen yet and it's possible to change it. That said, for the last three languages (Java, JS, Zig) no changes were needed, which means it should be expressive enough (at least for C-like languages). |
My main issue remains the "comment syntax" for the re2c blocks, but I will confess I haven't dived in deeply enough to things like the API. Maybe I should. One option is to do a release soon but make the support for languages using the syntax files "experimental" to get more widespread feedback. |
@pmetzger I rebased syntax-files branch and pulled it into master. Sorry if I broke your workflow. From now on just use master - I will keep merging syntax-files into it. |
I will give it more thought.
It's not the re2c way to break backward compatibility, if possible to avoid it - I don't think we have a big enough community to get timely feedback. |
So on the comments: there are going to end up being languages where // or /* is valid syntax. (For example, in Python, // is the integer division operator.) It feels safer to be able to use comments that make sense in the context of a given language. |
That's a good point about syntax clash: I don't think it's a problem for the opening comment Language-specific lexer will be hard to implement. At the moment lexer is written in re2c, and I'd like to keep it this way both for dogfooding and performance reasons. Also, not all languages have multiline comments. Instead of trying to use language-specific syntax, we can do what lex and bison do: use syntax that fits equally bad into any language, namely What I'm more worried about are single quotes (some languages allow them as parts of identifiers, labels, etc.). Syntax files already have some configurations that tell re2c whether to expect single quotes, backtick-quoted strings, etc. |
I think that's certainly an option, especially if that can be shifted to an alternative in the unlikely event that a specific language is using that specific bracket pair for real syntax.
ML descended languages use them to identify type variables. Lisp uses them to identify unevaluated forms. |
It occurs to me that, with very high likelihood, nothing is ever going to use |
Exactly, that's the way it already works. We just need to extend
Good, let's keep a list of all such cases and gradually add support for them in the lexer (it already knows about some). So far there's one boolean-valued configuration |
Lisp will do both things like |
Nice! I'm curious why you're allowing arbitrary text before the |
I think it's useful (it saves space) to allow staring a block in the middle of a line, e.g.:
|
Makes sense. I also see (given that this is using an re2c regex) why it would be hard to have several different flavors of braces etc. I almost wonder if adding one more character (something like |
I think |
As I was updating docs for the next release, I noticed one more thing to make the syntax more consistent for new languages: block start markers have the word For block start markers the easy and logical solution is to use For configurations, I always felt that The change is on @pmetzger @helly25 @trofi @sergeyklay (as our most active participants, anyone else welcome) what do you think? Any problems with the new syntax? |
The goal of the current naming schemes is to be clear and to express a
hierarchy. We claim the root prefix 're2c'.
In the blocks you could indeed omit the prefix but then you lose clarity in
larger code. As it is even the untrained eye can easily spot those
elements. And the middle part 'define' clearly express the meta
functionality. Next the separators make certain identification and handling
easier, a single ':' prefix worked be much harder for humans to work with.
It would be faster to write, but much much much harder to read and
maintain. Least but not least, the longer identifiers are easier to change
when the rec code is itself generated.
I do agree on your observations regarding the multi-line blocks. They
suggested to be the best solution since indexes and other tools will
naturally ignore them without knowing anything about the tool. Other
solutions are very problematic for tooling.
Cheers
Marcus
…On Sat, Oct 19, 2024, 18:32 Ulya Trofimovich ***@***.***> wrote:
As I was updating docs for the next release, I noticed one more thing to
make the syntax more consistent for new languages: block start markers have
the word re2c in them (/*!re2c, /*!max:re2c, etc.) and configurations
also start with re2c.
For block start markers the easy and logical solution is to use /*!re2go,
/*!re2ocaml, etc. (same as the name of the binary) and to allow /*re2c
for backwards compatibility. Of course we have language-independent %{,
but I find comment syntax nice for languages that have C-style multiline
comments.
For configurations, I always felt that re2c prefix is a waste of space,
as they are obviously inside of a re2c block, so there's no need to type it
again. Also, in configurations like re2c:define:YYCTYPE the only part
carrying useful information is YYCTYPE, so define: can be dropped as
well. This leaves just :YYCTYPE or e.g. :yyfill:enable - the leading
colon is useful, as it allows the lexer immediately know it's a
configuration.
The change is on syntax-files branch: cb247ad
<cb247ad>
and examples are updated in later commits, e.g. the commit for C/C++ is
48ac0c2
<48ac0c2>.
Of course, old syntax is still supported, but all the examples use the new
simplified syntax.
@pmetzger <https://github.com/pmetzger> @helly25
<https://github.com/helly25> @trofi <https://github.com/trofi> @sergeyklay
<https://github.com/sergeyklay> (as our most active participants, anyone
else welcome) what do you think? Any problems with the new syntax?
—
Reply to this email directly, view it on GitHub
<#450 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQ7NSI3WGSBJJS44CNRLH3Z4KCQVAVCNFSM6AAAAAAZOJAO26VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRUGA2TGMBYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks Marcus, note that I'm not suggesting to take away any existent
syntax: if you find the longer prefixes preferable, just use them. But some
new syntax is needed for new language backends to match their tool names
(re2d, re2go, etc.) Adding them all as configuration prefixes is possible,
of course, but I felt like it's a waste of lexing effort (re2c needs to lex
and validate them) and typing effort (in behalf of the user). I also find
it less readable - but it's only a personal preference.
…On Sat, 19 Oct 2024, 17:44 Marcus Boerger, ***@***.***> wrote:
The goal of the current naming schemes is to be clear and to express a
hierarchy. We claim the root prefix 're2c'.
In the blocks you could indeed omit the prefix but then you lose clarity
in
larger code. As it is even the untrained eye can easily spot those
elements. And the middle part 'define' clearly express the meta
functionality. Next the separators make certain identification and
handling
easier, a single ':' prefix worked be much harder for humans to work with.
It would be faster to write, but much much much harder to read and
maintain. Least but not least, the longer identifiers are easier to change
when the rec code is itself generated.
I do agree on your observations regarding the multi-line blocks. They
suggested to be the best solution since indexes and other tools will
naturally ignore them without knowing anything about the tool. Other
solutions are very problematic for tooling.
Cheers
Marcus
On Sat, Oct 19, 2024, 18:32 Ulya Trofimovich ***@***.***>
wrote:
> As I was updating docs for the next release, I noticed one more thing to
> make the syntax more consistent for new languages: block start markers
have
> the word re2c in them (/*!re2c, /*!max:re2c, etc.) and configurations
> also start with re2c.
>
> For block start markers the easy and logical solution is to use
/*!re2go,
> /*!re2ocaml, etc. (same as the name of the binary) and to allow /*re2c
> for backwards compatibility. Of course we have language-independent %{,
> but I find comment syntax nice for languages that have C-style multiline
> comments.
>
> For configurations, I always felt that re2c prefix is a waste of space,
> as they are obviously inside of a re2c block, so there's no need to type
it
> again. Also, in configurations like re2c:define:YYCTYPE the only part
> carrying useful information is YYCTYPE, so define: can be dropped as
> well. This leaves just :YYCTYPE or e.g. :yyfill:enable - the leading
> colon is useful, as it allows the lexer immediately know it's a
> configuration.
>
> The change is on syntax-files branch: cb247ad
> <
cb247ad>
> and examples are updated in later commits, e.g. the commit for C/C++ is
> 48ac0c2
> <
48ac0c2>.
> Of course, old syntax is still supported, but all the examples use the
new
> simplified syntax.
>
> @pmetzger <https://github.com/pmetzger> @helly25
> <https://github.com/helly25> @trofi <https://github.com/trofi>
@sergeyklay
> <https://github.com/sergeyklay> (as our most active participants,
anyone
> else welcome) what do you think? Any problems with the new syntax?
>
> —
> Reply to this email directly, view it on GitHub
> <#450 (comment)>,
or
> unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ABQ7NSI3WGSBJJS44CNRLH3Z4KCQVAVCNFSM6AAAAAAZOJAO26VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRUGA2TGMBYGY>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#450 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAISVJWAVEKJO4QHGDJLUFDZ4KEAHAVCNFSM6AAAAAAZOJAO26VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRUGA3DIOJWGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@trofi suggested (offline) to rebrand "re2c" from "regular expressions to C" to "regular expression to compiler" and use it for all language backends. It may be a good idea (although at the moment I'm perhaps irrationally opposed to it after a few weeks of work on writing scripts to auto-fix all the docs to use the right tool name). The question is, should we still install individual tools like re2go, re2ocaml, etc.? We already have re2go and re2rust, so taking them away will break people's workflows. But it seems illogical to have re2ocaml but |
Why? What does it matter what the prefix is. Plus with the rebranding you
already have a solution for all problems. It also does not matter how fast
you can do it. Start with the intro on the web page, readmes and major
docs... That'll settle it and we can rest this discussion. Keep It Simple
Safe 😄
…On Sun, Oct 20, 2024, 17:43 Ulya Trofimovich ***@***.***> wrote:
@trofi <https://github.com/trofi> suggested (offline) to rebrand "re2c"
from "regular expressions to C" to "regular expression to compiler" and use
it for all language backends. It may be a good idea (although at the moment
I'm perhaps irrationally opposed to it after a few weeks of work on writing
scripts to auto-fix all the docs to use the right tool name).
The question is, should we still install individual tools like re2go,
re2ocaml, etc.? We already have re2go and re2rust, so taking them away will
break people's workflows. But it seems illogical to have re2ocaml but
re2c: prefix in configurations.
—
Reply to this email directly, view it on GitHub
<#450 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQ7NSMPPVO5GOGA4MNTULTZ4PFT3AVCNFSM6AAAAAAZOJAO26VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRVGA2TEMZWGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Alright. I spoke to some long-time active re2c users that are not on github, and they also like the prefix and want it to be So all of the opinions I've heard are unanimous and I'll go by I still want to allow dropping |
+1 for "regular expressions to code". |
I must admit, I'm not the best person to provide advice on the conciseness of syntax, as I tend to shy away from abbreviations in my work, whenever a longer version exists. Given the vast array of contexts I work with daily, I've developed a habit of using descriptive lexemes everywhere — even down to the parameters in command-line options. This helps me avoid having to recall what a shorthand might mean, which could be a time sink in my case. That being said, I have no issues with the clarity or convenience of the proposed simplifications. I agree that within the re2c blocks, these prefixes may indeed be unnecessary, as the context is already clear. Simplifying the syntax will undoubtedly make configuration files more readable and easier to write, which is always a good thing. And if I've understood the thread correctly — the old syntax will remain supported, so folks like me who prefer more verbose prefixes, or already have systems set up using them, won’t be affected. In summary, the proposed simplifications seem like a reasonable step towards improving usability. I’d support these changes, particularly if they maintain backward compatibility and uphold the principles of structured configuration where it matters. |
Thanks for a detailed answer @sergeyklay !
Sure, backward compatibility is a strict policy, we definitely won't break it for something like adding new configuration syntax. If we want any backwards compatible changes though, now is a good time, as the upcoming release will add support for 8 new languages.
Thanks! Given that every single re2c user or contributor I asked has either a personal, or a general preference for |
Syntax files should be config files that describe a language backend via a set of configurations. When generating code, re2c would map various codegen concepts to the descriptions provided by the syntax file. This way a new language can be added easily by supplying a syntax file (by the user or by re2c developers --- existing backends should be described via syntax files, distributed with the re2c source code and as part of a re2c installation).
The man difficulty is to decide on a minimal set of configurations that are orthogonal and capable of describing different languages, so that we don't have to add new ad-hoc configurations for each new language. Once this is decided, codegen subsystem should be modified to support syntax files, and exising backends should be rewritten using syntax files (before adding new ones).
Related bugs/commits:
The text was updated successfully, but these errors were encountered: