Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the start rule feature undocumented? #77

Open
bpj opened this issue Mar 15, 2020 · 12 comments
Open

Why is the start rule feature undocumented? #77

bpj opened this issue Mar 15, 2020 · 12 comments

Comments

@bpj
Copy link

bpj commented Mar 15, 2020

Why is the start rule feature undocumented? I have a very good use case for it. Should I refrain from using it?

@mohawk2
Copy link
Collaborator

mohawk2 commented Mar 24, 2020

Sorry for responding slowly. It's documented that the first rule is the "start" rule. Can you spell out a bit more what problem you're facing?

@bpj
Copy link
Author

bpj commented Mar 24, 2020 via email

@mohawk2
Copy link
Collaborator

mohawk2 commented Mar 24, 2020

Why not make a PR to document it? @ingydotnet any thoughts?

@bpj As an alternative thought, it sounds like you're taking quite a code-driven approach to your parsing. Have you considered a more data-driven approach whereby you produce the entire parse-tree (which means you'd only need to call parse on the top-level document), then give it semantic meaning in a following phase?

@bpj
Copy link
Author

bpj commented Mar 25, 2020

@mohawk2 I think you don't understand. It has everything to do with my "approach" to the data.

  1. First I parse the input text which contains certain directives, some of which contain "paths" to some location in a data structure which as yet is unavailable.

  2. The AST is traversed/evaluated to produce the output. Only now is the data structure which the "paths" point into provided as an argument to the evaluation method.

  3. Some "paths" point to values which themselves contain strings specifying "paths". These secondary/"indirect" paths must now be parsed using a subset of the main grammar and the values they point to are then retrieved from the same data structure, which was not available during the main parsing phase.

There simply is no way both paths can be parsed during the same phase, because the secondary/indirect path does not yet "exist" during the main parsing phase.

As a concrete example suppose the main text contains a path !/foo/1/bar, where the ! indicates that the value pointed to by this path contains the path to the actual value. Now in phase 2 the evaluation method is passed a data structure looking like this (represented as YAML for brevity):

foo:
  - some value
  - bar: '/biz/buz/quux'
    # presumably more data here
biz:
  buz:
    quux: The actual data
    # presumably more data here
# presumably more data here

When the evaluator sees the piece of the AST representing the primary path /foo/1/bar it fetches the value pointed to by that value, which is the "secondary" path /biz/buz/quux. The evaluator now calls on a parser instance to parse that path using the subset of the grammar which parses paths, and then the evaluator fetches the value "The actual data" pointed to by that path.

Since the syntax for specifying paths is the same in both cases it is only natural to parse the "secondary" path using the same grammar, but using the path rule as start rule instead of the full_input rule. Note that the path syntax is a bit more complex than just / (: SLASH WORD+ )+ /, since there can be keys which don't match / WORD+ /: even "simple" keys are Unicode aware, so that the actual regex for matching a "word" is more along the lines of

/ (: (= \pL ) \X (: \p{Dash}? (= [\pL\pN] ) \X | _ )* /

Please see perlre and perluniprops documentation if you don't know what these escapes mean. Basically "A 'word- starts with a letter in the Unicode sense, followed by zero or more underscores/letters/numerics in the Unicode sense, possibly with following combining diacritical marks and possibly separated by dashes in the Unicode sense". While this is a regex it is complicated enough that I don't want to have to maintain it in more than one place!
There are also keys which are "quoted" using angle brackets and may include whitespace, character escapes, a syntax for references to characters by codepoint and some other things, notably slashes (/people/<Kurt G$#<0xf6>del>/<email/url> is a possible example — I won't go into the matter of matching hash keys with Unicode normalization!), so you can't just skip over the path using some regex and parse it later, but you have to parse the path in the main input and each key in it to see where the path ends, and again I would like to not need to keep the same piece of grammar in more than one place, and besides the "pointy-quoted string" syntax is not only used in other places too, but moreover there is also (although not permitted in paths) a variant which uses the Unicode pointy brackets ‹…› and a "double-quoted" variant with <<...>> or «…» which allows data interpolation, so there are already four grammar rules, each with their recursive subrule for nested balanced delimiters, which are very similar, and which I want to keep all in one place.
(FWIW there is a point in not using ordinary ASCII quotes and backslash escaping: you are supposed to be able to use this syntax inside YAML or JSON quoted strings without ending up in Escaping Level Hell! Thus the ASCII quotes and backslash are not used in my DSL syntax.)

I hope this explains better what I mean. Note that English isn't my native language, which unfortunately may mean that I don't know the right words to use for some concepts.

I'll be happy to take a stab at documenting the alternative start rule syntax if there is an interest.

@mohawk2
Copy link
Collaborator

mohawk2 commented Mar 28, 2020

I don't understand why you wouldn't have a rule called something like pathspec. You could then, in the original parse run (not requiring a second call), have a rule something like:

dollar-ref: TICK BANG pathspec TICK

That way the original AST would contain the pathspec already parsed.

Are you sure you're not solving the wrong problem here? :-)

@bpj
Copy link
Author

bpj commented Mar 28, 2020

Of course the grammar for the whole language would contain the rule, and an AST from a parse of a whole text would contain the paths contained in that text, but some paths are fetched from elsewhere after the whole text/program has already been parsed. Now how would I parse a string containing a path fetched from elsewhere, which is not embedded in any other text without specifying the rule for parsing a path as the start rule instead of the top rule used when parsing a whole program?

The problem is that I need to parse some strings using a subset of the grammar. I can't see how I can do that without either

  • Keep that subset in a separate file/string and keep concatenating it to the rest of the grammar when I want to parse the whole language, which is hell to maintain.

  • Specifying the pathspec rule as alternative start rule when I want to parse a string containing only a path.

I can't see what would be wrong with the second approach. I could of course set things up so that the grammar always parses either a whole text or a bare path, but that seems wrong, since sometimes I want a whole text and sometimes I want a bare path, but never either/or.

@mohawk2
Copy link
Collaborator

mohawk2 commented Mar 29, 2020

My gut says that if you provide a suitable subset of your program, I can provide an answer. Please prove me wrong so we can justify this API change :-)

@ingydotnet
Copy link
Collaborator

I think you both misunderstand what start_rules is for.

It is a set of rules passed to the Pegex compiler. The compiler takes a textual Pegex grammar and turns it into a grammar object. That's phase 1.

Then it does a combinate phase. It takes the starting rule and follows all the rule references and does certain combining effects. Any rule that is not reached in this process is removed from the grammar object. Note: they don't need to be removed but currently that's what happens.

So start_rules is a list of alternate starting rules whose trees contain rules that you want to survive the combinate phase, that otherwise would not.

Now there is a related concept in Pegex::Parser of a starting rule. Look in Pegex/Parser.pm and you'll see:

sub parse {
    my ($self, $input, $start) = @_;

You can do a parse with the grammar using an alternate starting rule. This sounds like what you are trying to do. You only need the starting_rules attribute if the compiler is throwing out the non-default rules you need during its compile/combinate phase.

It doesn't sound like you need start_rules at all because the rule in question is already part of your main start rule, so it will be available also for your alternate start rule.

I hope I understood things right, and that this is helpful.

@bpj
Copy link
Author

bpj commented Apr 3, 2020 via email

@ingydotnet
Copy link
Collaborator

The optional start rule feature will not go away. It should be documented. Freel free to make a pull request if you'd like to do that.

@bpj
Copy link
Author

bpj commented Apr 4, 2020 via email

@bpj
Copy link
Author

bpj commented Apr 18, 2020

If it's OK I'll leave this issue open as a reminder until I've made that docu PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants