Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grammar files get opened an unnecessary amount of times, causing an enormous loading time when creating a parser #992

Open
ornariece opened this issue Sep 12, 2021 · 72 comments

Comments

@ornariece
Copy link
Contributor

it seems that, when using a FromPackageLoader object, a grammar file is opened and read from each time another grammar uses a rule that is imported from that former grammar. this means opening the same file over and over again, for each occurence of a rule contained in that file.

while this may not be noticeable for parsers that only use grammar files contained in the same directory (meaning no custom FromPackageLoader is necessary), it becomes highly problematic when using many FromPackageLoaders, as the time required to construct a parser goes up by an absurd amount.

by placing a print(resource_name) in the get_data() function of the python lib pkgutil.py, i was able to count how many times my grammar files were loaded each. for example, the common.lark grammar provided by lark gets opened 61 (!) times, one of my own grammars 25 times, another 16, etc.

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

Can you provided an example grammar(s)? lark should not do more than one import of common.lark per grammar file.

@ornariece
Copy link
Contributor Author

i have 46 grammar files, some of which don't use common.lark.

lark should not do more than one import of common.lark per grammar file.

do you mean the common.lark file gets opened and read from once per grammar file? wouldn't it be better to cache the grammar once and for all?

Can you provided an example grammar(s)?

let me cook smth up so that i don't have to link my whole project lol

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

wouldn't it be better to cache the grammar once and for all?

Parsing/loading the grammar file is almost certainly not the problem. Also, consider these grammars:

// common_rules
name: NAME

%import common.NAME

// a.lark

start_a: "A:" name

%import common_rules.name


// b.lark

start_b: "B:" name

%import common_rules.name


// main.lark

start: start_a | start_b

%import a.start_a
%import b.start_b

when you now parse "A:hello", you get the tree Tree('start', [Tree('start_a', [Tree('a__name', [Token('a__common_rules__CNAME', 'hello')])])]). Notices how the import path (main (no prefix) imports a imports common_rules) is include in the rule and token names. This would be quite hard if we cached the parsed grammar, for almost no benefit because parsing is so fast.

The slowdown you have is because of many many rules, which is hard to avoid since the prefixes make all those new rules. (e.g. a__name and b__name are completely distinct rules, each analyzed on their own).

@ornariece
Copy link
Contributor Author

i get that (i used to merge transformers manually before merge_transformers existed, so i'm familiar with the whole naming of imported rules). what i'm questioning is not the parsing of the grammar each time, rather the opening of the file containing the grammar each time. or am i completely mistaken in thinking what's slowing down the construction of the parser is the opening and closing of the files (we're talking literally 100+ openings of file)?

@erezsh
Copy link
Member

erezsh commented Sep 12, 2021

Parsing the lark grammar is done using LALR, so unless it's thousands of lines, I don't think it will be noticeable.

However, reading the file again and again might be noticeable on an extremely slow filesystem, such as NFS.

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

or am i completely mistaken in thinking what's slowing down the construction of the parser is the opening and closing of the files (we're talking literally 100+ openings of file)?

Probably, although I can't be 100% sure. You can add a simple print/timing statement in Lark.__init__ near line 300 after load_grammar. At that point all file excess (except cache) has been done.

@ornariece
Copy link
Contributor Author

ornariece commented Sep 12, 2021

You can add a simple print/timing statement in Lark.__init__ near line 300 after load_grammar. At that point all file excess (except cache) has been done.

how will that help me distinguish between the opening of the files and the parsing of the grammars though? at that point both have been done if i'm not mistaken

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

@ornariece I would expect that whole operation to be the fast part and the slow part to come afterwards (when compiling the grammar).

@ornariece
Copy link
Contributor Author

i just measured. in my case, the part after it takes less time in each calling of Lark.__init__. notably, for the calling that takes the most time out of all, i get 0.78s of loading and 0.55s of compiling.
so yea, opening the files just one time each would probably reduce that loading time to at most hundredths of second

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

@ornariece Can you check if the loading part is the slow stuff by adding

from functools import lru_cache
import pkgutil
pkgutil.get_data = lru_cache(pkgutil.get_data)

before import lark?

(This should at the very least remove all file access for common.lark)

@ornariece
Copy link
Contributor Author

do you mean lru_cache()(pkgutil.get_data)?

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

For me it works without (), but yes. (probably python version differences)

@ornariece
Copy link
Contributor Author

ah yes i was using 3.7.
well, no notable difference noticed... so it's mostly parsing time.
well i guess if i want to avoid that i should limit my use of imported grammars...

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

@ornariece Can you provided the full project? It is not impossible to cache the parsing, but I would like to do profiling myself.

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

But also note that cache=True for the whole lark instance should be really fast either way.

@ornariece
Copy link
Contributor Author

@ornariece Can you provided the full project? It is not impossible to cache the parsing, but I would like to do profiling myself.

i can't do that, unless you're willing to sign a NDA :/

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

Ok, I will try to provided a patch without the project for the moment.

@ornariece
Copy link
Contributor Author

But also note that cache=True for the whole lark instance should be really fast either way.

i've tried that before, without seeing any improvement nor cache file

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

i've tried that before, without seeing any improvement nor cache file

That on the other seems like a bug. The cache file will get stored in your temp folder. Can you try to manually specify a cache file name (e.g. cache="grammar_cache.tmp")

@ornariece
Copy link
Contributor Author

that creates the file alright, but i can't notice any measurable improvement.
(i'm using open_from_package to create the parser, maybe that matters, although from what i've seen it shouldn't)

@erezsh
Copy link
Member

erezsh commented Sep 12, 2021

Iirc, in order to detect changes in imported files, we end up reading and parsing all of the grammars, even when cache=True.

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

@erezsh But only once. I would be surprised if that would be as slow as reading and parsing everything repeatedly. (And yes open_from_package shouldn't matter)

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

@ornariece Can you try this branch? pip install git+https://github.com/MegaIng/lark.git@load_files_once

@ornariece
Copy link
Contributor Author

no noticeable change still. i checked that the change you made in load_grammar is indeed reached, and it is.
so yea, i think we can be pretty certain now that it's the parsing of each imported library that's eating up time. i didn't expect it to take that much time for such small grammars tbh (390 lines in total). i get the trickiness of handling imported grammars without re-parsing them, but is there hope you go down that path at some point haha?

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

@ornariece That branch should also only parse each grammar once.

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

@ornariece Could you use some kind like this:

from functools import lru_cache
from time import time


def log_wrap(f):
    i = 0

    def wraps(*args, **kwargs):
        nonlocal i
        i += 1
        print(f"{i:3}: {f} called with {args=} and {kwargs=}")
        start = time()
        r = f(*args, **kwargs)
        end = time()
        print(f"{i:3}: {f} took {end - start}s")
        return r

    return wraps


import pkgutil
import io

io.open = log_wrap(io.open)
pkgutil.get_data = log_wrap(pkgutil.get_data)

from lark import Lark

from lark import load_grammar
load_grammar._parse_grammar = log_wrap(load_grammar._parse_grammar)

And try to figure if any of that takes long?

@ornariece
Copy link
Contributor Author

@ornariece That branch should also only parse each grammar once.

ok now i'm confused

And try to figure if any of that takes long?

no function is taking significantly long. what's curious though is that many calls to the get_data don't output their elapsed time. there are 970 calls for get_data, and only 235 prints for the elapsed time (counting all functions)

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

@ornariece That is expected since not all call to get_data succeed (they throw FileNotFound errors.)

Can you do a full on profile of the entire application and look at which function take the most time?

I would suggest https://nejc.saje.info/pstats-viewer.html or https://jiffyclub.github.io/snakeviz/ to look at the results.

It might be deepcopy take a lot of time.

@erezsh
Copy link
Member

erezsh commented Sep 12, 2021

Why not just use regular profile module? Sort by cumulative time and you should see who's at fault here.

@ornariece
Copy link
Contributor Author

ornariece commented Sep 12, 2021

i've already done that actually (using snakeviz). but here i'm talking about time the application spends initializing in the imports (ie. including creating the parsers). i've analyzed the import times using https://pypi.org/project/importtime-waterfall/, and what comes out of it is creating the parsers is ~100% of it. but wait let me analyze the creation of the parsers specifically

@ornariece
Copy link
Contributor Author

    cached_parser_data = pickle.load(f)
AttributeError: 'ElementsTransformer' object has no attribute 'str'

@ornariece
Copy link
Contributor Author

do you have to duplicate the rules though? can't you make some custom "accessor" to access them by properly going through the import tree?
not sure what you mean by that %include directive

@MegaIng
Copy link
Member

MegaIng commented Sep 12, 2021

ElementsTransformer is a class from you, right? So you are using transformer=ElementsTransformer(). But why does that break...

@erezsh Suggestion: Change the default level of the logger to ERROR, so that that specific call to logger.exception does not get swallowed by default (WARNING still gets hidden)

@ornariece
Copy link
Contributor Author

oops yea that's from me. i'm not using that transformer directly, it's merged into a transformer that is in turn merged with the one i use for this parser

@MegaIng
Copy link
Member

MegaIng commented Sep 13, 2021

At some point we are going to have to duplicate the rules, unless we completely rewrite a lot of the library (if it is even possible without taking performance hits since the data in the resulting Tree would now depended on context, e.g. outer using rules). However it might be possible to do a lot of restructuring of load_grammar. (note for self: turn the import resolving inside, e.g. first determine all imports that are to be done, the work backwards up the import tree).

not sure what you mean by that %include directive

I mean something like this:

// main.lark

start: start_a | start_b

%include a
%include b

(assuming same a.lark and b.lark as above). Then parsing "A:hello" would instead result in the tree Tree('start', [Tree('start_a', [Tree('name', [Token('common_rules__CNAME', 'hello')])])]), without the a_ prefix, since the grammars a and b are completely merged. There would also be only one import of common_rules that get's done. Would that be an option?

@ornariece
Copy link
Contributor Author

how would that work with merging transformers though? no grammar namepace means redefining the rule transforming method

@MegaIng
Copy link
Member

MegaIng commented Sep 13, 2021

It wouldn't.

@ornariece
Copy link
Contributor Author

ornariece commented Sep 13, 2021

if i'm getting this correctly, the only requirement for such a statement to work for me is to not have naming conflicts among the imported ("included") grammars?

@MegaIng
Copy link
Member

MegaIng commented Sep 13, 2021

Yes. But I am also not sure how big the benefit is.

@MegaIng
Copy link
Member

MegaIng commented Sep 13, 2021

But I am currently more interested in why the cache fails. (It shouldn't). Can you post the line where you are calling Lark initialization , with all options you provide? And does the ElementsTransformer have a str method/attribute?

@ornariece
Copy link
Contributor Author

i think it might be quite big if it can avoid duplicating rules each time a grammar is imported. i might be overestimating it though; the gain would be made on nr_deepcopy_tree and resolve_term_references right?

@ornariece
Copy link
Contributor Author

image

@ornariece
Copy link
Contributor Author

and no ElementsTransformer doesn't have a stratribute (neither do any of my transformers)

@MegaIng
Copy link
Member

MegaIng commented Sep 13, 2021

@ornariece Can you try to create a minimal example that falls outside the NDA and create a new issue for the cache bug?

@ornariece
Copy link
Contributor Author

so i added a __str__ to it, and...
image

@MegaIng
Copy link
Member

MegaIng commented Sep 13, 2021

To what did you add __str__ method?

@ornariece
Copy link
Contributor Author

@ornariece Can you try to create a minimal example that falls outside the NDA and create a new issue for the cache bug?

i will

@ornariece
Copy link
Contributor Author

To what did you add __str__ method?

to my ElementsTransformer class

@MegaIng
Copy link
Member

MegaIng commented Sep 13, 2021

What was the problem, and why did that fix it? Or did you have a somewhat broken __str__, potentially in a base class?

@ornariece
Copy link
Contributor Author

the problem was the exception i provided, and i have no clue why the absence of a str method is problematic, nor why adding one fixes it. ElementsTransformer is only inheriting from Transformer, so i am clueless as to what's causing that

@ornariece
Copy link
Contributor Author

ornariece commented Sep 13, 2021

i'm also surprised at how much time is saved by the cache option!

@ornariece
Copy link
Contributor Author

ok even weirder, with cache=True the same exception still gets raised even though i've added the str method. i'll include that in the bug report

@MegaIng
Copy link
Member

MegaIng commented Sep 13, 2021

It completely skips the frontend (text -> grammar object -> BNF-RULES -> lalr state table) and only leaves a bit of loading from file. It is designed to essentially act as a "precompile the grammar" option.

Is the performance now acceptable? I might at some point still redesign parts of load_grammar.py, but if your problem is solved for the moment I would postpone that a bit.

@ornariece
Copy link
Contributor Author

well we went from 2s to 0.4s so it would be kinda ungrateful from me to say it's not haha. i still think ~1s to create all my parsers for my rather small grammars is excessive, but i wouldn't want to push my luck too far
is there still a point to the %includesyntax with the cache option enabled? because i think such a syntax might improve things

@MegaIng
Copy link
Member

MegaIng commented Sep 13, 2021

%include might have in general benefits since it would reduce the number of duplicated rules, which should in general provide better performance.

But also note, the last image you posted still shows calls to load_grammar, which means that at least one of your grammar is not being cached.

@ornariece
Copy link
Contributor Author

i was importing another parser in the same file... fixing that, we are now at under 0.1s. i can't decently call that anything else than satisfying!

well, thanks a lot for your time and various improvements! (i'd still be interested in a %include syntax but that is not very pressing)

@MegaIng
Copy link
Member

MegaIng commented Sep 13, 2021

@ornariece I will create a PR, probably tomorrow. Now I gotta sleep. :-)

@erezsh
Copy link
Member

erezsh commented Sep 13, 2021

@MegaIng We can deduplicate rules in other, maybe better ways. For example, nested grammars can stand on their own, so that importing a.lark from two different grammars will result in the same rules and terminals (and the namespace will be just a, not nested)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants