-
-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grammar files get opened an unnecessary amount of times, causing an enormous loading time when creating a parser #992
Comments
Can you provided an example grammar(s)? lark should not do more than one import of |
i have 46 grammar files, some of which don't use
do you mean the
let me cook smth up so that i don't have to link my whole project lol |
Parsing/loading the grammar file is almost certainly not the problem. Also, consider these grammars:
when you now parse The slowdown you have is because of many many rules, which is hard to avoid since the prefixes make all those new rules. (e.g. |
i get that (i used to merge transformers manually before |
Parsing the lark grammar is done using LALR, so unless it's thousands of lines, I don't think it will be noticeable. However, reading the file again and again might be noticeable on an extremely slow filesystem, such as NFS. |
Probably, although I can't be 100% sure. You can add a simple print/timing statement in |
how will that help me distinguish between the opening of the files and the parsing of the grammars though? at that point both have been done if i'm not mistaken |
@ornariece I would expect that whole operation to be the fast part and the slow part to come afterwards (when compiling the grammar). |
i just measured. in my case, the part after it takes less time in each calling of |
@ornariece Can you check if the loading part is the slow stuff by adding
before import lark? (This should at the very least remove all file access for |
do you mean |
For me it works without |
ah yes i was using 3.7. |
@ornariece Can you provided the full project? It is not impossible to cache the parsing, but I would like to do profiling myself. |
But also note that |
i can't do that, unless you're willing to sign a NDA :/ |
Ok, I will try to provided a patch without the project for the moment. |
i've tried that before, without seeing any improvement nor cache file |
That on the other seems like a bug. The cache file will get stored in your |
that creates the file alright, but i can't notice any measurable improvement. |
Iirc, in order to detect changes in imported files, we end up reading and parsing all of the grammars, even when cache=True. |
@erezsh But only once. I would be surprised if that would be as slow as reading and parsing everything repeatedly. (And yes |
@ornariece Can you try this branch? |
no noticeable change still. i checked that the change you made in |
@ornariece That branch should also only parse each grammar once. |
@ornariece Could you use some kind like this:
And try to figure if any of that takes long? |
ok now i'm confused
no function is taking significantly long. what's curious though is that many calls to the |
@ornariece That is expected since not all call to Can you do a full on profile of the entire application and look at which function take the most time? I would suggest https://nejc.saje.info/pstats-viewer.html or https://jiffyclub.github.io/snakeviz/ to look at the results. It might be |
Why not just use regular profile module? Sort by cumulative time and you should see who's at fault here. |
i've already done that actually (using snakeviz). but here i'm talking about time the application spends initializing in the imports (ie. including creating the parsers). i've analyzed the import times using https://pypi.org/project/importtime-waterfall/, and what comes out of it is creating the parsers is ~100% of it. but wait let me analyze the creation of the parsers specifically |
|
do you have to duplicate the rules though? can't you make some custom "accessor" to access them by properly going through the import tree? |
@erezsh Suggestion: Change the default level of the logger to |
oops yea that's from me. i'm not using that transformer directly, it's merged into a transformer that is in turn merged with the one i use for this parser |
At some point we are going to have to duplicate the rules, unless we completely rewrite a lot of the library (if it is even possible without taking performance hits since the data in the resulting Tree would now depended on context, e.g. outer using rules). However it might be possible to do a lot of restructuring of
I mean something like this:
(assuming same |
how would that work with merging transformers though? no grammar namepace means redefining the rule transforming method |
It wouldn't. |
if i'm getting this correctly, the only requirement for such a statement to work for me is to not have naming conflicts among the imported ("included") grammars? |
Yes. But I am also not sure how big the benefit is. |
But I am currently more interested in why the cache fails. (It shouldn't). Can you post the line where you are calling |
i think it might be quite big if it can avoid duplicating rules each time a grammar is imported. i might be overestimating it though; the gain would be made on |
and no |
@ornariece Can you try to create a minimal example that falls outside the NDA and create a new issue for the cache bug? |
To what did you add |
i will |
to my |
What was the problem, and why did that fix it? Or did you have a somewhat broken |
the problem was the exception i provided, and i have no clue why the absence of a |
i'm also surprised at how much time is saved by the |
ok even weirder, with |
It completely skips the frontend (text -> grammar object -> BNF-RULES -> lalr state table) and only leaves a bit of loading from file. It is designed to essentially act as a "precompile the grammar" option. Is the performance now acceptable? I might at some point still redesign parts of |
well we went from 2s to 0.4s so it would be kinda ungrateful from me to say it's not haha. i still think ~1s to create all my parsers for my rather small grammars is excessive, but i wouldn't want to push my luck too far |
But also note, the last image you posted still shows calls to |
i was importing another parser in the same file... fixing that, we are now at under 0.1s. i can't decently call that anything else than satisfying! well, thanks a lot for your time and various improvements! (i'd still be interested in a |
@ornariece I will create a PR, probably tomorrow. Now I gotta sleep. :-) |
@MegaIng We can deduplicate rules in other, maybe better ways. For example, nested grammars can stand on their own, so that importing |
it seems that, when using a
FromPackageLoader
object, a grammar file is opened and read from each time another grammar uses a rule that is imported from that former grammar. this means opening the same file over and over again, for each occurence of a rule contained in that file.while this may not be noticeable for parsers that only use grammar files contained in the same directory (meaning no custom
FromPackageLoader
is necessary), it becomes highly problematic when using manyFromPackageLoader
s, as the time required to construct a parser goes up by an absurd amount.by placing a
print(resource_name)
in theget_data()
function of the python libpkgutil.py
, i was able to count how many times my grammar files were loaded each. for example, thecommon.lark
grammar provided by lark gets opened 61 (!) times, one of my own grammars 25 times, another 16, etc.The text was updated successfully, but these errors were encountered: