grammar files get opened an unnecessary amount of times, causing an enormous loading time when creating a parser #992

ornariece · 2021-09-12T15:39:25Z

it seems that, when using a FromPackageLoader object, a grammar file is opened and read from each time another grammar uses a rule that is imported from that former grammar. this means opening the same file over and over again, for each occurence of a rule contained in that file.

while this may not be noticeable for parsers that only use grammar files contained in the same directory (meaning no custom FromPackageLoader is necessary), it becomes highly problematic when using many FromPackageLoaders, as the time required to construct a parser goes up by an absurd amount.

by placing a print(resource_name) in the get_data() function of the python lib pkgutil.py, i was able to count how many times my grammar files were loaded each. for example, the common.lark grammar provided by lark gets opened 61 (!) times, one of my own grammars 25 times, another 16, etc.

The text was updated successfully, but these errors were encountered:

MegaIng · 2021-09-12T15:42:40Z

Can you provided an example grammar(s)? lark should not do more than one import of common.lark per grammar file.

ornariece · 2021-09-12T15:51:40Z

i have 46 grammar files, some of which don't use common.lark.

lark should not do more than one import of common.lark per grammar file.

do you mean the common.lark file gets opened and read from once per grammar file? wouldn't it be better to cache the grammar once and for all?

Can you provided an example grammar(s)?

let me cook smth up so that i don't have to link my whole project lol

MegaIng · 2021-09-12T16:17:03Z

wouldn't it be better to cache the grammar once and for all?

Parsing/loading the grammar file is almost certainly not the problem. Also, consider these grammars:

// common_rules
name: NAME

%import common.NAME

// a.lark

start_a: "A:" name

%import common_rules.name


// b.lark

start_b: "B:" name

%import common_rules.name


// main.lark

start: start_a | start_b

%import a.start_a
%import b.start_b

when you now parse "A:hello", you get the tree Tree('start', [Tree('start_a', [Tree('a__name', [Token('a__common_rules__CNAME', 'hello')])])]). Notices how the import path (main (no prefix) imports a imports common_rules) is include in the rule and token names. This would be quite hard if we cached the parsed grammar, for almost no benefit because parsing is so fast.

The slowdown you have is because of many many rules, which is hard to avoid since the prefixes make all those new rules. (e.g. a__name and b__name are completely distinct rules, each analyzed on their own).

ornariece · 2021-09-12T16:26:03Z

i get that (i used to merge transformers manually before merge_transformers existed, so i'm familiar with the whole naming of imported rules). what i'm questioning is not the parsing of the grammar each time, rather the opening of the file containing the grammar each time. or am i completely mistaken in thinking what's slowing down the construction of the parser is the opening and closing of the files (we're talking literally 100+ openings of file)?

erezsh · 2021-09-12T16:27:23Z

Parsing the lark grammar is done using LALR, so unless it's thousands of lines, I don't think it will be noticeable.

However, reading the file again and again might be noticeable on an extremely slow filesystem, such as NFS.

MegaIng · 2021-09-12T16:28:32Z

or am i completely mistaken in thinking what's slowing down the construction of the parser is the opening and closing of the files (we're talking literally 100+ openings of file)?

Probably, although I can't be 100% sure. You can add a simple print/timing statement in Lark.__init__ near line 300 after load_grammar. At that point all file excess (except cache) has been done.

ornariece · 2021-09-12T16:36:29Z

You can add a simple print/timing statement in Lark.__init__ near line 300 after load_grammar. At that point all file excess (except cache) has been done.

how will that help me distinguish between the opening of the files and the parsing of the grammars though? at that point both have been done if i'm not mistaken

MegaIng · 2021-09-12T16:37:25Z

@ornariece I would expect that whole operation to be the fast part and the slow part to come afterwards (when compiling the grammar).

ornariece · 2021-09-12T16:42:48Z

i just measured. in my case, the part after it takes less time in each calling of Lark.__init__. notably, for the calling that takes the most time out of all, i get 0.78s of loading and 0.55s of compiling.
so yea, opening the files just one time each would probably reduce that loading time to at most hundredths of second

MegaIng · 2021-09-12T16:45:04Z

@ornariece Can you check if the loading part is the slow stuff by adding

from functools import lru_cache
import pkgutil
pkgutil.get_data = lru_cache(pkgutil.get_data)

before import lark?

(This should at the very least remove all file access for common.lark)

ornariece · 2021-09-12T16:52:34Z

do you mean lru_cache()(pkgutil.get_data)?

MegaIng · 2021-09-12T16:53:14Z

For me it works without (), but yes. (probably python version differences)

ornariece · 2021-09-12T17:00:36Z

ah yes i was using 3.7.
well, no notable difference noticed... so it's mostly parsing time.
well i guess if i want to avoid that i should limit my use of imported grammars...

MegaIng · 2021-09-12T17:01:44Z

@ornariece Can you provided the full project? It is not impossible to cache the parsing, but I would like to do profiling myself.

MegaIng · 2021-09-12T17:03:09Z

But also note that cache=True for the whole lark instance should be really fast either way.

ornariece · 2021-09-12T17:05:03Z

@ornariece Can you provided the full project? It is not impossible to cache the parsing, but I would like to do profiling myself.

i can't do that, unless you're willing to sign a NDA :/

MegaIng · 2021-09-12T17:07:20Z

Ok, I will try to provided a patch without the project for the moment.

ornariece · 2021-09-12T17:08:25Z

But also note that cache=True for the whole lark instance should be really fast either way.

i've tried that before, without seeing any improvement nor cache file

MegaIng · 2021-09-12T17:10:45Z

i've tried that before, without seeing any improvement nor cache file

That on the other seems like a bug. The cache file will get stored in your temp folder. Can you try to manually specify a cache file name (e.g. cache="grammar_cache.tmp")

ornariece · 2021-09-12T17:16:45Z

that creates the file alright, but i can't notice any measurable improvement.
(i'm using open_from_package to create the parser, maybe that matters, although from what i've seen it shouldn't)

erezsh · 2021-09-12T18:15:05Z

Iirc, in order to detect changes in imported files, we end up reading and parsing all of the grammars, even when cache=True.

MegaIng · 2021-09-12T18:16:31Z

@erezsh But only once. I would be surprised if that would be as slow as reading and parsing everything repeatedly. (And yes open_from_package shouldn't matter)

MegaIng · 2021-09-12T18:25:18Z

@ornariece Can you try this branch? pip install git+https://github.com/MegaIng/lark.git@load_files_once

ornariece · 2021-09-12T21:45:42Z

no noticeable change still. i checked that the change you made in load_grammar is indeed reached, and it is.
so yea, i think we can be pretty certain now that it's the parsing of each imported library that's eating up time. i didn't expect it to take that much time for such small grammars tbh (390 lines in total). i get the trickiness of handling imported grammars without re-parsing them, but is there hope you go down that path at some point haha?

MegaIng · 2021-09-12T21:46:38Z

@ornariece That branch should also only parse each grammar once.

MegaIng · 2021-09-12T21:47:46Z

@ornariece Could you use some kind like this:

from functools import lru_cache
from time import time


def log_wrap(f):
    i = 0

    def wraps(*args, **kwargs):
        nonlocal i
        i += 1
        print(f"{i:3}: {f} called with {args=} and {kwargs=}")
        start = time()
        r = f(*args, **kwargs)
        end = time()
        print(f"{i:3}: {f} took {end - start}s")
        return r

    return wraps


import pkgutil
import io

io.open = log_wrap(io.open)
pkgutil.get_data = log_wrap(pkgutil.get_data)

from lark import Lark

from lark import load_grammar
load_grammar._parse_grammar = log_wrap(load_grammar._parse_grammar)

And try to figure if any of that takes long?

ornariece · 2021-09-12T22:06:10Z

@ornariece That branch should also only parse each grammar once.

ok now i'm confused

And try to figure if any of that takes long?

no function is taking significantly long. what's curious though is that many calls to the get_data don't output their elapsed time. there are 970 calls for get_data, and only 235 prints for the elapsed time (counting all functions)

MegaIng · 2021-09-12T22:12:20Z

@ornariece That is expected since not all call to get_data succeed (they throw FileNotFound errors.)

Can you do a full on profile of the entire application and look at which function take the most time?

I would suggest https://nejc.saje.info/pstats-viewer.html or https://jiffyclub.github.io/snakeviz/ to look at the results.

It might be deepcopy take a lot of time.

erezsh · 2021-09-12T22:16:15Z

Why not just use regular profile module? Sort by cumulative time and you should see who's at fault here.

ornariece · 2021-09-12T22:16:28Z

i've already done that actually (using snakeviz). but here i'm talking about time the application spends initializing in the imports (ie. including creating the parsers). i've analyzed the import times using https://pypi.org/project/importtime-waterfall/, and what comes out of it is creating the parsers is ~100% of it. but wait let me analyze the creation of the parsers specifically

ornariece · 2021-09-12T23:50:03Z

    cached_parser_data = pickle.load(f)
AttributeError: 'ElementsTransformer' object has no attribute 'str'

ornariece · 2021-09-12T23:52:45Z

do you have to duplicate the rules though? can't you make some custom "accessor" to access them by properly going through the import tree?
not sure what you mean by that %include directive

MegaIng · 2021-09-12T23:53:34Z

ElementsTransformer is a class from you, right? So you are using transformer=ElementsTransformer(). But why does that break...

@erezsh Suggestion: Change the default level of the logger to ERROR, so that that specific call to logger.exception does not get swallowed by default (WARNING still gets hidden)

ornariece · 2021-09-12T23:56:43Z

oops yea that's from me. i'm not using that transformer directly, it's merged into a transformer that is in turn merged with the one i use for this parser

MegaIng · 2021-09-13T00:03:07Z

At some point we are going to have to duplicate the rules, unless we completely rewrite a lot of the library (if it is even possible without taking performance hits since the data in the resulting Tree would now depended on context, e.g. outer using rules). However it might be possible to do a lot of restructuring of load_grammar. (note for self: turn the import resolving inside, e.g. first determine all imports that are to be done, the work backwards up the import tree).

not sure what you mean by that %include directive

I mean something like this:

// main.lark

start: start_a | start_b

%include a
%include b

(assuming same a.lark and b.lark as above). Then parsing "A:hello" would instead result in the tree Tree('start', [Tree('start_a', [Tree('name', [Token('common_rules__CNAME', 'hello')])])]), without the a_ prefix, since the grammars a and b are completely merged. There would also be only one import of common_rules that get's done. Would that be an option?

ornariece · 2021-09-13T00:10:42Z

how would that work with merging transformers though? no grammar namepace means redefining the rule transforming method

MegaIng · 2021-09-13T00:11:17Z

It wouldn't.

ornariece · 2021-09-13T00:15:42Z

if i'm getting this correctly, the only requirement for such a statement to work for me is to not have naming conflicts among the imported ("included") grammars?

MegaIng · 2021-09-13T00:16:40Z

Yes. But I am also not sure how big the benefit is.

MegaIng · 2021-09-13T00:18:30Z

But I am currently more interested in why the cache fails. (It shouldn't). Can you post the line where you are calling Lark initialization , with all options you provide? And does the ElementsTransformer have a str method/attribute?

ornariece · 2021-09-13T00:19:19Z

i think it might be quite big if it can avoid duplicating rules each time a grammar is imported. i might be overestimating it though; the gain would be made on nr_deepcopy_tree and resolve_term_references right?

ornariece · 2021-09-13T00:23:06Z

ornariece · 2021-09-13T00:23:57Z

and no ElementsTransformer doesn't have a stratribute (neither do any of my transformers)

MegaIng · 2021-09-13T00:28:15Z

@ornariece Can you try to create a minimal example that falls outside the NDA and create a new issue for the cache bug?

ornariece · 2021-09-13T00:29:02Z

so i added a __str__ to it, and...

MegaIng · 2021-09-13T00:30:09Z

To what did you add __str__ method?

ornariece · 2021-09-13T00:30:19Z

@ornariece Can you try to create a minimal example that falls outside the NDA and create a new issue for the cache bug?

i will

ornariece · 2021-09-13T00:30:34Z

To what did you add __str__ method?

to my ElementsTransformer class

MegaIng · 2021-09-13T00:31:30Z

What was the problem, and why did that fix it? Or did you have a somewhat broken __str__, potentially in a base class?

ornariece · 2021-09-13T00:33:49Z

the problem was the exception i provided, and i have no clue why the absence of a str method is problematic, nor why adding one fixes it. ElementsTransformer is only inheriting from Transformer, so i am clueless as to what's causing that

ornariece · 2021-09-13T00:34:41Z

i'm also surprised at how much time is saved by the cache option!

ornariece · 2021-09-13T00:39:36Z

ok even weirder, with cache=True the same exception still gets raised even though i've added the str method. i'll include that in the bug report

MegaIng · 2021-09-13T00:39:39Z

It completely skips the frontend (text -> grammar object -> BNF-RULES -> lalr state table) and only leaves a bit of loading from file. It is designed to essentially act as a "precompile the grammar" option.

Is the performance now acceptable? I might at some point still redesign parts of load_grammar.py, but if your problem is solved for the moment I would postpone that a bit.

ornariece · 2021-09-13T00:44:27Z

well we went from 2s to 0.4s so it would be kinda ungrateful from me to say it's not haha. i still think ~1s to create all my parsers for my rather small grammars is excessive, but i wouldn't want to push my luck too far
is there still a point to the %includesyntax with the cache option enabled? because i think such a syntax might improve things

MegaIng · 2021-09-13T00:47:37Z

%include might have in general benefits since it would reduce the number of duplicated rules, which should in general provide better performance.

But also note, the last image you posted still shows calls to load_grammar, which means that at least one of your grammar is not being cached.

ornariece · 2021-09-13T00:58:30Z

i was importing another parser in the same file... fixing that, we are now at under 0.1s. i can't decently call that anything else than satisfying!

well, thanks a lot for your time and various improvements! (i'd still be interested in a %include syntax but that is not very pressing)

MegaIng · 2021-09-13T01:01:27Z

@ornariece I will create a PR, probably tomorrow. Now I gotta sleep. :-)

erezsh · 2021-09-13T11:30:38Z

@MegaIng We can deduplicate rules in other, maybe better ways. For example, nested grammars can stand on their own, so that importing a.lark from two different grammars will result in the same rules and terminals (and the namespace will be just a, not nested)

ornariece mentioned this issue Sep 13, 2021

failure to load grammar from cache #997

Closed

MegaIng mentioned this issue Sep 14, 2021

Load files once #998

Open

grammar files get opened an unnecessary amount of times, causing an enormous loading time when creating a parser #992

grammar files get opened an unnecessary amount of times, causing an enormous loading time when creating a parser #992

Comments

ornariece commented Sep 12, 2021

MegaIng commented Sep 12, 2021

ornariece commented Sep 12, 2021

MegaIng commented Sep 12, 2021

ornariece commented Sep 12, 2021

erezsh commented Sep 12, 2021

MegaIng commented Sep 12, 2021

ornariece commented Sep 12, 2021 • edited Loading

MegaIng commented Sep 12, 2021

ornariece commented Sep 12, 2021

MegaIng commented Sep 12, 2021 • edited Loading

ornariece commented Sep 12, 2021

MegaIng commented Sep 12, 2021 • edited Loading

ornariece commented Sep 12, 2021

MegaIng commented Sep 12, 2021

MegaIng commented Sep 12, 2021

ornariece commented Sep 12, 2021

MegaIng commented Sep 12, 2021

ornariece commented Sep 12, 2021

MegaIng commented Sep 12, 2021

ornariece commented Sep 12, 2021

erezsh commented Sep 12, 2021

MegaIng commented Sep 12, 2021

MegaIng commented Sep 12, 2021

ornariece commented Sep 12, 2021

MegaIng commented Sep 12, 2021

MegaIng commented Sep 12, 2021

ornariece commented Sep 12, 2021

MegaIng commented Sep 12, 2021 • edited Loading

erezsh commented Sep 12, 2021

ornariece commented Sep 12, 2021 • edited Loading

ornariece commented Sep 12, 2021

ornariece commented Sep 12, 2021

MegaIng commented Sep 12, 2021

ornariece commented Sep 12, 2021

MegaIng commented Sep 13, 2021

ornariece commented Sep 13, 2021

MegaIng commented Sep 13, 2021

ornariece commented Sep 13, 2021 • edited Loading

MegaIng commented Sep 13, 2021

MegaIng commented Sep 13, 2021

ornariece commented Sep 13, 2021

ornariece commented Sep 13, 2021

ornariece commented Sep 13, 2021

MegaIng commented Sep 13, 2021

ornariece commented Sep 13, 2021

MegaIng commented Sep 13, 2021

ornariece commented Sep 13, 2021

ornariece commented Sep 13, 2021

MegaIng commented Sep 13, 2021

ornariece commented Sep 13, 2021

ornariece commented Sep 13, 2021 • edited Loading

ornariece commented Sep 13, 2021

MegaIng commented Sep 13, 2021

ornariece commented Sep 13, 2021

MegaIng commented Sep 13, 2021

ornariece commented Sep 13, 2021

MegaIng commented Sep 13, 2021

erezsh commented Sep 13, 2021

ornariece commented Sep 12, 2021 •

edited

Loading

MegaIng commented Sep 12, 2021 •

edited

Loading

MegaIng commented Sep 12, 2021 •

edited

Loading

MegaIng commented Sep 12, 2021 •

edited

Loading

ornariece commented Sep 12, 2021 •

edited

Loading

ornariece commented Sep 13, 2021 •

edited

Loading

ornariece commented Sep 13, 2021 •

edited

Loading