Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting YAML to JSON is super slow #422

Closed
JoakimSoderberg opened this issue Apr 22, 2020 · 17 comments
Closed

Converting YAML to JSON is super slow #422

JoakimSoderberg opened this issue Apr 22, 2020 · 17 comments
Labels
Milestone

Comments

@JoakimSoderberg
Copy link

Describe the bug
When converting a 5000 line YAML to JSON using yq:

time docker run \                                                      
        --rm \
        -u $(id -ru):$(id -rg) \
        -v $(pwd):/shared \
        -w /shared \
        mikefarah/yq yq r --tojson swagger.json

It takes 2 minutes:

...
docker run --rm -u $(id -ru):$(id -rg) -v $(pwd):/shared -w /shared  yq r    0,07s user 0,03s system 0% cpu 2:18,42 total

Similarly, a 500 line file takes around 5s, which is in itself super slow.

Input Yaml

5000 lines of YAML, not pasting that here.

Command
The command you ran:

yq r --tojson swagger.yml

Actual behavior

Super slow

Expected behavior

It should take under a second. That's what a simple Python program does it in.

Additional context
Add any other context about the problem here.

@PaulCharlton
Copy link

+1

@PaulCharlton
Copy link

PaulCharlton commented Apr 22, 2020

in my experience - it slowed down a lot some time since early January

> yq -V
yq version 3.2.1

@PaulCharlton
Copy link

same for

> yq -V
yq version 3.3.0

I am imagining an O(N^2) algorithm somewhere inappropriate

@PaulCharlton
Copy link

after downgrade to yq 2.4.1 --> zippy fast again (ie: 30s to parse yaml file in 3.x became sub second response time with 2.4.1)

this helped to downgrade -> https://stackoverflow.com/questions/53837861/brew-list-and-install-specific-versions-of-formula

@mikefarah
Copy link
Owner

Ah yeah, this will be when I moved to 3.0. As part of handling yaml anchors on conversion to JSON I made it iterate through the yaml structure. Let me play around to see if I can do that better...or even skip it

@mikefarah
Copy link
Owner

Ok what I can do is make it not handle merge anchors by default when converting to JSON - however if there are merge anchors in the yaml file, then you will need to pass in --explodeAnchors/-X with the -j flag. Otherwise, the command will fail (and suggest to you to retry with --explodeAnchors). Thoughts?

@PaulCharlton
Copy link

Ok what I can do is make it not handle merge anchors by default when converting to JSON - however if there are merge anchors in the yaml file, then you will need to pass in --explodeAnchors/-X with the -j flag. Otherwise, the command will fail (and suggest to you to retry with --explodeAnchors). Thoughts?

Hey Mike -- will the "--explodeAnchors' still be slow? if so, does not get my vote. I have written plenty of one-pass parsers, and what it has boiled down to in the past is using some memory to cache the anchors (macros) as they are found, then doing an efficient look-up (hashed by name) at the point of reference.

@PaulCharlton
Copy link

PaulCharlton commented Apr 23, 2020

or, if by merge anchors you mean the special key <<, then sort by key name both the LHS and RHS of the merge and then do a zipper O(1) across the key lists?

@mikefarah
Copy link
Owner

Yep will still be slow, that sounds pretty neat, would def take a PR for it :)

@PaulCharlton
Copy link

@mikefarah I am loving to take a look and do a PR -- before beginning, is there some architectural change reason that 2.4.x works quickly and 3.x is slow?

@mikefarah
Copy link
Owner

Yeah so 2.4.x didn't handle anchors, it was because the underlying yaml parser did not expose them.

In 3.x I updated the underlying parser which let me access the merge anchor information.

Now when outputting to json - I needed to process the anchors into a regular object to I can encode to JSON correctly, this wasn't done in 2.4.x.

In 3.x, it is done here: https://github.com/mikefarah/yq/blob/master/cmd/utils.go#L184 before being passed to the encoder (https://github.com/mikefarah/yq/blob/master/pkg/yqlib/encoder.go#L61)

@mikefarah
Copy link
Owner

Further context, the way explodeAnchors works, (it's quite naive :)) is to deeply recurse all matching nodes and merge all the leaf nodes into a new empty data structure

@dd-ssc
Copy link

dd-ssc commented Jun 1, 2020

+1:

A yaml input file of mine is 827 lines long and makes heavy, multi-level use of anchors and aliases; exploding it into a 10294-line output file takes 10-12 minutes. In some sitations, I can speed up development by exploding it into a json file and loading that using jq (which takes <1s), but obviously, whenever the original yaml file changes, I am stalled. Only workaround for now is to comment out the aliases that have not changed.

I do understand that anchor / alias support was implemented in a naive way in order to support them at all - and @mikefarah: I do sincerely appreciate the great work 👍 - anchors and aliases enable me to do fantastic stuff with yaml - but whatever makes this faster would bring a significant benefit to my work.

BTW: I know nothing about Go, but from I've heard, it's really good at parallelizing tasks ("co-routines" and what not ?!?); looking at top while the exploding is running shows yq using ~180% CPU (e.g. CPU usage: 25% user, 5% sys, 70% idle) on a mid-2015 MacBook Pro (Intel Core i7 Quad Core); wouldn't this ideally max out my CPU ? (at - I don't know - 400% / 95% user or so ?!?)

@PaulCharlton: Is there anything I can do to support you with the PR you mention ? I'm able and more than happy to test dev versions and report benchmarks - somewhat self-serving, would get me a chance to get GOing (jeez, my puns have seen better days...)

@dd-ssc
Copy link

dd-ssc commented Jun 2, 2020

Update: My yaml input file keeps growing (e.g. 958 lines, including comments) and making smarter re-use of multi-level sections used at multiple locations (e.g. 63972 lines in the json output file).
YAY to yaml anchors and aliases! 🎉

However, the time yq requires to explode the yaml file and/or convert it to json is getting prohibitive; from my log output: Time to load yaml file: 1h 19m 53s.

In order to keep making progress in my project, I will have to stop using yq for the time being and work around this issue by converting the yaml input file to a json output file up front; takes <1s with a fairly simple Python script.

@mikefarah
Copy link
Owner

Alright I think I have a good way forward with this - working on it, should be available within a few days if I get enough time. Does anyone have a sample large doc to test it with? I've made my own but not sure how representative it is...

@mikefarah mikefarah added this to the 3.3.1 milestone Jun 10, 2020
@dd-ssc
Copy link

dd-ssc commented Jun 17, 2020

I re-ran the file I mentioned earlier, takes less than a second. That's one heck of a speed-up!
@mikefarah : Great work mate! 👍

@mikefarah
Copy link
Owner

Fantastic, glad to hear that :)

Thanks for the feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants