Converting YAML to JSON is super slow #422

JoakimSoderberg · 2020-04-22T08:44:02Z

Describe the bug
When converting a 5000 line YAML to JSON using yq:

time docker run \                                                      
        --rm \
        -u $(id -ru):$(id -rg) \
        -v $(pwd):/shared \
        -w /shared \
        mikefarah/yq yq r --tojson swagger.json

It takes 2 minutes:

...
docker run --rm -u $(id -ru):$(id -rg) -v $(pwd):/shared -w /shared  yq r    0,07s user 0,03s system 0% cpu 2:18,42 total

Similarly, a 500 line file takes around 5s, which is in itself super slow.

Input Yaml

5000 lines of YAML, not pasting that here.

Command
The command you ran:

yq r --tojson swagger.yml

Actual behavior

Super slow

Expected behavior

It should take under a second. That's what a simple Python program does it in.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

PaulCharlton · 2020-04-22T19:44:57Z

+1

PaulCharlton · 2020-04-22T19:45:52Z

in my experience - it slowed down a lot some time since early January

> yq -V
yq version 3.2.1

PaulCharlton · 2020-04-22T21:06:14Z

same for

> yq -V
yq version 3.3.0

I am imagining an O(N^2) algorithm somewhere inappropriate

PaulCharlton · 2020-04-22T21:31:48Z

after downgrade to yq 2.4.1 --> zippy fast again (ie: 30s to parse yaml file in 3.x became sub second response time with 2.4.1)

this helped to downgrade -> https://stackoverflow.com/questions/53837861/brew-list-and-install-specific-versions-of-formula

mikefarah · 2020-04-22T23:59:15Z

Ah yeah, this will be when I moved to 3.0. As part of handling yaml anchors on conversion to JSON I made it iterate through the yaml structure. Let me play around to see if I can do that better...or even skip it

mikefarah · 2020-04-23T01:11:08Z

Ok what I can do is make it not handle merge anchors by default when converting to JSON - however if there are merge anchors in the yaml file, then you will need to pass in --explodeAnchors/-X with the -j flag. Otherwise, the command will fail (and suggest to you to retry with --explodeAnchors). Thoughts?

PaulCharlton · 2020-04-23T01:16:19Z

Ok what I can do is make it not handle merge anchors by default when converting to JSON - however if there are merge anchors in the yaml file, then you will need to pass in --explodeAnchors/-X with the -j flag. Otherwise, the command will fail (and suggest to you to retry with --explodeAnchors). Thoughts?

Hey Mike -- will the "--explodeAnchors' still be slow? if so, does not get my vote. I have written plenty of one-pass parsers, and what it has boiled down to in the past is using some memory to cache the anchors (macros) as they are found, then doing an efficient look-up (hashed by name) at the point of reference.

PaulCharlton · 2020-04-23T01:19:41Z

or, if by merge anchors you mean the special key <<, then sort by key name both the LHS and RHS of the merge and then do a zipper O(1) across the key lists?

mikefarah · 2020-04-23T01:26:53Z

Yep will still be slow, that sounds pretty neat, would def take a PR for it :)

PaulCharlton · 2020-04-23T01:29:30Z

@mikefarah I am loving to take a look and do a PR -- before beginning, is there some architectural change reason that 2.4.x works quickly and 3.x is slow?

mikefarah · 2020-04-23T04:06:15Z

Yeah so 2.4.x didn't handle anchors, it was because the underlying yaml parser did not expose them.

In 3.x I updated the underlying parser which let me access the merge anchor information.

Now when outputting to json - I needed to process the anchors into a regular object to I can encode to JSON correctly, this wasn't done in 2.4.x.

In 3.x, it is done here: https://github.com/mikefarah/yq/blob/master/cmd/utils.go#L184 before being passed to the encoder (https://github.com/mikefarah/yq/blob/master/pkg/yqlib/encoder.go#L61)

mikefarah · 2020-04-23T04:11:29Z

Further context, the way explodeAnchors works, (it's quite naive :)) is to deeply recurse all matching nodes and merge all the leaf nodes into a new empty data structure

dd-ssc · 2020-06-01T10:12:59Z

+1:

A yaml input file of mine is 827 lines long and makes heavy, multi-level use of anchors and aliases; exploding it into a 10294-line output file takes 10-12 minutes. In some sitations, I can speed up development by exploding it into a json file and loading that using jq (which takes <1s), but obviously, whenever the original yaml file changes, I am stalled. Only workaround for now is to comment out the aliases that have not changed.

I do understand that anchor / alias support was implemented in a naive way in order to support them at all - and @mikefarah: I do sincerely appreciate the great work 👍 - anchors and aliases enable me to do fantastic stuff with yaml - but whatever makes this faster would bring a significant benefit to my work.

BTW: I know nothing about Go, but from I've heard, it's really good at parallelizing tasks ("co-routines" and what not ?!?); looking at top while the exploding is running shows yq using ~180% CPU (e.g. CPU usage: 25% user, 5% sys, 70% idle) on a mid-2015 MacBook Pro (Intel Core i7 Quad Core); wouldn't this ideally max out my CPU ? (at - I don't know - 400% / 95% user or so ?!?)

@PaulCharlton: Is there anything I can do to support you with the PR you mention ? I'm able and more than happy to test dev versions and report benchmarks - somewhat self-serving, would get me a chance to get GOing (jeez, my puns have seen better days...)

dd-ssc · 2020-06-02T06:20:28Z

Update: My yaml input file keeps growing (e.g. 958 lines, including comments) and making smarter re-use of multi-level sections used at multiple locations (e.g. 63972 lines in the json output file).
YAY to yaml anchors and aliases! 🎉

However, the time yq requires to explode the yaml file and/or convert it to json is getting prohibitive; from my log output: Time to load yaml file: 1h 19m 53s.

In order to keep making progress in my project, I will have to stop using yq for the time being and work around this issue by converting the yaml input file to a json output file up front; takes <1s with a fairly simple Python script.

mikefarah · 2020-06-10T05:38:09Z

Alright I think I have a good way forward with this - working on it, should be available within a few days if I get enough time. Does anyone have a sample large doc to test it with? I've made my own but not sure how representative it is...

dd-ssc · 2020-06-17T07:47:56Z

I re-ran the file I mentioned earlier, takes less than a second. That's one heck of a speed-up!
@mikefarah : Great work mate! 👍

mikefarah · 2020-06-17T08:57:26Z

Fantastic, glad to hear that :)

Thanks for the feedback

JoakimSoderberg added the bug label Apr 22, 2020

mikefarah added this to the 3.3.1 milestone Jun 10, 2020

mikefarah closed this as completed Jun 11, 2020

0-sv mentioned this issue Jan 18, 2021

performance issues yq linode/apl-core#261

Closed

renovate bot mentioned this issue Apr 10, 2024

feat(deps)!: Update mikefarah/yq ( 3.3.0 → v4.43.1 ) billimek/k8s-gitops#3754

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting YAML to JSON is super slow #422

Converting YAML to JSON is super slow #422

JoakimSoderberg commented Apr 22, 2020

PaulCharlton commented Apr 22, 2020

PaulCharlton commented Apr 22, 2020 •

edited

Loading

PaulCharlton commented Apr 22, 2020

PaulCharlton commented Apr 22, 2020

mikefarah commented Apr 22, 2020

mikefarah commented Apr 23, 2020

PaulCharlton commented Apr 23, 2020

PaulCharlton commented Apr 23, 2020 •

edited

Loading

mikefarah commented Apr 23, 2020

PaulCharlton commented Apr 23, 2020

mikefarah commented Apr 23, 2020

mikefarah commented Apr 23, 2020

dd-ssc commented Jun 1, 2020 •

edited

Loading

dd-ssc commented Jun 2, 2020

mikefarah commented Jun 10, 2020

dd-ssc commented Jun 17, 2020

mikefarah commented Jun 17, 2020

Converting YAML to JSON is super slow #422

Converting YAML to JSON is super slow #422

Comments

JoakimSoderberg commented Apr 22, 2020

PaulCharlton commented Apr 22, 2020

PaulCharlton commented Apr 22, 2020 • edited Loading

PaulCharlton commented Apr 22, 2020

PaulCharlton commented Apr 22, 2020

mikefarah commented Apr 22, 2020

mikefarah commented Apr 23, 2020

PaulCharlton commented Apr 23, 2020

PaulCharlton commented Apr 23, 2020 • edited Loading

mikefarah commented Apr 23, 2020

PaulCharlton commented Apr 23, 2020

mikefarah commented Apr 23, 2020

mikefarah commented Apr 23, 2020

dd-ssc commented Jun 1, 2020 • edited Loading

dd-ssc commented Jun 2, 2020

mikefarah commented Jun 10, 2020

dd-ssc commented Jun 17, 2020

mikefarah commented Jun 17, 2020

PaulCharlton commented Apr 22, 2020 •

edited

Loading

PaulCharlton commented Apr 23, 2020 •

edited

Loading

dd-ssc commented Jun 1, 2020 •

edited

Loading