Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support large files #57

Closed
maxbrunet opened this issue Jul 6, 2021 · 8 comments · Fixed by #76
Closed

Support large files #57

maxbrunet opened this issue Jul 6, 2021 · 8 comments · Fixed by #76

Comments

@maxbrunet
Copy link
Contributor

Righ now files are limited to 4MB

const maxResourceSize = 4 * 1024 * 1024 // 4MB ought to be enough for everybody

maxResourceSize := 4 * 1024 * 1024 // 4MB ought to be enough for everybody

For example, I have a 17MB output for our Grafana deployment, mainly dozens of dashboards in ConfigMaps: 😛

Summary: 0 resource found parsing stdin - Valid: 0, Invalid: 0, Errors: 0, Skipped: 0

kubeval had the same issue and solved it with a default buffer (expend itself dynamically) and io.Copy() like in instrumenta/kubeval#220. Not sure how it affects the overall performance.

If you want to reproduce, here's a couple quick and dirty scripts to generate large files:

json.go
package main

import "fmt"

const (
	N = 20000
)

func main() {
	fmt.Println(
		`{
    "apiVersion": "v1",
    "kind": "List",
    "items": [`,
	)

	for i := 0; i < N; i++ {
		fmt.Printf(
			`        {
            "apiVersion": "v1",
            "kind": "ConfigMap",
            "metadata": {
                "name": "yet-another-confimap-%d"
            },
            "data": {
                "key": "value"
            }
        }`,
			i,
		)
		if i == N - 1 {
			fmt.Println()
		} else {
			fmt.Println(",")
		}
	}

	fmt.Println(`    ]
}
`)
}
yaml.go
package main

import "fmt"

func main() {
	fmt.Println(
		`apiVersion: v1
kind: List
items:`,
	)

	for i := 0; i < 50000; i++ {
		fmt.Printf(
			`- apiVersion: v1
  kind: ConfigMap
  metadata:
    name: yet-another-confimap-%d
  data:
    key: value
`,
			i,
		)
	}
}
@yannh
Copy link
Owner

yannh commented Jul 11, 2021

I have a patch when reading from files using the file size as an indicator to grow the buffer, I am still wondering what might be the best approach when reading from a stream like stdin 🤔

Note: I believe the io.Copy approach will read all of stdin in memory before starting to process anything. Kubeconform takes the approach of reading and processing immediately as data comes in. This is a bit faster and uses a lot less memory when reading a lot of data from stdin.

@yannh
Copy link
Owner

yannh commented Jul 11, 2021

@maxbrunet let me know what you think of my branch :) for stdin however... apart from trying to load all of stdin in memory before doing anything, which I'm not so keen on, I'm not sure what the best solution might be..

@maxbrunet
Copy link
Contributor Author

Kubeconform takes the approach of reading and processing immediately as data comes in.

I do not see how you would start unmarshalling a partial JSON, or even the value of unmarshalling a partial YAML (nothing tells if it'll be valid in the end). IMHO it has to be loaded all in memory, and if the file or stream is large, it is the responsibility of the user to have enough memory available. What is the approach of other tools like kubectl?

@yannh
Copy link
Owner

yannh commented Aug 29, 2021

@maxbrunet do you want an (undocumented?) environment variable to override the default value while we figure this out, if this is blocking you? :)

@maxbrunet
Copy link
Contributor Author

No hurries, I'm doing fine for now with kubeval, your schema repository, and the openapi2jsonschema.py script. Thank you @yannh

@yannh
Copy link
Owner

yannh commented Sep 26, 2021

Done, I guess I just needed to read the doc more carefully. Bufio.Scanner will resize the buffer on its own when needed. I also added a test for it.

@yannh yannh closed this as completed Sep 26, 2021
@maxbrunet
Copy link
Contributor Author

I have finally taken the time to test this, and I have been able to validate all my manifests, thank you! Unfortunately, I do not observe any significant performance difference with kubeval, probably because my CI pipeline runs on a single core.

@yannh
Copy link
Owner

yannh commented Dec 18, 2021

Glad this works, thanks for posting an update! Indeed performance usually improves with number of cores, even though there could be a tiny startup win since we also download the schemas in parallel. That might not make such a massive difference in most cases though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants