Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use a worker pool to ingest entries faster #1964

Merged
merged 2 commits into from
May 19, 2021
Merged

Conversation

ozkatz
Copy link
Collaborator

@ozkatz ozkatz commented May 18, 2021

No description provided.

@ozkatz ozkatz requested review from nopcoder and itaiad200 May 18, 2021 16:10
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

I'm not sure how much speedup you can get from something like this -- each client goroutine appears to do less work than the server, and the server can load multiple DB threads but they all contend for the same resource. Still, at least now the client does the most it can do.


func stageWorker(wg *sync.WaitGroup, requests <-chan *stageRequest, responses chan<- *stageResponse) {
defer wg.Done()
client := getClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we share clients? I think HTTP connection pooling/reuse anyway happens at a level below, so it might manage to reduce connection re-use which could increase performance (or decrease performance...).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a different angle - getClient should be called once, it can fail and we do not want to dump multiple errors.
They all share the same transport - but we don't need to perform the same operation multiple times to construct it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// update every 10k records
if summary.Objects%10000 == 0 {
Write("Staged {{ .Objects | green }} objects so far...\n", summary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Write("Staged {{ .Objects | green }} objects so far...\n", summary)
Write("Staged {{ .Objects | green }} objects so far...\r", summary)

might give a cute single-line "animation". (But you'd have to write that \n later, when done)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea. Done.

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested it, works like a charm!

@@ -81,5 +135,6 @@ func init() {
_ = ingestCmd.MarkFlagRequired("to")
ingestCmd.Flags().Bool("dry-run", false, "only print the paths to be ingested")
ingestCmd.Flags().BoolP("verbose", "v", false, "print stats for each individual object staged")
ingestCmd.Flags().IntP("concurrency", "C", 64, "max concurrent API calls to make to the lakeFS server")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why capital 'C'?

Suggested change
ingestCmd.Flags().IntP("concurrency", "C", 64, "max concurrent API calls to make to the lakeFS server")
ingestCmd.Flags().IntP("concurrency", "c", 64, "max concurrent API calls to make to the lakeFS server")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we use -c on the global lakectl command as short for --config

@@ -12,6 +14,32 @@ const ingestSummaryTemplate = `
Staged {{ .Objects | yellow }} external objects (total of {{ .Bytes | human_bytes | yellow }})
`

type stageRequest struct {
ctx context.Context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx can be passed to the workers once on creation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 77 to 79
if lakefsURI.Path != nil && *lakefsURI.Path != "" {
path := *lakefsURI.Path
if strings.HasSuffix(*lakefsURI.Path, "/") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done once outside the loop - just set prefix with the right value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}()

for response := range responses {
DieOnResponseError(response.resp, response.error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we just die - so no point in passing errors from the worker - can reduce all the specific response channel and just process responses

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

summary.Bytes += api.Int64Value(response.resp.JSON201.SizeBytes)

// update every 10k records
if summary.Objects%10000 == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using time since instead of round numbers or optional progress bar

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


func stageWorker(wg *sync.WaitGroup, requests <-chan *stageRequest, responses chan<- *stageResponse) {
defer wg.Done()
client := getClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a different angle - getClient should be called once, it can fail and we do not want to dump multiple errors.
They all share the same transport - but we don't need to perform the same operation multiple times to construct it.

@ozkatz
Copy link
Collaborator Author

ozkatz commented May 19, 2021

Neat!

I'm not sure how much speedup you can get from something like this -- each client goroutine appears to do less work than the server, and the server can load multiple DB threads but they all contend for the same resource. Still, at least now the client does the most it can do.

@arielshaqed the main speedup comes from the fact that this is a lakectl command - there's good chance a user will run it locally from their machine. In that case, the latency between user -> lakeFS server could be big, so doing it serially one object at a time really adds up.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat. Love the progress-by-time idea & implementation.

Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ozkatz ozkatz merged commit 4bfead3 into master May 19, 2021
@ozkatz ozkatz deleted the feature/concurrent-ingest branch May 19, 2021 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants