-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use a worker pool to ingest entries faster #1964
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
I'm not sure how much speedup you can get from something like this -- each client goroutine appears to do less work than the server, and the server can load multiple DB threads but they all contend for the same resource. Still, at least now the client does the most it can do.
cmd/lakectl/cmd/ingest.go
Outdated
|
||
func stageWorker(wg *sync.WaitGroup, requests <-chan *stageRequest, responses chan<- *stageResponse) { | ||
defer wg.Done() | ||
client := getClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we share clients? I think HTTP connection pooling/reuse anyway happens at a level below, so it might manage to reduce connection re-use which could increase performance (or decrease performance...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a different angle - getClient should be called once, it can fail and we do not want to dump multiple errors.
They all share the same transport - but we don't need to perform the same operation multiple times to construct it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmd/lakectl/cmd/ingest.go
Outdated
|
||
// update every 10k records | ||
if summary.Objects%10000 == 0 { | ||
Write("Staged {{ .Objects | green }} objects so far...\n", summary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write("Staged {{ .Objects | green }} objects so far...\n", summary) | |
Write("Staged {{ .Objects | green }} objects so far...\r", summary) |
might give a cute single-line "animation". (But you'd have to write that \n
later, when done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested it, works like a charm!
@@ -81,5 +135,6 @@ func init() { | |||
_ = ingestCmd.MarkFlagRequired("to") | |||
ingestCmd.Flags().Bool("dry-run", false, "only print the paths to be ingested") | |||
ingestCmd.Flags().BoolP("verbose", "v", false, "print stats for each individual object staged") | |||
ingestCmd.Flags().IntP("concurrency", "C", 64, "max concurrent API calls to make to the lakeFS server") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why capital 'C'?
ingestCmd.Flags().IntP("concurrency", "C", 64, "max concurrent API calls to make to the lakeFS server") | |
ingestCmd.Flags().IntP("concurrency", "c", 64, "max concurrent API calls to make to the lakeFS server") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we use -c
on the global lakectl
command as short for --config
cmd/lakectl/cmd/ingest.go
Outdated
@@ -12,6 +14,32 @@ const ingestSummaryTemplate = ` | |||
Staged {{ .Objects | yellow }} external objects (total of {{ .Bytes | human_bytes | yellow }}) | |||
` | |||
|
|||
type stageRequest struct { | |||
ctx context.Context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx can be passed to the workers once on creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmd/lakectl/cmd/ingest.go
Outdated
if lakefsURI.Path != nil && *lakefsURI.Path != "" { | ||
path := *lakefsURI.Path | ||
if strings.HasSuffix(*lakefsURI.Path, "/") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done once outside the loop - just set prefix
with the right value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmd/lakectl/cmd/ingest.go
Outdated
}() | ||
|
||
for response := range responses { | ||
DieOnResponseError(response.resp, response.error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we just die - so no point in passing errors from the worker - can reduce all the specific response channel and just process responses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmd/lakectl/cmd/ingest.go
Outdated
summary.Bytes += api.Int64Value(response.resp.JSON201.SizeBytes) | ||
|
||
// update every 10k records | ||
if summary.Objects%10000 == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using time since instead of round numbers or optional progress bar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmd/lakectl/cmd/ingest.go
Outdated
|
||
func stageWorker(wg *sync.WaitGroup, requests <-chan *stageRequest, responses chan<- *stageResponse) { | ||
defer wg.Done() | ||
client := getClient() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a different angle - getClient should be called once, it can fail and we do not want to dump multiple errors.
They all share the same transport - but we don't need to perform the same operation multiple times to construct it.
@arielshaqed the main speedup comes from the fact that this is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat. Love the progress-by-time idea & implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
No description provided.