Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make bloom gateway tasks cancelable (and other fixes) #11717

Closed
wants to merge 12 commits into from

Conversation

chaudum
Copy link
Contributor

@chaudum chaudum commented Jan 19, 2024

Why we need it:

A bloom gateway request calling FilterChunkRefs that timed out caused a deadlock of the workers, because when the request context is cancelled, the request returned, but the task continued to process until the block querier tried to write to the results channel. That writing was blocked, because the request that was supposed to read from the results channel to gather the result was gone.

How is is solved:

The PR introduces two changes:

  1. The request context is also passed when creating a new Task, which allows to check the cancellation of a request also in the task.
  2. The channels of the Task are replaced by a "pipe channel", which consists of a sender channel and a receiver channel. The pipe reads from the sender channel until it is closed, and forwards the received item to the receiver channel, but only if the internal context of the pipe isn't closed.
  3. In the worker, when dequeuing and processing the tests, the context of tasks is checked for error, to shortcut operations and return errors early.

Further smaller changes/fixes:

  1. Fix a potential index out of bounds panic in the code that removes filtered out chunk refs from th original request.
  2. Make sure that the DequeueMany call on the task queue returns successive tasks based on their enqueue order.
  3. Do not remove cancelled task from task list once they have been partitioned by fingerprint ranged.

@chaudum chaudum marked this pull request as ready for review January 19, 2024 13:57
@chaudum chaudum requested a review from a team as a code owner January 19, 2024 13:57
Copy link
Contributor

@vlad-diachenko vlad-diachenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks brilliant 💎


// The sender needs to close the channel once it's done sending to it.
// Otherwise the forwarder in the gorouting never exits.
func (ch ChanPipe[T]) forward(ctx context.Context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

}

// Snd returns the channel for sending to the pipe.
func (ch ChanPipe[T]) Snd() chan<- T {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd rather use Send and Receive

@chaudum
Copy link
Contributor Author

chaudum commented Jan 20, 2024

During testing I found another issue with this PR:

2024-01-19 15:08:03.363		/src/enterprise-logs/vendor/github.com/grafana/dskit/services/basic_service.go:119 +0x105
2024-01-19 15:08:03.363	created by github.com/grafana/dskit/services.(*BasicService).StartAsync.func1 in goroutine 772
2024-01-19 15:08:03.363		/src/enterprise-logs/vendor/github.com/grafana/dskit/services/basic_service.go:190 +0x1b4
2024-01-19 15:08:03.363	github.com/grafana/dskit/services.(*BasicService).main(0xc00214cc80)
2024-01-19 15:08:03.363		/src/enterprise-logs/vendor/github.com/grafana/loki/pkg/bloomgateway/worker.go:238 +0x1666
2024-01-19 15:08:03.363	github.com/grafana/loki/pkg/bloomgateway.(*worker).running(0xc000a42700, {0x37eadf8, 0xc0001437c0})
2024-01-19 15:08:03.363		/src/enterprise-logs/vendor/github.com/grafana/loki/pkg/bloomgateway/multiplexing.go:158
2024-01-19 15:08:03.363	github.com/grafana/loki/pkg/bloomgateway.Task.Close(...)
2024-01-19 15:08:03.363		/src/enterprise-logs/vendor/go.uber.org/atomic/bool.go:66
2024-01-19 15:08:03.363	go.uber.org/atomic.(*Bool).CompareAndSwap(...)
2024-01-19 15:08:03.363	goroutine 780 [running]:
2024-01-19 15:08:03.363	
2024-01-19 15:08:03.363	[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x22bc6c6]
2024-01-19 15:08:03.363	panic: runtime error: invalid memory address or nil pointer dereference

Not quite sure yet what the nil pointer dereference is caused by.

@chaudum chaudum marked this pull request as draft January 22, 2024 07:29
@chaudum chaudum force-pushed the chaudum/make-bloomfilter-task-cancelable branch 3 times, most recently from c8e4275 to cc92906 Compare January 22, 2024 15:10
@chaudum chaudum changed the title Make bloom gateway tasks cancelable Make bloom gateway tasks cancelable (and other fixes) Jan 22, 2024
@chaudum chaudum marked this pull request as ready for review January 22, 2024 15:19
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
@chaudum chaudum force-pushed the chaudum/make-bloomfilter-task-cancelable branch from ea7295b to 5ec5fcc Compare January 23, 2024 12:20
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skimmed, lgtm

@chaudum
Copy link
Contributor Author

chaudum commented Jan 24, 2024

Going to open a few smaller PRs with individual fixes.

@chaudum
Copy link
Contributor Author

chaudum commented Jan 26, 2024

Superseded by #11792

@chaudum chaudum closed this Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants