Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(page_service): Timeline gate guard holding + cancellation + shutdown #8339

Merged
merged 51 commits into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
2e3e4c1
remove sensitivity to Tenant::cancel in handle_pagerequests
problame Jul 5, 2024
c1da923
remove sensitivity to handler_timeline while in read-request/write-re…
problame Jul 5, 2024
a45d714
`PageStreamError` no longer has an `Other` variant
problame Jul 5, 2024
5f78c8f
comments and better structure
problame Jul 5, 2024
8efdfce
remove page_service `show <tenant_id>`
problame Jul 11, 2024
b227697
postgres_backend: pass the `.run()` CancellationToken to Handler::pro…
problame Jul 11, 2024
3406f0e
Revert "postgres_backend: pass the `.run()` CancellationToken to Hand…
problame Jul 11, 2024
f3172f0
refactor: don't use task_mgr for libpq, mgmt API, consumption metrics…
problame Jul 12, 2024
c40cd3e
clean up consumption metrics launch & shutdown
problame Jul 12, 2024
ba552e4
no more task_mgr for disk-usage-based eviction
problame Jul 12, 2024
1cfe9d8
no more task_mgr for page_service
problame Jul 12, 2024
77ea76a
track background purges without task_mgr
problame Jul 16, 2024
4f25f0e
WIP: implement cache
problame Jul 16, 2024
ec68633
polish names & add docstrings
problame Jul 17, 2024
f46903b
address the TODO
problame Jul 17, 2024
95c5094
WIP: shard routing in cache
problame Jul 17, 2024
94726f9
compile fix
problame Jul 19, 2024
c62ec52
WIP: integrate ("wait for active") is missing
problame Jul 19, 2024
3469436
inline remaining get_active_... methods
problame Jul 19, 2024
6388f15
ShardSelector == GetArg
problame Jul 19, 2024
8aafc0d
WIP impl the tennat manager trait & bring back wait-for-active
problame Jul 19, 2024
76f1b58
WIP
problame Jul 19, 2024
f523457
finish (no more smart drops, shut_down flag instead)
problame Jul 19, 2024
2a90034
get rid of the Arc Mutex inside Cache
problame Jul 19, 2024
b6e858a
doc fix
problame Jul 19, 2024
b2a8085
Merge remote-tracking branch 'origin/main' into problame/slow-detach-fix
problame Jul 19, 2024
2c404ca
Merge remote-tracking branch 'origin/main' into problame/slow-detach-fix
problame Jul 19, 2024
1cf1ff9
fixup(no more task_mgr for page_service)
problame Jul 20, 2024
97acc0c
fix: PerTimelineState is initialized to shut down state
problame Jul 20, 2024
bc5d3a5
fix: WillNotBecomeActive not resulting in connection shutdown
problame Jul 21, 2024
7d9b535
fix: TimelineHandles resolution failure not resulting in connection s…
problame Jul 21, 2024
ffbfd4f
no more task_mgr for secondary controller
problame Jul 21, 2024
cb7147a
task_mgr::spawn: require a TenantId to dis-incentivize global tasks v…
problame Jul 21, 2024
0a5fe9d
Merge remote-tracking branch 'origin/main' into problame/slow-detach-fix
problame Jul 22, 2024
79d2530
Revert "task_mgr::spawn: require a TenantId to dis-incentivize global…
problame Jul 22, 2024
59b1d76
Timeline::handlers => Timline::handles
problame Jul 22, 2024
750e4f3
be generic over timeline type so we can test the cache in isolation
problame Jul 22, 2024
6938e10
implement test that validates behavior for timeline shutdown
problame Jul 22, 2024
1e84913
test multiple timelines / timeline deletion case and fix bug uncovere…
problame Jul 22, 2024
f06570e
add test for what happens during shard split (and fix a bug exposed b…
problame Jul 22, 2024
46295e8
add test for behavior on connection drop (and fix a bug uncovered by it)
problame Jul 22, 2024
5a0b941
improve efficiency of the fix from previous commit
problame Jul 22, 2024
dfe1120
fix docstrings
problame Jul 25, 2024
4fcbb39
Revert "fix docstrings"
problame Jul 25, 2024
17a0f64
replace trait Types with individual generic params
problame Jul 25, 2024
0506b83
Revert "replace trait Types with individual generic params"
problame Jul 25, 2024
544cc48
fix docstrings
problame Jul 25, 2024
2df61e4
centralize docs & document design in module-level comment
problame Jul 26, 2024
46744ea
Merge remote-tracking branch 'origin/main' into problame/slow-detach-fix
problame Jul 28, 2024
f9961a0
clippy
problame Jul 29, 2024
38b0f3c
ACTIVE_TENANT_TIMEOUT: fix accidental use of http timeout; https://gi…
problame Jul 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 11 additions & 32 deletions pageserver/src/bin/pageserver.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,9 @@ use pageserver::config::PageserverIdentity;
use pageserver::control_plane_client::ControlPlaneClient;
use pageserver::disk_usage_eviction_task::{self, launch_disk_usage_global_eviction_task};
use pageserver::metrics::{STARTUP_DURATION, STARTUP_IS_LOADING};
use pageserver::task_mgr::WALRECEIVER_RUNTIME;
use pageserver::task_mgr::{COMPUTE_REQUEST_RUNTIME, WALRECEIVER_RUNTIME};
use pageserver::tenant::{secondary, TenantSharedResources};
use pageserver::{
CancellableTask, ConsumptionMetricsTasks, HttpEndpointListener, LibpqEndpointListener,
};
use pageserver::{CancellableTask, ConsumptionMetricsTasks, HttpEndpointListener};
use remote_storage::GenericRemoteStorage;
use tokio::signal::unix::SignalKind;
use tokio::time::Instant;
Expand All @@ -31,11 +29,9 @@ use tracing::*;
use metrics::set_build_info_metric;
use pageserver::{
config::PageServerConf,
context::{DownloadBehavior, RequestContext},
deletion_queue::DeletionQueue,
http, page_cache, page_service, task_mgr,
task_mgr::TaskKind,
task_mgr::{BACKGROUND_RUNTIME, COMPUTE_REQUEST_RUNTIME, MGMT_REQUEST_RUNTIME},
task_mgr::{BACKGROUND_RUNTIME, MGMT_REQUEST_RUNTIME},
tenant::mgr,
virtual_file,
};
Expand Down Expand Up @@ -593,30 +589,13 @@ fn start_pageserver(

// Spawn a task to listen for libpq connections. It will spawn further tasks
// for each connection. We created the listener earlier already.
let libpq_listener = {
let cancel = CancellationToken::new();
let libpq_ctx = RequestContext::todo_child(
TaskKind::LibpqEndpointListener,
// listener task shouldn't need to download anything. (We will
// create a separate sub-contexts for each connection, with their
// own download behavior. This context is used only to listen and
// accept connections.)
DownloadBehavior::Error,
);

let task = COMPUTE_REQUEST_RUNTIME.spawn(task_mgr::exit_on_panic_or_error(
"libpq listener",
page_service::libpq_listener_main(
tenant_manager.clone(),
pg_auth,
pageserver_listener,
conf.pg_auth_type,
libpq_ctx,
cancel.clone(),
),
));
LibpqEndpointListener(CancellableTask { task, cancel })
};
let page_service = page_service::spawn(conf, tenant_manager.clone(), pg_auth, {
let _entered = COMPUTE_REQUEST_RUNTIME.enter(); // TcpListener::from_std requires it
pageserver_listener
.set_nonblocking(true)
.context("set listener to nonblocking")?;
tokio::net::TcpListener::from_std(pageserver_listener).context("create tokio listener")?
});

let mut shutdown_pageserver = Some(shutdown_pageserver.drop_guard());

Expand Down Expand Up @@ -644,7 +623,7 @@ fn start_pageserver(
shutdown_pageserver.take();
pageserver::shutdown_pageserver(
http_endpoint_listener,
libpq_listener,
page_service,
consumption_metrics_tasks,
disk_usage_eviction_task,
&tenant_manager,
Expand Down
5 changes: 5 additions & 0 deletions pageserver/src/http/routes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,11 @@ impl From<GetActiveTenantError> for ApiError {
GetActiveTenantError::WaitForActiveTimeout { .. } => {
ApiError::ResourceUnavailable(format!("{}", e).into())
}
GetActiveTenantError::SwitchedTenant => {
// in our HTTP handlers, this error doesn't happen
// TODO: separate error types
ApiError::ResourceUnavailable("switched tenant".into())
}
}
}
}
Expand Down
10 changes: 4 additions & 6 deletions pageserver/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@ pub mod walingest;
pub mod walrecord;
pub mod walredo;

use crate::task_mgr::TaskKind;
use camino::Utf8Path;
use deletion_queue::DeletionQueue;
use tenant::{
Expand Down Expand Up @@ -63,7 +62,6 @@ pub struct CancellableTask {
pub cancel: CancellationToken,
}
pub struct HttpEndpointListener(pub CancellableTask);
pub struct LibpqEndpointListener(pub CancellableTask);
pub struct ConsumptionMetricsTasks(pub CancellableTask);
pub struct DiskUsageEvictionTask(pub CancellableTask);
impl CancellableTask {
Expand All @@ -77,7 +75,7 @@ impl CancellableTask {
#[allow(clippy::too_many_arguments)]
pub async fn shutdown_pageserver(
http_listener: HttpEndpointListener,
libpq_listener: LibpqEndpointListener,
page_service: page_service::Listener,
consumption_metrics_worker: ConsumptionMetricsTasks,
disk_usage_eviction_task: Option<DiskUsageEvictionTask>,
tenant_manager: &TenantManager,
Expand All @@ -89,8 +87,8 @@ pub async fn shutdown_pageserver(
use std::time::Duration;
// Shut down the libpq endpoint task. This prevents new connections from
// being accepted.
timed(
libpq_listener.0.shutdown(),
let remaining_connections = timed(
page_service.stop_accepting(),
"shutdown LibpqEndpointListener",
Duration::from_secs(1),
)
Expand All @@ -108,7 +106,7 @@ pub async fn shutdown_pageserver(
// Shut down any page service tasks: any in-progress work for particular timelines or tenants
// should already have been canclled via mgr::shutdown_all_tenants
timed(
task_mgr::shutdown_tasks(Some(TaskKind::PageRequestHandler), None, None),
remaining_connections.shutdown(),
"shutdown PageRequestHandlers",
Duration::from_secs(1),
)
Expand Down
Loading
Loading