Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reoccurring errors with checkpointing #688

Closed
hazelnut-99 opened this issue Jul 16, 2024 · 7 comments · Fixed by #689
Closed

Reoccurring errors with checkpointing #688

hazelnut-99 opened this issue Jul 16, 2024 · 7 comments · Fixed by #689
Labels
bug Something isn't working

Comments

@hazelnut-99
Copy link

Hi!
We are currently running Arroyo version 0.11.0. After launching a pipeline and allowing it to run for an extended period, the checkpointing process appears to become stuck. The controller continuously logs the following error messages:

{"timestamp":"2024-07-16T15:09:48.179959Z","level":"ERROR","fields":{"message":"cleanup panicked","job_id":"job_4sCHsbvu3N","error":"JoinError::Panic(Id(1797740), ...)"},"target":"arroyo_controller::job_controller"}
{"timestamp":"2024-07-16T15:09:48.281679Z","level":"INFO","fields":{"message":"Starting cleaning","job_id":"job_4sCHsbvu3N","min_epoch":32,"new_min":34},"target":"arroyo_controller::job_controller"}
{"timestamp":"2024-07-16T15:09:48.283224Z","level":"INFO","fields":{"message":"Cleaning checkpoint","min_epoch":34,"job_id":"job_4sCHsbvu3N"},"target":"arroyo_state::parquet"}
{"timestamp":"2024-07-16T15:09:48.286151Z","level":"ERROR","fields":{"message":"panicked at crates/arroyo-state/src/parquet.rs:276:81:\ncalled `Result::unwrap()` on an `Err` value: missing table config for operator join_12, table right, metadata is TableCheckpointMetadata { table_type: ExpiringKeyedTimeTable, data: [10, 113, 8, 33, 18, 78, 106, 111, 98, 95, 52, 115, 67, 72, 115, 98, 118, 117, 51, 78, 47, 99, 104, 101, 99, 107, 112, 111, 105, 110, 116, 115, 47, 99, 104, 101, 99, 107, 112, 111, 105, 110, 116, 45, 48, 48, 48, 48, 48, 51, 51, 47, 111, 112, 101, 114, 97, 116, 111, 114, 45, 106, 111, 105, 110, 95, 49, 50, 47, 116, 97, 98, 108, 101, 45, 114, 105, 103, 104, 116, 45, 48, 48, 49, 24, 203, 149, 220, 231, 152, 152, 186, 210, 128, 1, 32, 129, 202, 212, 174, 220, 142, 185, 146, 254, 1, 40, 191, 231, 132, 154, 224, 171, 135, 3, 10, 110, 8, 33, 18, 78, 106, 111, 98, 95, 52, 115, 67, 72, 115, 98, 118, 117, 51, 78, 47, 99, 104, 101, 99, 107, 112, 111, 105, 110, 116, 115, 47, 99, 104, 101, 99, 107, 112, 111, 105, 110, 116, 45, 48, 48, 48, 48, 48, 51, 51, 47, 111, 112, 101, 114, 97, 116, 111, 114, 45, 106, 111, 105, 110, 95, 49, 50, 47, 116, 97, 98, 108, 101, 45, 114, 105, 103, 104, 116, 45, 48, 48, 48, 24, 219, 167, 250, 128, 237, 226, 137, 14, 32, 175, 168, 210, 193, 177, 179, 246, 212, 127, 40, 191, 231, 132, 154, 224, 171, 135, 3] }, operator_metadata is OperatorCheckpointMetadata { operator_metadata: Some(OperatorMetadata { job_id: \"job_4sCHsbvu3N\", operator_id: \"join_12\", epoch: 34, min_watermark: Some(1721139479000000), max_watermark: Some(1721139479000000), parallelism: 2 }), start_time: 1721139495376226000, finish_time: 1721139495399593, table_checkpoint_metadata: {\"left\": TableCheckpointMetadata { table_type: ExpiringKeyedTimeTable, data: [10, 109, 8, 34, 18, 77, 106, 111, 98, 95, 52, 115, 67, 72, 115, 98, 118, 117, 51, 78, 47, 99, 104, 101, 99, 107, 112, 111, 105, 110, 116, 115, 47, 99, 104, 101, 99, 107, 112, 111, 105, 110, 116, 45, 48, 48, 48, 48, 48, 51, 52, 47, 111, 112, 101, 114, 97, 116, 111, 114, 45, 106, 111, 105, 110, 95, 49, 50, 47, 116, 97, 98, 108, 101, 45, 108, 101, 102, 116, 45, 48, 48, 48, 24, 228, 137, 139, 230, 156, 192, 183, 30, 32, 140, 192, 186, 202, 209, 191, 154, 239, 127, 40, 191, 157, 225, 159, 224, 171, 135, 3, 10, 112, 8, 34, 18, 77, 106, 111, 98, 95, 52, 115, 67, 72, 115, 98, 118, 117, 51, 78, 47, 99, 104, 101, 99, 107, 112, 111, 105, 110, 116, 115, 47, 99, 104, 101, 99, 107, 112, 111, 105, 110, 116, 45, 48, 48, 48, 48, 48, 51, 52, 47, 111, 112, 101, 114, 97, 116, 111, 114, 45, 106, 111, 105, 110, 95, 49, 50, 47, 116, 97, 98, 108, 101, 45, 108, 101, 102, 116, 45, 48, 48, 49, 24, 136, 147, 236, 197, 236, 211, 177, 164, 128, 1, 32, 207, 156, 225, 253, 144, 194, 239, 236, 255, 1, 40, 191, 157, 225, 159, 224, 171, 135, 3] }}, table_configs: {\"left\": TableConfig { table_type: ExpiringKeyedTimeTable, config: [10, 4, 108, 101, 102, 116, 18, 14, 108, 101, 102, 116, 32, 106, 111, 105, 110, 32, 100, 97, 116, 97, 42, 137, 9, 10, 254, 8, 123, 34, 102, 105, 101, 108, 100, 115, 34, 58, 91, 123, 34, 110, 97, 109, 101, 34, 58, 34, 95, 107, 101, 121, 95, 48, 34, 44, 34, 100, 97, 116, 97, 95, 116, 121, 112, 101, 34, 58, 34, 85, 116, 102, 56, 34, 44, 34, 110, 117, 108, 108, 97, 98, 108, 101, 34, 58, 116, 114, 117, 101, 44, 34, 100, 105, 99, 116, 95, 105, 100, 34, 58, 48, 44, 34, 100, 105, 99, 116, 95, 105, 115, 95, 111, 114, 100, 101, 114, 101, 100, 34, 58, 102, 97, 108, 115, 101, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 95, 107, 101, 121, 95, 49, 34, 44, 34, 100, 97, 116, 97, 95, 116, 121, 112, 101, 34, 58, 34, 85, 116, 102, 56, 34, 44, 34, 110, 117, 108, 108, 97, 98, 108, 101, 34, 58, 116, 114, 117, 101, 44, 34, 100, 105, 99, 116, 95, 105, 100, 34, 58, 48, 44, 34, 100, 105, 99, 116, 95, 105, 115, 95, 111, 114, 100, 101, 114, 101, 100, 34, 58, 102, 97, 108, 115, 101, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 115, 116, 114, 97, 116, 101, 103, 121, 34, 44, 34, 100, 97, 116, 97, 95, 116, 121, 112, 101, 34, 58, 34, 85, 116, 102, 56, 34, 44, 34, 110, 117, 108, 108, 97, 98, 108, 101, 34, 58, 116, 114, 117, 101, 44, 34, 100, 105, 99, 116, 95, 105, 100, 34, 58, 48, 44, 34, 100, 105, 99, 116, 95, 105, 115, 95, 111, 114, 100, 101, 114, 101, 100, 34, 58, 102, 97, 108, 115, 101, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 101, 120, 112, 105, 114, 121, 34, 44, 34, 100, 97, 116, 97, 95, 116, 121, 112, 101, 34, 58, 34, 85, 116, 102, 56, 34, 44, 34, 110, 117, 108, 108, 97, 98, 108, 101, 34, 58, 116, 114, 117, 101, 44, 34, 100, 105, 99, 116, 95, 105, 100, 34, 58, 48, 44, 34, 100, 105, 99, 116, 95, 105, 115, 95, 111, 114, 100, 101, 114, 101, 100, 34, 58, 102, 97, 108, 115, 101, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 119, 105, 110, 100, 111, 119, 34, 44, 34, 100, 97, 116, 97, 95, 116, 121, 112, 101, 34, 58, 123, 34, 83, 116, 114, 117, 99, 116, 34, 58, 91, 123, 34, 110, 97, 109, 101, 34, 58, 34, 115, 116, 97, 114, 116, 34, 44, 34, 100, 97, 116, 97, 95, 116, 121, 112, 101, 34, 58, 123, 34, 84, 105, 109, 101, 115, 116, 97, 109, 112, 34, 58, 91, 34, 78, 97, 110, 111, 115, 101, 99, 111, 110, 100, 34, 44, 110, 117, 108, 108, 93, 125, 44, 34, 110, 117, 108, 108, 97, 98, 108, 101, 34, 58, 102, 97, 108, 115, 101, 44, 34, 100, 105, 99, 116, 95, 105, 100, 34, 58, 48, 44, 34, 100, 105, 99, 116, 95, 105, 115, 95, 111, 114, 100, 101, 114, 101, 100, 34, 58, 102, 97, 108, 115, 101, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 101, 110, 100, 34, 44, 34, 100, 97, 116, 97, 95, 116, 121, 112, 101, 34, 58, 123, 34, 84, 105, 109, 101, 115, 116, 97, 109, 112, 34, 58, 91, 34, 78, 97, 110, 111, 115, 101, 99, 111, 110, 100, 34, 44, 110, 117, 108, 108, 93, 125, 44, 34, 110, 117, 108, 108, 97, 98, 108, 101, 34, 58, 102, 97, 108, 115, 101, 44, 34, 100, 105, 99, 116, 95, 105, 100, 34, 58, 48, 44, 34, 100, 105, 99, 116, 95, 105, 115, 95, 111, 114, 100, 101, 114, 101, 100, 34, 58, 102, 97, 108, 115, 101, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 93, 125, 44, 34, 110, 117, 108, 108, 97, 98, 108, 101, 34, 58, 116, 114, 117, 101, 44, 34, 100, 105, 99, 116, 95, 105, 100, 34, 58, 48, 44, 34, 100, 105, 99, 116, 95, 105, 115, 95, 111, 114, 100, 101, 114, 101, 100, 34, 58, 102, 97, 108, 115, 101, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 98, 97, 115, 101, 95, 118, 111, 108, 34, 44, 34, 100, 97, 116, 97, 95, 116, 121, 112, 101, 34, 58, 34, 70, 108, 111, 97, 116, 54, 52, 34, 44, 34, 110, 117, 108, 108, 97, 98, 108, 101, 34, 58, 116, 114, 117, 101, 44, 34, 100, 105, 99, 116, 95, 105, 100, 34, 58, 48, 44, 34, 100, 105, 99, 116, 95, 105, 115, 95, 111, 114, 100, 101, 114, 101, 100, 34, 58, 102, 97, 108, 115, 101, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 116, 115, 34, 44, 34, 100, 97, 116, 97, 95, 116, 121, 112, 101, 34, 58, 34, 73, 110, 116, 54, 52, 34, 44, 34, 110, 117, 108, 108, 97, 98, 108, 101, 34, 58, 116, 114, 117, 101, 44, 34, 100, 105, 99, 116, 95, 105, 100, 34, 58, 48, 44, 34, 100, 105, 99, 116, 95, 105, 115, 95, 111, 114, 100, 101, 114, 101, 100, 34, 58, 102, 97, 108, 115, 101, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 44, 123, 34, 110, 97, 109, 101, 34, 58, 34, 95, 116, 105, 109, 101, 115, 116, 97, 109, 112, 34, 44, 34, 100, 97, 116, 97, 95, 116, 121, 112, 101, 34, 58, 123, 34, 84, 105, 109, 101, 115, 116, 97, 109, 112, 34, 58, 91, 34, 78, 97, 110, 111, 115, 101, 99, 111, 110, 100, 34, 44, 110, 117, 108, 108, 93, 125, 44, 34, 110, 117, 108, 108, 97, 98, 108, 101, 34, 58, 102, 97, 108, 115, 101, 44, 34, 100, 105, 99, 116, 95, 105, 100, 34, 58, 48, 44, 34, 100, 105, 99, 116, 95, 105, 115, 95, 111, 114, 100, 101, 114, 101, 100, 34, 58, 102, 97, 108, 115, 101, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 93, 44, 34, 109, 101, 116, 97, 100, 97, 116, 97, 34, 58, 123, 125, 125, 16, 7, 26, 2, 0, 1, 32, 1] }} }","panic.file":"crates/arroyo-state/src/parquet.rs","panic.line":276,"panic.column":81},"target":"arroyo_server_common"}
{"timestamp":"2024-07-16T15:09:48.482357Z","level":"ERROR","fields":{"message":"cleanup panicked","job_id":"job_4sCHsbvu3N","error":"JoinError::Panic(Id(1797919), ...)"},"target":"arroyo_controller::job_controller"}
{"timestamp":"2024-07-16T15:09:48.583549Z","level":"INFO","fields":{"message":"Starting cleaning","job_id":"job_4sCHsbvu3N","min_epoch":32,"new_min":34},"target":"arroyo_controller::job_controller"}
{"timestamp":"2024-07-16T15:09:48.584957Z","level":"INFO","fields":{"message":"Cleaning checkpoint","min_epoch":34,"job_id":"job_4sCHsbvu3N"},"target":"arroyo_state::parquet"}

Subsequent attempts to launch new pipelines also encounter the same errors.

@mwylde
Copy link
Member

mwylde commented Jul 17, 2024

Hi @hazelnut-99 — thanks for reporting! Can you provide some more details about how you're running Arroyo? In particular:

  • What scheduler are you using?
  • Where are checkpoints being stored (a local filesystem? s3? somewhere else?)
  • What does the query look like?

Thanks!

@mwylde mwylde added the bug Something isn't working label Jul 17, 2024
@hazelnut-99
Copy link
Author

Scheduler: Kubernetes Scheduler
Checkpoint storage: local filesystem
Query:

CREATE TABLE my_kafka_sink (
    field_1 TEXT,
    field_2 TEXT,
    field_3 DOUBLE,
    ts_1 BIGINT,
    ts_2 BIGINT
) WITH (
    'connector' = 'kafka',
    'avro.confluent_schema_registry' = 'true',
    'bootstrap_servers' = 'my_servers',
    'schema_registry.endpoint' = 'my_endpoint',
    'type' = 'sink',
    'topic' = 'my_topic',
    'format' = 'avro'
);

INSERT INTO my_kafka_sink 
select table_1.field_1, table_1.field_2, 
    (table_2.field_y - table_1.field_z) as field_3,
    table_1.ts as ts_1, table_2.ts as ts_2
from
(
    select field_1, field_2, hop(INTERVAL '5' second, INTERVAL '3' minute) as window, 
    last_value(field_z) as field_z, last_value("timestampMillis") as ts
    from connection_1
    group by 1, 2, 3
) as table_1 join 
(
    select field_1, field_2, hop(INTERVAL '5' second, INTERVAL '3' minute) as window, 
    last_value(value_latest) as field_y, last_value("timestampMillis") as ts
    from connection_2
    group by 1, 2, 3
) as table_2
on table_1.field_1 = table_2.field_1 and 
    table_1.field_2 = table_2.field_2 and 
    table_1.window = table_2.window

@mwylde
Copy link
Member

mwylde commented Jul 17, 2024

And to double check is this a distributed kubernetes cluster or a local one (like running in minikube or similar)?

All nodes in the cluster need to be able to read from the state, so if you're running on a distributed kubernetes cluster it's necessary to use some kind of distributed state storage (typically this will be an object store like S3 but could be a shared filesystem like NFS as well). This is described in the docs (https://doc.arroyo.dev/deployment/kubernetes) but could be made clearer.

@mwylde mwylde removed the bug Something isn't working label Jul 17, 2024
@hazelnut-99
Copy link
Author

Hi!
This is a distributed k8s cluster. I am aware the checkpoint destination needs to be available to all pods and have configured volumes and volumeMounts according to the doc.

To provide more context, this error is reproducible when I'm running it locally using the default process scheduler as well. Also typically when I launch the new pipeline, the first 100 (or so) checkpoints can succeed but will be stuck due to the reported error afterwards.

Thanks so much!

@mwylde
Copy link
Member

mwylde commented Jul 17, 2024

Thanks so much for providing the query — I'm able to reproduce the issue.

@mwylde mwylde added the bug Something isn't working label Jul 17, 2024
mwylde added a commit that referenced this issue Jul 17, 2024
@mwylde
Copy link
Member

mwylde commented Jul 17, 2024

Thanks for the great bug report and for helping me diagnose this issue! I've merged a fix that I believe should address it. Please try out the latest master (available as the docker image ghcr.io/arroyosystems/arroyo:tip) and please re-open if you're still seeing it.

@hazelnut-99
Copy link
Author

Wow, that was swift!!! I'll test it out, thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants