Skip to content

Conversation

@huanliwang-db
Copy link
Contributor

"23/02/23 23:57:44 INFO Executor: Running task 2.0 in stage 57.1 (TID 363)
"23/02/23 23:58:44 ERROR RocksDB StateStoreId(opId=0,partId=3,name=default): RocksDB instance 
could not be acquired by [ThreadId: Some(49), task: 3.0 in stage 57, TID 363] as it was not released by 
[ThreadId: Some(51), task: 3.1 in stage 57, TID 342] after 60002 ms.

We are seeing those error messages for a testing query. The taskId != partitionId but we fail to be clear on this in the error log.

It's confusing when we see those logs: the second log entry seems to talk about task 3.0 (it's actually partition 3 and retry attempt 0), but the TID 363 is already occupied by task 2.0 in stage 57.1.

Also, it's unclear at which stage retry attempt, the lock is acquired (or fails to be acquired)

What changes were proposed in this pull request?

  • add partition after task: in the log message for clarification
  • add stage attempt to distinguish different stage retries.

Why are the changes needed?

improve the log message for a better debuggability

Does this PR introduce any user-facing change?

No

How was this patch tested?

only log message change

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Nice!

@HeartSaVioR HeartSaVioR changed the title [SPARK-42565][SS] Error log improve ment for the lock acquisition of RocksDB state store instance [SPARK-42565][SS] Error log improvement for the lock acquisition of RocksDB state store instance Feb 24, 2023
@HeartSaVioR
Copy link
Contributor

Thanks! Merged to master.

Kimahriman pushed a commit to Kimahriman/spark that referenced this pull request Jun 20, 2023
…ocksDB state store instance

```
"23/02/23 23:57:44 INFO Executor: Running task 2.0 in stage 57.1 (TID 363)
"23/02/23 23:58:44 ERROR RocksDB StateStoreId(opId=0,partId=3,name=default): RocksDB instance
could not be acquired by [ThreadId: Some(49), task: 3.0 in stage 57, TID 363] as it was not released by
[ThreadId: Some(51), task: 3.1 in stage 57, TID 342] after 60002 ms.
```

We are seeing those error messages for a testing query. The `taskId != partitionId` but we fail to be clear on this in the error log.

It's confusing when we see those logs: the second log entry seems to talk about `task 3.0` (it's actually partition 3 and retry attempt 0), but the `TID 363` is already occupied by `task 2.0 in stage 57.1`.

Also, it's unclear at which stage retry attempt, the lock is acquired (or fails to be acquired)

### What changes were proposed in this pull request?
* add `partition ` after `task: ` in the log message for clarification
* add stage attempt to distinguish different stage retries.

### Why are the changes needed?

improve the log message for a better debuggability

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

only log message change

Closes apache#40161 from huanliwang-db/rocksdb.

Authored-by: Huanli Wang <huanli.wang@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants