Compactor Failing to compact blocks as it contains out of orders samples. #4573

alanprot · 2021-11-26T19:32:03Z

Describe the bug

Compactor Failing to compact blocks as it contains out of orders samples.

level=error ts=2021-11-25T23:23:46.612236147Z caller=compactor.go:537 component=compactor msg="failed to compact user blocks" user=redacted err="compaction: group 0@16569674436192844872: blocks with out-of-order chunks are dropped from compaction: /data/compactor/compact/0@16569674436192844872/01FNAYWZRW617H05QAY4KCH7P8: 3/214285 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

I'm not sure if cortex is the right place to open this issue as this seems to be an issue on TSDB.

This is not a compactor problem but rather ingesters accepting duplicating samples or samples out of the order.

Looking at the problematic block i can see 3 problematic chunks:

Serie1:
- Chunk1:
  - Timerange: 1637803960602 to 1637803960602
  - Samples: 1
- Chunk2:
  - Timerange: 1637803960602 to 1637805560140
  - Samples: 64
Serie2:
- Chunk1:
  - Timerange: 1637803960806 to 1637803960806
  - Samples: 1
- Chunk2:
  - Timerange: 1637803960806 to 1637805590480
  - Samples: 98
Serie3:
- Chunk1:
  - Timerange: 1637803960806 to 1637803960806
  - Samples: 1
- Chunk2:
  - Timerange: 1637803960806 to 1637805590480
  - Samples: 98

Looking at those blocks, we see that the problems seems all happening around GMT 01:32:40.806, which interestingly it was when ingester was compacting and creating a new block:

level=info ts=2021-11-25T01:32:37.052823025Z caller=head.go:803 org_id=redacted msg=\"Head GC completed\" duration=317.076532ms\n"
level=info ts=2021-11-25T01:32:36.735244358Z caller=compact.go:518 org_id=redacted msg=\"write block\" mint=1637791200000 maxt=1637798400000 ulid=01FNABJ4WSAZWYVSJFVBYGWYVA duration=1.637787092s\n
level=info ts=2021-11-25T01:32:35.082028862Z caller=head.go:803 org_id=redacted msg=\"Head GC completed\" duration=101.006245ms\n
level=info ts=2021-11-25T01:32:34.979894064Z caller=compact.go:518 org_id=redacted msg=\"write block\" mint=1637791040000 maxt=1637791200000 ulid=01FNABJ3RPEM22F743KWWBTTB8 duration=1.037618327s\n

And also we can see that lots of series were being created/removed from the ingester memory at the same time as well (probably because gc after the compact head)

Looking at TSDB code, seems that we have possible race condition when this happens:

Head Gc is running
- https://github.com/prometheus/prometheus/blob/e63b85be4fd7a6425dc697adc8244065028dfae2/tsdb/db.go#L969
We GC a serie
- https://github.com/prometheus/prometheus/blob/e63b85be4fd7a6425dc697adc8244065028dfae2/tsdb/head.go#L1377
And we receive data for that serie at the same time:
- https://github.com/prometheus/prometheus/blob/e63b85be4fd7a6425dc697adc8244065028dfae2/tsdb/head_append.go#L242

Ex Timeline:

SeriesA had samples from 22:00 to 23:50
At 1:30 u compact a new block (with data up to 00:00)
We receive a new sample (t1, v1) for SeriesA and the push go routine is on this line
At the same point the SeriesA is gced here and here
(t1, v1) is pushed to the store
We receive another sample (t1, v2), we create the SeriesA in memory and allow this push.

To Reproduce
Run Cortex and send very sparse data with duplicated samples. It will eventually happen.

I also have a unit test that triggers the issue: alanprot/prometheus@e30d871

Expected behavior
The sample (t1, v2) should not be accepted.

Environment:

Infrastructure: [e.g., Kubernetes

Storage Engine

Blocks
Chunks

Additional Context

The text was updated successfully, but these errors were encountered:

pstibrany · 2021-11-26T20:07:15Z

This is now hopefully fixed by prometheus/prometheus#9856, but it needs to be updated in Cortex too.

alanprot · 2021-11-26T20:31:14Z

Thanks @pstibrany ...
But maybe is not the same problem as it still happens on the actual main branch on prometheus. (my repro case is based on the latest changes)

pstibrany · 2021-11-26T21:06:26Z

In that case I would definitely suggest to report it to Prometheus and include your repro test case. /cc @codesome

alanprot · 2021-11-26T22:49:37Z

Potential fix for this should be something like this: alanprot/prometheus@fbc4206

stale · 2022-03-02T11:54:19Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

jeromeinsf · 2022-03-02T16:35:30Z

not stale discussion in prometheus still happening

yeya24 · 2023-11-10T05:25:30Z

We can probably this close issue now as prometheus/prometheus#10624 was also closed? @alanprot

alanprot mentioned this issue Nov 26, 2021

Potential Race condition when compacting head and receiving samples at the same time prometheus/prometheus#9879

Closed

stale bot added the stale label Mar 2, 2022

stale bot removed the stale label Mar 2, 2022

shybbko mentioned this issue Mar 30, 2022

Compactor is unable to properly compact blocks #4677

Closed

alvinlin123 added the keepalive Skipped by stale bot label Apr 21, 2022

shybbko mentioned this issue Apr 25, 2022

Investigating the reason out of order samples are appearing in LTS blocks #4724

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor Failing to compact blocks as it contains out of orders samples. #4573

Compactor Failing to compact blocks as it contains out of orders samples. #4573

alanprot commented Nov 26, 2021 •

edited

Loading

pstibrany commented Nov 26, 2021 •

edited

Loading

alanprot commented Nov 26, 2021 •

edited

Loading

pstibrany commented Nov 26, 2021

alanprot commented Nov 26, 2021

stale bot commented Mar 2, 2022

jeromeinsf commented Mar 2, 2022

yeya24 commented Nov 10, 2023

Compactor Failing to compact blocks as it contains out of orders samples. #4573

Compactor Failing to compact blocks as it contains out of orders samples. #4573

Comments

alanprot commented Nov 26, 2021 • edited Loading

pstibrany commented Nov 26, 2021 • edited Loading

alanprot commented Nov 26, 2021 • edited Loading

pstibrany commented Nov 26, 2021

alanprot commented Nov 26, 2021

stale bot commented Mar 2, 2022

jeromeinsf commented Mar 2, 2022

yeya24 commented Nov 10, 2023

alanprot commented Nov 26, 2021 •

edited

Loading

pstibrany commented Nov 26, 2021 •

edited

Loading

alanprot commented Nov 26, 2021 •

edited

Loading