Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor Failing to compact blocks as it contains out of orders samples. #4573

Open
1 of 2 tasks
alanprot opened this issue Nov 26, 2021 · 7 comments
Open
1 of 2 tasks
Labels
keepalive Skipped by stale bot

Comments

@alanprot
Copy link
Member

alanprot commented Nov 26, 2021

Describe the bug

Compactor Failing to compact blocks as it contains out of orders samples.

level=error ts=2021-11-25T23:23:46.612236147Z caller=compactor.go:537 component=compactor msg="failed to compact user blocks" user=redacted err="compaction: group 0@16569674436192844872: blocks with out-of-order chunks are dropped from compaction: /data/compactor/compact/0@16569674436192844872/01FNAYWZRW617H05QAY4KCH7P8: 3/214285 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

I'm not sure if cortex is the right place to open this issue as this seems to be an issue on TSDB.

This is not a compactor problem but rather ingesters accepting duplicating samples or samples out of the order.

Looking at the problematic block i can see 3 problematic chunks:

  • Serie1:

    • Chunk1:
      • Timerange: 1637803960602 to 1637803960602
      • Samples: 1
    • Chunk2:
      • Timerange: 1637803960602 to 1637805560140
      • Samples: 64
  • Serie2:

    • Chunk1:
      • Timerange: 1637803960806 to 1637803960806
      • Samples: 1
    • Chunk2:
      • Timerange: 1637803960806 to 1637805590480
      • Samples: 98
  • Serie3:

    • Chunk1:
      • Timerange: 1637803960806 to 1637803960806
      • Samples: 1
    • Chunk2:
      • Timerange: 1637803960806 to 1637805590480
      • Samples: 98

Looking at those blocks, we see that the problems seems all happening around GMT 01:32:40.806, which interestingly it was when ingester was compacting and creating a new block:

level=info ts=2021-11-25T01:32:37.052823025Z caller=head.go:803 org_id=redacted msg=\"Head GC completed\" duration=317.076532ms\n"
level=info ts=2021-11-25T01:32:36.735244358Z caller=compact.go:518 org_id=redacted msg=\"write block\" mint=1637791200000 maxt=1637798400000 ulid=01FNABJ4WSAZWYVSJFVBYGWYVA duration=1.637787092s\n
level=info ts=2021-11-25T01:32:35.082028862Z caller=head.go:803 org_id=redacted msg=\"Head GC completed\" duration=101.006245ms\n
level=info ts=2021-11-25T01:32:34.979894064Z caller=compact.go:518 org_id=redacted msg=\"write block\" mint=1637791040000 maxt=1637791200000 ulid=01FNABJ3RPEM22F743KWWBTTB8 duration=1.037618327s\n

And also we can see that lots of series were being created/removed from the ingester memory at the same time as well (probably because gc after the compact head)

Screen Shot 2021-11-26 at 10 26 52 AM

Looking at TSDB code, seems that we have possible race condition when this happens:

Ex Timeline:

  • SeriesA had samples from 22:00 to 23:50
  • At 1:30 u compact a new block (with data up to 00:00)
  • We receive a new sample (t1, v1) for SeriesA and the push go routine is on this line
  • At the same point the SeriesA is gced here and here
  • (t1, v1) is pushed to the store
  • We receive another sample (t1, v2), we create the SeriesA in memory and allow this push.

To Reproduce
Run Cortex and send very sparse data with duplicated samples. It will eventually happen.

I also have a unit test that triggers the issue: alanprot/prometheus@e30d871

Expected behavior
The sample (t1, v2) should not be accepted.

Environment:

  • Infrastructure: [e.g., Kubernetes

Storage Engine

  • Blocks
  • Chunks

Additional Context

@pstibrany
Copy link
Contributor

pstibrany commented Nov 26, 2021

This is now hopefully fixed by prometheus/prometheus#9856, but it needs to be updated in Cortex too.

@alanprot
Copy link
Member Author

alanprot commented Nov 26, 2021

Thanks @pstibrany ...
But maybe is not the same problem as it still happens on the actual main branch on prometheus. (my repro case is based on the latest changes)

@pstibrany
Copy link
Contributor

In that case I would definitely suggest to report it to Prometheus and include your repro test case. /cc @codesome

@alanprot
Copy link
Member Author

Potential fix for this should be something like this: alanprot/prometheus@fbc4206

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 2, 2022
@jeromeinsf
Copy link
Contributor

not stale discussion in prometheus still happening

@yeya24
Copy link
Contributor

yeya24 commented Nov 10, 2023

We can probably this close issue now as prometheus/prometheus#10624 was also closed? @alanprot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Skipped by stale bot
Projects
None yet
Development

No branches or pull requests

5 participants