Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune checkpoints in Lambda #4777

Merged
merged 9 commits into from
Apr 3, 2024
Merged

Prune checkpoints in Lambda #4777

merged 9 commits into from
Apr 3, 2024

Conversation

rdettai
Copy link
Collaborator

@rdettai rdettai commented Mar 21, 2024

Description

Closes #4613

Avoid accumulating file sources when running Lambda indexer

How was this PR tested?

Describe how you tested this PR.

@rdettai rdettai self-assigned this Mar 21, 2024
Copy link
Collaborator Author

@rdettai rdettai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bug remains, only one source is created:

    "checkpoint": {
      "ingest-lambda-source-1711043830": {
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711043826.gz": "00000000000016044634",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711044126.gz": "00000000000016043012",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711044426.gz": "00000000000016038941",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711044726.gz": "00000000000016041053",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711045026.gz": "00000000000016041903",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711045326.gz": "00000000000016044080",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711045626.gz": "00000000000016041526",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711045925.gz": "00000000000016042481",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711046225.gz": "00000000000016044227",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711046526.gz": "00000000000016042532",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711046825.gz": "00000000000016043210",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711047126.gz": "00000000000016042689",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711047426.gz": "00000000000016042582",
        "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711047726.gz": "00000000000016042450"
      }
    },
    "create_timestamp": 1711039331,
    "sources": [
      {
        "version": "0.8",
        "source_id": "ingest-lambda-source-1711043830",
        "num_pipelines": 1,
        "enabled": true,
        "source_type": "file",
        "params": {
          "filepath": "s3://mockdatastack-sourcemockdata26422bfc-mpua0jb4rrh1/mock-sales/1711043826.gz"
        },
        "input_format": "json"
      }
    ]
  },

@rdettai rdettai linked an issue Mar 21, 2024 that may be closed by this pull request
@rdettai rdettai changed the title Lambda-prune-checkpoints Prune checkpoints in Lambda Mar 21, 2024
@rdettai rdettai force-pushed the improve-lambda-merges branch from 4f9cde0 to a3de045 Compare March 22, 2024 13:56
Base automatically changed from improve-lambda-merges to main March 22, 2024 14:08
@rdettai rdettai marked this pull request as ready for review March 26, 2024 08:11
@rdettai rdettai force-pushed the lambda-prune-checkpoints branch from 4d0f3aa to f6065a4 Compare March 26, 2024 08:14
@rdettai rdettai requested a review from trinity-1686a March 28, 2024 17:21
Copy link
Contributor

@trinity-1686a trinity-1686a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd rather we always keep the last X checkpoints so as to make sure we are resilient to redundant notification, but i'm not sure how that can be done (embed a timestamp in the partition id, after a # maybe?). Anyway that's definitely an improvement

@rdettai
Copy link
Collaborator Author

rdettai commented Apr 3, 2024

i'd rather we always keep the last X checkpoints

Yes, couldn't agree more. I logged in #4613 a few things I tried to achieve that and why they failed. We could definitively come up with a solution but it would require a more massive rewrite on the metastore or the source.

@rdettai rdettai force-pushed the lambda-prune-checkpoints branch from d82b649 to b352bd2 Compare April 3, 2024 08:00
@rdettai rdettai enabled auto-merge (squash) April 3, 2024 08:00
@rdettai rdettai merged commit d097326 into main Apr 3, 2024
4 checks passed
@rdettai rdettai deleted the lambda-prune-checkpoints branch April 3, 2024 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Garbage collect ingested files in the metastore in Lambda indexer
2 participants