Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[frontend] List named blobs SQL #2955

Open
wants to merge 50 commits into
base: master
Choose a base branch
from
Open

[frontend] List named blobs SQL #2955

wants to merge 50 commits into from

Conversation

snalli
Copy link
Contributor

@snalli snalli commented Dec 3, 2024

Fix list-named-blobs SQL to use smaller joins. This improves query performance by 1000x. Yes you read that right.

Copy link
Contributor

@SophieGuo410 SophieGuo410 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful about what we changed here.
I suggest we could have a new query and switch by config. so we can rollback easily if something is wrong.
Also we need to add integration test to make sure it cover every case.
For example,
if we have the delete_ts is null for older version and deleted_ts is not null for a new version, we will return the old version blobName.

@snalli
Copy link
Contributor Author

snalli commented Dec 3, 2024

We should be careful about what we changed here. I suggest we could have a new query and switch by config. so we can rollback easily if something is wrong. Also we need to add integration test to make sure it cover every case. For example, if we have the delete_ts is null for older version and deleted_ts is not null for a new version, we will return the old version blobName.

will add a test, the coverage seems limited, because the tests passed even though my prev query was not what is required.

@snalli
Copy link
Contributor Author

snalli commented Dec 3, 2024

We should be careful about what we changed here. I suggest we could have a new query and switch by config. so we can rollback easily if something is wrong. Also we need to add integration test to make sure it cover every case. For example, if we have the delete_ts is null for older version and deleted_ts is not null for a new version, we will return the old version blobName.

converted the query itself to a config to avoid a v2/v3 switch, as we far too many of those.

* In each group, it selects the blob with the highest version.
* The outer query filters out blobs that have been deleted, and return the latest blob in each group that is ready.
*/
@Config(LIST_NAMED_BLOBS_SQL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like having the actual query as a config is not good. We want to make the config as simple as possible, besides, having a sql cmd in config instead of in the MySqlNamedBlobDb makes the code hard to track. if you think we have too many boolean configs, you can easily remove them after the query has been fully verified by actual traffic.

time.sleep(100);

// delete blob and list should return empty
namedBlobDb.delete(account.getName(), container.getName(), blobName).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some test like like

  1. add one permanent version.
  2. upsert the named blob and delete it.
    In this case, we should list nothing cause the latest version already been deleted.

if (blobNamePrefix == null) {
// list-all no prefix
statement.setInt(1, accountId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit confused here.
You didn't change the LIST_ALL_QUERY_V2, why you can update the statement settings?

+ " AND blob_state = %1$s "
+ " AND blob_name LIKE ? " // 7
+ " AND blob_name >= ? " // 8
+ " GROUP BY blob_name "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GROUP BY logic operates independently of the WHERE clause, this GROUP BY account_id, container_id, blob_name ensures that the aggregation (e.g., MAX(version)) respects the boundaries of each account and container. Again, we are not the expert of mysql queries, and you might want to add some test case like adding named blob for different account and container and test the list. Even though you use where to select the corresponding account and container, it might not give you the right max version based on my understanding.

@Config(LIST_NAMED_BLOBS_SQL)
public static final String DEFAULT_LIST_NAMED_BLOBS_SQL = ""
+ " WITH "
+ " BlobsAllVersion AS ( "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this query is still pretty complex. I suggest we can think of some simpler queries which makes the code easy to read. Sometimes having two queries is also ok cause we are trying to avoid the complex sql queris which might missing edge cases. cc @justinlin-linkedin

statement.setInt(6, containerId);
statement.setString(7, blobNamePrefix + "%");
statement.setString(8, pageToken != null ? pageToken : blobNamePrefix);
statement.setInt(9, maxKeysValue + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some time like this:
blobName1 blobId1 version1
blobName1 blobId2 version2
so listing we should return blobName1 blobId1 version2 right?
but I think in some case we might return the blobName1 blobId2 version.
This is ok but we should not return the NamedBlobRecord with blobId info. Instead, we can only return the blobName, cause blobId1 and blobId2 all map to blobName1 and we don't need blobId for listing.

@snalli snalli changed the title [WIP][frontend] List named blobs [frontend] List named blobs SQL Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants