Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce globally unique table locations #67

Conversation

eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented Aug 1, 2024

Description

This PR introduces a new flag, ENFORCE_GLOBALLY_UNIQUE_TABLE_LOCATIONS, which enforces that all newly-created tables must have a unique location which does not overlap with any other existing table.

This PR additionally introduces another new flag, ENFORCE_TABLE_LOCATIONS_INSIDE_NAMESPACE_LOCATIONS, which can be used to disable the requirement a table must reside within a location which is a child of its namespace.

Together, these two options can be used to create tables in essentially arbitrary locations within a catalog without violating the invariant that one table cannot be stored in another table's location.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Implemented a new test, PolarisOverlappingTableTest

Checklist:

Please delete options that are not relevant.

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue
  • I have signed and submitted the ICLA and if needed, the CCLA. See Contributing for details.

AND typeCode = :table_code
AND (
location LIKE CONCAT(:location, '%')
OR :location LIKE CONCAT(location, '%')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the 2nd condition cause a full-table-scan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as written it unfortunately will.

Of course, this is all dependent on the metastore manager implementation (and in this case the backing database's implementation).

In any event if this check was always being done, I think there are some easy ways to optimize this. But since the check is optional, we may need to check every path. I'm still testing ways to improve performance of this optional check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this problem a bit. It's a tricky-fun one ;)
Just brain-dumping some thoughts:

There are a few things that play a role here - respectively w/ locations in general.

I think we can rely on / as a separator. With that in mind, a location is a duple of "bucket" (S3/GCS bucket or ADSL fs-name) plus a list of path elements. If we distinguish parent directories from table directories, the check becomes easier. I.e. a separate metastore entity that is only used to track locations.

CREATE TABLE locations (
  bucket TEXT NOT NULL,  -- storage bucket, e.g. s3://bucket/
  path TEXT NOT NULL,   -- storage location path, e.g. my/path/to/my-table
  kind TEXT NOT NULL,    -- marker for "parent-directory" or "table-location"
  entity_id TEXT NULL   -- id of the table
);

When you want to add/check for a new location like s3://bucket/my/path/my-table, the following INSERTs could do the trick:

INSERT INTO locations (bucket, path, kind, entity_id) VALUES ( 's3://bucket', `my`, `parent`, NULL) ON CONFLICT DO NOTHING;
INSERT INTO locations (bucket, path, kind, entity_id) VALUES ( 's3://bucket', `my/path`, `parent`, NULL) ON CONFLICT DO NOTHING;
INSERT INTO locations (bucket, path, kind, entity_id) VALUES ( 's3://bucket', `my/my-table`, `table`, '1234');

If the last one fails -> location already used -> fail hard. If any of the parents did not succeed, verify that those are all kind = 'parent'.

I suspect, this needs some more thought around race conditions (two tables with conflicting locations).

Comment on lines +442 to +458
queryString = IntStream.range(0, directoryList.size())
.mapToObj(i ->
"SELECT location " +
"FROM ModelEntityActive " +
"WHERE location IS NOT NULL " +
"AND typeCode = :table_code " +
"AND location = :directory_" + i
)
.collect(Collectors.joining(" UNION ALL "));

queryString += " UNION ALL " +
"SELECT location " +
"FROM ModelEntityActive " +
"WHERE location IS NOT NULL " +
"AND typeCode = :table_code " +
"AND location LIKE :locationPrefix";
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not do a location IN (:directory_list) query? I think EclipseLink must support lists as parameters.

@mayankvadariya
Copy link

@eric-maynard are you planning to rework on this PR? if not, any insight on why this work was halted would be appreciated. Thanks!

@eric-maynard
Copy link
Contributor Author

eric-maynard commented Sep 20, 2024

Hi @mayankvadariya, thanks for your interest in this feature. There are a number of considerations here that contributed to me putting this work on hold. I'll walk through them here, but feel free to open up a discussion or thread on Zulip if you want to get into more detail. There's a lot of context here.

  1. Initially, this feature was intended to enable users to set the table config write.object-storage.enabled or to otherwise decouple the physical arrangement of their tables in object storage from the logical arrangement on them in namespaces & catalogs. However, other work (e.g.) has unblocked this use case if users are willing to accept that credentials vended for one table could in theory be used to access data in another table. In this sense, this solution is no longer as urgently necessary.
  2. For users who are unwilling to accept this, it's not obvious that Polaris's existing controls are sufficient to fully lock down access to data. For example, tables could be moved, could be dropped but not yet purged, or these could be some non-tabular data that a table's vended credentials might be able to read. Here, consider that this solution might not be complete enough. We'd probably need a bona fide proposal to talk through all the security implications in the right forum.
  3. In light of the above points, there hasn't been appetite for this feature yet. Our internal use of Polaris doesn't yet require it, and I haven't heard from other users who want it. So I put it on hold for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants