Enforce globally unique table locations #67

eric-maynard · 2024-08-01T23:01:52Z

Description

This PR introduces a new flag, ENFORCE_GLOBALLY_UNIQUE_TABLE_LOCATIONS, which enforces that all newly-created tables must have a unique location which does not overlap with any other existing table.

This PR additionally introduces another new flag, ENFORCE_TABLE_LOCATIONS_INSIDE_NAMESPACE_LOCATIONS, which can be used to disable the requirement a table must reside within a location which is a child of its namespace.

Together, these two options can be used to create tables in essentially arbitrary locations within a catalog without violating the invariant that one table cannot be stored in another table's location.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Implemented a new test, PolarisOverlappingTableTest

Checklist:

Please delete options that are not relevant.

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue
I have signed and submitted the ICLA and if needed, the CCLA. See Contributing for details.

…e-globally-unique-table-location

snazy · 2024-08-05T09:40:12Z

...src/main/java/io/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkStore.java

+ AND typeCode = :table_code
+ AND (
+ location LIKE CONCAT(:location, '%')
+ OR :location LIKE CONCAT(location, '%')


Doesn't the 2nd condition cause a full-table-scan?

Yes, as written it unfortunately will.

Of course, this is all dependent on the metastore manager implementation (and in this case the backing database's implementation).

In any event if this check was always being done, I think there are some easy ways to optimize this. But since the check is optional, we may need to check every path. I'm still testing ways to improve performance of this optional check.

I was thinking about this problem a bit. It's a tricky-fun one ;)
Just brain-dumping some thoughts:

There are a few things that play a role here - respectively w/ locations in general.

I think we can rely on / as a separator. With that in mind, a location is a duple of "bucket" (S3/GCS bucket or ADSL fs-name) plus a list of path elements. If we distinguish parent directories from table directories, the check becomes easier. I.e. a separate metastore entity that is only used to track locations.

CREATE TABLE locations ( bucket TEXT NOT NULL, -- storage bucket, e.g. s3://bucket/ path TEXT NOT NULL, -- storage location path, e.g. my/path/to/my-table kind TEXT NOT NULL, -- marker for "parent-directory" or "table-location" entity_id TEXT NULL -- id of the table );

When you want to add/check for a new location like s3://bucket/my/path/my-table, the following INSERTs could do the trick:

INSERT INTO locations (bucket, path, kind, entity_id) VALUES ( 's3://bucket', `my`, `parent`, NULL) ON CONFLICT DO NOTHING; INSERT INTO locations (bucket, path, kind, entity_id) VALUES ( 's3://bucket', `my/path`, `parent`, NULL) ON CONFLICT DO NOTHING; INSERT INTO locations (bucket, path, kind, entity_id) VALUES ( 's3://bucket', `my/my-table`, `table`, '1234');

If the last one fails -> location already used -> fail hard. If any of the parents did not succeed, verify that those are all kind = 'parent'.

I suspect, this needs some more thought around race conditions (two tables with conflicting locations).

collado-mike · 2024-08-05T22:47:35Z

...src/main/java/io/polaris/extension/persistence/impl/eclipselink/PolarisEclipseLinkStore.java

+ queryString = IntStream.range(0, directoryList.size())
+ .mapToObj(i ->
+ "SELECT location " +
+ "FROM ModelEntityActive " +
+ "WHERE location IS NOT NULL " +
+ "AND typeCode = :table_code " +
+ "AND location = :directory_" + i
+ )
+ .collect(Collectors.joining(" UNION ALL "));
+
+ queryString += " UNION ALL " +
+ "SELECT location " +
+ "FROM ModelEntityActive " +
+ "WHERE location IS NOT NULL " +
+ "AND typeCode = :table_code " +
+ "AND location LIKE :locationPrefix";
+ }


Can we not do a location IN (:directory_list) query? I think EclipseLink must support lists as parameters.

mayankvadariya · 2024-09-20T01:41:37Z

@eric-maynard are you planning to rework on this PR? if not, any insight on why this work was halted would be appreciated. Thanks!

eric-maynard · 2024-09-20T01:54:35Z

Hi @mayankvadariya, thanks for your interest in this feature. There are a number of considerations here that contributed to me putting this work on hold. I'll walk through them here, but feel free to open up a discussion or thread on Zulip if you want to get into more detail. There's a lot of context here.

Initially, this feature was intended to enable users to set the table config write.object-storage.enabled or to otherwise decouple the physical arrangement of their tables in object storage from the logical arrangement on them in namespaces & catalogs. However, other work (e.g.) has unblocked this use case if users are willing to accept that credentials vended for one table could in theory be used to access data in another table. In this sense, this solution is no longer as urgently necessary.
For users who are unwilling to accept this, it's not obvious that Polaris's existing controls are sufficient to fully lock down access to data. For example, tables could be moved, could be dropped but not yet purged, or these could be some non-tabular data that a table's vended credentials might be able to read. Here, consider that this solution might not be complete enough. We'd probably need a bona fide proposal to talk through all the security implications in the right forum.
In light of the above points, there hasn't been appetite for this feature yet. Our internal use of Polaris doesn't yet require it, and I haven't heard from other users who want it. So I put it on hold for now.

eric-maynard added 18 commits August 1, 2024 12:56

initial commit

d7a6acc

semi-stable

bdc92c5

add check

ce5b0fd

wip

97a1700

start implementing test

5c656a3

progress on test

9919f80

entity locations are null

279bb64

check in

daae37b

try fixing loation routing

834696a

getting close

3d24bd8

stable

def7330

ENFORCE_TABLE_LOCATIONS_INSIDE_NAMESPACE_LOCATIONS; stable

88f9cfb

stable

2230a93

fix an existing test

3103f91

debugging

092cffa

Merge branch 'main' of github.com:polaris-catalog/polaris into enforc…

a95bcde

…e-globally-unique-table-location

progress on refactor

e91d45e

pause for now

797d467

snazy reviewed Aug 5, 2024

View reviewed changes

eric-maynard added 3 commits August 5, 2024 11:34

push location check into PolarisBaseEntity

e8e7fe2

somewhat optimize query

38eb126

harden query

2d3e886

collado-mike reviewed Aug 5, 2024

View reviewed changes

eric-maynard closed this Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce globally unique table locations #67

Enforce globally unique table locations #67

eric-maynard commented Aug 1, 2024 •

edited

Loading

snazy Aug 5, 2024

eric-maynard Aug 5, 2024

snazy Aug 6, 2024

collado-mike Aug 5, 2024

mayankvadariya commented Sep 20, 2024

eric-maynard commented Sep 20, 2024 •

edited

Loading

Enforce globally unique table locations #67

Enforce globally unique table locations #67

Conversation

eric-maynard commented Aug 1, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

Checklist:

snazy Aug 5, 2024

Choose a reason for hiding this comment

eric-maynard Aug 5, 2024

Choose a reason for hiding this comment

snazy Aug 6, 2024

Choose a reason for hiding this comment

collado-mike Aug 5, 2024

Choose a reason for hiding this comment

mayankvadariya commented Sep 20, 2024

eric-maynard commented Sep 20, 2024 • edited Loading

eric-maynard commented Aug 1, 2024 •

edited

Loading

eric-maynard commented Sep 20, 2024 •

edited

Loading