Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix an issue in InlineProjections that may cause query failure when multiple column masks exist on a table #12262

Merged
merged 1 commit into from
May 9, 2022

Conversation

weiatwork
Copy link
Contributor

@weiatwork weiatwork commented May 5, 2022

Description

Is this change a fix, improvement, new feature, refactoring, or other?
Bug fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)
Query engine

How would you describe this change to a non-technical end user or system administrator?
Fixes an issue in InlineProjections that may cause query failure when multiple column masks exist on a table

Related issues, pull requests, and links

I incorporated unit test cases from #10437. The change in #10437 can be seen as an improvement to avoid unnecessary Project node stacking, but the root cause is how we handle multiple Project nodes when performing symbol inlining.

Fixes #10370

Documentation

(X) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(X) Release notes entries required with the following suggested text:

# Security
* Fix some query failures when multiple column masks exist on a table. ({issue}`10370`)

@cla-bot cla-bot bot added the cla-signed label May 5, 2022
@weiatwork weiatwork force-pushed the columnmaskdependency branch from 53fe8d8 to 31dfa0b Compare May 5, 2022 23:51
@findepi
Copy link
Member

findepi commented May 6, 2022

@weiatwork weiatwork requested a review from kokosing May 6, 2022 15:54
@weiatwork weiatwork marked this pull request as ready for review May 6, 2022 15:54
@weiatwork weiatwork requested a review from guyco33 May 6, 2022 15:55
@kokosing kokosing requested a review from kasiafi May 6, 2022 19:49
Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

% comments

@weiatwork weiatwork force-pushed the columnmaskdependency branch from 28e4d26 to 600b3a7 Compare May 6, 2022 22:51
When there are multiple column masks on a table, it is possible to have
symbols in child projection that don't qualify for inlining. If that's
the case, we should keep their original assignment instead of resetting
them to identity projection.
@weiatwork weiatwork force-pushed the columnmaskdependency branch 2 times, most recently from 22e3ab0 to 4b6d7d1 Compare May 7, 2022 05:23
@kokosing kokosing merged commit 49348fd into trinodb:master May 9, 2022
@kokosing
Copy link
Member

kokosing commented May 9, 2022

Merged thanks.

@kokosing kokosing mentioned this pull request May 9, 2022
@github-actions github-actions bot added this to the 381 milestone May 9, 2022
@kasiafi
Copy link
Member

kasiafi commented May 9, 2022

This issue was originally caused by the way column masks are planned. While planning column masks, we add projections which change the semantics of the masked symbols:

masked_column <-- mask_expression(masked_column)

This seems brittle, because different plan transformations we have in Trino span across sequences of plan nodes, and they assume that a symbol consistently has the same semantics. I would not be surprised if we had silent correctness issues resulting from that.

Another issue with planning column masks this way is the notion of "sequential masking". The masks which are applied later "see" the other columns already masked (but maybe it is the way it should be?). Consider this example:

@Test
    public void testSequentialMasking()
    {
        accessControl.reset();

        accessControl.columnMask(
                new QualifiedObjectName(CATALOG, "tiny", "region"),
                "regionkey",
                USER,
                new ViewExpression(USER, Optional.empty(), Optional.empty(), "-1"));

        accessControl.columnMask(
                new QualifiedObjectName(CATALOG, "tiny", "region"),
                "name",
                USER,
                new ViewExpression(USER, Optional.empty(), Optional.empty(), "if(regionkey > 2, cast('***' as varchar(25)), name)"));

        assertThat(assertions.query("SELECT regionkey, name FROM region"))
                .matches("VALUES " +
                        "(BIGINT '-1', CAST('AFRICA' as varchar(25)))," +
                        "(BIGINT '-1', CAST('AMERICA' as varchar(25))), " +
                        "(BIGINT '-1', CAST('ASIA' as varchar(25))), " +
                        "(BIGINT '-1', CAST('EUROPE' as varchar(25))), " +
                        "(BIGINT '-1', CAST('MIDDLE EAST' as varchar(25)))");
    }

In this example, the mask on regionkey column is applied first. As a result, the other mask on name column is never effective, because it assumes the original values of regionkey column, not the masked values.

The solution to both issues would be to plan masks so that the resulting symbol is a new symbol.

@martint @kokosing @weiatwork

@weiatwork weiatwork deleted the columnmaskdependency branch May 9, 2022 17:25
@weiatwork
Copy link
Contributor Author

Thanks @kasiafi for pointing that out. For your second note on "sequential masking", I agree there is a mask dependency issue, which I'm not sure what is the correct/compliant behavior. What's worse is if the sequence of masking is performed in an non-deterministic way, the query result will be inconsistent as well.

@kokosing
Copy link
Member

Each mask is like a view. If you have multiple masks then you have recursive view that reads from view that reads from a table.

So each mask is applied independently. If you have a table with columns a, b, and two masks functions that M1: a = hash1(a) and M2: b = hash2(a), then it results in a = hash1(original a) and b=hash2(hash1(original a).

Surely the order is important and engine applies them in the order that was returned by the plugin. Not sure how much test coverage we have here.

assertThat(assertions.query(query))
.matches("VALUES (CAST('***' as varchar(79)), 'O', CAST('***#000000951' as varchar(15)))");

// Mask "comment" and "orderstatus" using "clerk" ("clerk" appears between "orderstatus" and "comment" in table definition)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of columns shouldn't determine any mask/filter related behavior.

I don't understand this comment, but it looks like explaining why the expected behavior is what it is.
If the expected behavior was identified as dependent on the column order in the table definition, this comment should be a TODO + link to a github issue.

see also #14420 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments on the order of column definitions do not seem related. The tests don't show any dependency between the column order and the effective semantics of the masks. @weiatwork, am I correct? What's the purpose of those comments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kasiafi That's right. That test didn't demonstrate any "column mask dependent" masking scenario, only trying to avoid the previously not so well understood behavior.

Since now I'm more confident that this behavior is problematic, I made adjustment to the test case and intentionally demo that there shouldn't be dependencies after the fix: https://github.com/trinodb/trino/pull/14420/files#diff-637b4407daa07892834452da6da9e755fd1dfb72813298df590c7fe019796249R849

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weiatwork please see my comment #14420 (comment).
If the behavior of the masks does not depend on the order of column definitions, then it would be worthy to remove the related part of the comments, e.g. ("clerk" appears between "orderstatus" and "comment" in table definition), as they might mislead someone into thinking that there is a dependency between mask semantics and column def order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is dependency. Now I realize there shouldn't be any dependency among masks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Query fails while masking table columns with different schema order
4 participants