Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query fails while masking table columns with different schema order #10370

Closed
guyco33 opened this issue Dec 21, 2021 · 5 comments · Fixed by #12262
Closed

Query fails while masking table columns with different schema order #10370

guyco33 opened this issue Dec 21, 2021 · 5 comments · Fixed by #12262
Labels
bug Something isn't working syntax-needs-review

Comments

@guyco33
Copy link
Member

guyco33 commented Dec 21, 2021

etc/access_control-rules.json

{
  "tables": [
    {
      "catalog": "hive",
      "schema": "temp",
      "table": "test_mask_(good|bad)",
      "columns": [
        {
          "name": "payload",
          "mask": "payload||''"
        },
        {
          "name": "col1",
          "mask": "if(json_extract_scalar(payload,'$.attr') IN ('a'), '***', col1)"
        },
        {
          "name": "col2",
          "mask": "if(json_extract_scalar(payload,'$.attr') IN ('a') or json_extract_scalar(payload,'$.attr') IN ('b'), '***', col2)"
        }
      ],
      "privileges": [
        "SELECT"
      ]
    }
  ]
}

GOOD:

CREATE TABLE temp.test_mask_good(col1 varchar, col2 varchar, payload varchar);

EXPLAIN SELECT col1, col2 from temp.test_mask_good;

Fragment 0 [SINGLE]
    Output layout: [col1, col2]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    Output[col1, col2]
    │   Layout: [col1:varchar, col2:varchar]
    │   Estimates: {rows: 0 (0B), cpu: 0, memory: 0B, network: 0B}
    └─ RemoteSource[1]
           Layout: [col2:varchar, col1:varchar]

Fragment 1 [SOURCE]
    Output layout: [col2, col1]
    Output partitioning: SINGLE []
    Stage Execution Strategy: UNGROUPED_EXECUTION
    ScanProject[table = hive:temp:test_mask_good, grouped = false]
        Layout: [col2:varchar, col1:varchar]
        Estimates: {rows: 0 (0B), cpu: 0, memory: 0B, network: 0B}/{rows: 0 (0B), cpu: 0, memory: 0B, network: 0B}
        col2 := (CASE WHEN ((json_extract_scalar("payload", "$literal$"(from_base64('DgAAAFZBUklBQkxFX1dJRFRIAQAAAAYAAAAABgAAACQuYXR0cg=='))) = VARCHAR 'a') OR (json_extract_scalar("payload", "$literal$"(from_base64('DgAAAFZBUklBQkxFX1dJRFRIAQAAAAYAAAAABgAAACQuYXR0cg=='))) = VARCHAR 'b')) THEN VARCHAR '***' ELSE "col2" END)
        col1 := (CASE WHEN (json_extract_scalar("payload", "$literal$"(from_base64('DgAAAFZBUklBQkxFX1dJRFRIAQAAAAYAAAAABgAAACQuYXR0cg=='))) = VARCHAR 'a') THEN VARCHAR '***' ELSE "col1" END)
        payload := payload:string:REGULAR
        col2 := col2:string:REGULAR
        col1 := col1:string:REGULAR

BAD: (INTERNAL_ERROR)

CREATE TABLE temp.test_mask_bad(col1 varchar, payload varchar, col2 varchar);

EXPLAIN SELECT col1, col2 from temp.test_mask_bad;

java.lang.IllegalStateException: Symbol payload already has assignment "@concat@52QIVVB44009Q180UA4I07RG96DG090K6U5ML5K1KLK2QAVH0OL2DK23J1VO46IUK6LC5U81PLGJ831QOD68L252G81AGB3B419NEG27AMO6PAIC72IPLVJ0BE5PSSIPD8J0M5SHJSDH8125O20UD016EA9RDGOF08O21UEJ3G33KMU5KGGTDVBSJL15LDT05G4BLMN2VBSLTKI478T2D9R7Q1ORAFA54GTH64G07K970F7GOKO4G4EGECECU3KSMU3T73NJ33134LHICKVL213N69TN108MM9U5LHHJ08P9GUHQLEA9O==="("payload", CAST('' AS varchar)), while adding "payload"
	at com.google.common.base.Preconditions.checkState(Preconditions.java:845)
	at io.trino.sql.planner.plan.Assignments$Builder.put(Assignments.java:270)
	at io.trino.sql.planner.plan.Assignments$Builder.putIdentity(Assignments.java:297)
	at io.trino.sql.planner.iterative.rule.InlineProjections.inlineProjections(InlineProjections.java:124)
	at io.trino.sql.planner.iterative.rule.InlineProjections.apply(InlineProjections.java:83)
	at io.trino.sql.planner.iterative.rule.InlineProjections.apply(InlineProjections.java:58)
	at io.trino.sql.planner.iterative.IterativeOptimizer.transform(IterativeOptimizer.java:184)
	at io.trino.sql.planner.iterative.IterativeOptimizer.exploreNode(IterativeOptimizer.java:159)
	at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:131)
	at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:217)
	at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:126)
	at io.trino.sql.planner.iterative.IterativeOptimizer.optimize(IterativeOptimizer.java:109)
	at io.trino.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:232)
	at io.trino.sql.analyzer.QueryExplainer.getLogicalPlan(QueryExplainer.java:176)
	at io.trino.sql.analyzer.QueryExplainer.getDistributedPlan(QueryExplainer.java:187)
	at io.trino.sql.analyzer.QueryExplainer.getPlan(QueryExplainer.java:104)
	at io.trino.sql.rewrite.ExplainRewrite$Visitor.getQueryPlan(ExplainRewrite.java:147)
	at io.trino.sql.rewrite.ExplainRewrite$Visitor.visitExplain(ExplainRewrite.java:125)
	at io.trino.sql.rewrite.ExplainRewrite$Visitor.visitExplain(ExplainRewrite.java:73)
	at io.trino.sql.tree.Explain.accept(Explain.java:61)
	at io.trino.sql.tree.AstVisitor.process(AstVisitor.java:27)
	at io.trino.sql.rewrite.ExplainRewrite.rewrite(ExplainRewrite.java:70)
	at io.trino.sql.rewrite.StatementRewrite.rewrite(StatementRewrite.java:53)
	at io.trino.sql.analyzer.Analyzer.analyze(Analyzer.java:88)
	at io.trino.sql.analyzer.Analyzer.analyze(Analyzer.java:83)
	at io.trino.execution.SqlQueryExecution.analyze(SqlQueryExecution.java:250)
	at io.trino.execution.SqlQueryExecution.<init>(SqlQueryExecution.java:177)
	at io.trino.execution.SqlQueryExecution$SqlQueryExecutionFactory.createQueryExecution(SqlQueryExecution.java:768)
	at io.trino.dispatcher.LocalDispatchQueryFactory.lambda$createDispatchQuery$0(LocalDispatchQueryFactory.java:132)
	at io.trino.$gen.Trino_366____20211221_071947_2.call(Unknown Source)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
@guyco33 guyco33 added the bug Something isn't working label Dec 21, 2021
@findepi
Copy link
Member

findepi commented Dec 21, 2021

Column masks should not be used as a replacement for generated columns.
It's counter-intuitive -- and to me it's unintentional feature -- that a column mask can reference other columns.
IMO we should forbid this.

cc @kokosing

@guyco33
Copy link
Member Author

guyco33 commented Dec 21, 2021

@findepi maybe the payload||'' function wasn't a good one. I just used it to demonstrate a dummy masking function.

There are cases that some attributes in the payload (that are not masked in the json) should be used to determine if other columns need to be masked.

{
  "tables": [
    {
      "catalog": "hive",
      "schema": "temp",
      "table": "test_mask_(good|bad)",
      "columns": [
        {
          "name": "payload",
          "mask": "function_that_mask_some_attributes_in_the_json(payload)"
        },
        {
          "name": "col1",
          "mask": "if(json_extract_scalar(payload,'$.unmasked_attr') IN ('a'), '***', col1)"
        },
        {
          "name": "col2",
          "mask": "if(json_extract_scalar(payload,'$.unmasked_attr') IN ('a') or json_extract_scalar(payload,'$.unmasked_attr') IN ('b'), '***', col2)"
        }
      ],
      "privileges": [
        "SELECT"
      ]
    }
  ]
}

@martint
Copy link
Member

martint commented Dec 22, 2021

and to me it's unintentional feature -- that a column mask can reference other columns.

It's actually intentional, and sometimes necessary, as the mask may depend on various dynamic conditions (e.g., you might need to mask some column based on whether the row contains data for some country -- indicated by some other column). However, it should only be able to see columns before masking is applied, and should not be able to generate columns that don't exist in the table. I believe both of these are true today.

@martint
Copy link
Member

martint commented Dec 22, 2021

@findepi, there's no syntax change proposed here. What syntax needs to be reviewed (or was the label applied by mistake)?

@kokosing
Copy link
Member

I don't find anything wrong in the examples above. It looks like a bug. Order of columns in the table should not matter here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working syntax-needs-review
4 participants