C++: Switch to use-use flow in IR dataflow #10190

MathiasVP · 2022-08-26T14:11:32Z

This is a very large redesign of the C++ IR dataflow library. Most importantly, it switches C++ to use-use flow (instead of def-use flow) like all the other languages that use the shared dataflow library.

Previously, this has been impossible to do for C++. To see why, consider the following example:

const char* path = getenv("PATH"); // *path is tainted
const char* path2 = path;
sink(path);

Most other languages would have flow from path on line 1 to path on line 2, and from path on line 2 to path on line 3. However, if we just step from the expression node on line 1 to the expression node on line 2 we conflate the pointed-to value with the pointer value.

Instead, C++ currently has flow from path on line 1 to path on both lines 2 and 3. This prevents pointee-pointer conflation, but it also has a number of bad side effects:

It makes barriers terribly difficult to use.
It will sometimes explode on codebases where a single definition have thousands of uses leading to bad performance.
It makes it impossible to implement clearsContent for C/C++ (whereas this is a trivial one-liner in languages with use-use flow).

All of these issues will be fixed by this PR.

I expect this PR will take quite a while to go through. While each commit doesn't compile on its own, it's meant to be reviewed commit-by-commit.

commit we will introduce the "real" SSA analysis that we'll end up using in dataflow (which handles) indirections. However, this will introduce many additional reads and writes. So to avoid this blowing up we will first perform a SSA stage that conflates &x, x, *x, etc. The final relation we will use from this stage is the 'Def' class defined in the very bottom. This class has been pruned to only contain those definitions which are eventually read. Both SSA stages uses a common 'SsaInternalsCommon' file that specifies which indirection of a variable a given store writes to, and which indirection a given read reads from. However, this number isn't used in this first SSA stage. It will, however, be used in the next stage.

the rest of dataflow.

order to have use-use flow we need a way of representing data behind an indirection. The IR does have memory instructions for exactly this purpose, but they are the result of a must-use analysis which is not useful for dataflow in general. Instead, we add two new dataflow nodes: IndirectOperand and IndirectInstruction, which represent values that can be obtained by loading the underlying operand (or instruction) a number of times (the number given by the 'index' column). We obain the maximum number of indices for an operand (or instruction) by counting the number of indirections in the type. We can use these IPA branches for lots of things: - They will be used to represent the pre-update note after a field update. - They will be used to represent use-use flow. - They will be used to represent indirections going into function calls, and indirect return values out of function calls. Note, however, that we cannot use them to represent the value of an argument after they have been updated by a function call since we need these dataflow nodes to be distinct from the one going into the function. So we add a 'TIndirectArgumentOutNode' IPA branch to represent those (and their pre-update node will also be an 'IndirectOperand'). All this adds a lot of new dataflow nodes. I've done my best to prune the set of dataflow nodes by removing unnecessary instructions and operands from 'InstructionNode' and 'OperandNode'.

This commit also caches 'toString' and 'getLocation' as these were becoming very expensive to compute as we add more dataflow nodes.

functions. For flow into functions we add an additional column to 'IndirectionPosition' so that a parameter/argument position is (i, j) where i represent the argument index, and j represent the number of times the variable must be loaded. For example: ```cpp void f(int* x, int y) { ... } ``` will have 3 arguments/parameters: - 'x' as 'TDirectPosition(0)' - '*x' as 'TIndirectionPosition(0, 1)' - y as 'TDirectPosition(1)' Similarly for returns, we add an index to both the 'TNormalReturnKind' and the 'TIndirectReturnKind' branches that represent the number of loads necessary to obtain the value.

instruction/operand flow is now much more simple as SSA will give us much of this flow via the 'defUseFlow' relation. The flow through the indirect nodes is also quite simple: there's flow from '*x' to '*y' if there's flow from 'x' to 'y'. Additionally, we also allow flow from an operand of a pointer arithmetic instruction to the result of the pointer arithmetic instruction when the instruction is an indirect instruction. This reflects that we want flow in the following example: ```cpp char* data = source(); // *data is tainted sink(*(data + 1)) // *(data + 1) is also tainted ```

but there are some subtleties. In addition to keeping track of indices on SSA read/writes, and dataflow nodes, we also need to keep indices on field contents. This is because we don't want dataflow in this example: ```cpp struct S { int* x; }; ... S s; *s.x = source(); sink(s.x) // reads the pointer, but not the pointee. ``` Additionally, now we've done all these smart things for field updates on non-union objects, but none of it works for union fields since these were previously excluded by our field flow. The only thing needed for proper dataflow through unions were a read step and a store step, which I've included in this commit. A store to a union field creates a union `Content` containing the union as a column instead of the field. This means that we'll correctly find this flow: ```cpp union U { int x; int y; } ... U u; u.x = source(); sink(u.y); ``` since both the store and read step will use a 'TUnionContent(U)'. I omitted the index column in the example, but it has a similar rolse as in the 'struct' case.

which indirection has data we should now be able to write that down in our modelling language. This commit extends the 'FunctionInput' and 'FunctionOutput' classes with an index, and updates how we interpret those member predicates in dataflow and taintflow. I've left the existing (non-indexed) member predicates in there so we preserve backwards compatibility.

the old taint rules can now be stated as "flow from x to *x" by reducing the index in indirect operands and instructions by 1.

doesn't always give you back an InstructionNode. We cannot always give back an instruction as we now also sometimes want to state that an indirect node is a barrier.

select a sink that was backwards compatible with the old taint-library, and this commit places a couple of additional layers on that. We now conflate indirect dataflow nodes with their underlying node and use the underlying node's 'adjustedSink' as the 'adjustedSink' for the indirect nodes. in addition, we have to repair a couple of the barriers that previously assumed instruction nodes.

with indirect dataflow nodes.

the logic from ModelUtil and would have required the same update as in the earlier commit. Instead of doing that, this commit imports ModelUtil and forwards the hard job of interpreting the remote flow source/sink functions to ModelUtil.

as tainted (instead of the pointee), or vice versa. Because of existing dataflow pointer/pointee conflation we never noticed that, but since this PR removes those imprecisions we now need to update these models.

cpp/ql/lib/semmle/code/cpp/models/interfaces/FunctionInputsAndOutputs.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/TaintTrackingUtil.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/TaintTrackingUtil.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

and 'cpp/system-data-exposure' changes.

rdmarsh2 · 2022-08-31T18:24:40Z

cpp/ql/src/Security/CWE/CWE-129/ImproperArrayIndexValidation.ql

+  exists(VariableAccess va |
+    va = node.asExpr() and
+    va.getTarget() = checkedVar and
+    access.getDef().getAst() = va and
+    any(IRGuardCondition guard).ensuresEq(access, _, _, access.getDef().getBlock(), true)
+  )


Is this actually equivalent? It looks like access and node were only tied together by the Variable previously, but now they're tied by the VariableAccess.

aschackmull · 2022-09-01T06:52:25Z

Here's something to aspire to: Put qldoc on all predicates. See e.g. cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll, which doesn't have even a single line of qldoc.

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll

hvitved

I only did a light review of the SSA parts.

hvitved · 2022-09-01T11:02:06Z

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll

+   * the enclosing basic block. To obtain this index, use
+   * `DefOrUseImpl::hasIndexInBlock/2` or `DefOrUseImpl::hasIndexInBlock/3`.
+   */
+  abstract int getIndirectionIndex();


Would getIndirectionCount be a better name?

Yeah, that would also work. I'll think about whether that might be a more appropriate name. Thanks!

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

… the index for phi nodes and (2) don't use 'phiHasInputFromBlock' in 'fromPhiNode'.

hvitved · 2022-09-02T07:45:43Z

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll

@@ -254,6 +254,13 @@ class UseImpl extends DefOrUseImpl, TUseImpl {
  predicate isCertain() { isUse(true, operand, _, _, ind) }
 }

+/**
+ * Holds if `defOrUse1` is a definition which is first read by `use`,
+ * or if `defOrUse1` is a use and `use` is the next subsequent use.


nit: a next subsequent use (there could be multiple)

Good point. Fixed in 98435d1.

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll

MathiasVP · 2022-09-02T09:23:31Z

Here's something to aspire to: Put qldoc on all predicates. See e.g. cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll, which doesn't have even a single line of qldoc.

Thanks for the callout on this. I definitely intended to write QLDoc for the predicates in that file, but it somehow slipped my mind 🤦. e5e6356 adds the required QLDoc to that file.

cpp/ql/test/library-tests/dataflow/dereferences/ASTConfiguration.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowPrivate.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/ModelUtil.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/ssa0/SsaInternals.qll

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll

github-advanced-security

Found 23 potential problems in the proposed changes. Check the Files changed tab for more details.

MathiasVP · 2022-09-09T16:17:55Z

Superseded by #10366.

MathiasVP added 18 commits August 26, 2022 13:15

C++: Introduce the final SSA stage that we use for

ae9edb0

the rest of dataflow.

C++: Create wrappers for the new IPA branches.

4afb66c

This commit also caches 'toString' and 'getLocation' as these were becoming very expensive to compute as we add more dataflow nodes.

C++: Update the local taint step relation. Many of

fbc0de3

the old taint rules can now be stated as "flow from x to *x" by reducing the index in indirect operands and instructions by 1.

C++: Fixup the 'BarrierGuard' module so that 'getABarrierNode'

0b4f911

doesn't always give you back an InstructionNode. We cannot always give back an instruction as we now also sometimes want to state that an indirect node is a barrier.

C++: Add some user-facing predicates to interact

4aa0cd4

with indirect dataflow nodes.

C++: Qualified import for 'cpp'.

a4a6839

C++: Repair a few broken models that were incorrectly a pointer

e6db1e9

as tainted (instead of the pointee), or vice versa. Because of existing dataflow pointer/pointee conflation we never noticed that, but since this PR removes those imprecisions we now need to update these models.

C++: Repair ExecTainted.

722e2bb

C++: Repair XXE.

b0ab293

C++: Repair 'OverflowDestination' and 'TaintedAllocationSize.ql'

dc71613

github-actions bot added the C++ label Aug 26, 2022

github-advanced-security bot found potential problems Aug 26, 2022

View reviewed changes

MathiasVP added 10 commits August 29, 2022 10:29

C++: Add a testcase with precise indirection flow testing.

4516405

C++: Accept library test changes.

39224a5

C++: Accept 'cpp/overflow-destination' changes.

0fc5273

C++: Accept 'cpp/cleartext-storage-buffer' changes.

5399b50

C++: Accept 'cpp/unbounded-write' changes.

107d817

C++: Accept 'cpp/potential-system-data-exposure'

c91d6f4

and 'cpp/system-data-exposure' changes.

C++: Accept 'cpp/tainted-arithmetic' changes.

e518109

C++: Accept 'cpp/tainted-format-string' changes.

d0da66b

C++: Accept 'cpp/tainted-format-string-through-global' changes.

be90d1b

C++: Accept 'cpp/integer-overflow-tainted' changes.

9972004

rdmarsh2 reviewed Aug 31, 2022

View reviewed changes

aschackmull reviewed Sep 1, 2022

View reviewed changes

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll Outdated Show resolved Hide resolved

aschackmull reviewed Sep 1, 2022

View reviewed changes

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll Outdated Show resolved Hide resolved

hvitved reviewed Sep 1, 2022

View reviewed changes

MathiasVP added 7 commits September 1, 2022 13:48

Merge branch 'main' into use-use-flow-for-cpp

eec6ce9

C++: Put 'isUse' and 'isDef' into a shared cached module.

3aa825b

C++: Add QLDoc to 'adjacentDefRead'.

1ebf0d0

C++: Cache 'TSsaDefOrUse'.

471e08b

C++: Two fixes based on review comments: (1) Don't rely on '-1' being…

b692aed

… the index for phi nodes and (2) don't use 'phiHasInputFromBlock' in 'fromPhiNode'.

C++: Fix QLDoc formatting.

53b58ac

C++: Rename defUseFlow to ssaFlow.

e1f7cb6

hvitved reviewed Sep 2, 2022

View reviewed changes

C++: Fix QLDoc.

98435d1

MathiasVP added the no-change-note-required This PR does not need a change note label Sep 2, 2022

github-advanced-security bot found potential problems Sep 2, 2022

View reviewed changes

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll Fixed Show fixed Hide fixed

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll Fixed Show fixed Hide fixed

cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll Fixed Show fixed Hide fixed

C++: Add QLDoc to 'SsaInternalsCommon'.

e5e6356

MathiasVP force-pushed the use-use-flow-for-cpp branch from 89cafee to e5e6356 Compare September 2, 2022 09:24

rdmarsh2 mentioned this pull request Sep 7, 2022

C++: further work on experimental buffer overflow queries #10340

Closed

Merge branch 'main' into use-use-flow-for-cpp

2dc681c

github-advanced-security bot found potential problems Sep 8, 2022

View reviewed changes

MathiasVP marked this pull request as draft September 9, 2022 10:54

MathiasVP force-pushed the use-use-flow-for-cpp branch from 2dc681c to a891b15 Compare September 9, 2022 10:54

github-advanced-security bot found potential problems Sep 9, 2022

View reviewed changes

MathiasVP force-pushed the use-use-flow-for-cpp branch from a891b15 to 2dc681c Compare September 9, 2022 11:14

MathiasVP mentioned this pull request Sep 9, 2022

C++: Use-use flow in experimental #10366

Merged

MathiasVP closed this Sep 9, 2022

MathiasVP mentioned this pull request Nov 15, 2022

C++: Fix spurious reference flow #11254

Merged

C++: Switch to use-use flow in IR dataflow #10190

C++: Switch to use-use flow in IR dataflow #10190

Uh oh!

Conversation

MathiasVP commented Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdmarsh2 Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

aschackmull commented Sep 1, 2022

Uh oh!

Uh oh!

Uh oh!

hvitved left a comment

Choose a reason for hiding this comment

Uh oh!

hvitved Sep 1, 2022

Choose a reason for hiding this comment

Uh oh!

MathiasVP Sep 1, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hvitved Sep 2, 2022

Choose a reason for hiding this comment

Uh oh!

MathiasVP Sep 2, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MathiasVP commented Sep 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-advanced-security bot left a comment

Choose a reason for hiding this comment

Uh oh!

MathiasVP commented Sep 9, 2022

Uh oh!

Uh oh!

MathiasVP commented Aug 26, 2022 •

edited

Loading

MathiasVP commented Sep 2, 2022 •

edited

Loading