-
Notifications
You must be signed in to change notification settings - Fork 1.7k
C++: Switch to use-use flow in IR dataflow #10190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
commit we will introduce the "real" SSA analysis that we'll end up using in dataflow (which handles) indirections. However, this will introduce many additional reads and writes. So to avoid this blowing up we will first perform a SSA stage that conflates &x, x, *x, etc. The final relation we will use from this stage is the 'Def' class defined in the very bottom. This class has been pruned to only contain those definitions which are eventually read. Both SSA stages uses a common 'SsaInternalsCommon' file that specifies which indirection of a variable a given store writes to, and which indirection a given read reads from. However, this number isn't used in this first SSA stage. It will, however, be used in the next stage.
the rest of dataflow.
order to have use-use flow we need a way of representing data behind an indirection. The IR does have memory instructions for exactly this purpose, but they are the result of a must-use analysis which is not useful for dataflow in general. Instead, we add two new dataflow nodes: IndirectOperand and IndirectInstruction, which represent values that can be obtained by loading the underlying operand (or instruction) a number of times (the number given by the 'index' column). We obain the maximum number of indices for an operand (or instruction) by counting the number of indirections in the type. We can use these IPA branches for lots of things: - They will be used to represent the pre-update note after a field update. - They will be used to represent use-use flow. - They will be used to represent indirections going into function calls, and indirect return values out of function calls. Note, however, that we cannot use them to represent the value of an argument after they have been updated by a function call since we need these dataflow nodes to be distinct from the one going into the function. So we add a 'TIndirectArgumentOutNode' IPA branch to represent those (and their pre-update node will also be an 'IndirectOperand'). All this adds a lot of new dataflow nodes. I've done my best to prune the set of dataflow nodes by removing unnecessary instructions and operands from 'InstructionNode' and 'OperandNode'.
This commit also caches 'toString' and 'getLocation' as these were becoming very expensive to compute as we add more dataflow nodes.
functions. For flow into functions we add an additional column to 'IndirectionPosition' so that a parameter/argument position is (i, j) where i represent the argument index, and j represent the number of times the variable must be loaded. For example: ```cpp void f(int* x, int y) { ... } ``` will have 3 arguments/parameters: - 'x' as 'TDirectPosition(0)' - '*x' as 'TIndirectionPosition(0, 1)' - y as 'TDirectPosition(1)' Similarly for returns, we add an index to both the 'TNormalReturnKind' and the 'TIndirectReturnKind' branches that represent the number of loads necessary to obtain the value.
instruction/operand flow is now much more simple as SSA will give us much of this flow via the 'defUseFlow' relation. The flow through the indirect nodes is also quite simple: there's flow from '*x' to '*y' if there's flow from 'x' to 'y'. Additionally, we also allow flow from an operand of a pointer arithmetic instruction to the result of the pointer arithmetic instruction when the instruction is an indirect instruction. This reflects that we want flow in the following example: ```cpp char* data = source(); // *data is tainted sink(*(data + 1)) // *(data + 1) is also tainted ```
but there are some subtleties. In addition to keeping track of indices on SSA read/writes, and dataflow nodes, we also need to keep indices on field contents. This is because we don't want dataflow in this example: ```cpp struct S { int* x; }; ... S s; *s.x = source(); sink(s.x) // reads the pointer, but not the pointee. ``` Additionally, now we've done all these smart things for field updates on non-union objects, but none of it works for union fields since these were previously excluded by our field flow. The only thing needed for proper dataflow through unions were a read step and a store step, which I've included in this commit. A store to a union field creates a union `Content` containing the union as a column instead of the field. This means that we'll correctly find this flow: ```cpp union U { int x; int y; } ... U u; u.x = source(); sink(u.y); ``` since both the store and read step will use a 'TUnionContent(U)'. I omitted the index column in the example, but it has a similar rolse as in the 'struct' case.
which indirection has data we should now be able to write that down in our modelling language. This commit extends the 'FunctionInput' and 'FunctionOutput' classes with an index, and updates how we interpret those member predicates in dataflow and taintflow. I've left the existing (non-indexed) member predicates in there so we preserve backwards compatibility.
the old taint rules can now be stated as "flow from x to *x" by reducing the index in indirect operands and instructions by 1.
doesn't always give you back an InstructionNode. We cannot always give back an instruction as we now also sometimes want to state that an indirect node is a barrier.
select a sink that was backwards compatible with the old taint-library, and this commit places a couple of additional layers on that. We now conflate indirect dataflow nodes with their underlying node and use the underlying node's 'adjustedSink' as the 'adjustedSink' for the indirect nodes. in addition, we have to repair a couple of the barriers that previously assumed instruction nodes.
with indirect dataflow nodes.
the logic from ModelUtil and would have required the same update as in the earlier commit. Instead of doing that, this commit imports ModelUtil and forwards the hard job of interpreting the remote flow source/sink functions to ModelUtil.
as tainted (instead of the pointee), or vice versa. Because of existing dataflow pointer/pointee conflation we never noticed that, but since this PR removes those imprecisions we now need to update these models.
cpp/ql/lib/semmle/code/cpp/models/interfaces/FunctionInputsAndOutputs.qll
Fixed
Show fixed
Hide fixed
cpp/ql/lib/semmle/code/cpp/models/interfaces/FunctionInputsAndOutputs.qll
Fixed
Show fixed
Hide fixed
and 'cpp/system-data-exposure' changes.
exists(VariableAccess va | | ||
va = node.asExpr() and | ||
va.getTarget() = checkedVar and | ||
access.getDef().getAst() = va and | ||
any(IRGuardCondition guard).ensuresEq(access, _, _, access.getDef().getBlock(), true) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually equivalent? It looks like access
and node
were only tied together by the Variable
previously, but now they're tied by the VariableAccess
.
Here's something to aspire to: Put qldoc on all predicates. See e.g. |
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did a light review of the SSA parts.
* the enclosing basic block. To obtain this index, use | ||
* `DefOrUseImpl::hasIndexInBlock/2` or `DefOrUseImpl::hasIndexInBlock/3`. | ||
*/ | ||
abstract int getIndirectionIndex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would getIndirectionCount
be a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would also work. I'll think about whether that might be a more appropriate name. Thanks!
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternals.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/DataFlowUtil.qll
Outdated
Show resolved
Hide resolved
… the index for phi nodes and (2) don't use 'phiHasInputFromBlock' in 'fromPhiNode'.
@@ -254,6 +254,13 @@ class UseImpl extends DefOrUseImpl, TUseImpl { | |||
predicate isCertain() { isUse(true, operand, _, _, ind) } | |||
} | |||
|
|||
/** | |||
* Holds if `defOrUse1` is a definition which is first read by `use`, | |||
* or if `defOrUse1` is a use and `use` is the next subsequent use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a next subsequent use
(there could be multiple)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed in 98435d1.
Thanks for the callout on this. I definitely intended to write QLDoc for the predicates in that file, but it somehow slipped my mind 🤦. e5e6356 adds the required QLDoc to that file. |
89cafee
to
e5e6356
Compare
2dc681c
to
a891b15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 23 potential problems in the proposed changes. Check the Files changed tab for more details.
a891b15
to
2dc681c
Compare
Superseded by #10366. |
This is a very large redesign of the C++ IR dataflow library. Most importantly, it switches C++ to use-use flow (instead of def-use flow) like all the other languages that use the shared dataflow library.
Previously, this has been impossible to do for C++. To see why, consider the following example:
Most other languages would have flow from
path
on line 1 topath
on line 2, and frompath
on line 2 topath
on line 3. However, if we just step from the expression node on line 1 to the expression node on line 2 we conflate the pointed-to value with the pointer value.Instead, C++ currently has flow from
path
on line 1 topath
on both lines 2 and 3. This prevents pointee-pointer conflation, but it also has a number of bad side effects:clearsContent
for C/C++ (whereas this is a trivial one-liner in languages with use-use flow).All of these issues will be fixed by this PR.
I expect this PR will take quite a while to go through. While each commit doesn't compile on its own, it's meant to be reviewed commit-by-commit.