Skip to content

C++: Switch to use-use flow in IR dataflow #10190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 54 commits into from

Conversation

MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Aug 26, 2022

This is a very large redesign of the C++ IR dataflow library. Most importantly, it switches C++ to use-use flow (instead of def-use flow) like all the other languages that use the shared dataflow library.

Previously, this has been impossible to do for C++. To see why, consider the following example:

const char* path = getenv("PATH"); // *path is tainted
const char* path2 = path;
sink(path);

Most other languages would have flow from path on line 1 to path on line 2, and from path on line 2 to path on line 3. However, if we just step from the expression node on line 1 to the expression node on line 2 we conflate the pointed-to value with the pointer value.

Instead, C++ currently has flow from path on line 1 to path on both lines 2 and 3. This prevents pointee-pointer conflation, but it also has a number of bad side effects:

  • It makes barriers terribly difficult to use.
  • It will sometimes explode on codebases where a single definition have thousands of uses leading to bad performance.
  • It makes it impossible to implement clearsContent for C/C++ (whereas this is a trivial one-liner in languages with use-use flow).

All of these issues will be fixed by this PR.

I expect this PR will take quite a while to go through. While each commit doesn't compile on its own, it's meant to be reviewed commit-by-commit.

commit we will introduce the "real" SSA analysis
that we'll end up using in dataflow (which handles)
indirections. However, this will introduce many
additional reads and writes. So to avoid this blowing
up we will first perform a SSA stage that conflates
&x, x, *x, etc.

The final relation we will use from this stage is the
'Def' class defined in the very bottom. This class has
been pruned to only contain those definitions which are
eventually read.

Both SSA stages uses a common 'SsaInternalsCommon' file
that specifies which indirection of a variable a given
store writes to, and which indirection a given read
reads from. However, this number isn't used in this
first SSA stage. It will, however, be used in the next
stage.
order to have use-use flow we need a way of representing
data behind an indirection. The IR does have memory
instructions for exactly this purpose, but they are the
result of a must-use analysis which is not useful for
dataflow in general. Instead, we add two new dataflow nodes:
IndirectOperand and IndirectInstruction, which represent values
that can be obtained by loading the underlying operand (or
instruction) a number of times (the number given by the 'index'
column). We obain the maximum number of indices for an operand
(or instruction) by counting the number of indirections in the
type.

We can use these IPA branches for lots of things:
- They will be used to represent the pre-update note after a
field update.
- They will be used to represent use-use flow.
- They will be used to represent indirections going into function
calls, and indirect return values out of function calls.

Note, however, that we cannot use them to represent the value
of an argument after they have been updated by a function call
since we need these dataflow nodes to be distinct from the one
going into the function. So we add a 'TIndirectArgumentOutNode'
IPA branch to represent those (and their pre-update node will
also be an 'IndirectOperand').

All this adds a lot of new dataflow nodes. I've done my best to
prune the set of dataflow nodes by removing unnecessary
instructions and operands from 'InstructionNode' and 'OperandNode'.
This commit also caches 'toString' and 'getLocation'
as these were becoming very expensive to compute as
we add more dataflow nodes.
functions. For flow into functions we add an
additional column to 'IndirectionPosition' so
that a parameter/argument position is (i, j)
where i represent the argument index, and j
represent the number of times the variable
must be loaded. For example:
```cpp
void f(int* x, int y) { ... }
```
will have 3 arguments/parameters:
- 'x' as 'TDirectPosition(0)'
- '*x' as 'TIndirectionPosition(0, 1)'
- y as 'TDirectPosition(1)'

Similarly for returns, we add an index to both the
'TNormalReturnKind' and the 'TIndirectReturnKind'
branches that represent the number of loads necessary
to obtain the value.
instruction/operand flow is now much more simple
as SSA will give us much of this flow via the
'defUseFlow' relation. The flow through the indirect
nodes is also quite simple: there's flow from '*x' to
'*y' if there's flow from 'x' to 'y'. Additionally, we
also allow flow from an operand of a pointer arithmetic
instruction to the result of the pointer arithmetic
instruction when the instruction is an indirect instruction.
This reflects that we want flow in the following example:
```cpp
char* data = source(); // *data is tainted
sink(*(data + 1)) // *(data + 1) is also tainted
```
but there are some subtleties.
In addition to keeping track of indices on SSA read/writes,
and dataflow nodes, we also need to keep indices on field
contents. This is because we don't want dataflow in this example:
```cpp
struct S { int* x; };
...
S s;
*s.x = source();
sink(s.x) // reads the pointer, but not the pointee.
```
Additionally, now we've done all these smart things for
field updates on non-union objects, but none of it
works for union fields since these were previously
excluded by our field flow. The only thing needed for
proper dataflow through unions were a read step and a
store step, which I've included in this commit.

A store to a union field creates a union `Content` containing
the union as a column instead of the field. This means that
we'll correctly find this flow:
```cpp
union U { int x; int y; }
...
U u;
u.x = source();
sink(u.y);
```
since both the store and read step will use a
'TUnionContent(U)'.
I omitted the index column in the example, but
it has a similar rolse as in the 'struct' case.
which indirection has data we should now be able to write that
down in our modelling language. This commit extends the
'FunctionInput' and 'FunctionOutput' classes with an index, and
updates how we interpret those member predicates in dataflow and
taintflow.

I've left the existing (non-indexed) member predicates in there
so we preserve backwards compatibility.
the old taint rules can now be stated as "flow from
x to *x" by reducing the index in indirect operands
and instructions by 1.
doesn't always give you back an InstructionNode. We cannot
always give back an instruction as we now also sometimes
want to state that an indirect node is a barrier.
select a sink that was backwards compatible with the old
taint-library, and this commit places a couple of additional
layers on that. We now conflate indirect dataflow nodes with
their underlying node and use the underlying node's 'adjustedSink'
as the 'adjustedSink' for the indirect nodes.
in addition, we have to repair a couple of the barriers that
previously assumed instruction nodes.
the logic from ModelUtil and would have required the same update
as in the earlier commit.

Instead of doing that, this commit imports ModelUtil and forwards
the hard job of interpreting the remote flow source/sink functions
to ModelUtil.
as tainted (instead of the pointee), or vice versa. Because of
existing dataflow pointer/pointee conflation we never noticed that,
but since this PR removes those imprecisions we now need to update
these models.
@github-actions github-actions bot added the C++ label Aug 26, 2022
Comment on lines +49 to +54
exists(VariableAccess va |
va = node.asExpr() and
va.getTarget() = checkedVar and
access.getDef().getAst() = va and
any(IRGuardCondition guard).ensuresEq(access, _, _, access.getDef().getBlock(), true)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually equivalent? It looks like access and node were only tied together by the Variable previously, but now they're tied by the VariableAccess.

@aschackmull
Copy link
Contributor

Here's something to aspire to: Put qldoc on all predicates. See e.g. cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll, which doesn't have even a single line of qldoc.

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a light review of the SSA parts.

* the enclosing basic block. To obtain this index, use
* `DefOrUseImpl::hasIndexInBlock/2` or `DefOrUseImpl::hasIndexInBlock/3`.
*/
abstract int getIndirectionIndex();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would getIndirectionCount be a better name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would also work. I'll think about whether that might be a more appropriate name. Thanks!

@@ -254,6 +254,13 @@ class UseImpl extends DefOrUseImpl, TUseImpl {
predicate isCertain() { isUse(true, operand, _, _, ind) }
}

/**
* Holds if `defOrUse1` is a definition which is first read by `use`,
* or if `defOrUse1` is a use and `use` is the next subsequent use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a next subsequent use (there could be multiple)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed in 98435d1.

@MathiasVP MathiasVP added the no-change-note-required This PR does not need a change note label Sep 2, 2022
@MathiasVP
Copy link
Contributor Author

MathiasVP commented Sep 2, 2022

Here's something to aspire to: Put qldoc on all predicates. See e.g. cpp/ql/lib/semmle/code/cpp/ir/dataflow/internal/SsaInternalsCommon.qll, which doesn't have even a single line of qldoc.

Thanks for the callout on this. I definitely intended to write QLDoc for the predicates in that file, but it somehow slipped my mind 🤦. e5e6356 adds the required QLDoc to that file.

@MathiasVP MathiasVP marked this pull request as draft September 9, 2022 10:54
@MathiasVP MathiasVP force-pushed the use-use-flow-for-cpp branch from 2dc681c to a891b15 Compare September 9, 2022 10:54
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 23 potential problems in the proposed changes. Check the Files changed tab for more details.

@MathiasVP
Copy link
Contributor Author

Superseded by #10366.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ no-change-note-required This PR does not need a change note
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants