Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbolic tracing #157

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from
Draft

Conversation

damienmaier
Copy link
Contributor

The goal of this PR is to add the tracing feature described in section 5.3 here

@damienmaier
Copy link
Contributor Author

I still have to wok on it before it is ready. What I plan to do if ok for you is :

  • Fix the coding style
  • Put all code added by this PR inside some #ifdef such that it is compiled only if a flag is set when running cmake
  • Make the tracing file being put in the same folder as the normal symcc output

Copy link
Contributor

@lekcyjna123 lekcyjna123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to add any tests for the Tracer? Without them it will be hard to prevent regressions.

#include "Runtime.h"
#include <cstdint>
#include <fstream>
#include <nlohmann/json.hpp>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Json library has not been added to SymCC dependencies. Please update the README, dockerfile and github workflows.

nlohmann::json Tracer::currentTrace;
nlohmann::json Tracer::expressions;
nlohmann::json Tracer::pathConstraints;
const std::string Tracer::BACKEND_TRACE_FILE = "/tmp/backend_trace.json";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be either configurable or created in more generic way (e.g. mktemp).

for (auto byteAddress = pageAddress; byteAddress < pageAddress + kPageSize;
byteAddress++) {
auto byteExpr = _sym_read_memory((u_int8_t *)byteAddress, 1, true);
if (byteExpr != nullptr && !byteExpr->isConcrete()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can compare the current memory state with previous one to don't duplicate information in each step and only print diffs? This will reduce number of IO operations and needed memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would certainly be a useful improvement but unfortunately I will not have time to work on this.

#include "Shadow.h"
#include "llvm/Support/raw_ostream.h"

nlohmann::json Tracer::currentTrace;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In RuntimeCommon.cpp there is a goal to do the runtime thread local. You use here the global state. Could you refactor your code so that it can be run in multiple threads in future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand. Are we talking about threads of the tested program ? In my understanding, the shadow memory pages, the set of accumulated path constraints, etc. should be shared across threads of the target program.

As this tracing feature simply takes snapshots of this state, I do not see how it could be thread local.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, the shadow memory pages, the set of accumulated path constraints, etc. should be shared across threads of the target program.

I think that in case of multi-thread tested programs the set of accumulated path constraints should be thread local, because each thread will execute potentially different path, so the set should be different. According to the shadow memory, you are right.

My idea was to have the possibility to execute the multiple runtimes in different threads to implement #14 without forking, but I don't know if that has a much sense.

Nevertheless if there is a possibility to doesn't have global state we should work in that direction, because programs with such state doesn't scale well.

target_link_libraries(SymRuntime ${Z3_LIBRARIES} ${QSYM_LLVM_DEPS})
include(FetchContent)

FetchContent_Declare(json URL https://github.com/nlohmann/json/releases/download/v3.11.2/json.tar.xz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Why this is in qsym_backend specific directory? Why you fetch the code instead of require nlohmann-json to be available in operating system? This is very popular package and is available on major distrubiutions, e.g:
Ubuntu https://packages.ubuntu.com/jammy/nlohmann-json3-dev
Debian https://packages.debian.org/bookworm/nlohmann-json3-dev
Arch Linux https://archlinux.org/packages/extra/any/nlohmann-json/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no experience working with CMake, I simply used a solution that was suggested in the README of nlohmann-json. If this is a bad solution maybe you could change it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to prepare a PR till the end of week.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I created a PR: damienmaier#1

@@ -449,7 +469,7 @@ bool _sym_feasible(SymExpr expr) {
//

void _sym_collect_garbage() {
if (allocatedExpressions.size() < g_config.garbageCollectionThreshold)
// if (allocatedExpressions.size() < g_config.garbageCollectionThreshold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you left that comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tracer identifies each expression with a unique ID, that is used in the JSON file. Currently, I used the memory address of the expression as ID and thus to avoid ID collision we need to disable the garbage collection.

As discussed above, I will put the code of this PR inside #ifdefs such that garbage collection is disabled only when performing tracing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe changing way of creating IDs will be a good idea? I am afraid that after turning off garbage collection the memory usage can explode.

@@ -321,6 +340,7 @@ void _sym_push_path_constraint(SymExpr constraint, int taken,
if (constraint == nullptr)
return;

Tracer::tracePathConstraint(constraint, taken != 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be calls to tracer put into compiler plugin code? So that it will be runtime independent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. What do you mean by compiler plugin code ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compiler directory in symcc repository root (https://github.com/eurecom-s3/symcc/tree/master/compiler). There is a code, which is executed on compilation phase as a plugin to clang.

Copy link
Contributor

@rmalmain rmalmain Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho it should also not be put in compiler, otherwise, SymQEMU will not be able to use tracing.
I guess the best option would be to have _sym_push_path_constraint wrapped into a runtime-agnostic function calling to the tracer and the runtime-specific corresponding function.
The extra call indirection should be optimized out at compile-time.


Args
trace_file: The path to the trace file generated by SymCC.
memory_region_names_file: The path to a file containing a mapping from memory region names to their addresses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to generate such file using SymCC? What should be its format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know how to generate the memory region file with SymCC. I added code in SymQEMU such that it outputs this file, and I plan to submit a PR to SymQEMU with this feature.

I will document the format of the file, and I will also edit this python function such that the memory region files is not mandatory.

For a bit of context, the motivation for using this memory region information is that, because of ASLR (and because I was comparing the behavior of different versions of SymQEMU), equivalent memory areas were put at different locations in memory across different runs. To be able to compare different traces, I needed to express the addresses relatively to their area.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you are adding SymQEMU specific code to the SymCC I don't think that this is a good idea, because not everyone who will use SymCC has to use SymQEMU. So I think that the solution should be made more general to work without SymQEMU. In you paper I see:

As a solution, we added some code in both versions of SymQEMU
that makes them output, in another json file, the addresses of the different memory regions,
including the addresses of the guest memory segments (stack, data, etc.)

Here I some my ideas how this can be improved:

  • Maybe will be enough to use variables provided by linker? It provides you end of text, end of data and end of initialized memory symbols: https://stackoverflow.com/a/4309620.
  • If you need more data you can add a code which reads /proc/<pid>/maps file to get addresses of interesting sections in runtime.
  • You can start tested program using setarch -R to disable adress randomization
  • Instead of addresses you can generate ID's in other more generic way (e.g. hashes similar to qsym).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added information how to install it, but not how to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will improve the README

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add tests. Also type and format checking will be useful (e.g. pyright, black).

In a previous commit I got allocatedExpressions out of the anonymous
namespace, but this is not necessary anymore as it can be accessed with
getAllocatedExpressions
Those are lines of code that I did not intend to include in the PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants