Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i#2144 : single step on jump #2295

Merged
merged 32 commits into from
May 16, 2017
Merged

Conversation

Simorfo
Copy link
Contributor

@Simorfo Simorfo commented Mar 15, 2017

Changing exception address when a single step exception occurs when
it happens in the stub.

Fixes i#2144

@Simorfo
Copy link
Contributor Author

Simorfo commented Mar 15, 2017

The fix is not perfect. I think it should change the exception address to the jump destination and not the address of the jump instruction. But I do not know how to get this value.

@derekbruening
Copy link
Contributor

I wonder if you could paste the relevant pieces from loglevel 4 into the issue tracker to make sure we understand what's happening here? I put a request there.

* i#2144 : We exclude single step cases in a jump.
*/
if (pExcptRec->ExceptionCode == EXCEPTION_SINGLE_STEP &&
!in_fcache(pExcptRec->ExceptionAddress)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment below is now in the wrong place, about calling in_fcache last.

Logically this would be better out in the caller and not here in check_internal_exception which is just supposed to handle a crash in DR itself.

Probably what is needed is a query in_exit_stub().

!in_fcache(pExcptRec->ExceptionAddress)) {
fragment_t wrapper;
linkstub_t *l;
fragment_t *f = fragment_pclookup(dcontext, dcontext->next_tag, &wrapper);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next_tag is only set on entering the cache. This could have run thousands of blocks since then, so next_tag is irrelevant at this point.

It seems that to handle this we need to decode the stub and extract the linkstub pointer. We have existing stub decoding for coarse stubs today but not for fine.

fragment_t wrapper;
linkstub_t *l;
fragment_t *f = fragment_pclookup(dcontext, dcontext->next_tag, &wrapper);
l = FRAGMENT_EXIT_STUBS(f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no check for pclookup failure.

@derekbruening
Copy link
Contributor

An alternative could be to insert a nop during mangling after everything that could trigger this: just popf and iret?

@Simorfo
Copy link
Contributor Author

Simorfo commented Mar 16, 2017

Maybe your alternative is good.
But I did not feel confident about the exhaustivity of the conditions which could trigger this.
And even so, it would not be totally transparent about the exception address sent to the exception handler.
I will put some level 4 logs as requested.

@derekbruening
Copy link
Contributor

But I did not fell confident about the exhaustivity of the conditions which could trigger this.

You mean, things like execution under a debugger?

And even so, it would not be totally transparent about the exception address sent to the exception handler.

We would arrange to have the translation do the right thing.

@Simorfo
Copy link
Contributor Author

Simorfo commented Mar 16, 2017

You mean, things like execution under a debugger?

That is a good example.
It just seems hard to me to be sure that we thought about every possible case...

@Simorfo
Copy link
Contributor Author

Simorfo commented Mar 27, 2017

So, what should be done ?
Look for pop and iret ?
Or add some code to see if we are in a stub ?

First solution seems easier, but maybe not best for transparency (ExceptionAddress)

@derekbruening
Copy link
Contributor

It is up to you. Certainly putting a nop after the couple of explicit flag writes is the easiest thing to implement and should cover most cases here, whille implementing stub translation is a bigger project.

Execution under a debugger already hits cases we just can't solve from user space: e.g., on Windows, OutputDebugString can cause us to lose control under a debugger. So bailing on handling this under a debugger for now does not seem terrible.

Something like setting a thread's context to trigger this seems like more of a problem. We could at least add a check for that which complains even if it doesn't handle it.

@Simorfo
Copy link
Contributor Author

Simorfo commented Mar 29, 2017

Okay.
I will go for popf and iret

Should I dot it in mangle function ? something like

#ifdef FOOL_CPUID
        else if (instr_get_opcode(instr) == OP_cpuid) {
            mangle_cpuid(dcontext, ilist, instr, next_instr);
            continue;
        }
#endif

@derekbruening
Copy link
Contributor

Yes, sthg like mangle() asking if (instr_can_set_single_step()) mangle_single_step()?

@Simorfo
Copy link
Contributor Author

Simorfo commented Mar 30, 2017

For iret, I did not know what to do.
I think that we cannot insert a no after a iret because we go to a different basic block.

@derekbruening
Copy link
Contributor

For iret, I did not know what to do.
I think that we cannot insert a no after a iret because we go to a different basic block.

We split iret up into its pieces and could put a nop after the popf.

@derekbruening
Copy link
Contributor

For iret, I did not know what to do.
I think that we cannot insert a no after a iret because we go to a different basic block.

We split iret up into its pieces and could put a nop after the popf.

I guess that means we need interactions between iret mangling and this single-step mangling.

* FIXME i#2144 : to be absolutely transparent, we should translate the
* exception address as if we did not insert this nop.
*/
POST(ilist, instr, INSTR_CREATE_nop(dcontext));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the instrlist_set_translation_target() is probably automagically giving this a translation target but it's equal to the popf, which is not what we want, right? Also, when translating the single-step exception that points at the nop, won't translate_walk_track() consider this "unsupported_mangle" and thus translate_walk_good_state() will return false and we'll get an assert?

So we need two things: one, set the translation to the next app pc; and two, recognize the nop during translation.

xor eax, eax
/* push flags on the stack */
#ifdef X64
pushfq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a cpp layer (see asm_defines.asm) to avoid needing ifdefs: use PUSHF, PTRSZ, and REG_XSP.

inc eax
/* popping flags from top of the stack */
#ifdef X64
popfq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use POPF and no ifdef is needed

@Simorfo
Copy link
Contributor Author

Simorfo commented Apr 3, 2017

I hope this fix is ok.
There still remains that the ExceptionAddress is not transparent : it should be the jump target, not the jump address.
I do not know if it could be fixed in the translate_walk_track and such process.

@Simorfo
Copy link
Contributor Author

Simorfo commented Apr 3, 2017

I did not manage to reproduce the appveyor 64bit test failure. The 64 bit test build went fine by me.

@Simorfo
Copy link
Contributor Author

Simorfo commented Apr 3, 2017

In facts, the 64-bit test works when running with my client, and fails without it (because mangling does not insert the nop)... No idea why...

@derekbruening
Copy link
Contributor

derekbruening commented Apr 3, 2017

Both 32-bit and 64-bit singlestep tests are crashing on a STATUS_SINGLE_STEP exception:

start of test, count = 0
<Application C:\projects\dynamorio\build\build_debug-internal-32\suite\tests\bin\security-win32.singlestep.exe (1716).  DynamoRIO internal crash at PC 0x1a385f94.  Please report this at http://dynamorio.org/issues/.  Program aborted.
0x80000004 0x00000000 0x1a385f94 0x1a385f94 0x00000000 0x00000000
Base: 0x742b0000
Registers: eax=0x00000001 ebx=0x00000000 ecx=0x002f219c edx=0x0071a731
	esi=0x002f1f1f edi=0x002f1f1f esp=0x0042fb04 ebp=0x0042fb08
	eflags=0x0000
version 6.2.17259, custom build
-no_dynamic_options -code_api -probe_api -msgbox_mask 0 -dumpcore_mask 125 -stderr_mask 12 -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -staged -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct -no_aslr_dr -pad_jmps_mark_no_trace 
0x0042fb08 0x002f1e5c
0x0042fb50 0x756f7c04
0x0042fb64 0x778eab8f
0x0042fbac 0x778eab5a
0x0042fbbc 0x00000000>

@derekbruening
Copy link
Contributor

There still remains that the ExceptionAddress is not transparent : it should be the jump target, not the jump address.

I'm not sure what jump your'e referring to? With the nop, isn't the STATUS_SINGLE_STEP pointing at the nop, which is in_fcache and thus gets translated (hence my prior comments about properly translating the nop)? There's no jump involved, right?

@derekbruening
Copy link
Contributor

derekbruening commented Apr 3, 2017

There still remains that the ExceptionAddress is not transparent : it should be the jump target, not the jump address.

I'm not sure what jump your'e referring to? With the nop, isn't the STATUS_SINGLE_STEP pointing at the nop, which is in_fcache and thus gets translated (hence my prior comments about properly translating the nop)? There's no jump involved, right?

Keep forgetting the details here: ok, so you're saying that even once the mange-inserted nop is translated to the next app instr, when that's a jump, the exception address is supposed to go further and point at the jump target. I guess that weird behavior is limited to STATUS_SINGLE_STEP and we'd handle that in callback.c?

@@ -2403,7 +2405,17 @@ mangle_single_step(dcontext_t *dcontext, instrlist_t *ilist, instr_t *instr)
* FIXME i#2144 : to be absolutely transparent, we should translate the
* exception address as if we did not insert this nop.
*/
POST(ilist, instr, INSTR_CREATE_nop(dcontext));
instr_t* next = instr_get_next_app(instr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: t *n

@@ -2403,7 +2405,17 @@ mangle_single_step(dcontext_t *dcontext, instrlist_t *ilist, instr_t *instr)
* FIXME i#2144 : to be absolutely transparent, we should translate the
* exception address as if we did not insert this nop.
*/
POST(ilist, instr, INSTR_CREATE_nop(dcontext));
instr_t* next = instr_get_next_app(instr);
/* Assumes popf cannot end a basic block. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a valid assumption as blocks are truncated prior to a branch for various reasons. Can we just have NULL be another condition.

@@ -2403,7 +2405,17 @@ mangle_single_step(dcontext_t *dcontext, instrlist_t *ilist, instr_t *instr)
* FIXME i#2144 : to be absolutely transparent, we should translate the
* exception address as if we did not insert this nop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we already doing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we should get the jump target, not the jump instruction address...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the jump target instead of the jump is unique to STATUS_SINGLE_STEP, right? Nothing else does that. So the code to do that shouldn't be in general translate.c but instead in callback.c where single step is handled. IMHO it should be part of this CL: it's just a couple of lines of code and without it the tests fail, don't they?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the jump target instead of the jump is unique to STATUS_SINGLE_STEP, right?

Yes to the best of my knowledge.

The tests I added will not fail because my fault handler does not check the exception address (just the exception code)

I will try to put this check before check_internal_exception

* because the ExceptionAddress should be the jump target.
*/
if (instr_is_cti(next)) {
/* Next app instruction gives a correct translation to this nop. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But won't instrlist_set_translation be set to instr and not next?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in my tests although I am not fully aware of instrlist_set_translation code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mangle_shared.c:865:

            app_pc xl8 = get_app_instr_xl8(instr);
            instrlist_set_translation_target(ilist, xl8);

It doesn't matter where the nop is inserted: its xl8 will be set to ilist->translation_target, which is the xl8 of the instr being mangled, which is the popf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I understand.
But the ExceptionAddress is set to the EIP value after execution of a single instruction.
So, the ExceptionAddress we get is the EIP value after the no we insert, i.e. the jump address.
Considering that, I think it is ok to leave the translation address of the no to the popf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow -- isn't it like this: we put into the code cache "popf; nop". The popf writes to TF which generates the exception. The exception points at the next instruction from the one that generated it, so it points at the nop. Right? Then we process the exception and we translate it to the popf in the original app code.

Copy link
Contributor Author

@Simorfo Simorfo Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The popf writes to TF

yes

which generates the exception.

no : the next instruction after popf does

The exception points at the next instruction from the one that generated it

yes

so it points at the nop. Right?

no, it points at the jump

Then we process the exception and we translate it to the popf in the original app code.

no we translate it to the jump but we should translate it to the address after the jump

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Right. I also looked at the manual and now I believe I finally have all the details. Sorry, I should have thought this through earlier. Unfortunately this nop solution is not going to work for the general case. We can make it work for "popf;jmp" or "popf;jcc" with no client, because we can emulate the jmp or jcc, but the presence of instrumentation means we need a different solution as we can't easily emulate arbitrary sequences of instructions. I believe we have to disable and later forge the single-step exception. I will add something to the issue tracker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the discussion to #2144 (comment) where I try to describe the problems (instru or mangling between popf and next app instr) that require a larger solution.

@Simorfo
Copy link
Contributor Author

Simorfo commented Apr 4, 2017

Keep forgetting the details here: ok, so you're saying that even once the mange-inserted nop is translated to the next app instr, when that's a jump, the exception address is supposed to go further and point at the jump target.

Exactly

@Simorfo
Copy link
Contributor Author

Simorfo commented Apr 4, 2017

I think that I understand why there is a crash with no client (and success with a client).

It looks like the loop over instructions in mangle function does not take into account the popf instruction because it is somehow not marked as interesting. (with my client, every instruction is marked as interesting)
It looks that this is done in build_bb_ilist for other kinds of instructions.

Did I point out something here ?
Can you tell me more ?

@derekbruening
Copy link
Contributor

I think that I understand why there is a crash with no client (and success with a client).

It looks like the loop over instructions in mangle function does not take into account the popf instruction because it is somehow not marked as interesting. (with my client, every instruction is marked as interesting)
It looks that this is done in build_bb_ilist for other kinds of instructions.

With no client, DR uses IR Level 0 bundles of instructions for non-"interesting" instructions: with no client, it doesn't care about non-branches in general. The mangle loop skips instrs with no opcode. You want to update decode_cti() -- probably next to the decode of iret.

@Simorfo
Copy link
Contributor Author

Simorfo commented Apr 4, 2017

You want to update decode_cti()

Thanks, that was exactly what I was looking for

I moved the FIXME comment to callback.c
I do not know how I should get the next EIP.
In the case of a conditional jump, we should decide which branch to take...

@derekbruening
Copy link
Contributor

I'm afraid that the mixedmode_late failure is from this CL -- I filed as #1421 but it seems to be consistently happening here and only here. Does that test pass locally for you? (Not sure but it may only be enabled in a full suite run so sthg like "ctest" in one build dir is not sufficient.)

@Simorfo
Copy link
Contributor Author

Simorfo commented May 7, 2017

The mixedmode.exe does not work on my windows 7 (without dynamorio) (I get the popup asking to close this program because of a failure)

@Simorfo
Copy link
Contributor Author

Simorfo commented May 10, 2017

From the appveyor log, I gather that the problem comes from an added instruction by this CL (there is a popf in mixedmode)
I guess that this is the instruction where we set dynamically the exit reason of the special single step basic block.
I just made it the same as other parts of code even if I do not understand the difference implied by the condition DYNAMO_OPTION(private_ib_in_tls) || TEST(FRAG_SHARED, flags)

@Simorfo
Copy link
Contributor Author

Simorfo commented May 10, 2017

Can you provide me a log for the error with mixedmode ? I do not manage to run the test by myself.

@Simorfo
Copy link
Contributor Author

Simorfo commented May 15, 2017

OK, I got the log and here is the problem :
ERROR: Could not find encoding for: popf %rsp (%rsp)[8byte] -> %rsp

@Simorfo
Copy link
Contributor Author

Simorfo commented May 15, 2017

Test now works by me.
The problem was the fast decoding of popf which did not take into account the size of esp register (when running 32 bit program on 64 bit dynamorio)
Maybe this patch can be better written with something like resolve_variable_size_dc for registers

resolve_variable_size_dc(dcontext, 0, OPSZ_ret, false)));
instr_set_dst(instr, 0, opnd_create_reg(REG_XSP));
(stack_sized_reg, REG_NULL, 0, 0,
resolve_variable_size_dc(dcontext, 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to pass prefixes here, not 0, to handle a data-size prefix on popf

Copy link
Contributor Author

@Simorfo Simorfo May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this ? even if we cannot get this far with prefixes > 0 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you say "we cannot get this far with prefixes > 0"? Maybe you could add to your test a case with the data size prefix to make sure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because before my patched code, there is

    if (prefixes > 0) {
        /* prefixes are rare on ctis
         * rather than handle them all here, just do full decode
         * FIXME: if we start to see more and more jcc branch hints we
         * may change our minds here!  This is case 211206/6749.
         */
        if (decode(dcontext, start_pc, instr) == NULL)
            return NULL;
        else
            return (start_pc + sz);
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so it goes to the slow path. It doesn't have to for this case but it's not worth reordering. It still seems safer to pass prefixes here rather than 0 to future-proof. It would also be nice to have a test w/ a size prefix but there is a long list of "nice to have" things in the tests so this seems sufficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update to master and merge.

@derekbruening
Copy link
Contributor

Appveyor failure is api.detach #2246.

@derekbruening derekbruening merged commit 47b56c8 into DynamoRIO:master May 16, 2017
Simorfo pushed a commit to Simorfo/dynamorio that referenced this pull request May 17, 2017
@derekbruening
Copy link
Contributor

There are two problems with what was checked in: style violations and worse a crash, documented at #2144 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants