Skip to content
This repository has been archived by the owner on Oct 15, 2023. It is now read-only.

inconsistent use of immediate operand placeholders #17

Open
robertmuth opened this issue Nov 23, 2021 · 5 comments
Open

inconsistent use of immediate operand placeholders #17

robertmuth opened this issue Nov 23, 2021 · 5 comments

Comments

@robertmuth
Copy link

Example:
"add" , "x:al, ib/ub" , "I" , "04 ib"

place holders do not match ib/ub vs ib

On the other hand

"add" , "x:r16/m16, ib" , "MI" , "66 83 /0 ib"

uses ib consistently

@kobalicek
Copy link
Member

kobalicek commented Nov 23, 2021

Can you elaborate?

I don't see the issue personally.

In the first example:

"add" , "x:al, ib/ub" , "I" , "04 ib"

The first ib/ub means that the immediate could be either signed or unsigned, and 04 ib is the opcode as shown in instruction manuals. Instruction manuals don't really do signed/unsigned distinction like this, instead it's provided by the documentation.

In the second example:

"add" , "x:r16/m16, ib" , "MI" , "66 83 /0 ib"

It describes an instruction using 16-bit register or memory, followed by an 8-bit immediate, which is always signed and sign extends to 16-bit. Then in the opcode 66 83 /0 ib it's again shown as ib, because this is how instruction manuals describe immediates.

In general instruction manuals don't really care about signedness/unsignedness of immediates, but AsmJit, which uses asmdb does distinguish between signed / unsigned.

@robertmuth
Copy link
Author

BTW: thanks for this really awesome project.

As you probably figured, I am processing the table programmatically for my own instruction encoder/decoder.
For the processing it would be helpful to know if a an immediate is signed or unsigned.
In the second case, there is no confusion but in the first there is.
If the immediate in the first case must always be signed why be vague about it and say "ib/ub" or are there situations
where could be unsigned?

Having said that, saying that this is just a transcription of the manuals is totally fine.

I would love to base my encoder completely on your table without having to consult additional documentation. ;-)
How does AsmJit determined the signedness in the first case?

@kobalicek
Copy link
Member

kobalicek commented Nov 23, 2021

Although I designed the tables, I found it still to be pretty difficult to programatically generate assembler or disassembler out of the table. The problem is categorizing instructions into some groups that you can use to implement parts of the decoder / encoder. Not saying it's impossible, but it's difficult to group stuff - so maybe you will end up generating each instruction separately, which is wasteful :)

For the case of encoding - you can actually describe immediate value as int64_t - that would be enough for all immediates in X86/X86_64 instruction set. If you have such type, then ib vs ub makes a difference. For example you can write and r8, -1 and and r8, 0xFF - these two are equivalent, but when you convert that -1 to unsigned you will get 0xFFFFFFFFFFFFFFFF, which is out of range for an unsigned check, and 0xFF is out of range for a signed check. So this is why asmdb provides this information - you can check both signed and unsigned and continue when one of them passes the check.

For the case of decoding, signed would be preferred when working with GP instructions, and unsigned when working with SIMD - for example a predicate in PSHUFD would be decoded as unsigned, add r32, 0xFFFFFFFF would be decoded as add r32, -1 - but that's on you - the important is that the assembler can assemble it back.

(not sure I answered all the questions)

@robertmuth
Copy link
Author

I am quite confident that at least the decoder can be done entirely table driven.
The encoder as well as long as you do not want to optimize for short encodings.
For this you need to know which instruction groups have the same effects, so you can pick the shortest among them
I assume this is what you mean by grouping.
(BTW: it would be really nice if the tables contained that information as well.)

I only care about rather simple instruction that a typical compiler would generate so segment stuff is not important.

I am also rather new to x86 encodings, so I am not 100% sure that knowing signedness is as important as I think it is.
For other ARM it definitely is. Here is a hypothetical example:

mov.q reg, -1

the immediate is only one byte and will be extended to a quad word by the CPU. Obviously, the value stored in the register
will be different depending on whether the CPU interprets the bytes as signed or unsigned.
Again, I am not sure if this issue exists on x86.

@robertmuth
Copy link
Author

I have completed the work on my decoder based on your tables and have confirmed that my output matches
that of objdump for x86-64 executables.

(The exact list of opcodes considered is here: https://github.com/robertmuth/Cwerg/blob/master/CpuX64/opcode_tab.py)

I am super pleased with asmdb and will focus on an encoder next.
Now that I have more experience with the immediate encodings I have the following suggestion:

  • keep the use of ib/iw/id/iq in the encoding field unchanged
  • change the immediate-place holder in the operand field to reflect the immediate width actually used

For example:
"add" , "X:r32/m32, ib" , "MI" , "83 /0 ib"
would become
"add" , "X:r32/m32, id" , "MI" , "83 /0 ib"

since ib gets signed extended to id before adding

More importantly:
"push" , "id" , "I" , "68 id"
and
"push" , "ib" , "I" , "6A ib"

Should change the ib and id operands to something that indicates that the real width of the operand is either 64bit (=iq)
or dependent on the mode (maybe im)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants