Hello everyone, it's been a while since i posted something here, but i wanted to share a project update and the current state of the project, if you'ld like to see a more in detail description of what i've been doing with disassembler i recorded some demo videos and posted them in Youtube (
https://www.youtube.com/channel/UCumRmCCamu0sJnRywpfWVTw), in which i try to keep a more up to date blog as i think is more understandable to record a video explaining and showing the code and the data as it goes. But anyway, i think that writing posts is as well something really valuable and i'll do my best to continue posting about disassembler updates, at least each month.
So, the disassembler as it is right now it supports the next x86 instructions groups:
1. general
2. system
3. x87 FPU
4. MMX
5. SSE
6. SSE1
7. SSE2
8. SSE3
9. SSSE3
10. SSE4
11. VMX
12. SMX instructions
(both one-byte and two-byte instructions)
What i'll try to explain here is the data transformations needed to automatically generate the opcode tables used by eaglefly disassembler. In this video (
https://www.youtube.com/watch?v=cZ8c6LeIdiM&t=53s) i explain more thoroughly the data transformations needed to transform source's database to eaglefly's database.
The opcode tables are generated from MazeGen's awesome tables (
http://ref.x86asm.net/), these tables have a thorough instruction description: opcodes, assembly syntax, instruction classes, and flags modified by instruction, among other things. What mazegen has is an xml database for all instructions 16, 32 and 64 bit modes, which i basically lex it and made some transformations to create eaglefly database.
What i did to transform Mazegen's data was first to understand XML files and its structure layout:
<element attr_name="attr_value">text</element>
So for that i made a tool that generates the xml tree layout, so i could easily see what are the nested properties of the data and have the possibility to lex them.
By the way i think that XML's file structure sucks, since it doesn't have a header, which could have the structure layout for the data it contains, the body, and in that way it wouldn't be necessary to make xml tree layout by myself.
So once i got the xml file tree, i could start to lex the file and tokenize the data that i really needed, the opcode identifiers and the assembly syntax, which for convenience i divided them into two files so i could easily make data cleanup and handle entry collisions, more on that later. What i did was to filter source's data into two data streams one for opcode identifiers and one for assembly syntaxes:
Per example, opcode identifiers for 64-bit:
one-byte
pri_opcd = 00
entry = 0
opcode_id (8,0x00) mod reg rm
pri_opcd = 01
entry = 1
opcode_id (8,0x01) mod reg rm
pri_opcd = 02
entry = 2
opcode_id (8,0x02) mod reg rm
pri_opcd = 03
entry = 3
opcode_id (8,0x03) mod reg rm
pri_opcd = 04
entry = 4
opcode_id (8,0x04)
pri_opcd = 05
entry = 5
opcode_id (8,0x05)
Assembly syntaxes for 64-bit:
one-byte
pri_opcd = 00
entry = 0
grammar ADD <Eb>,<Gb>
pri_opcd = 01
entry = 1
grammar ADD <Evqp>,<Gvqp>
pri_opcd = 02
entry = 2
grammar ADD <Gb>,<Eb>
pri_opcd = 03
entry = 3
grammar ADD <Gvqp>,<Evqp>
pri_opcd = 04
entry = 4
grammar ADD <AL>,<Ib>
pri_opcd = 05
entry = 5
grammar ADD <rAX>,<Ivds>
If you notice i put 64-bit since i made this for 3 operating modes 16 real-address, 32 protected, and 64 bit long modes. which was fun!
Once i got these data layout i could start to see entry collisions, since source's database has much more information that i needed as undocumented instructions, legacy instructions, etc.
One simple cleanup i did was to erase REX prefixes and keep one byte INC & DEC instructions in non 64 bit modes, as well i added duplicate instructions in 64 bit mode, ones with rex and ones without it so that disassembler could identify more easily SIMD instructions with mandatory prefixes since REX prefix has to go after all legacy prefixes it was convenient to add it in the tables as well, per example:
MOVD/MOVQ SSE2 instructions:
grammar MOVD <Vdq>,<Ed>
opcode_id (8,0x66) (8,0x0F) (8,0x6E) mod reg rm
extra_grammar MOVQ <Vdq>,<Eqp>
extra_opcode_id (8,0x66) rex (8,0x0F) (8,0x6E) mod reg rm
The meaning of the grammars within guillemets are documented in Mazegen's website, some of them are created by her\him.
Per example for the MOVD instruction:
"V" means the reg field of modrm byte selects a 128-bit xmm register.
"dq" is the operand type which is a double quadword independent of operand size attribute.
"E" means the mod and rm fields select either a general purpose register or a memory address.
"d" means doubleword independ of operand size attribute.
After this i needed to make some arrengements so that the disassembler could identify instruction assembly syntaxes that match to the same opcode identifier but they vary on operand size attibute.
Since the operand size attribute is something the disassembler knows at run time, my solution was to group this assembly syntaxes into one assembly syntax group.
Per example:
the INS instruction has the same opcode identifier for 16 bit operand sizes and 32 operand sizes, which is 6DH, so the way the disassembler matches the instruction is at runtime by operand size attribute:
grammar_group INSGR <Yv>
grammar_entry #16 INSW
grammar_entry #32 INSD
Notice there's no 64-bit operand size for I/O operations.
Once i made these cleanups i was ready to meta program the code that generates eaglefly's database:
For the first instruction, this is the code generated:
//First instruction ADD <Eb>,<Gb>
//opcode identifiers:
BeginInstructionBits(FileManager);
CopyBitsToFile(FileManager, BitsIDFile(8,0x00));
CopyBitsToFile(FileManager, mod);
CopyBitsToFile(FileManager, reg);
CopyBitsToFile(FileManager, rm);
EndInstructionBits(FileManager);
//assembly syntax:
InitOperandArray(&OperandArray);
PushOperand(&OperandArray, OperandType_E,OperandSize_b);
PushOperand(&OperandArray, OperandType_G,OperandSize_b);
PushInstructionInfo(FileManager, "ADD", OperandArray.Count, (u32*)OperandArray.TypesArray, (u32*)OperandArray.SizesArray, !DEFAULTSF64, !NOT_SUPPORTED, !INVALID_64, !REP_VALID, !REPNE_VALID, !LOCK_VALID);
So that's it. In the picture below you can see a block of asembly instructions disassembled from ray.exe, raytracer made by Casey Muratori. The disassembly right now is wrong heh, only the first two instructions are correct, but the thing is that it outputs something which is what i needed.
The format is still loosy as well but that´s something for the future.
I hope you enjoyed this post, if you have any questions you can contact me at
[email protected]
Don't forget to follow the project in Youtube:
https://www.youtube.com/channel/UCumRmCCamu0sJnRywpfWVTw.
eaglefly_decoder.exe -64 C:\Raytracer\build\ray.exe > eaglefly_64.asm
-64
MOV 8byte ptr SS : [ RSP + + 0x20 ]R9
MOV 8byte ptr SS : [ RSP + + 0x18 ]R8
PUSH R12
AND AL0x10
MOV 8byte ptr SS : [ RSP + + 0x08 ]RCX
PUSH ESI
PUSH EDI
SUB RSP0x000001A8
MOV RAXCS : [ RIP + 0x00058FE4 ]
XOR RAXRSP
MOV 8byte ptr SS : [ RSP + + 0x00000190 ]RAX
LEA RAXinvalid ptr SS : [ RSP + + 0x00000170 ]
MOV RDIRAX
XOR EAXEAX
MOV ECX0x00000010
STOSB
MOV 4byte ptr SS : [ RSP + + 0x48 ]0x00000000
MOVSS XMM0CS : [ RIP + 0x000442D9 ]
MOVSS 16byte ptr SS : [ RSP + + 0x4C ]XMM0
MOVSS XMM0CS : [ RIP + 0x000442CB ]
MOVSS 16byte ptr SS : [ RSP + + 0x50 ]XMM0
MOV 4byte ptr SS : [ RSP + + 0x30 ]0x00000000
JMP 0x0A
MOV EAX4byte ptr SS : [ RSP + + 0x30 ]
INC EAX
MOV 4byte ptr SS : [ RSP + + 0x30 ]EAX
CMP 4byte ptr SS : [ RSP + + 0x30 ]0x02
JAE 0x00000461
MOVSS XMM0CS : [ RIP + 0x000442BA ]
MOVSS 16byte ptr SS : [ RSP + + 0x20 ]XMM0
MOV 4byte ptr SS : [ RSP + + 0x24 ]0x00000000
JMP 0x0A
MOV EAX4byte ptr SS : [ RSP + + 0x24 ]
INC EAX
MOV 4byte ptr SS : [ RSP + + 0x24 ]EAX
MOV RAX8byte ptr SS : [ RSP + + 0x000001C8 ]
MOV EAX4byte ptr DS : [ RAX + 0x10 ]
CMP 4byte ptr SS : [ RSP + + 0x24 ]EAX
JAE 0x00000130
MOV EAX4byte ptr SS : [ RSP + + 0x24 ]
IMUL RAXRAX0x14
MOV RCX8byte ptr SS : [ RSP + + 0x000001C8 ]
ADD RAX8byte ptr DS : [ RCX + 0x18 ]
MOV 8byte ptr SS : [ RSP + + 0x58 ]RAX
LEA RAXinvalid ptr SS : [ RSP + + 0x00000090 ]
MOV RDIRAX
MOV RSI8byte ptr SS : [ RSP + + 0x000001D8 ]
MOV ECX0x0000000C
MOVSB
LEA RAXinvalid ptr SS : [ RSP + + 0x000000A0 ]
MOV RCX8byte ptr SS : [ RSP + + 0x58 ]
MOV RDIRAX
LEA RSIinvalid ptr DS : [ RCX + 0x04 ]
MOV ECX0x0000000C
MOVSB
LEA RDXinvalid ptr SS : [ RSP + + 0x00000090 ]