handmade.network » Forums » Work-in-Progress » Project EagleFly - Disassembler for IA-32 & x64 arquitectures
AlejandroArmenta
Alejadro Armenta
3 posts / 2 projects

24 years old, Engine programmer, coding EagleFly disassembler for x86 architectures.

#16791 Project EagleFly - Disassembler for IA-32 & x64 arquitectures
7 months ago Edited by Alejadro Armenta on Nov. 17, 2018, 6:57 p.m.

Hello Everyone, my name is Alejandro Armenta and I'm making a disassembler for IA-32 and x64 arquitectures. The disassembler comes along a bigger project of a debugger/binary analysis tool. So I decided to start with the disassembler as its basis.

The project started based on the handmade principles and it's being inspired by many different projects that are being made throughout the network. Its sessions are recorded and uploaded into Youtube for people to watch them whenever they want.

The goal is to make a fast and robust disassembler that the debugger can use, explaining the different methodologies used for solving the problems we face.

The disassembler right now it's taking it's own form, disassembling a test instruction stream with different instuction encoding types (one , two, three byte opcodes, Opcode extended, x87, SSE2, all addressing modes and REX prefixes) so that, we've already started to define it's usage code and we are starting to see the API design construction that we are looking for.

If you feel like, follow us on Youtube and retweet our tweets on Twitter!
Youtube: https://www.youtube.com/channel/UCumRmCCamu0sJnRywpfWVTw
Twitter: https://twitter.com/lex_armenta

If you have any questions, please feel free to ask directly here!, or if you have a session specific question you can post them on the specific Youtube video or Twitter post.


Current x64 output:

Arg1: V:\build\win32_eaglefly_disasm..exe
PUSH R12 , ,
PUSH RAX , ,
MOV SPL , [RAX] ,
MOV RBX , [R13 + RAX * 2 + 0000000000000000] ,
MOV RBX , [RBP + RAX * 2 + 0000000000000000] ,
MOV RBX , [RBP + RAX * 1 + FFFFFFFFFFFFFFFF] ,
MOV RBX , [RAX * 4 + 0000000033221100] ,
MOV RBP , [RSP + R12 * 8 + 0000000033221100] ,
MOV R8 , [R12] ,
MOV R8 , [R13 + 0000000000000000] ,
MOV R8 , [RAX + 0000000033221100] ,
MOV R8D , [RAX + 0000000033221100] ,
MOV RAX , RDX ,
ADD [RAX + FFFFFFFFFFFFFFF0] , , 0B
FADD ST(0) , ST(1) ,
FUCOM ST(1) , ST(0) ,
FLD , [0000000000000004] ,
VMRESUME , ,
PALIGNR XMM0 , XMM1 , 08
SHLD [0000000000000000] , EAX , 03
ADD EAX , [0000000000000000] ,


Instruction groups to dissassemble:

1.1. General Purpose
1.2. x87 FPU
1.3. x87 FPU and SIMD state management
1.4. Intel MMX technology
1.5. SSE/SSE2/SSE3/SSSE3/SSE4 extensions
1.6. AESNI/PCMULQDQ
1.7. Intel AVX extensions
1.8. F16C, RDRAND, RDSEED, FS/GS base access
1.9. FMA extensions
1.10. Intel AVX2 Extensions
1.11. Intel transactional synchronization extensions
1.12. System Instructions
1.13. IA-32e mode: 64-bit mode instructions
1.14. VMX Instructions
1.15. SMX Instructions
1.16. ADCX and ADOX
1.17. Intel Memory Protection Extensions
1.18. Intel Security Guard Extensions

mmozeiko
Mārtiņš Možeiko
1934 posts / 1 project
#16792 Project EagleFly - Disassembler for IA-32 & x64 arquitectures
7 months ago Edited by Mārtiņš Možeiko on Nov. 16, 2018, 4:20 a.m.

Nice! I have written couple (limited) disassemblers in past, and I've found it is a great way to learn about assembly and CPU.
Btw if the goal is to be fast I expect to see benchmarks against Zydis :) Afaik it is fastest and most accurate x86/x64 disassembler from open-source ones (pretty small too).

Not to discourage you, bet there are so many good x86/64 disassemblers out there... ARM one's on the other hand, I have seen no good ones (fast/small/complete). Especially nowadays when ARM has released formal instruction set reference in machine readable files: ARM - Exploration Tools. Meaning that it should be possible to write generator for disassembler.
AlejandroArmenta
Alejadro Armenta
3 posts / 2 projects

24 years old, Engine programmer, coding EagleFly disassembler for x86 architectures.

#16795 Project EagleFly - Disassembler for IA-32 & x64 arquitectures
7 months ago

mmozeico
Nice! I have written couple (limited) disassemblers in past, and I've found it is a great way to learn about assembly and CPU.
Btw if the goal is to be fast I expect to see benchmarks against Zydis :) Afaik it is fastest and most accurate x86/x64 disassembler from open-source ones (pretty small too).

Not to discourage you, bet there are so many good x86/64 disassemblers out there... ARM one's on the other hand, I have seen no good ones (fast/small/complete). Especially nowadays when ARM has released formal instruction set reference in machine readable files: ARM - Exploration Tools. Meaning that it should be possible to write generator for disassembler.


Awesome, I've been looking for test benchmarks that I could compare the disassembler against. I found Zydis' tests benchmarks where made against intel xed x86 encoder / decoder. So i'll give it a shot.

Yeah! I think it'ld be great if Intel had a formal specification as well, so that we could use it to generate disassembler, its reallly sad =(.

QUESTION: Do you know how to attach local pictures inside this posts? I tried with "", but it didn´t work.





mmozeiko
Mārtiņš Možeiko
1934 posts / 1 project
#16796 Project EagleFly - Disassembler for IA-32 & x64 arquitectures
7 months ago

You cannot attach. You need to host them somewhere - imgur, google drive, ... your choice. Then you put following bbtag with full image address:

1
[img]http://address/to/image.png[/img]
AlejandroArmenta
Alejadro Armenta
3 posts / 2 projects

24 years old, Engine programmer, coding EagleFly disassembler for x86 architectures.

#21057 Demo 00 - Automatically generating opcode tables
1 month ago Edited by Alejadro Armenta on May 19, 2019, 3:04 a.m.

Hello everyone, it's been a while since i posted something here, but i wanted to share a project update and the current state of the project, if you'ld like to see a more in detail description of what i've been doing with disassembler i recorded some demo videos and posted them in Youtube (https://www.youtube.com/channel/UCumRmCCamu0sJnRywpfWVTw), in which i try to keep a more up to date blog as i think is more understandable to record a video explaining and showing the code and the data as it goes. But anyway, i think that writing posts is as well something really valuable and i'll do my best to continue posting about disassembler updates, at least each month.


So, the disassembler as it is right now it supports the next x86 instructions groups:


1. general
2. system
3. x87 FPU
4. MMX
5. SSE
6. SSE1
7. SSE2
8. SSE3
9. SSSE3
10. SSE4
11. VMX
12. SMX instructions

(both one-byte and two-byte instructions)


What i'll try to explain here is the data transformations needed to automatically generate the opcode tables used by eaglefly disassembler. In this video (https://www.youtube.com/watch?v=cZ8c6LeIdiM&t=53s) i explain more thoroughly the data transformations needed to transform source's database to eaglefly's database.


The opcode tables are generated from MazeGen's awesome tables (http://ref.x86asm.net/), these tables have a thorough instruction description: opcodes, assembly syntax, instruction classes, and flags modified by instruction, among other things. What mazegen has is an xml database for all instructions 16, 32 and 64 bit modes, which i basically lex it and made some transformations to create eaglefly database.


What i did to transform Mazegen's data was first to understand XML files and its structure layout:

<element attr_name="attr_value">text</element>

So for that i made a tool that generates the xml tree layout, so i could easily see what are the nested properties of the data and have the possibility to lex them.

By the way i think that XML's file structure sucks, since it doesn't have a header, which could have the structure layout for the data it contains, the body, and in that way it wouldn't be necessary to make xml tree layout by myself.

So once i got the xml file tree, i could start to lex the file and tokenize the data that i really needed, the opcode identifiers and the assembly syntax, which for convenience i divided them into two files so i could easily make data cleanup and handle entry collisions, more on that later. What i did was to filter source's data into two data streams one for opcode identifiers and one for assembly syntaxes:

Per example, opcode identifiers for 64-bit:

one-byte
pri_opcd = 00
entry = 0
opcode_id (8,0x00) mod reg rm
pri_opcd = 01
entry = 1
opcode_id (8,0x01) mod reg rm
pri_opcd = 02
entry = 2
opcode_id (8,0x02) mod reg rm
pri_opcd = 03
entry = 3
opcode_id (8,0x03) mod reg rm
pri_opcd = 04
entry = 4
opcode_id (8,0x04)
pri_opcd = 05
entry = 5
opcode_id (8,0x05)


Assembly syntaxes for 64-bit:

one-byte
pri_opcd = 00
entry = 0
grammar ADD <Eb>,<Gb>
pri_opcd = 01
entry = 1
grammar ADD <Evqp>,<Gvqp>
pri_opcd = 02
entry = 2
grammar ADD <Gb>,<Eb>
pri_opcd = 03
entry = 3
grammar ADD <Gvqp>,<Evqp>
pri_opcd = 04
entry = 4
grammar ADD <AL>,<Ib>
pri_opcd = 05
entry = 5
grammar ADD <rAX>,<Ivds>


If you notice i put 64-bit since i made this for 3 operating modes 16 real-address, 32 protected, and 64 bit long modes. which was fun!

Once i got these data layout i could start to see entry collisions, since source's database has much more information that i needed as undocumented instructions, legacy instructions, etc.

One simple cleanup i did was to erase REX prefixes and keep one byte INC & DEC instructions in non 64 bit modes, as well i added duplicate instructions in 64 bit mode, ones with rex and ones without it so that disassembler could identify more easily SIMD instructions with mandatory prefixes since REX prefix has to go after all legacy prefixes it was convenient to add it in the tables as well, per example:

MOVD/MOVQ SSE2 instructions:

grammar MOVD <Vdq>,<Ed>
opcode_id (8,0x66) (8,0x0F) (8,0x6E) mod reg rm

extra_grammar MOVQ <Vdq>,<Eqp>
extra_opcode_id (8,0x66) rex (8,0x0F) (8,0x6E) mod reg rm


The meaning of the grammars within guillemets are documented in Mazegen's website, some of them are created by her\him.

Per example for the MOVD instruction:

"V" means the reg field of modrm byte selects a 128-bit xmm register.
"dq" is the operand type which is a double quadword independent of operand size attribute.

"E" means the mod and rm fields select either a general purpose register or a memory address.
"d" means doubleword independ of operand size attribute.

After this i needed to make some arrengements so that the disassembler could identify instruction assembly syntaxes that match to the same opcode identifier but they vary on operand size attibute.

Since the operand size attribute is something the disassembler knows at run time, my solution was to group this assembly syntaxes into one assembly syntax group.

Per example:

the INS instruction has the same opcode identifier for 16 bit operand sizes and 32 operand sizes, which is 6DH, so the way the disassembler matches the instruction is at runtime by operand size attribute:

grammar_group INSGR <Yv>
grammar_entry #16 INSW
grammar_entry #32 INSD

Notice there's no 64-bit operand size for I/O operations.


Once i made these cleanups i was ready to meta program the code that generates eaglefly's database:

For the first instruction, this is the code generated:

//First instruction ADD <Eb>,<Gb>

//opcode identifiers:


BeginInstructionBits(FileManager);
CopyBitsToFile(FileManager, BitsIDFile(8,0x00));
CopyBitsToFile(FileManager, mod);
CopyBitsToFile(FileManager, reg);
CopyBitsToFile(FileManager, rm);
EndInstructionBits(FileManager);

//assembly syntax:

InitOperandArray(&OperandArray);
PushOperand(&OperandArray, OperandType_E,OperandSize_b);
PushOperand(&OperandArray, OperandType_G,OperandSize_b);
PushInstructionInfo(FileManager, "ADD", OperandArray.Count, (u32*)OperandArray.TypesArray, (u32*)OperandArray.SizesArray, !DEFAULTSF64, !NOT_SUPPORTED, !INVALID_64, !REP_VALID, !REPNE_VALID, !LOCK_VALID);


So that's it. In the picture below you can see a block of asembly instructions disassembled from ray.exe, raytracer made by Casey Muratori. The disassembly right now is wrong heh, only the first two instructions are correct, but the thing is that it outputs something which is what i needed.

The format is still loosy as well but that´s something for the future.

I hope you enjoyed this post, if you have any questions you can contact me at [email protected]

Don't forget to follow the project in Youtube: https://www.youtube.com/channel/UCumRmCCamu0sJnRywpfWVTw.

eaglefly_decoder.exe -64 C:\Raytracer\build\ray.exe > eaglefly_64.asm

-64
MOV 8byte ptr SS : [ RSP + + 0x20 ]R9
MOV 8byte ptr SS : [ RSP + + 0x18 ]R8
PUSH R12
AND AL0x10
MOV 8byte ptr SS : [ RSP + + 0x08 ]RCX
PUSH ESI
PUSH EDI
SUB RSP0x000001A8
MOV RAXCS : [ RIP + 0x00058FE4 ]
XOR RAXRSP
MOV 8byte ptr SS : [ RSP + + 0x00000190 ]RAX
LEA RAXinvalid ptr SS : [ RSP + + 0x00000170 ]
MOV RDIRAX
XOR EAXEAX
MOV ECX0x00000010
STOSB
MOV 4byte ptr SS : [ RSP + + 0x48 ]0x00000000
MOVSS XMM0CS : [ RIP + 0x000442D9 ]
MOVSS 16byte ptr SS : [ RSP + + 0x4C ]XMM0
MOVSS XMM0CS : [ RIP + 0x000442CB ]
MOVSS 16byte ptr SS : [ RSP + + 0x50 ]XMM0
MOV 4byte ptr SS : [ RSP + + 0x30 ]0x00000000
JMP 0x0A
MOV EAX4byte ptr SS : [ RSP + + 0x30 ]
INC EAX
MOV 4byte ptr SS : [ RSP + + 0x30 ]EAX
CMP 4byte ptr SS : [ RSP + + 0x30 ]0x02
JAE 0x00000461
MOVSS XMM0CS : [ RIP + 0x000442BA ]
MOVSS 16byte ptr SS : [ RSP + + 0x20 ]XMM0
MOV 4byte ptr SS : [ RSP + + 0x24 ]0x00000000
JMP 0x0A
MOV EAX4byte ptr SS : [ RSP + + 0x24 ]
INC EAX
MOV 4byte ptr SS : [ RSP + + 0x24 ]EAX
MOV RAX8byte ptr SS : [ RSP + + 0x000001C8 ]
MOV EAX4byte ptr DS : [ RAX + 0x10 ]
CMP 4byte ptr SS : [ RSP + + 0x24 ]EAX
JAE 0x00000130
MOV EAX4byte ptr SS : [ RSP + + 0x24 ]
IMUL RAXRAX0x14
MOV RCX8byte ptr SS : [ RSP + + 0x000001C8 ]
ADD RAX8byte ptr DS : [ RCX + 0x18 ]
MOV 8byte ptr SS : [ RSP + + 0x58 ]RAX
LEA RAXinvalid ptr SS : [ RSP + + 0x00000090 ]
MOV RDIRAX
MOV RSI8byte ptr SS : [ RSP + + 0x000001D8 ]
MOV ECX0x0000000C
MOVSB
LEA RAXinvalid ptr SS : [ RSP + + 0x000000A0 ]
MOV RCX8byte ptr SS : [ RSP + + 0x58 ]
MOV RDIRAX
LEA RSIinvalid ptr DS : [ RCX + 0x04 ]
MOV ECX0x0000000C
MOVSB
LEA RDXinvalid ptr SS : [ RSP + + 0x00000090 ]