Compiling without the c run-time library: C functions VS intrinsics; single statement asserts.

I'm trying to compile some C code without the CRT. I followed part of the wiki article on the subject. Most of my code doesn't use the CRT but I use some single file header libraries that use some (e.g. stb_truetype). I replaced malloc, free, realloc, qsort with function I implemented myself.

For math functions as I'm not confident I would do a good job even for simple ones, I wanted to look how they are implemented in the CRT. I searched a little bit, and it seems that musl libc is well organized and easy to browse. So my idea was to make a header where I would copy the musl functions I needed.

But for some of those functions there seems to be intrinsics available. For example, floor, ceil, round, sqrt. Is there a reason to use the C functions instead of the intrinsics ? My only target is x64 with SSE 4.2. Will I encounter problem using the instrinsics, expecting them to behave like the CRT ?



A second issue I encountered is that my asserts don't work with stb libs.

My assert is simply:
1
#define _assert( expression ) if ( !( expression ) ) { __debugbreak( ); }


But in the stb libs several "if"s don't use curly braces:
1
2
3
4
5
6
if ( a )
    stb_assert( expression_a );
else if ( b )
    stb_assert( expression_b );
else
    stb_assert( expression_c );


Which, once expanded, produce:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
if ( a )
    if ( expression_a ) {
        __debugbreak( );
    }; else if ( b )
        if ( expression_b ) {
            __debugbreak( );
        }; else
            if ( expression_c ) {
                 __debugbreak( );
            };


There are 2 issues: the nesting of ifs are wrong; there is a semicolon between closing curly braces and "else". Is there some trick I can use to fix that ? The assert from MSVC assert.h looks like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#ifdef  __cplusplus
extern "C" {
#endif

_CRTIMP void __cdecl _wassert(_In_z_ const wchar_t * _Message, _In_z_ const wchar_t *_File, _In_ unsigned _Line);

#ifdef  __cplusplus
}
#endif

#define assert(_Expression) (void)( (!!(_Expression)) || (_wassert(_CRT_WIDE(#_Expression), _CRT_WIDE(__FILE__), __LINE__), 0) )


I tried to do
1
#define _assert( expression ) ( ( expression ) || __debugbreak( ) )

and some variations with no success. I get the following error: error C2297: '||': illegal, right operand has type 'void' which I guess makes sense but why is it working in the MSVC code ? I don't understand what the MSVC macro is trying to do. It's casting something to void, it has "!!" which I guess just cancel itself, ... ?

musl has a similar macro. It also has the _Noreturn which is __attribute__((__noreturn__)), but I don't know the equivalent for MSVC or even if I can set that for __debugbreak( ) since it's an intrinsics to output int3 in the assembly.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#define assert(x) ((void)((x) || (__assert_fail(#x, __FILE__, __LINE__, __func__),0)))

#ifdef __cplusplus
extern "C" {
#endif

_Noreturn void __assert_fail (const char *, const char *, int, const char *);

#ifdef __cplusplus
}
#endif


Edited by Simon Anciaux on Reason: typo
The macro nesting issue is usually handled using a do while loop:
1
#define _assert( expression ) do { if ( !( expression ) ) { __debugbreak( ); } while(0)
The way these assert macros work is by using short-circuit evaluation and the comma operator.

1
#define _assert( expression ) ( ( expression ) || __debugbreak( ) )


The problem is __debugbreak( ) returns void and so it can't be used with the || operator.

So instead you can use the comma operator:
1
#define _assert( expression ) ( ( expression ) || (__debugbreak( ), 0) )


Then the expressions on both sides of the comma get evaluated but then the 0 is used with the ||.

The (!!(_Expression)) is used for forcing the value to a 0 or 1.
Yep, "do { ... } while (0)" is preferred way.

In fact all statement-like macro's should do that to be safe. Especially when you are shipping them in a library for others to use - as you don't know in what kind of code they will be used.
Thanks everybody. I just didn't thought about using the "do while trick" here.

@william: I completely missed the ", 0" at the end. I knew there was something possible with commas, but I didn't knew what. Also I never thought about "!!" forcing the value to 0 or 1 but it makes sense.

Why is there a cast to void in front of everything ? Is it to prevent misusing the assert ? For example if ( assert( expression ) ) ... would not work ?


Any thoughts about math C functions vs intrinsics as this was my primary question ?

Not sure, but maybe that cast is to prevent compiler to generate warning about value not being used?

There is a reason to use C functions if you need full IEEE compliance. Meaning denormals and correctly handling NaN's, infinities. A lot of approximations or some intrinsics do not do that.

If you don't expect your code to deal with such float's, then there's no reason to not use intrinsics directly.

muslc functions sometimes rely on external assembly files with arcane (and slow) operations. For example, atan2 - it uses x87 FPU instruction.
I suggest to look at cephes code instead: http://www.netlib.org/cephes/ It has both float and double, and handle all denormal/nan/inf values. You can always strip out that part of code if you don't need special value handlings.

Edited by Mārtiņš Možeiko on
Thanks.

I looked into cephes and it's easy to get what I want. The code looks old though (pre C89 I believe) and need some (simple) changes.

mmozeiko
Not sure, but maybe that cast is to prevent compiler to generate warning about value not being used?


I tried without the cast to void, with /W4 and there was no warnings. The cast to void makes the assert in a if produce an error so I suppose that's it.

@mmozeiko Do you know of any resource or example code for compiling without the CRT on linux ? I found this article with some information. But it uses assembly which I would like to avoid if possible and I don't know it it works with x64.

So I can pass -nostdlib to the compiler and -emain to the linker to have no libs and have main as the entry point. But it segfault and to my understanding I need to call _exit or generate a exit syscall, but those are function inside the stdlib. Or maybe there is another lib containing them ? Also how can I access the command line (or argc and argv) ?
If you want to completely avoid C runtime library on Linux, you will need to write some assembly do to syscalls. Like _exit. That does not need to be external assembly, it can be inline assembly. But it will be assembly - thus it will different for each target architecture you need to support.

Example of inline asm syscall for x86_64:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
static inline long syscall(long num, long arg0, long arg1, long arg2, long arg3, long arg4, long arg5)
{
    register long r10 __asm__("r10") = arg3;
    register long r8 __asm__("r8") = arg4;
    register long r9 __asm__("r9") = arg5;

    long ret;
    __asm__ __volatile__(
        "syscall"
        : "=a"(ret)
        : "a"(num), "D"(arg0), "S"(arg1), "d"(arg2), "r"(r10), "r"(r8), "r"(r9)
        : "memory", "rcx", "r11");
    return ret;
}

You can create overloads that have less arguments if you need such syscalls. 6 is the max x86_64 supports.
Here's a table with x86_64 syscalls numbers: https://filippo.io/linux-syscall-table/

The documentation for x86_64 syscall ABI is in this document: https://refspecs.linuxfoundation.org/elf/x86_64-abi-0.99.pdf ("A.2 AMD64 Linux Kernel Conventions").
To access command-line read "3.4.1 Initial Stack and Register State" in same document. It specifies what is inside stack when _start is called (basically what WinMainCRTStartup is for MSVC). This includes command-line arguments. You can access all of that with a bit of inline asm - to get rsp value. After that it is just some pointer arithmetic in C to get all necessary information. This is also architecture specific. Different arch's can have different conventions for this.

Make sure you understand that if you want to use OpenGL with Nvidia binary driver, then there is no way to avoid C runtime. It will use C runtime in your process - as it will load most of system X libraries into your process, including libc.

Edited by Mārtiņš Možeiko on
Thanks, I'll have a look.
[EDIT] Please read the rest of the posts before using any of this post code as there are errors in this post.

I finally took the time to look into it.

Getting the exit (or exit_group) syscall was easy once I learnt a little bit about inline assembly. Getting the command line parameters was a bit harder. The ABI says that upon startup the last thing on the stack (what is pointed by RSP or ESP on x86) is argc value, followed on the stack by argv and envp.

I used simple assembly to retrieve the value of RSP, but I couldn't figure a way to make sure my code was the first instruction of the program. Depending on the optimization level and the code in the entry point function the compiler can produce some assembly before my code but not always. For example, when compiling with O0, RBP is pushed on the stack and then set to the value of RSP (standard stack management) so I could use offsets from RBP to get what I wanted. But with optimization it was not the case, sometimes my code was the first instruction, sometimes not.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
void no_crt_main( void ) {
    
    /* No guarantee that the assembly will be the first instruction, so RSP might have change. */
    uint64_t* rsp; 
    __asm__ __volatile__ ( "movq %%rsp, %0" : "=r"( rsp ) ); 
    
    int argc = ( int ) *rsp; 
    rsp++;
    char** argv = (char** ) ( *rsp );
    rsp += ( argc + 1 );
    char** envp = (char** ) ( *rsp );
        
    int status = main( argc, argv );
    
    no_crt_exit( status );
}


@mmozeiko Is there a way to make sure the inline assembly is the first thing in the executable entry point function ? I read gcc inline assembly doc but didn't saw anything like that.

So I ended up writing the program entry point (_start) completely in assembly, writing the actual program in C and linking the two together. I think the code is correct but since it's the first time I tried writing and compiling x86 assembly, I would be glad if someone could check it. I left some comment I wrote while learning calling conventions.

x64 version:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
global _start
extern main

section .text

_start:
    mov rdi, [rsp] ; argc
    lea rsi, [rsp + 8] ; argv
    lea rdx, [rsi + rdi * 8 + 8 ] ; envp

    mov rbp, rsp ; Set initial stack base pointer.

    call main

    mov rdi, rax ; Exit status value in rdi.
    mov rax, 231 ; exit_group
    syscall

; This is only for Linux, Windows uses different conventions.

; Stack grows downward from high address to low address.

; When the program starts:
; - rsp seems to be 16 bytes aligned (I didn't find documentation on that);
; - rsp points to the argument count value;
; - rsp + 8 points to argv;
; - argv + argc * 8 + 8 (argv is followed by a null pointer) points to envp.

; rsp points to the last thing that was pushed on the stack (if you write at rsp you override the last value that was pushed).
; When you use "push" it decrement rsp, then store the value at rsp.
; When you use "pop" it reads the value at rsp, then increment rsp.

; When calling a function, we put value in register or on the stack if we filled the available registers.
; So arguments are part of the caller stack frame.
; When putting arguments on the stack, if the value is less than 64 bits, we still push 64 bit of space but use only the required size.
; (that's what I observed, I didn't find documentation on that).
; That means that to access the value we still compute the address as a 64 bit value.
; Integer parameters can be passed in: rdi, rsi, rdx, rcx, r8, r9
; Floating point parameters can be passed in: xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7
; Compound types can pass fields in registers.
; Compound types of more than 4 (2?) 64 bits bit value are passed on the stack (that's my understanding).

; The call instruction will decrement rsp by 8 and then set the return pointer (rip+ size of the call instruction) at rsp.
; Before a "call" instruction, rsp must be 16 bytes aligned.
; From x64 ABI: "The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on the stack) byte boundary."
; That means that when executing the first instruction of a function, rsp + 8 must be a multiple of 16 (since the return pointer was pushed on the stack).
; Upon entering a function, stack arguments will be at rsp + 8, the scond as rsp + 16...
; Does that mean that the alignment for the stack must be before pushing arguments ?
; rbp is generally used as the base stack pointer. So we want it to have the value of rsp upon enterring a function.
; But we want to restore rbp before exciting the function so we need save its value by pushing in on the stack upon enterring a function.
; So rbp is the value of rsp when you entered the funciton - 8.

; push rbp
; mov rbp, rsp

; Now the first stack argument is rsp + 16 or rbp + 16.

; We need to restore rbp before exiting the function. We can use mov (or pop if rsp is at the right place).
; We also need to set rsp to point at the return value before calling "ret".
; ret will increment rsp after reading the return pointer.

; rbp, rbx, r12, r13, r14, r15, belong to the caller, the called function need to save and restore those values.
; The abi doesn't list rsp as belonging to the caller, but it's needed to be restored for the program to work (my interpretation).

; If the return value needs to be put on the stack, the caller reserve space, and pass the address of that space in rdi. rax will contain a copy of rdi when the function returns.
; integer values can be returned in rax or rdx:rax (high:low).
; floats can be returned in xmm0 or xmm1:xmm0 (high:low).

1
2
3
4
5
#! /bin/bash

clang main.c -nostdlib -g -O0 -c -o main.o
nasm -felf64 no_crt_main.asm -o no_crt_main.o
ld main.o no_crt_main.o -nostdlib -o no_crt.bin


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
global _start
extern main

section .text

_start:
    mov edi, [esp] ; argc
    lea esi, [esp + 4] ; argv
    lea edx, [esi + edi * 4 + 4] ; envp

    mov ebp, esp ; Set initial stack base pointer.

    ; Push argc, argv, envp right to left.
    push edx
    push esi
    push edi
    call main

    mov edi, eax ; exit status value in edi.
    mov eax, 252 ; exit_group
    int 0x80 ; syscall

; ebp, ebx, edi, esi, esp belong to the caller, the called function need to save and restore those value.
; The x86 abi actually specifies that arguments pushed on the stack that are less than a word (32 bits) must be tail padded to be a multiple o a word.
; The caller must reserve space in its stack frame for returning struct/union and pass the pointer as the first argument (the last pushed on the stack).
; The called must remove that address from the stack before returning.
; ABI example:
; pop eax => eax contains the return address, esp points to the struct return pointer
; xchg [esp], eax => copy the content of eax (return address) at esp, copy memory pointed by esp (struct return pointer) in eax
; push ebp => back to normal and the struct return pointer isn't on the stack anymore.
; ...
; mov eax, [ebp - 4]
; leave
; ret

; float and double are passed on the stack. double don't need to be 8 bytes aligned.

1
2
3
4
5
#! /bin/bash

clang main.c -nostdlib -g -O0 -c -o main.o
nasm -felf32 no_crt_main_x32.asm -o no_crt_main.o
ld main.o no_crt_main.o -nostdlib -o no_crt.bin

Edited by Simon Anciaux on
Yeah, don't do that in C. It will break depending on optimization levels. Have small assembly startup like you did.
But you don't need to put it into external assembly file. You can do it as inline assembly directly in C code just fine:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
int main(int argc, char** argv, char** envp);

__asm__(
    ".global _start              \n"
    "_start:                     \n"
    "  movq 0(%rsp), %rdi        \n"
    "  leaq 8(%rsp), %rsi        \n"
    "  leaq 8(%rsi,%rdi,8), %rdx \n"

    "  movq %rsp, %rbp \n"  // not sure why you do this. ABI doc seems to say that rbp should be set to 0

    "  call main \n"
    
    "  movq %rax, %rdi \n"
    "  movl $231, %eax \n"
    "  syscall         \n"
);

Change to intel asm if you prefer that syntax.
You can easily #ifdef this part depending on architecture.

; When putting arguments on the stack, if the value is less than 64 bits, we still push 64 bit of space but use only the required size.
; (that's what I observed, I didn't find documentation on that).

All this is documented in System V ABI for AMD64 in "3.2.3 Parameter Passing" section. It has all the details how to pass arguments to functions. Including larger objects than 64 bits.

; rbp is generally used as the base stack pointer. So we want it to have the value of rsp upon enterring a function.
You can do that, or you can ignore this and use rbp as generic purpose register for your own needs. There's no requirement to use it as base stack pointer.

Edited by Mārtiņš Možeiko on
Thanks. I just didn't thought about using a "global" asm block in C file, but now it seems obvious.

mmozeiko
movq %rsp, %rbp // not sure why you do this. ABI doc seems to say that rbp should be set to 0.


Upon startup rbp seems to be 0 and it took me quite some time to figure things out (I thought at some point that the value of rbp was important for the function that was going to be called) and while I was testing things setting rbp seemed to make GDB able to display stack information. That wasn't actually the case, I was using GDB wrong. And then I forgot about the doc saying to set it to zero to indicate the deepest stack frame. Thanks for reminding me.

; When putting arguments on the stack, if the value is less than 64 bits, we still push 64 bit of space but use only the required size.
; (that's what I observed, I didn't find documentation on that).


My concern there was about type less than 64 bit. For example if my 7th argument was an int, do I need to push only 32 bit on the stack. I though the doc didn't address that, but it kinda does, it's just a bit convoluted in my opinion (or I missed the obvious thing). But it makes sens to always push a 64bit value. The x86 ABI on the other hand clearly states that value less than 32bit should be padded.

mmozeiko
; rbp is generally used as the base stack pointer. So we want it to have the value of rsp upon entering a function.
You can do that, or you can ignore this and use rbp as generic purpose register for your own needs. There's no requirement to use it as base stack pointer.


Yes, thanks. It was easier for me to understand things using rbp at first. And I think when you compile with optimization off, function always start with push rbp; mov rsp, rbp;.

Can I just use Intel syntax directly in inline asm ? I thought I read that I needed some setup (compiler flags or defines) in the GDB doc, and I didn't want to figure that at the moment. Do you have a reason for using either at&t or Intel syntax ? I was more used to Intel syntax because of Visual Studio disassembly but I don't really mind using one or the other.
Yes, you can use intel syntax in gcc inline asm like this:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
__asm__(
    ".intel_syntax noprefix\n"

    ".global _start                  \n"
    "_start:                         \n"
    "  mov rdi, [rsp]                \n"
    "  lea rsi, [rsp + 8]            \n"
    "  lea rdx, [rsi + rdi * 8 + 8 ] \n"

    "  xor rbp, rbp \n"

    "  call main \n"

    "  mov rdi, rax \n"
    "  mov rax, 231 \n"
    "  syscall      \n"

    ".att_syntax prefix\n" // switches back to AT&T for rest of the code
);


Edited by Mārtiņš Možeiko on
Thanks again.

For anyone who's reading this thread, the prefix and noprefix "keywords" specify if you need to put '%' in front of registers.