Inlinig by the compiler generates operation with the stack instead of using registers

Simon Anciaux

#25891

February 3, 2022

I've implemented some pcg variants in C to include in my codebase. I've got a 64bit state generating 32 values, and a 128bit state generating 64bit values.

The implementation from the site (pcg_setseq_128_xsl_rr_64_random_r) uses clang or gcc __int128 type which I don't believe can be used with MSVC. So I implemented the few 128bit math operation required (multiply, add, shift left and right) using 2 uint64_t.

The code was working, but when I did a simple performance test, generating 2*u32_max random numbers, my code was about 10 second slower than the official code (23.3s for the official implementation using __int128, 33.8s for my 128bit code), using clang 12.0.1 with -O2. Compiling with MSVC (19.29) was a lot worse, about 93.4s also with -O2 (and MSVC is my primary target).

After looking at the disassembly, I noticed that the 128bit math functions were inlined in the main function, but the register values were pushed on the stack instead of being used directly.

; This is from clang
;...
mov             qword ptr [rsp+0x20], r11
mov             qword ptr [rsp+0x38], r8
mov             qword ptr [rsp+0x28], r13
mov             qword ptr [rsp+0x30], rsi
mov             rcx, qword ptr [rsp+0x20]
imul            rcx, qword ptr [rsp+0x30]
mov             rsi, qword ptr [rsp+0x38]
imul            rsi, qword ptr [rsp+0x28]
mov             rax, qword ptr [rsp+0x20]
mul             qword ptr [rsp+0x28]
;...

I changed the function (multiply and add) to be macros instead and the code generated (both clang and msvc) uses only register and is much closer to the __int128 code. The performances are a lot closer too (__int128 => 23.7, my code => 25.5).

My question is: is there a particular reason why the compilers pushed the registers on the stack before using them ? Is there a way to keep the function code but make it generate better assembly ?

Here is the code. You'll need pcg_variants.h and pcg-advance-128.c. In random_64_get you can change which version of the 128bit multiply and add to use.

/* clang pcg_test.c -opcg_test.exe -O2 -fno-exceptions -ffast-math -fno-rtti -Werror -g -msse2 -msse4.1 -gcodeview -gno-column-info -fdiagnostics-absolute-paths */

#include <windows.h>
#include <stdint.h>
#include <intrin.h>
#include <stdio.h>

typedef uint8_t u8;
typedef uint32_t u32;
typedef uint64_t u64;
typedef int64_t s64;
typedef double r64;

#define false 0
#define true 1

#define u32_max 0xffffffff

#define safe_macro( macro ) do { macro; } while ( 0 )

#if defined ( __clang__ )
#define _assert( expression ) safe_macro( if ( !( expression ) ) { __builtin_debugtrap( ); } )
#else
#define _assert( expression ) safe_macro( if ( !( expression ) ) { __debugbreak( ); } )
#endif

#if defined ( __clang__ )

/*
 * PCG Random Number Generation for C.
 *
 * Copyright 2014-2019 Melissa O'Neill <[email protected]>,
 *                     and the PCG Project contributors.
 *
 * SPDX-License-Identifier: (Apache-2.0 OR MIT)
 *
 * Licensed under the Apache License, Version 2.0 (provided in
 * LICENSE-APACHE.txt and at http://www.apache.org/licenses/LICENSE-2.0)
 * or under the MIT license (provided in LICENSE-MIT.txt and at
 * http://opensource.org/licenses/MIT), at your option. This file may not
 * be copied, modified, or distributed except according to those terms.
 *
 * Distributed on an "AS IS" BASIS, WITHOUT WARRANTY OF ANY KIND, either
 * express or implied.  See your chosen license for details.
 *
 * For additional information about the PCG random number generation scheme,
 * visit http://www.pcg-random.org/.
 */

#include "pcg_variants.h"
#include "pcg-advance-128.c"
#endif

#define perf_get_frequency( u64_ptr ) QueryPerformanceFrequency( ( LARGE_INTEGER* ) u64_ptr )
#define perf_get_time( u64_ptr ) QueryPerformanceCounter( ( LARGE_INTEGER* ) u64_ptr )

/* == 128 bit math == */

typedef struct u128_t {
    u64 low, high;
} u128_t;

#define u128_make( low, high ) ( u128_t ) { low, high }

#define math_u128_mul_inline( r, a, b ) { u128_t t; t.low = _umul128( a.low, b.low, &t.high ); t.high += ( ( a.low * b.high ) + ( a.high * b.low ) ); r = t; }

u128_t math_u128_mul( volatile u128_t a, volatile u128_t b ) {
    
    /* NOTE simon: 
    Idea is to do similiar thing as written multiplication:
    multiply each digit from a with each digit of b and add each multiplication result.
    But we use 64bit instead of digits.

    The a.low + b.low is a 64bit * 64bit multiply where the result is 128bits. We keep every 128bits
    using the `mul` instruction using rax (implied) and another register and the result is stored
    in rdx (high 64bits) and rax (low 64bits).

    We than multiply a.low with b.high which is a regular 64bit multiply because we can't store
    the bit that would overflow (bit that would be outside of the 128bits). We do a similar thing with
    a.high and b.low.

    We don't need to multiply a.high with b.high as that can only produce bits higher than the 128bits.
    e.g. If we used 2 digits numbers, the smallest we could do would be 10 * 10 which is 100 and
    requires 3 digits.

    The result is adding the results from those calculations.
    */
    
    u128_t result;
    
    u64 high_1 = a.low * b.high;
    u64 high_2 = a.high * b.low;
    u64 high_3;
    
    result.low = _umul128( a.low, b.low, &high_3 );
    result.high = high_1 + high_2 + high_3;
    
    return result;
}

#define math_u128_add_inline( r, a, b ) { u128_t t; t.low = a.low + b.low; t.high = a.high + b.high + ( ( t.low < a.low ) ? 1 : 0 ); r = t; }

u128_t math_u128_add( u128_t a, u128_t b ) {
    
    u128_t result;
    result.low = a.low + b.low;
    result.high = a.high + b.high + ( ( result.low < a.low ) ? 1 : 0 );
    
    return result;
}

u128_t math_u128_left_shift( u128_t a, u8 bit_count ) {
    
    u128_t result;
    
    if ( bit_count <= 64 ) {
        result.low = a.low << bit_count;
        result.high = a.high << bit_count;
        result.high |= a.low >> ( 64 - bit_count );
    } else {
        result.low = 0;
        result.high = a.low << ( bit_count - 64 );
    }
    
    return result;
}

u128_t math_u128_right_shift( u128_t a, u8 bit_count ) {
    
    u128_t result;
    
    if ( bit_count <= 64 ) {
        result.high = a.high >> bit_count;
        result.low = a.low >> bit_count;
        result.low |= a.high << ( 64 - bit_count );
    } else {
        result.high = 0;
        result.low = a.high >> ( bit_count - 64 );
    }
    
    return result;
}


/* == Custom pcg implementation. */


#define random_64_multiplier ( u128_t ) { /* low */ 4865540595714422341ull, /* high */ 2549297995355413924ull }

typedef struct random_64_t {
    u128_t state;
    u128_t increment;
} random_64_t;

random_64_t random_64_make( u128_t seed, u128_t sequence ) {
    
    random_64_t result = { 0 };
    
    result.increment = math_u128_left_shift( sequence, 1 );
    result.increment.low |= 1;
    
    result.state = math_u128_add( result.increment, seed );
    result.state = math_u128_mul( result.state, random_64_multiplier );
    result.state = math_u128_add( result.state, result.increment );
    
    return result;
}

u64 random_64_get( random_64_t* random_64 ) {
    
    /* NOTE simon: 128 xsl rr 64 */
    
    u64 result = 0;
    random_64->state = math_u128_mul( random_64->state, random_64_multiplier );
    // math_u128_mul_inline( random_64->state, random_64->state, random_64_multiplier );
    random_64->state = math_u128_add( random_64->state, random_64->increment );
    // math_u128_add_inline( random_64->state, random_64->state, random_64->increment );
    
#if 0
    u8 rotation = ( u8 ) math_u128_right_shift( random_64->state, 122 ).low;
    u64 temp = math_u128_right_shift( random_64->state, 64 ).low;
    temp ^= random_64->state.low;
#else
    /* NOTE simon: 2^6 => 64 bit rotation. */
    u8 rotation = ( u8 ) ( random_64->state.high >> ( 64 - 6 ) );
    u64 temp = random_64->state.high ^ random_64->state.low;
#endif
    
    result = ( temp >> rotation ) | ( temp << ( 64 - rotation ) );
    
    return result;
}

void random_64_advance( random_64_t* random_64, u128_t delta ) {
    
    u128_t multiplier = random_64_multiplier;
    u128_t increment = random_64->increment;
    
    u128_t multiply_accumulator = u128_make( 1, 0 );
    u128_t increment_accumulator = u128_make( 0, 0 );
    
    while ( delta.low > 0 || delta.high > 0 ) {
        
        if ( delta.low & 1 ) {
            multiply_accumulator = math_u128_mul( multiply_accumulator, multiplier );
            increment_accumulator = math_u128_mul( increment_accumulator, multiplier );
            increment_accumulator = math_u128_add( increment_accumulator, increment );
        }
        
        u128_t temp = math_u128_add( multiplier, u128_make( 1, 0 ) );
        increment = math_u128_mul( temp, increment );
        multiplier = math_u128_mul( multiplier, multiplier );
        delta = math_u128_right_shift( delta, 1 ); /* NOTE simon: delta /= 2. */
    }
    
    random_64->state = math_u128_mul( multiply_accumulator, random_64->state );
    random_64->state = math_u128_add( random_64->state, increment_accumulator );
}

u64 random_64_get_range( random_64_t* random_64, u64 min, u64 max ) {
    
    u64 result = 0;
    u64 range = max - min;
    u64 threshold = ( ( u64 ) ( -( s64 ) range ) ) % range;
    
    while ( true ) {
        
        u64 number = random_64_get( random_64 );
        
        if ( number >= threshold ) {
            result = min + ( number % range );
            break;
        }
    }
    
    return result;
}

int main( int argc, char** argv ) {
    
    u64 frequency;
    perf_get_frequency( &frequency );
    
    u64 sum = 0;
    
    struct pcg_state_setseq_128 rng_1;
    
    // pcg64_srandom_r( &rng_1, ( pcg128_t ) &rng_1, ( pcg128_t ) &rng_1 );
    pcg64_srandom_r( &rng_1, ( pcg128_t ) 0x000000DF8F19FBD0, ( pcg128_t ) 0x000000DF8F19FBD0 );
    
    // random_64_t rng_2 = random_64_make( u128_make( ( u64 ) &rng_1, 0 ), u128_make( ( u64 ) &rng_1, 0 ) );
    random_64_t rng_2 = random_64_make( u128_make( ( u64 ) 0x000000DF8F19FBD0, 0 ), u128_make( ( u64 ) 0x000000DF8F19FBD0, 0 ) );
    
    /* Some tests.
        u64 a_high = ( u64 ) ( rng_1.state >> 64 );
        u64 a_low = ( u64 ) ( rng_1.state );
        u64 b_high = ( u64 ) ( rng_1.inc >> 64 );
        u64 b_low = ( u64 ) ( rng_1.inc );
        
        _assert( a_high == rng_2.state.high );
        _assert( a_low == rng_2.state.low );
        _assert( b_high == rng_2.increment.high );
        _assert( b_low == rng_2.increment.low );
        
        pcg64_advance_r( &rng_1, 12345 );
        pcg64_advance_r( &rng_1, 145 );
        pcg64_advance_r( &rng_1, 1235 );
        pcg64_advance_r( &rng_1, 1245 );
        pcg64_advance_r( &rng_1, 14839255 );
        
        random_64_advance( &rng_2, u128_make( 12345, 0 ) );
        random_64_advance( &rng_2, u128_make( 145, 0 ) );
        random_64_advance( &rng_2, u128_make( 1235, 0 ) );
        random_64_advance( &rng_2, u128_make( 1245, 0 ) );
        random_64_advance( &rng_2, u128_make( 14839255, 0 ) );
        
        a_high = ( u64 ) ( rng_1.state >> 64 );
        a_low = ( u64 ) ( rng_1.state );
        b_high = ( u64 ) ( rng_1.inc >> 64 );
        b_low = ( u64 ) ( rng_1.inc );
        
        _assert( a_high == rng_2.state.high );
        _assert( a_low == rng_2.state.low );
        _assert( b_high == rng_2.increment.high );
        _assert( b_low == rng_2.increment.low );
*/
    
    for ( u32 i = 0; i < u32_max; i++ ) {
        
        uint64_t a = pcg64_random_r( &rng_1 );
        uint64_t b = random_64_get( &rng_2 );
        _assert( a == b );
        uint64_t c = a + b;
        _assert( c == a * 2 );
        
        sum += c;
    }
    
    /* Time */
    
    // pcg64_srandom_r( &rng_1, ( pcg128_t ) &rng_1, ( pcg128_t ) &rng_1 );
    pcg64_srandom_r( &rng_1, ( pcg128_t ) 0x000000DF8F19FBD0, ( pcg128_t ) 0x000000DF8F19FBD0 );
    
    u64 pcg_start;
    perf_get_time( &pcg_start ); 
    
    for ( u32 i = 0; i < u32_max; i++ ) {
        
        uint64_t a = pcg64_random_r( &rng_1 );
        uint64_t b = pcg64_random_r( &rng_1 );
        uint64_t c = a + b;
        sum += c;
    }
    
    u64 pcg_end;
    perf_get_time( &pcg_end );
    
    r64 pcg_time = ( r64 ) ( pcg_end - pcg_start ) / ( r64 ) frequency;
    printf( "pcg: %f\n", pcg_time );
    
    // rng_2 = random_64_make( u128_make( ( u64 ) &rng_1, 0 ), u128_make( ( u64 ) &rng_1, 0 ) );
    rng_2 = random_64_make( u128_make( ( u64 ) 0x000000DF8F19FBD0, 0 ), u128_make( ( u64 ) 0x000000DF8F19FBD0, 0 ) );
    
    u64 custom_start;
    perf_get_time( &custom_start );
    
    for ( u32 i = 0; i < u32_max; i++ ) {
        
        uint64_t a = random_64_get( &rng_2 );
        uint64_t b = random_64_get( &rng_2 );
        uint64_t c = a + b;
        sum += c;
    }
    
    u64 custom_end;
    perf_get_time( &custom_end );
    
    r64 custom_time = ( r64 ) ( custom_end - custom_start ) / ( r64 ) frequency;
    printf( "custom: %f\n", custom_time );
    
    return ( u32 ) sum;
}

Edited by Simon Anciaux on February 3, 2022, 5:56pm

Mārtiņš Možeiko

#25892

February 3, 2022

That happens because you put math_u128_mul arguments with volatile type. Volatile means you are telling compiler "please do not optimize these variables, always put them in memory and every access to them must be memory load & store". Don't do volatile.

Also you can do better codegen for your add/shift operations for MSVC by using similar intrinsics as _umul128 - use _addcarry_u64 and __shiftright128

Here's my pcg code that implements these operations for all kinds of compilers - gcc/clang, and msvc for x64 and 64-bit ARM, and even generic 32-bit code fallback.

https://gist.github.com/mmozeiko/1561361cd4105749f80bb0b9223e9db8#file-pcg64-h

Edited by Mārtiņš Možeiko on February 3, 2022, 7:02pm

Simon Anciaux

#25893

February 3, 2022

I feel completely stupid.

I added those volatile when I was trying stuff in godbolt precisely to try to prevent the compiler from inlining the function and never looked back at the function signature. What a way to waste half a day.

Thanks for the code, I'll look into that tomorrow.

Simon Anciaux

#25900

February 4, 2022

I tried to use __addcarry_u64 but most of the time the code is slower than the basic code I had. Note that I'm using a i7 860 which is quite old so it might be faster on more recent processor.

The only time the speed is the same is when I have the test loop that verify that my generator generates the same number as the official pcg code (and using a single __addcarry_u64). The generated code in that case is a bit different for some reason (the correctness loop and timing loop are different, and I changed the timing loops to only generate 1 number by iteration and adding it to the sum).

u128_t math_u128_add( u128_t a, u128_t b ) {

/* Using __int128 add I have 13.1s */

#if 1 // 12.7s
    u128_t result;
    result.low = a.low + b.low;
    result.high = a.high + b.high + ( ( result.low < a.low ) ? 1 : 0 );
#elif 0 // 13.7s , when testing correctness before 12.7s
    u128_t result;
    result.low = a.low + b.low;
    _addcarry_u64( ( result.low < a.low ) ? 1 : 0, a.high, b.high, &result.high );
#elif 0 // 13.4s
    u128_t result;
    _addcarry_u64( _addcarry_u64( 0, a.low, b.low, &result.low ), a.high, b.high, &result.high );
#else // 13.4s
    u128_t result;
    u8 carry = _addcarry_u64( 0, a.low, b.low, &result.low );
    _addcarry_u64( carry, a.high, b.high, &result.high );
#endif
    
    return result;
}

; naive code
mov             rax, rsi
mul             r12
imul            rsi, r14
imul            rdi, r12
add             rdi, rsi
add             rdi, rdx
lea             rsi, ptr [rax+r15*1]
xor             ecx, ecx
cmp             rax, r9
setnbe          cl
add             rdi, rcx
mov             rcx, rdi
shr             rcx, 0x3a
mov             rax, rdi
xor             rax, rsi
ror             rax, cl

; single addcarry_u64 without correctness
mov             rax, rsi
mul             r12
imul            rsi, r14
imul            rdi, r12
add             rdi, rsi
add             rdi, rdx
lea             rsi, ptr [rax+r15*1]
cmp             rax, r9
setnbe          al
add             al, 0xff
adc             rdi, 0x0
mov             rcx, rdi
shr             rcx, 0x3a
mov             rax, rdi
xor             rax, rsi
ror             rax, cl

; single addcarry_u64 with correctness
mov             rax, r12
mul             rsi
imul            r12, r8
imul            r15, rsi
add             r15, r12
add             r15, rdx
lea             r12, ptr [rax+r9*1]
cmp             rdi, rax
adc             r15, 0x0
mov             rcx, r15
shr             rcx, 0x3a
mov             rax, r15
xor             rax, r12
ror             rax, cl

; double __addcarry_u64
mov             rax, rcx
mul             r8
imul            rcx, r9
imul            rsi, r8
add             rsi, rcx
add             rsi, rdx
add             rax, r10
adc             rsi, 0x0
mov             rcx, rsi
shr             rcx, 0x3a
mov             rdx, rsi
xor             rdx, rax
ror             rdx, cl

That was with clang. With msvc the naive code is the fastest overall (12.3s) with the cleaner assembly (for msvc). When using __addcarry_u64 the code was 4 time slower (44.2s) and the assembly contains some stack read and store as well as some sse register use.

; msvc naive code
imul            r8, rsi
mov             rax, rsi
mul             r9
imul            r9, rbp
add             r8, rdx
add             r8, r9
lea             r9, ptr [rdi+rax*1]
cmp             r9, rax
mov             rdx, r9
adc             r8, r11
xor             rdx, r8
mov             rcx, r8
shr             rcx, 0x3a
ror             rdx, cl

; msvc with __addcarry_u64
mov             rcx, qword ptr [rbp-0x48]
mov             rax, rdi
mul             rcx
imul            rcx, rsi
mov             r8, rax
mov             rax, qword ptr [rbp-0x40]
imul            rax, rdi
add             rdx, rax
add             rdx, rcx
add             r8, r11
mov             qword ptr [rbp-0x58], r8
adc             rdx, r10
xor             r8, rdx
mov             qword ptr [rbp-0x50], rdx
movups          xmm0, xmmword ptr [rbp-0x58]
mov             rcx, rdx
shr             rcx, 0x3a
ror             r8, cl

I also find out that in the multiply function, the order of the code improved the performance.

u128_t math_u128_mul( u128_t a, u128_t b ) {

    u128_t result;
#if 1 /* This is 0.8s faster than the alternative with clang.
    result.low = _umul128( a.low, b.low, &result.high );
    u64 high_1 = a.low * b.high;
    u64 high_2 = a.high * b.low;
    result.high += high_1 + high_2;
#else
    u64 high_1 = a.low * b.high;
    u64 high_2 = a.high * b.low;
    result.low = _umul128( a.low, b.low, &result.high );
    result.high += high_1 + high_2;
#endif
    
    return result;
}

Could you explain, or point me to an explanation of the code for generating floats and doubles ? I mean I understand what the code is doing but not exactly what it does to the number. It looks similar to what the PCG site mention double d = ldexp(pcg32_random_r(&myrng), -32); (which uses double so it can store a u32 precisely) but 0x1.0p-24f would be 2^-36 ?

static inline float pcg64_nextf(pcg64* rng)
{
    uint64_t x = pcg64_next(rng);
    return (float)(int32_t)(x >> 40) * 0x1.0p-24f;
}

static inline double pcg64_nextd(pcg64* rng)
{
    uint64_t x = pcg64_next(rng);
    return (double)(int64_t)(x >> 11) * 0x1.0p-53;
}

What I'm understanding (correct me if I'm wrong) is that you are getting 24 bit (mantissa + sign bit ?) and dividing it by 2^36 ? It seems that we should use 0x1.0e-24f to divide by 2^24 ? p means the exponent is an hexadecimal value right ? Why 24bit and not 23 (the size of the mantissa) to not use the sign bit, I assume we want number between 0 and 1, both included ? Is there a reason for using the upper bits instead of the bottoms one ?

[EDIT] I misunderstood the meaning of p in floating point numbers. The rest of the question stands though.

Edited by Simon Anciaux on February 4, 2022, 3:48pm Reason: Note at the end

Mārtiņš Možeiko

#25904

February 4, 2022

I wrote this: https://gist.github.com/mmozeiko/37010d7352cffcdf535624f783a5944a And then run using google's benchmark library.

With clang -O2:

Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BenchMy                  1.66 ns         1.67 ns    448000000
BenchMyNoAddIntrin       1.81 ns         1.80 ns    373333333

With MSVC -O2:

Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BenchMy                  10.1 ns         10.3 ns     74666667
BenchMyNoAddIntrin       10.2 ns         10.3 ns     74666667

So pretty much the same, or a bit better with using _addcarry_u64.

The code you posted looks very strange - why there is xmm0 operation in there at all? There shouldn't be any: https://godbolt.org/z/voKETdnos

As for float versions - I'm not sure where you are getting 2^-36. 0x1.0p-24f is 2^-24. All that code is doing is generating 24 bits of uniform integer [0..2^24) and then dividing it by 2^24 to get [0..1) range. 24-bits is max you can take to guarantee uniform float. Taking more bits will make float use non-zero exponent value which means there will be more values closer to 0, than to 1. You can take less bits of course. Generating 10 bits of integer and multiplying with 0x1.0p-10f would also work, just generate less possibilities for float value.

It is 24 bits for mantissa, not 23. Because mantissa stores only bits after first 1 bit. First 1 bit is implicit, there's no point of storing it in actual value, thus only 23 are stored for 24-bit mantissa value in float. Similarly for double it is 52 bits stored for 53-bit mantissa value.

Lower bits are as good as upper bits, no special reason. Minor reason is that shift operation encodes more compactly - taking lower bits would require and with 32-bit constant - extra 4 bytes (or 8 bytes for double) to encode in x86 opcode. Also (not for this random generator) but for other random generators upper bits might provide higher quality randomness than lower ones, that's why typically you take upper ones.

Using ldexp[f] function often generates a lot more code, because that function is meant for arbitrary values. It potentially won't be inlined, or generate more code. Validating input with extra branches, is input in proper range. Here's a typical implementation of such function: https://git.musl-libc.org/cgit/musl/tree/src/math/scalbnf.c - ldexpf(x,n) directly calls scalbnf(x,n).

In my case it is just one integer shift, integer cast to float and one float multiply is all that's needed - very simple. As long as you have reasonably efficient int-to-float operation for your target architecture, this will be very fast. Can be easily expressed in SIMD as well.

Edited by Mārtiņš Možeiko on February 4, 2022, 7:38pm

Replying to mrmixer (#25900)

Simon Anciaux

#25915

February 5, 2022

The code you posted looks very strange - why there is xmm0 operation in there at all? There shouldn't be any: https://godbolt.org/z/voKETdnos

Here is the code on godbolt: https://godbolt.org/z/vv7jz54v5 If you know what I'm doing wrong I'm interested. Changing the #if in math_u128_add generates different code.

As for float versions - I'm not sure where you are getting 2^-36. 0x1.0p-24f is 2^-24.

I edited the post after writing it to say that I misunderstood the meaning of p (maybe you saw it before the edit). I never used hexadecimal to write floating point number, and after a web search I misread the explanation and though that 0x1.0p-24f meant that the exponent 24 was an hexadecimal number meaning 36 in decimal.

It is 24 bits for mantissa, not 23. Because mantissa stores only bits after first 1 bit. First 1 bit is implicit, there's no point of storing it in actual value, thus only 23 are stored for 24-bit mantissa value in float. Similarly for double it is 52 bits stored for 53-bit mantissa value.

I knew that, I should have figured it out. Sorry.

Lower bits are as good as upper bits, no special reason. Minor reason is that shift operation encodes more compactly - taking lower bits would require and with 32-bit constant - extra 4 bytes (or 8 bytes for double) to encode in x86 opcode. Also (not for this random generator) but for other random generators upper bits might provide higher quality randomness than lower ones, that's why typically you take upper ones.

Isn't modifying the the source bits changing the uniform "distribution". For example if we generate a 32bit integer, shift it by 8 bit, than any number generated up to 255 will generates a 0, so 0 would have 255 more chance to be generated ?

If that's correct, and assuming we don't care about speed, couldn't we use the pcg32_range function to generate numbers between 0 and 2^24 and than do the divide ?

After thinking more about this, I think that every number gets 255 more chance, so it's fine. So the two previous paragraphs are wrong.

Another question, if I want to have float including 1, is dividing by ( float ) 0xffffff ok ? I assume it will be slower than the multiply, but in term on uniformity of the generated number ?

Replying to mmozeiko (#25904)

Mārtiņš Možeiko

#25926

February 5, 2022

Seems compiler really tries to keep your uint128_t type into SSE register because it fits two uint64_t there. The code it generates for that is terrible though - lot's of unnecessary moves between GRP and SSE. You really don't want such codegen.

After thinking more about this, I think that every number gets 255 more chance, so it's fine. So the two previous paragraphs are wrong.

Think of it this way - if you have uniform 2^n random number, and want to get 2^k uniform random number (k<=n) then you can simply take k bits from that n bit number anywhere. All it does it maps 2^(n-k) sized buckets to each k-bit number. You need to use range() function only when 2^n is not divisible by range value.

Another question, if I want to have float including 1, is dividing by ( float ) 0xffffff ok ? I assume it will be slower than the multiply, but in term on uniformity of the generated number ?

First question I would ask is, does it really matter sometimes with super low probability to get value 1.0 exactly? For double that will be 2^-53 probability. That's much lower than winning million dollar lottery. And even without exact 1.0 value there will be so many other values super close to it - 0.999999999999999, 0.999999999999998, 0.999999999999997, ... - all these numbers are closer to number 1.0 than atom size in meters, so you might as well think of them as value 1.0

But if you really want it - I would not do with division. Because that won't produce uniform output. The result of division will be all kinds of floats that are not possible to represent exactly, so they will snap to closest float - and you'll have bunch of floats closer to each other, bunch of them further apart, etc..

Instead I would generate mantissa bits exactly, so you have exact uniformity.

For floats:

1.0 is 0x3f800000
       0x3fYxxxxx // (where Y>=8) numbers between 1.0 and 2.0
2.0 is 0x40000000

so generate random number in this range inclusive and subtract 1.0.

static inline float pcg64_nextf_inclusive(pcg64* rng)
{
    // range is not exact pow2 number, so cannot just take K bits
    uint32_t x = pcg64_range(rng, 0x3f800000, 0x40000000+1);
    float f;
    memcpy(&f, &x, 4);
    return f - 1.0f;
}

Edited by Mārtiņš Možeiko on February 5, 2022, 9:44pm

Replying to mrmixer (#25915)

Simon Anciaux

#25927

February 5, 2022

Thanks, that makes sens.

Replying to mmozeiko (#25926)

Simon Anciaux

#26097

March 6, 2022

Just adding this recent blog post from Matt Pharr that talks about uniform float pseudo random number: Sampling in Floating Point.

Linked articles and papers seemed interesting too.

Edited by Simon Anciaux on March 6, 2022, 2:19pm