I've implemented some pcg variants in C to include in my codebase. I've got a 64bit state generating 32 values, and a 128bit state generating 64bit values.
The implementation from the site (pcg_setseq_128_xsl_rr_64_random_r) uses clang or gcc __int128 type which I don't believe can be used with MSVC. So I implemented the few 128bit math operation required (multiply, add, shift left and right) using 2 uint64_t.
The code was working, but when I did a simple performance test, generating 2*u32_max random numbers, my code was about 10 second slower than the official code (23.3s for the official implementation using __int128, 33.8s for my 128bit code), using clang 12.0.1 with -O2. Compiling with MSVC (19.29) was a lot worse, about 93.4s also with -O2 (and MSVC is my primary target).
After looking at the disassembly, I noticed that the 128bit math functions were inlined in the main function, but the register values were pushed on the stack instead of being used directly.
; This is from clang ;... mov qword ptr [rsp+0x20], r11 mov qword ptr [rsp+0x38], r8 mov qword ptr [rsp+0x28], r13 mov qword ptr [rsp+0x30], rsi mov rcx, qword ptr [rsp+0x20] imul rcx, qword ptr [rsp+0x30] mov rsi, qword ptr [rsp+0x38] imul rsi, qword ptr [rsp+0x28] mov rax, qword ptr [rsp+0x20] mul qword ptr [rsp+0x28] ;...
I changed the function (multiply and add) to be macros instead and the code generated (both clang and msvc) uses only register and is much closer to the __int128 code. The performances are a lot closer too (__int128 => 23.7, my code => 25.5).
My question is: is there a particular reason why the compilers pushed the registers on the stack before using them ? Is there a way to keep the function code but make it generate better assembly ?
Here is the code. You'll need pcg_variants.h and pcg-advance-128.c. In random_64_get
you can change which version of the 128bit multiply and add to use.
/* clang pcg_test.c -opcg_test.exe -O2 -fno-exceptions -ffast-math -fno-rtti -Werror -g -msse2 -msse4.1 -gcodeview -gno-column-info -fdiagnostics-absolute-paths */ #include <windows.h> #include <stdint.h> #include <intrin.h> #include <stdio.h> typedef uint8_t u8; typedef uint32_t u32; typedef uint64_t u64; typedef int64_t s64; typedef double r64; #define false 0 #define true 1 #define u32_max 0xffffffff #define safe_macro( macro ) do { macro; } while ( 0 ) #if defined ( __clang__ ) #define _assert( expression ) safe_macro( if ( !( expression ) ) { __builtin_debugtrap( ); } ) #else #define _assert( expression ) safe_macro( if ( !( expression ) ) { __debugbreak( ); } ) #endif #if defined ( __clang__ ) /* * PCG Random Number Generation for C. * * Copyright 2014-2019 Melissa O'Neill <[email protected]>, * and the PCG Project contributors. * * SPDX-License-Identifier: (Apache-2.0 OR MIT) * * Licensed under the Apache License, Version 2.0 (provided in * LICENSE-APACHE.txt and at http://www.apache.org/licenses/LICENSE-2.0) * or under the MIT license (provided in LICENSE-MIT.txt and at * http://opensource.org/licenses/MIT), at your option. This file may not * be copied, modified, or distributed except according to those terms. * * Distributed on an "AS IS" BASIS, WITHOUT WARRANTY OF ANY KIND, either * express or implied. See your chosen license for details. * * For additional information about the PCG random number generation scheme, * visit http://www.pcg-random.org/. */ #include "pcg_variants.h" #include "pcg-advance-128.c" #endif #define perf_get_frequency( u64_ptr ) QueryPerformanceFrequency( ( LARGE_INTEGER* ) u64_ptr ) #define perf_get_time( u64_ptr ) QueryPerformanceCounter( ( LARGE_INTEGER* ) u64_ptr ) /* == 128 bit math == */ typedef struct u128_t { u64 low, high; } u128_t; #define u128_make( low, high ) ( u128_t ) { low, high } #define math_u128_mul_inline( r, a, b ) { u128_t t; t.low = _umul128( a.low, b.low, &t.high ); t.high += ( ( a.low * b.high ) + ( a.high * b.low ) ); r = t; } u128_t math_u128_mul( volatile u128_t a, volatile u128_t b ) { /* NOTE simon: Idea is to do similiar thing as written multiplication: multiply each digit from a with each digit of b and add each multiplication result. But we use 64bit instead of digits. The a.low + b.low is a 64bit * 64bit multiply where the result is 128bits. We keep every 128bits using the `mul` instruction using rax (implied) and another register and the result is stored in rdx (high 64bits) and rax (low 64bits). We than multiply a.low with b.high which is a regular 64bit multiply because we can't store the bit that would overflow (bit that would be outside of the 128bits). We do a similar thing with a.high and b.low. We don't need to multiply a.high with b.high as that can only produce bits higher than the 128bits. e.g. If we used 2 digits numbers, the smallest we could do would be 10 * 10 which is 100 and requires 3 digits. The result is adding the results from those calculations. */ u128_t result; u64 high_1 = a.low * b.high; u64 high_2 = a.high * b.low; u64 high_3; result.low = _umul128( a.low, b.low, &high_3 ); result.high = high_1 + high_2 + high_3; return result; } #define math_u128_add_inline( r, a, b ) { u128_t t; t.low = a.low + b.low; t.high = a.high + b.high + ( ( t.low < a.low ) ? 1 : 0 ); r = t; } u128_t math_u128_add( u128_t a, u128_t b ) { u128_t result; result.low = a.low + b.low; result.high = a.high + b.high + ( ( result.low < a.low ) ? 1 : 0 ); return result; } u128_t math_u128_left_shift( u128_t a, u8 bit_count ) { u128_t result; if ( bit_count <= 64 ) { result.low = a.low << bit_count; result.high = a.high << bit_count; result.high |= a.low >> ( 64 - bit_count ); } else { result.low = 0; result.high = a.low << ( bit_count - 64 ); } return result; } u128_t math_u128_right_shift( u128_t a, u8 bit_count ) { u128_t result; if ( bit_count <= 64 ) { result.high = a.high >> bit_count; result.low = a.low >> bit_count; result.low |= a.high << ( 64 - bit_count ); } else { result.high = 0; result.low = a.high >> ( bit_count - 64 ); } return result; } /* == Custom pcg implementation. */ #define random_64_multiplier ( u128_t ) { /* low */ 4865540595714422341ull, /* high */ 2549297995355413924ull } typedef struct random_64_t { u128_t state; u128_t increment; } random_64_t; random_64_t random_64_make( u128_t seed, u128_t sequence ) { random_64_t result = { 0 }; result.increment = math_u128_left_shift( sequence, 1 ); result.increment.low |= 1; result.state = math_u128_add( result.increment, seed ); result.state = math_u128_mul( result.state, random_64_multiplier ); result.state = math_u128_add( result.state, result.increment ); return result; } u64 random_64_get( random_64_t* random_64 ) { /* NOTE simon: 128 xsl rr 64 */ u64 result = 0; random_64->state = math_u128_mul( random_64->state, random_64_multiplier ); // math_u128_mul_inline( random_64->state, random_64->state, random_64_multiplier ); random_64->state = math_u128_add( random_64->state, random_64->increment ); // math_u128_add_inline( random_64->state, random_64->state, random_64->increment ); #if 0 u8 rotation = ( u8 ) math_u128_right_shift( random_64->state, 122 ).low; u64 temp = math_u128_right_shift( random_64->state, 64 ).low; temp ^= random_64->state.low; #else /* NOTE simon: 2^6 => 64 bit rotation. */ u8 rotation = ( u8 ) ( random_64->state.high >> ( 64 - 6 ) ); u64 temp = random_64->state.high ^ random_64->state.low; #endif result = ( temp >> rotation ) | ( temp << ( 64 - rotation ) ); return result; } void random_64_advance( random_64_t* random_64, u128_t delta ) { u128_t multiplier = random_64_multiplier; u128_t increment = random_64->increment; u128_t multiply_accumulator = u128_make( 1, 0 ); u128_t increment_accumulator = u128_make( 0, 0 ); while ( delta.low > 0 || delta.high > 0 ) { if ( delta.low & 1 ) { multiply_accumulator = math_u128_mul( multiply_accumulator, multiplier ); increment_accumulator = math_u128_mul( increment_accumulator, multiplier ); increment_accumulator = math_u128_add( increment_accumulator, increment ); } u128_t temp = math_u128_add( multiplier, u128_make( 1, 0 ) ); increment = math_u128_mul( temp, increment ); multiplier = math_u128_mul( multiplier, multiplier ); delta = math_u128_right_shift( delta, 1 ); /* NOTE simon: delta /= 2. */ } random_64->state = math_u128_mul( multiply_accumulator, random_64->state ); random_64->state = math_u128_add( random_64->state, increment_accumulator ); } u64 random_64_get_range( random_64_t* random_64, u64 min, u64 max ) { u64 result = 0; u64 range = max - min; u64 threshold = ( ( u64 ) ( -( s64 ) range ) ) % range; while ( true ) { u64 number = random_64_get( random_64 ); if ( number >= threshold ) { result = min + ( number % range ); break; } } return result; } int main( int argc, char** argv ) { u64 frequency; perf_get_frequency( &frequency ); u64 sum = 0; struct pcg_state_setseq_128 rng_1; // pcg64_srandom_r( &rng_1, ( pcg128_t ) &rng_1, ( pcg128_t ) &rng_1 ); pcg64_srandom_r( &rng_1, ( pcg128_t ) 0x000000DF8F19FBD0, ( pcg128_t ) 0x000000DF8F19FBD0 ); // random_64_t rng_2 = random_64_make( u128_make( ( u64 ) &rng_1, 0 ), u128_make( ( u64 ) &rng_1, 0 ) ); random_64_t rng_2 = random_64_make( u128_make( ( u64 ) 0x000000DF8F19FBD0, 0 ), u128_make( ( u64 ) 0x000000DF8F19FBD0, 0 ) ); /* Some tests. u64 a_high = ( u64 ) ( rng_1.state >> 64 ); u64 a_low = ( u64 ) ( rng_1.state ); u64 b_high = ( u64 ) ( rng_1.inc >> 64 ); u64 b_low = ( u64 ) ( rng_1.inc ); _assert( a_high == rng_2.state.high ); _assert( a_low == rng_2.state.low ); _assert( b_high == rng_2.increment.high ); _assert( b_low == rng_2.increment.low ); pcg64_advance_r( &rng_1, 12345 ); pcg64_advance_r( &rng_1, 145 ); pcg64_advance_r( &rng_1, 1235 ); pcg64_advance_r( &rng_1, 1245 ); pcg64_advance_r( &rng_1, 14839255 ); random_64_advance( &rng_2, u128_make( 12345, 0 ) ); random_64_advance( &rng_2, u128_make( 145, 0 ) ); random_64_advance( &rng_2, u128_make( 1235, 0 ) ); random_64_advance( &rng_2, u128_make( 1245, 0 ) ); random_64_advance( &rng_2, u128_make( 14839255, 0 ) ); a_high = ( u64 ) ( rng_1.state >> 64 ); a_low = ( u64 ) ( rng_1.state ); b_high = ( u64 ) ( rng_1.inc >> 64 ); b_low = ( u64 ) ( rng_1.inc ); _assert( a_high == rng_2.state.high ); _assert( a_low == rng_2.state.low ); _assert( b_high == rng_2.increment.high ); _assert( b_low == rng_2.increment.low ); */ for ( u32 i = 0; i < u32_max; i++ ) { uint64_t a = pcg64_random_r( &rng_1 ); uint64_t b = random_64_get( &rng_2 ); _assert( a == b ); uint64_t c = a + b; _assert( c == a * 2 ); sum += c; } /* Time */ // pcg64_srandom_r( &rng_1, ( pcg128_t ) &rng_1, ( pcg128_t ) &rng_1 ); pcg64_srandom_r( &rng_1, ( pcg128_t ) 0x000000DF8F19FBD0, ( pcg128_t ) 0x000000DF8F19FBD0 ); u64 pcg_start; perf_get_time( &pcg_start ); for ( u32 i = 0; i < u32_max; i++ ) { uint64_t a = pcg64_random_r( &rng_1 ); uint64_t b = pcg64_random_r( &rng_1 ); uint64_t c = a + b; sum += c; } u64 pcg_end; perf_get_time( &pcg_end ); r64 pcg_time = ( r64 ) ( pcg_end - pcg_start ) / ( r64 ) frequency; printf( "pcg: %f\n", pcg_time ); // rng_2 = random_64_make( u128_make( ( u64 ) &rng_1, 0 ), u128_make( ( u64 ) &rng_1, 0 ) ); rng_2 = random_64_make( u128_make( ( u64 ) 0x000000DF8F19FBD0, 0 ), u128_make( ( u64 ) 0x000000DF8F19FBD0, 0 ) ); u64 custom_start; perf_get_time( &custom_start ); for ( u32 i = 0; i < u32_max; i++ ) { uint64_t a = random_64_get( &rng_2 ); uint64_t b = random_64_get( &rng_2 ); uint64_t c = a + b; sum += c; } u64 custom_end; perf_get_time( &custom_end ); r64 custom_time = ( r64 ) ( custom_end - custom_start ) / ( r64 ) frequency; printf( "custom: %f\n", custom_time ); return ( u32 ) sum; }
That happens because you put math_u128_mul
arguments with volatile
type. Volatile means you are telling compiler "please do not optimize these variables, always put them in memory and every access to them must be memory load & store". Don't do volatile.
Also you can do better codegen for your add/shift operations for MSVC by using similar intrinsics as _umul128
- use _addcarry_u64
and __shiftright128
Here's my pcg code that implements these operations for all kinds of compilers - gcc/clang, and msvc for x64 and 64-bit ARM, and even generic 32-bit code fallback.
https://gist.github.com/mmozeiko/1561361cd4105749f80bb0b9223e9db8#file-pcg64-h
I feel completely stupid.
I added those volatile when I was trying stuff in godbolt precisely to try to prevent the compiler from inlining the function and never looked back at the function signature. What a way to waste half a day.
Thanks for the code, I'll look into that tomorrow.
I tried to use __addcarry_u64
but most of the time the code is slower than the basic code I had. Note that I'm using a i7 860 which is quite old so it might be faster on more recent processor.
The only time the speed is the same is when I have the test loop that verify that my generator generates the same number as the official pcg code (and using a single __addcarry_u64
). The generated code in that case is a bit different for some reason (the correctness loop and timing loop are different, and I changed the timing loops to only generate 1 number by iteration and adding it to the sum).
u128_t math_u128_add( u128_t a, u128_t b ) { /* Using __int128 add I have 13.1s */ #if 1 // 12.7s u128_t result; result.low = a.low + b.low; result.high = a.high + b.high + ( ( result.low < a.low ) ? 1 : 0 ); #elif 0 // 13.7s , when testing correctness before 12.7s u128_t result; result.low = a.low + b.low; _addcarry_u64( ( result.low < a.low ) ? 1 : 0, a.high, b.high, &result.high ); #elif 0 // 13.4s u128_t result; _addcarry_u64( _addcarry_u64( 0, a.low, b.low, &result.low ), a.high, b.high, &result.high ); #else // 13.4s u128_t result; u8 carry = _addcarry_u64( 0, a.low, b.low, &result.low ); _addcarry_u64( carry, a.high, b.high, &result.high ); #endif return result; }
; naive code mov rax, rsi mul r12 imul rsi, r14 imul rdi, r12 add rdi, rsi add rdi, rdx lea rsi, ptr [rax+r15*1] xor ecx, ecx cmp rax, r9 setnbe cl add rdi, rcx mov rcx, rdi shr rcx, 0x3a mov rax, rdi xor rax, rsi ror rax, cl
; single addcarry_u64 without correctness mov rax, rsi mul r12 imul rsi, r14 imul rdi, r12 add rdi, rsi add rdi, rdx lea rsi, ptr [rax+r15*1] cmp rax, r9 setnbe al add al, 0xff adc rdi, 0x0 mov rcx, rdi shr rcx, 0x3a mov rax, rdi xor rax, rsi ror rax, cl
; single addcarry_u64 with correctness mov rax, r12 mul rsi imul r12, r8 imul r15, rsi add r15, r12 add r15, rdx lea r12, ptr [rax+r9*1] cmp rdi, rax adc r15, 0x0 mov rcx, r15 shr rcx, 0x3a mov rax, r15 xor rax, r12 ror rax, cl
; double __addcarry_u64 mov rax, rcx mul r8 imul rcx, r9 imul rsi, r8 add rsi, rcx add rsi, rdx add rax, r10 adc rsi, 0x0 mov rcx, rsi shr rcx, 0x3a mov rdx, rsi xor rdx, rax ror rdx, cl
That was with clang. With msvc the naive code is the fastest overall (12.3s) with the cleaner assembly (for msvc). When using __addcarry_u64
the code was 4 time slower (44.2s) and the assembly contains some stack read and store as well as some sse register use.
; msvc naive code imul r8, rsi mov rax, rsi mul r9 imul r9, rbp add r8, rdx add r8, r9 lea r9, ptr [rdi+rax*1] cmp r9, rax mov rdx, r9 adc r8, r11 xor rdx, r8 mov rcx, r8 shr rcx, 0x3a ror rdx, cl
; msvc with __addcarry_u64 mov rcx, qword ptr [rbp-0x48] mov rax, rdi mul rcx imul rcx, rsi mov r8, rax mov rax, qword ptr [rbp-0x40] imul rax, rdi add rdx, rax add rdx, rcx add r8, r11 mov qword ptr [rbp-0x58], r8 adc rdx, r10 xor r8, rdx mov qword ptr [rbp-0x50], rdx movups xmm0, xmmword ptr [rbp-0x58] mov rcx, rdx shr rcx, 0x3a ror r8, cl
I also find out that in the multiply function, the order of the code improved the performance.
u128_t math_u128_mul( u128_t a, u128_t b ) { u128_t result; #if 1 /* This is 0.8s faster than the alternative with clang. result.low = _umul128( a.low, b.low, &result.high ); u64 high_1 = a.low * b.high; u64 high_2 = a.high * b.low; result.high += high_1 + high_2; #else u64 high_1 = a.low * b.high; u64 high_2 = a.high * b.low; result.low = _umul128( a.low, b.low, &result.high ); result.high += high_1 + high_2; #endif return result; }
Could you explain, or point me to an explanation of the code for generating floats and doubles ? I mean I understand what the code is doing but not exactly what it does to the number. It looks similar to what the PCG site mention double d = ldexp(pcg32_random_r(&myrng), -32);
(which uses double so it can store a u32 precisely) but 0x1.0p-24f
would be 2^-36
?
static inline float pcg64_nextf(pcg64* rng) { uint64_t x = pcg64_next(rng); return (float)(int32_t)(x >> 40) * 0x1.0p-24f; } static inline double pcg64_nextd(pcg64* rng) { uint64_t x = pcg64_next(rng); return (double)(int64_t)(x >> 11) * 0x1.0p-53; }
What I'm understanding (correct me if I'm wrong) is that you are getting 24 bit (mantissa + sign bit ?) and dividing it by 2^36 ? It seems that we should use 0x1.0e-24f to divide by 2^24 ? p
means the exponent is an hexadecimal value right ? Why 24bit and not 23 (the size of the mantissa) to not use the sign bit, I assume we want number between 0 and 1, both included ? Is there a reason for using the upper bits instead of the bottoms one ?
[EDIT] I misunderstood the meaning of p
in floating point numbers. The rest of the question stands though.
I wrote this: https://gist.github.com/mmozeiko/37010d7352cffcdf535624f783a5944a And then run using google's benchmark library.
With clang -O2:
Benchmark Time CPU Iterations ------------------------------------------------------------- BenchMy 1.66 ns 1.67 ns 448000000 BenchMyNoAddIntrin 1.81 ns 1.80 ns 373333333
With MSVC -O2:
Benchmark Time CPU Iterations ------------------------------------------------------------- BenchMy 10.1 ns 10.3 ns 74666667 BenchMyNoAddIntrin 10.2 ns 10.3 ns 74666667
So pretty much the same, or a bit better with using _addcarry_u64.
The code you posted looks very strange - why there is xmm0 operation in there at all? There shouldn't be any: https://godbolt.org/z/voKETdnos
As for float versions - I'm not sure where you are getting 2^-36. 0x1.0p-24f is 2^-24. All that code is doing is generating 24 bits of uniform integer [0..2^24) and then dividing it by 2^24 to get [0..1) range. 24-bits is max you can take to guarantee uniform float. Taking more bits will make float use non-zero exponent value which means there will be more values closer to 0, than to 1. You can take less bits of course. Generating 10 bits of integer and multiplying with 0x1.0p-10f would also work, just generate less possibilities for float value.
It is 24 bits for mantissa, not 23. Because mantissa stores only bits after first 1 bit. First 1 bit is implicit, there's no point of storing it in actual value, thus only 23 are stored for 24-bit mantissa value in float. Similarly for double it is 52 bits stored for 53-bit mantissa value.
Lower bits are as good as upper bits, no special reason. Minor reason is that shift operation encodes more compactly - taking lower bits would require and
with 32-bit constant - extra 4 bytes (or 8 bytes for double) to encode in x86 opcode. Also (not for this random generator) but for other random generators upper bits might provide higher quality randomness than lower ones, that's why typically you take upper ones.
Using ldexp[f] function often generates a lot more code, because that function is meant for arbitrary values. It potentially won't be inlined, or generate more code. Validating input with extra branches, is input in proper range. Here's a typical implementation of such function: https://git.musl-libc.org/cgit/musl/tree/src/math/scalbnf.c - ldexpf(x,n) directly calls scalbnf(x,n).
In my case it is just one integer shift, integer cast to float and one float multiply is all that's needed - very simple. As long as you have reasonably efficient int-to-float operation for your target architecture, this will be very fast. Can be easily expressed in SIMD as well.
The code you posted looks very strange - why there is xmm0 operation in there at all? There shouldn't be any: https://godbolt.org/z/voKETdnos
Here is the code on godbolt: https://godbolt.org/z/vv7jz54v5 If you know what I'm doing wrong I'm interested. Changing the #if in math_u128_add generates different code.
As for float versions - I'm not sure where you are getting 2^-36. 0x1.0p-24f is 2^-24.
I edited the post after writing it to say that I misunderstood the meaning of p
(maybe you saw it before the edit). I never used hexadecimal to write floating point number, and after a web search I misread the explanation and though that 0x1.0p-24f meant that the exponent 24
was an hexadecimal number meaning 36 in decimal.
It is 24 bits for mantissa, not 23. Because mantissa stores only bits after first 1 bit. First 1 bit is implicit, there's no point of storing it in actual value, thus only 23 are stored for 24-bit mantissa value in float. Similarly for double it is 52 bits stored for 53-bit mantissa value.
I knew that, I should have figured it out. Sorry.
Lower bits are as good as upper bits, no special reason. Minor reason is that shift operation encodes more compactly - taking lower bits would require and with 32-bit constant - extra 4 bytes (or 8 bytes for double) to encode in x86 opcode. Also (not for this random generator) but for other random generators upper bits might provide higher quality randomness than lower ones, that's why typically you take upper ones.
Isn't modifying the the source bits changing the uniform "distribution". For example if we generate a 32bit integer, shift it by 8 bit, than any number generated up to 255 will generates a 0, so 0 would have 255 more chance to be generated ?
If that's correct, and assuming we don't care about speed, couldn't we use the pcg32_range
function to generate numbers between 0 and 2^24 and than do the divide ?
After thinking more about this, I think that every number gets 255 more chance, so it's fine. So the two previous paragraphs are wrong.
Another question, if I want to have float including 1, is dividing by ( float ) 0xffffff
ok ? I assume it will be slower than the multiply, but in term on uniformity of the generated number ?
Seems compiler really tries to keep your uint128_t type into SSE register because it fits two uint64_t there. The code it generates for that is terrible though - lot's of unnecessary moves between GRP and SSE. You really don't want such codegen.
After thinking more about this, I think that every number gets 255 more chance, so it's fine. So the two previous paragraphs are wrong.
Think of it this way - if you have uniform 2^n random number, and want to get 2^k uniform random number (k<=n) then you can simply take k bits from that n bit number anywhere. All it does it maps 2^(n-k) sized buckets to each k-bit number. You need to use range() function only when 2^n is not divisible by range value.
Another question, if I want to have float including 1, is dividing by ( float ) 0xffffff ok ? I assume it will be slower than the multiply, but in term on uniformity of the generated number ?
First question I would ask is, does it really matter sometimes with super low probability to get value 1.0 exactly? For double that will be 2^-53 probability. That's much lower than winning million dollar lottery. And even without exact 1.0 value there will be so many other values super close to it - 0.999999999999999, 0.999999999999998, 0.999999999999997, ... - all these numbers are closer to number 1.0 than atom size in meters, so you might as well think of them as value 1.0
But if you really want it - I would not do with division. Because that won't produce uniform output. The result of division will be all kinds of floats that are not possible to represent exactly, so they will snap to closest float - and you'll have bunch of floats closer to each other, bunch of them further apart, etc..
Instead I would generate mantissa bits exactly, so you have exact uniformity.
For floats:
1.0 is 0x3f800000 0x3fYxxxxx // (where Y>=8) numbers between 1.0 and 2.0 2.0 is 0x40000000
so generate random number in this range inclusive and subtract 1.0.
static inline float pcg64_nextf_inclusive(pcg64* rng) { // range is not exact pow2 number, so cannot just take K bits uint32_t x = pcg64_range(rng, 0x3f800000, 0x40000000+1); float f; memcpy(&f, &x, 4); return f - 1.0f; }
Just adding this recent blog post from Matt Pharr that talks about uniform float pseudo random number: Sampling in Floating Point.
Linked articles and papers seemed interesting too.