How to "sample" 2D cubic bezier for 1D "animation"

← Back to index

How to "sample" 2D cubic bezier for 1D "animation"

Simon Anciaux

#30440

March 26, 2025

I'm working on audio for a game, and I'm trying to get frequency (and other parameters) interpolation for playing notes in a simple "tracker". I have several interpolation types, and one of them is to draw a curve in the editor that will drive the interpolation.

A way to do it is just to say that we pass the "time value" (normalized between 0 and 1) to the bezier curve, and use the resulting value as the output.

/*
p0 is the start value;
p1 is the end value;
cp0 is the first control point;
cp1 is the second control point;
All are single f32.
*/
f32 frequency_lerp_factor = bezier_cubic(p0, cp0, cp1, p1, t);

I can graph the curve by using t for x, and frequency_lerp_factor for y. This is (what I call) 1D. I can also graph the points and control points, but it gets confusing because their value would only be the y component, there isn't really a x here (except that p0 is at t = 0 and p1 is at t = 1. The control points, if I graph them, can be at any x in the graph, it doesn't matter where because their x value doesn't affect the curve.

Am I correct about it so far ?

But I'd like more control over the curve, so I want to "draw" a 2d curve using tangents (with some constraints to make sure the curve doesn't go backward). I've used 2d curves with tangents in the past while working on 3D/2D animations, so that was what I wanted when I started to work on the curve for interpolation.

In this case, the curve is controlled by 2D points, that I can graph normally, and I can get the curve by using bezier_cubic on x and y. But now, I think that to get the frequency lerp factor, I need to "sample" the resulting curve.

If I want to get the lerp factor (y) at x on the graph, I need to figure out which t will give me x, and use that t in bezier_cubic to get the y.

/* I don't know how to do bezier_cubic_inverse. */
f32 t = bezier_cubic_inverse( p0.x, cp0.x, cp1.x, p1.x, x );
f32 frequency_lerp_factor = bezier_cubic( p0.y, cp0.y, cp1.y, p1.y, t );

I tried to do the math and got to a point where it looks like I need to solve a cubic polynomial, which I looked a bit how to do, but it doesn't look like there is an easy way to do that on a computer.

This might be wrong. Only using x components of the points.
t³(-p0+3cp0-3cp1+p1) + t²(3p0-6cp0+3cp1) + t(-3p0+3cp0) + p0 = x
Since p0, p1, cp0, cp1, x are known, we could compute the inside of the parenthesis
and it becomes:
t³a + t²b + tc + p0 = x
And we need to solve for t.

If all that is correct, is solving the equation the way to do this or is there another way ? Or have I misunderstood something ?

Edited by Simon Anciaux on April 14, 2025, 1:28pm

Simon Anciaux

#30454

April 14, 2025

On Discord, people suggested to use the Newton–Raphson method to estimate the value we are searching. They also pointed me to A Primer on Bézier Curves which contains a lot of information about Bézier curves.

So I was correct that we need to solve the cubic equation to get what we need. I wanted to write a small summary in case it's useful to somebody.

To re-state the problem: we use a cubic Bézier interpolation with an interpolation factor t, a start point p0, a end point p1, a first control point cp0 (that with p0 creates a tangent at the start point), a second control point cp1 (that with p1 creates a tangent at the end point), and we interpolate on both x and y to "get" the curve. The equation for the curve is: f32 result = (u*u*u * p0) + (3*u*u*t * cp0) + (3*u*t*t * cp1) + (t*t*t * p1); Where u = (1-t).

There are constraints on this curve, since we want to use it for interpolation, we don't want the curve to "go back" on x, in other words: x(n+1) > x(n). To make sure of that we have: p0x < p1x and p0x <= cp0x <= p1x and p0x <= cp1x <= p1x.

vec2 point = { 0 };
point.x = bezier_cubic( p0.x, cp0.x, cp1.x, p1.x, t);
point.y = bezier_cubic( p0.y, cp0.y, cp1.y, p1.y, t);
/* Point is a point on the curve at interpolation factor t. */

This allows us to draw a curve, but how can we use it to interpolate something. If we consider only the graph, we consider the x axis as being the interpolation factor, and y being the result of the interpolation. So we want to say result = cubic(..., x ); but we don't know x, as it is the result of the interpolation (point.x above).

So the question is what interpolation factor t gives us a x value on the graph that correspond to the interpolation factor we want from the curve.

In a math sense (which I'm not good at), we can see it as: x = (u*u*u * p0) + (3*u*u*t * cp0) + (3*u*t*t * cp1) + (t*t*t * p1) where we know x, p0, p1, cp0 and cp1, and we search t, so we want an equation that looks like t = ... . But this is a cubic equation so there is no simple way to do that. Cubic equation can have several solutions, including solutions with complex numbers. But with the constraints we have, we know that there will be only a single value of t that gives us the value of x we search. We want a single real (as in real number) solution.

We can work out the equation to get it in the form of standard cubic equations at³+bt²+ct+d=0:

x = (u*u*u * p0) + (3*u*u*t * cp0) + (3*u*t*t * cp1) + (t*t*t * p1)
x = u³p0 + 3u²tcp0 + 3ut²cp1 + t³p1
/* replace u by 1-t */
x = (1-t)³p0 + 3(1-t)²tcp0 + 3(1-t)t²cp1 + t³p1
/* (a - b)³ = a³ - 3a²b + 3ab² - b³ */
/* (1-t)³ = 1 - 3t + 3t² - t³*/
/* (1-t)² = 1 - 2t + t² */
x = (1-3t+3t²-t³)p0 + 3(1-2t+t²)tcp0 + 3(1-t)t²cp1 + t³p1
/* Distributions */
x = (1-3t+3t²-t³)p0 + (3-6t+3t²)tcp0 + (3-3t)t²cp1 + t³p1
x = (1-3t+3t²-t³)p0 + (3t-6t²+3t³)cp0 + (3t²-3t³)cp1 + t³p1
x = (p0-3p0t+3p0t²-p0t³) + (3cp0t-6cp0t²+3cp0t³) + (3cp1t²-3cp1t³) + t³p1
x = p0-3p0t+3p0t²-p0t³ + 3cp0t-6cp0t²+3cp0t³ + 3cp1t²-3cp1t³ + p1t³
/* Group by power */
x = -p0t³ + 3cp0t³ - 3cp1t³ + p1t³ + 3p0t² - 6cp0t² + 3cp1t² - 3p0t + 3cp0t + p0
/* I don't know what is the term in English, but just put the common term in front. */
x = t³(-p0 + 3cp0 - 3cp1 + p1) + t²(3p0 - 6cp0 + 3cp1) + t(-3p0 + 3cp0) + p0
0 = t³(-p0 + 3cp0 - 3cp1 + p1) + t²(3p0 - 6cp0 + 3cp1) + t(-3p0 + 3cp0) + p0 - x
/* We can replace things that are constant */
a = -p0 + 3cp0 - 3cp1 + p1
b = 3p0 - 6cp0 + 3cp1
c = -3p0 + 3cp0
d = p0 - x
0 = t³a + t²b + tc + d

We can see that there is no easy way to get t=.... If we graph that equation, we have the t value on the x axis, and the x value (point.x from above) on the y axis. https://www.desmos.com/calculator/3ruowk5mvb

Since we know there is only one t value that gives us a given x value, that means that the function described by the equation, should cross the x axis at only one place (the root of the function), and that value is what we search: t. On the graph if you change the slider for p (the name are all weird, because I don't think I can name things better in desmos), you change which value we are searching: if it's set at zero, we search which t gives us 0; if it's set to 0.3 we search which t gives us 0.3 and the answer is always "where the graph crosses the x axis".

That still doesn't help us find the solution in code. The Bézier Primer linked above has some function that can compute all the roots of cubic, but I can't say I understand everything and it seems relatively expensive (I haven't measure it). That's when people on Discord suggested to use the Newton–Raphson method.

The method uses the derivative of the function to iteratively approach the value searched. It's a method that converges fast toward the solution, but isn't guaranteed to converge.

One of the interpretation of the derivative of a function at a point, is that it's the slope of the tangent at that point. The algorithm uses the tangent as an approximation of the function at a point.

We start with an initial guess, a coordinate on the x axis, which in this is the interpolation factor t, which in my test I always used the x value we a searching for. We compute the derivative at that point to get the tangent and compute where the it intersects the x axis. That is our new guess (new t) that we hope is better than the previous one. We use that value in the equation to compute a y value. If that y value is close enough to 0, than we accept t. Otherwise we do more iterations until we get a result or determine that the function doesn't converge.

/* function */
0 = t³a + t²b + tc + d
/* Derivative */
0 = 3t²a + 2tb + c
/* The slope of the tangent equals */
s = 3t²a + 2tb + c

/* The slope of a line defined by 2 points is */
s = (y1 - y0) / (x1-x0)
/*
x0 is the initial guess.
y0 is the value of the function if we pass it x0
We search x1.
We are searching for a point that crosses the x axis, so y1 = 0 */
s = -y0 / (x1-x0)
s * (x1-x0) = -y0
x1-x0 = -y0 / s
x1 = -y0/s + x0
x1 = x0 - (y0/s)
/* Which means */
x1 = x0 - (function(x0)/derivative(x0))

In the following graph, you can view the tangent (in purple) line at a guess value (the q slider). You can still set the p value for the point we are searching for.

Below the q slider is the x value at which the tangent crosses the x axis. You can use this value in for q to have the next iteration and repeat to see the function converging. Below that is the distance of the current guess to the x axis (if you pass the current guess q through the function, we get that value (distance on the y axis) and we want that value to be as close to zero as possible). This is also visualized as a blue line.

https://www.desmos.com/calculator/8bofj334lp

Some notes:

If the derivative value is zero, we get an horizontal tangent, which will never cross the x axis, and so the method will not converge.
The function can diverge at some point, often after the first guess, but still converge after.
The t value can go outside the 0 to 1 range at first, and then converge.
In my tests (not much) the only time the method didn't converge is when the tangent is horizontal (derivative == 0). Maybe the special case of what we are doing guarantees that the function will converge, I don't know for sure.
This method often converge fast (what I saw and what other told me).
Since there are no guarantee (as far as I know) that we will converge, people and code I've seen recommend to have a fallback to a bisection method, because bisection is guaranteed to converge (but probably slower).
Someone recommended Oscar Veliz youtube channel, which has a lot of videos about finding the root of functions using different methods.

Here is some code that implement the Newton-Raphson method. Note that this is just an example, it doesn't try to properly handle the case where it doesn't converge.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

typedef float f32;
typedef uintptr_t umm;
typedef int32_t s32;

static f32 interpolation_linear( f32 start, f32 end, f32 t ) {
    f32 result = start * ( 1 - t ) + end * t;
    return result;
}

static f32 interpolation_cubic( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 t ) {
    f32 u = 1 - t;
    f32 result =  ( u * u * u * start ) + ( 3 * u * u * t * cp_1 ) + ( 3 * u * t * t * cp_2 ) + ( t * t * t * end );
    return result;
}

static f32 interpolation_cubic_t_from_x_NR( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 epsilon, s32* iteration_count ) {
    
    ( *iteration_count ) = 0;
    
    f32 a = ( -start + 3 * cp_1 - 3 * cp_2 + end );
    f32 b = ( 3 * start - 6 * cp_1 + 3 * cp_2 );
    f32 c = ( -3 * start + 3 * cp_1 );
    f32 d = ( start - x );
    
    f32 t = x;
    
    for ( umm i = 0; i < 8; i++ ) {
        
        f32 f = a * t * t * t + b * t * t + c * t + d;
        
        if ( fabsf( f ) < epsilon ) {
            break;
        }
        
        f32 derivative = 3 * a * t * t + 2 * b * t + c;
        
        if ( derivative == 0 ) {
            /* The function did not converve, you need to use bisection instead. */
            break;
        }
        
        t = t - ( f / derivative );
        ( *iteration_count )++;
    }
    
    return t;
}

static f32 interpolation_cubic_t_from_x_bisection( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 epsilon, s32* iteration_count ) {
    
    ( *iteration_count ) = 0;
    
    f32 l = 0;
    f32 r = 1;
    f32 t = x;
    
    while ( l < r ) {
        
        f32 x2 = interpolation_cubic( start, cp_1, cp_2, end, t );
        
        if ( fabs( x - x2 ) < epsilon ) {
            break;
        }
        
        if ( x2 > x ) {
            r = t;
        } else {
            l = t;
        }
        
        t = interpolation_linear( l, r, 0.5f );
        ( *iteration_count )++;
    }
    
    return t;
}

int main( int argc, char** argv ) {
    
    f32 start = 0;
    f32 cp_1 = 1;
    f32 cp_2 = 0;
    f32 end = 1;
    
    f32 x = 0;
    
    for ( umm i = 0; i < 11; i++ ) {
        
        s32 it_1 = 0, it_2 = 0;
        f32 t1 = interpolation_cubic_t_from_x_NR( start, cp_1, cp_2, end, x, 0.001f, &it_1 );
        f32 t2 = interpolation_cubic_t_from_x_bisection( start, cp_1, cp_2, end, x, 0.001f, &it_2 );
        
        f32 x1 = interpolation_cubic( start, cp_1, cp_2, end, t1 );
        f32 x2 = interpolation_cubic( start, cp_1, cp_2, end, t2 );
        
        printf( "NR: x: %f -> t: %f -> x: %f | iteration: %d\n", x, t1, x1, it_1 );
        printf( "BI: x: %f -> t: %f -> x: %f | iteration: %d\n", x, t2, x2, it_2 );
        
        x += 0.1f;
    }
    
    return 0;
}

Edited by Simon Anciaux on April 23, 2025, 11:58am Reason: Fixed desmos graphs

Simon Anciaux

#30471

April 23, 2025

I wanted to do some performance tests, to see how much faster or slower some methods were. I implemented Cardano's algorithm using the French Wikipedia page as it was most straight forward than the English page (and I speak French). I'm not sure I implemented right because sometime it doesn't produce correct value, but I can't say I understand how it works, I just wanted some idea of how much times it would take.

In the course of doing all that, I also checked the results to see when Newton Raphson and Cardano didn't work. I also did a different implementation of the bisection method which also failed under some conditions. Here are some of the things I got from those test.

The tests were to use the different methods with different parameters:

start is always 0 and end is always 1;
cp1 and cp2 change between 0 and 1 (loop on cp1 and cp2 values with some fixed step to try to cover "every" combination);
for methods that require a starting guess (Bisection and newton Raphson), I also loop over "every" values with some fixed step.
there is an "epsilon" value for defining which error/precision we want (for bisection and Newton Raphson).
there is no fallback if a functions fails to give a valid result.

This started as a performance test but in the end it just gives a broad idea of what's going on, I wouldn't count the numbers as really representing the performance. I didn't try to do any optimization, I use very straight forward version of the methods. I measured the performance by taking the difference between calls of __rdtsc(). The code was compiles with MSVC (19.43.34809 x64) with "/O2". I tested 6 functions: 2 bisection methods, Newton Raphson using floats and doubles, and Cardano using floats and doubles.

The two bisection methods are:

computing the result of the Bézier equation (method A);
computing the result of the Bézier equation transformed in the form of ax³+bx²+cx+d=0 (as describe in the previous post) (method B).

For the performances, in general Newton Raphson is faster, then Cardano, then bisection. Newton Raphson and bisection takes more time with more precision requested, but Newton Raphson stay's faster with a precision of 0.000001 (in my use case I think a precision of 0.001 is enough, but I wanted to test it). Bisection at low precision is faster then Cardano. In all my tests I had weird "peaks" were for no apparent reason a method would take a lot of time, for example an average of 615 cycles but some peaks at 23244629 (about 0.001% of the time) for bisection.

This is just to get an idea (as I think my test isn't good) and keep in mind that only the bisection methods got the correct values on every calls:

Precision: 0.00001.

Bisection method A
cycles: 64065365548 | average: 615 | min: 21 | max: 23244629
iteration min: 1 | iteration max: 20 | iteration average: 16

Bisection method B
cycles: 52389432600 | average: 503 | min: 21 | max: 1433929
iteration min: 1 | iteration max: 20 | iteration average: 16

Newton Raphson f32
cycles: 14889205012 | average: 143 | min: 18 | max: 4251809
iteration min: 1 | iteration max: 61 | iteration average: 5

Newton Raphson f64
cycles: 16517484053 | average: 158 | min: 24 | max: 1164929
iteration min: 1 | iteration max: 61 | iteration average: 5

Cardano f32
cycles: 287757556 | average: 279 | min: 122 | max: 326471

Cardano f64
cycles: 400542181 | average: 388 | min: 148 | max: 606724

What ends up more interesting, is why the functions failed. I won't talk about the Cardano's functions as, as I said, I don't understand them, and the only improvement I was able to do is to use the f64 version to solve some cases.

The bisection method A never failed. As we are just computing the result of the Bézier curve for x until we find a value close enough to what we want, it can't fail (unless we ask for a precision too high).

The bisection method B failed because when we compute a value that as the correct precision for ax³+bx²+cx+d=0, when we use that result in the Bézier equation x=u³p0 + 3u²tcp0 + 3ut²cp1 + t³p1 (with u=1-t), we can end up with a value slightly outside the desired precision. A simple way so solve that was to ask for a slightly higher precision in the first equation. This work up to some precision as we run at some point in more precision than floating point number offer. This method is also slightly faster than method A.

In the previous post I was wondering if due to the constraints of our curve, Newton Raphson was guaranteed to converge unless the derivative was 0. The answer is no. I've seen 3 types of case where NR doesn't work.

The first one is what we knew, when the derivative is zero. This can happen often, for example when start = 0 and cp1 = 0 and the initial guess is 0, or end = 1 and cp2 = 1 and the initial guess is 1. Those are not (at all) the only cases and I don't know if it is a common setup.
The second one is when the method oscillate between two values. You do the computation with one value, it gives you a second value. When you do the computation with the second value, you're back to the first one. The more precision you want, the more likely you're going to run into that problem. This is due to the floating point precision, not an issue with the algorithm.
The third one is when the function converges, but to a root that is outside the [0;1] range. The parameter we use for our curve guarantee that there is only one root in [0;1], but there might be other root outside that range. See this graph for example: https://www.desmos.com/calculator/se3bomd3el there are roots at around -0.6 and 50. If you use an initial guess that is near where the tangent is horizontal (0.00075 in the graph linked), you can end up converging toward a root outside the desired range.

While looking at the first and third issue, it became clear that choosing a good initial guess can help the function converge, and converge to the correct value. And one observation is that since I want to use those curve to interpolate, I'll most likely try to search for value close to the last one found, so a good guess might be to use the previous result. A side note is not to forget to reset it when we reach the end of the curve and want to start at the beginning (at the end the result is 1 since the end point is at 1, and 1 is a bad guess for the first value of the curve. I failed to do that at some point which can lead to the method not converging).

There is also a similar issue, where the function converges to the correct root, but as with bisection method B, this value when plugged into the Bézier equation can produce a value that is slightly off the requested precision.

Here is the test code. Please don't use it, it's bad, there are left overs from different tests...

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <float.h>
#define _USE_MATH_DEFINES
#include <math.h>

typedef float f32;
typedef double f64;
typedef uintptr_t umm;
typedef uint8_t u8;
typedef uint32_t u32;
typedef uint64_t u64;
typedef int32_t s32;

#define cast( type, expr ) ( type ) ( expr )
#define array_count( array ) ( ( umm ) ( sizeof( array ) / sizeof( ( array )[ 0 ] ) ) )
#define _assert( expr ) if ( !( expr ) ) { __debugbreak( ); }

static f32 interpolation_linear( f32 start, f32 end, f32 t ) {
    f32 result = start * ( 1 - t ) + end * t;
    return result;
}

static f32 interpolation_cubic( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 t ) {
    f32 u = 1 - t;
    f32 result =  ( u * u * u * start ) + ( 3 * u * u * t * cp_1 ) + ( 3 * u * t * t * cp_2 ) + ( t * t * t * end );
    return result;
}

typedef f32 cubic_func_t( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 guess, f32 epsilon, u32* iteration_count );

static f32 interpolation_cubic_t_from_x_bisection( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 guess, f32 epsilon, u32* iteration_count ) {
    
    ( *iteration_count ) = 0;
    
    f32 l = 0;
    f32 r = 1;
    
    f32 t = guess;
    
    while ( l < r ) {
        
        ( *iteration_count )++;
        
        f32 x2 = interpolation_cubic( start, cp_1, cp_2, end, t );
        
        if ( fabsf( x - x2 ) < epsilon ) {
            break;
        }
        
        if ( x2 > x ) {
            r = t;
        } else {
            l = t;
        }
        
        t = interpolation_linear( l, r, 0.5f );
    }
    
    return t;
}

static f32 interpolation_cubic_t_from_x_bisection_2( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 guess, f32 epsilon, u32* iteration_count ) {
    
    ( *iteration_count ) = 0;
    
    f32 a = ( -start + 3 * cp_1 - 3 * cp_2 + end );
    f32 b = ( 3 * start - 6 * cp_1 + 3 * cp_2 );
    f32 c = ( -3 * start + 3 * cp_1 );
    f32 d = ( start - x );
    
    f32 l = 0;
    f32 r = 1;
    f32 t = guess;
    
    while ( l < r ) {
        
        ( *iteration_count )++;
        
        f32 f = a * t * t * t + b * t * t + c * t + d;
        
        if ( fabsf( f ) < epsilon ) {
            break;
        }
        
        if ( f - d > x ) {
            r = t;
        } else {
            l = t;
        }
        
        t = interpolation_linear( l, r, 0.5f );
    }
    
    return t;
}

static f32 interpolation_cubic_t_from_x_NR_f32( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 guess, f32 epsilon, u32* iteration_count ) {
    
    /* NOTE simon (18/04/25 16:44:31): Choosing a good starting point is important.
I tried to optimize by remembering the last result and using it as a starting point, since I want to use that for solving for near values.
Which is ok, except when we end a curve and start another, and the we have a guess of 1 for the starting point which is as far as we can be
that leads to the function not converging or taking a long time to converge.
     */
    /* NOTE simon (18/04/25 16:46:54): I ran into cases where the guess was oscillating between two values and never converging. */
    
    ( *iteration_count ) = 0;
    
    f32 a = ( -start + 3 * cp_1 - 3 * cp_2 + end );
    f32 b = ( 3 * start - 6 * cp_1 + 3 * cp_2 );
    f32 c = ( -3 * start + 3 * cp_1 );
    f32 d = ( start - x );
    
    f32 previous[ 2 ] = { FLT_MAX, FLT_MAX };
    
    f32 t = guess;
    
    if ( c == 0 ) {
        
        if ( t == 0 ) {
            t += epsilon;
        } else if ( t == 1 ) {
            t -= epsilon;
        }
    }
    
    for ( umm i = 0; i < 100; i++ ) {
        
        ( *iteration_count )++;
        
        f32 f = a * t * t * t + b * t * t + c * t + d;
        
        if ( fabsf( f ) < epsilon ) {
            break;
        }
        
        f32 derivative = 3 * a * t * t + 2 * b * t + c;
        
        if ( derivative == 0 ) {
            /* The function can't converve, use bisection instead. */
            t = -1;
            break;
        }
        
        t = t - ( f / derivative );
        
        if ( t == previous[ 0 ] || t == previous[ 1 ] ) {
            t = -2;
            break;
        }
        
        previous[ 0 ] = previous[ 1 ];
        previous[ 1 ] = t;
    }
    
    return t;
}

static f32 interpolation_cubic_t_from_x_NR_f64( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 guess, f32 epsilon, u32* iteration_count ) {
    
    /* NOTE simon (18/04/25 16:44:31): Choosing a good starting point is important.
I tried to optimize by remembering the last result and using it as a starting point, since I want to use that for solving for near values.
Which is ok, except when we end a curve and start another, and the we have a guess of 1 for the starting point which is as far as we can be
that leads to the function not converging or taking a long time to converge.
     */
    /* NOTE simon (18/04/25 16:46:54): I ran into cases where the guess was oscillating between two values and never converging. */
    
    ( *iteration_count ) = 0;
    
    f64 a = ( -start + 3.0 * cp_1 - 3.0 * cp_2 + end );
    f64 b = ( 3.0 * start - 6.0 * cp_1 + 3.0 * cp_2 );
    f64 c = ( -3.0 * start + 3.0 * cp_1 );
    f64 d = ( start - cast( f64, x ) );
    
    f64 previous[ 2 ] = { guess, guess };
    
    f64 t = guess;
    
    if ( t == 0 && c == 0 ) {
        t += epsilon;
    }
    
    for ( umm i = 0; i < 100; i++ ) {
        
        ( *iteration_count )++;
        
        f64 f = a * t * t * t + b * t * t + c * t + d;
        
        if ( fabs( f ) < epsilon ) {
            break;
        }
        
        f64 derivative = 3 * a * t * t + 2 * b * t + c;
        
        if ( derivative == 0 ) {
            /* The function can't converve, use bisection instead. */
            t = -1;
            break;
        }
        
        t = t - ( f / derivative );
        
        if ( t == previous[ 0 ] || t == previous[ 1 ] ) {
            t = -2;
            break;
        }
        
        previous[ 0 ] = previous[ 1 ];
        previous[ 1 ] = t;
    }
    
    return cast( f32, t );
}

static f32 interpolation_cubic_t_from_x_cardano_f32( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 guess, f32 epsilon, u32* iteration_count ) {
    
    ( *iteration_count ) = 1;
    
    f32 a = ( -start + 3 * cp_1 - 3 * cp_2 + end );
    f32 b = ( 3 * start - 6 * cp_1 + 3 * cp_2 );
    f32 c = ( -3 * start + 3 * cp_1 );
    f32 d = ( start - x );
    
    f32 result = 0;
    
    if ( a == 0 ) {
        
        if ( b == 0 ) {
            
            if ( c == 0 ) {
                result = 0; /* NOTE simon (22/04/25 15:58:42): No solution. */
            } else {
                result = -d / c;
            }
            
        } else {
            
            f32 delta = ( b*b ) - ( 4*a*c );
            
            if ( delta >= 0 ) {
                
                f32 result_1 = ( -b + sqrtf( delta ) ) / ( 2 * a );
                f32 result_2 = ( -b - sqrtf( delta ) ) / ( 2 * a );
                
                result = result_1;
                
                if ( result_1 < 0 || result_1 > 1 ) {
                    result = result_2;
                }
            } else {
                result = 0; /* NOTE simon (22/04/25 15:59:25): No solution. */
            }
        }
        
    } else {
        
        f32 p = ((3*a*c)-(b*b))/((3*a)*(3*a));
        f32 q = ((-9*a*b*c) + (27*a*a*d) + (2*b*b*b)) / (27*a*a*a);
        
        f32 discriminant = (-4*p*p*p ) -(27*q*q);
        
        if ( discriminant < 0 ) {
            
            f32 u1 = ( -q + sqrtf( -discriminant / 27 ) ) / 2;
            f32 u = cbrtf( u1 );
            
            f32 v1 = ( -q - sqrtf( -discriminant / 27 ) ) / 2;
            f32 v = cbrtf( v1 );
            
            result = ( u + v ) - ( b/(3*a) );
            
        } else if ( discriminant == 0 ) {
            
            if ( p == 0 && q == 0 ) {
                result = 0 - ( b/(3*a) );
            } else {
                
                f32 result_1 = ( 3*q ) / p;
                result = result_1 - ( b/(3*b ) );
                
                f32 result_2 = (-3*q)/(2*p);
                
                if ( result < 0 || result > 1 ) {
                    result = result_2 - ( b/(3*b) );
                }
            }
            
        } else {
            
            f32 result_1 = 2.0f * sqrtf( -p / 3.0f ) * cosf( (1.0f/3.0f) * acosf( ((3*q)/(2*p)) * sqrtf(3/-p)) + ((2.0f*0*( f32 ) M_PI ) / 3 ) );
            f32 result_2 = 2.0f * sqrtf( -p / 3.0f ) * cosf( (1.0f/3.0f) * acosf( ((3*q)/(2*p)) * sqrtf(3/-p)) + ((2.0f*1*( f32 ) M_PI ) / 3 ) );
            f32 result_3 = 2.0f * sqrtf( -p / 3.0f ) * cosf( (1.0f/3.0f) * acosf( ((3*q)/(2*p)) * sqrtf(3/-p)) + ((2.0f*2*( f32 ) M_PI ) / 3 ) );
            
            result = result_1 - ( b/(3*b ) );
            
            if ( result < 0 || result > 1 ) {
                result = result_2 - ( b/(3*b ) );
            }
            
            if ( result < 0 || result > 1 ) {
                result = result_3 - ( b/(3*b ) );
            }
        }
    }
    
    return result;
}

static f32 interpolation_cubic_t_from_x_cardano_f64( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 guess, f32 epsilon, u32* iteration_count ) {
    
    ( *iteration_count ) = 1;
    
    f64 a = ( -start + 3.0 * cp_1 - 3.0 * cp_2 + end );
    f64 b = ( 3.0 * start - 6.0 * cp_1 + 3.0 * cp_2 );
    f64 c = ( -3.0 * start + 3.0 * cp_1 );
    f64 d = ( start - cast( f64, x ) );
    
    f32 result = 0;
    
    if ( a == 0 ) {
        
        if ( b == 0 ) {
            
            if ( c == 0 ) {
                result = 0; /* NOTE simon (22/04/25 15:58:42): No solution. */
            } else {
                result = cast( f32, -d / c );
            }
            
        } else {
            
            f64 delta = ( b*b ) - ( 4*a*c );
            
            if ( delta >= 0 ) {
                
                f64 result_1 = ( -b + sqrt( delta ) ) / ( 2 * a );
                f64 result_2 = ( -b - sqrt( delta ) ) / ( 2 * a );
                
                result = cast( f32, result_1 );
                
                if ( result_1 < 0 || result_1 > 1 ) {
                    result = cast( f32, result_2 );
                }
                
            } else {
                result = 0; /* NOTE simon (22/04/25 15:59:25): No solution. */
            }
        }
        
    } else {
        
        f64 p = ((3*a*c)-(b*b))/((3*a)*(3*a));
        f64 q = ((-9*a*b*c) + (27*a*a*d) + (2*b*b*b)) / (27*a*a*a);
        
        f64 discriminant = (-4*p*p*p ) -(27*q*q);
        
        if ( discriminant < 0 ) {
            
            f64 u1 = ( -q + sqrt( -discriminant / 27 ) ) / 2;
            f64 u = cbrt( u1 );
            
            f64 v1 = ( -q - sqrt( -discriminant / 27 ) ) / 2;
            f64 v = cbrt( v1 );
            
            result = cast( f32, ( u + v ) - ( b/(3*a) ) );
            
        } else if ( discriminant == 0 ) {
            
            if ( p == 0 && q == 0 ) {
                result = cast( f32, 0 - ( b/(3*a) ) );
            } else {
                
                f64 result_1 = ( 3*q ) / p;
                result = cast( f32, result_1 - ( b/(3*b ) ) );
                
                f64 result_2 = (-3*q)/(2*p);
                
                if ( result < 0 || result > 1 ) {
                    result = cast( f32, result_2 - ( b/(3*b) ) );
                }
            }
            
        } else {
            
            f64 result_1 = 2.0 * sqrt( -p / 3 ) * cos( (1.0/3) * acos( ((3*q)/(2*p)) * sqrt(3/-p)) + (( 2.0*0*M_PI ) / 3 ) );
            f64 result_2 = 2.0 * sqrt( -p / 3 ) * cos( (1.0/3) * acos( ((3*q)/(2*p)) * sqrt(3/-p)) + (( 2.0*1*M_PI ) / 3 ) );
            f64 result_3 = 2.0 * sqrt( -p / 3 ) * cos( (1.0/3) * acos( ((3*q)/(2*p)) * sqrt(3/-p)) + (( 2.0*2*M_PI ) / 3 ) );
            
            result = cast( f32, result_1 - ( b/(3*b ) ) );
            
            if ( result < 0 || result > 1 ) {
                result = cast( f32, result_2 - ( b/(3*b ) ) );
            }
            
            if ( result < 0 || result > 1 ) {
                result = cast( f32, result_3 - ( b/(3*b ) ) );
            }
        }
    }
    
    return result;
}

typedef struct stats_t {
    cubic_func_t* func;
    char* name;
    u8 is_iterative;
    u64 hits;
    u64 cycles_total;
    u64 cycle_ranges[ 101 ];
    u64 cycles_min;
    u64 cycles_max;
    u64 iterations[ 100 ];
    f32 error_min;
    f32 error_max;
    u64 above_epsilon;
} stats_t;

#define print(...) pos += snprintf( out + pos, reserved - pos, __VA_ARGS__ )

int main( int argc, char** argv ) {
    
    s32 reserved = 1000 * 1024 * 1024;
    char* out = malloc( reserved );
    s32 pos = 0;
    
    stats_t all_stats[ ] = {
        {
            .func = interpolation_cubic_t_from_x_bisection,
            .name = "bisection 1",
            .is_iterative = 1,
            .cycles_min = 0xffffffffffffffff,
            .error_min = FLT_MAX,
        },
        {
            .func = interpolation_cubic_t_from_x_bisection_2,
            .name = "bisection 2",
            .is_iterative = 1,
            .cycles_min = 0xffffffffffffffff,
            .error_min = FLT_MAX,
        },
        {
            .func = interpolation_cubic_t_from_x_NR_f32,
            .name = "newton raphson f32",
            .is_iterative = 1,
            .cycles_min = 0xffffffffffffffff,
            .error_min = FLT_MAX,
        },
        {
            .func = interpolation_cubic_t_from_x_NR_f64,
            .name = "newton raphson f64",
            .is_iterative = 1,
            .cycles_min = 0xffffffffffffffff,
            .error_min = FLT_MAX,
        },
        {
            .func = interpolation_cubic_t_from_x_cardano_f32,
            .name = "cardano f32",
            .is_iterative = 0,
            .cycles_min = 0xffffffffffffffff,
            .error_min = FLT_MAX,
        },
        {
            .func = interpolation_cubic_t_from_x_cardano_f64,
            .name = "cardano f64",
            .is_iterative = 0,
            .cycles_min = 0xffffffffffffffff,
            .error_min = FLT_MAX,
        },
    };
    
    u32 cp_iteration_count = 101;
    f32 cp_step = 0.01f;
    
    u32 guess_iteration_count = 101;
    f32 guess_step = 0.01f;
    
    u32 iteration_count = 101;
    f32 step = 0.01f;
    
    f32 epsilon = 0.00001f;
    
    f32 start = 0;
    f32 end = 1;
    
    for ( u32 stats_index = 0; stats_index < array_count( all_stats ); stats_index++ ) {
        
        stats_t* stats = all_stats + stats_index;
        u32 guess_count = guess_iteration_count;
        
        if ( !stats->is_iterative ) {
            guess_count = 1;
        }
        
        print( "# %s\n", stats->name );
        
        for ( u32 cp_1_it = 0; cp_1_it < cp_iteration_count; cp_1_it++ ) {
            
            for ( u32 cp_2_it = 0; cp_2_it < cp_iteration_count; cp_2_it++ ) {
                
                f32 cp_1 = cp_1_it * cp_step;
                f32 cp_2 = cp_2_it * cp_step;
                
                for ( u32 guess_i = 0; guess_i < guess_count; guess_i++ ) {
                    
                    f32 initial_guess = guess_i * guess_step;
                    
                    for ( u32 i = 0; i < iteration_count; i++ ) {
                        
                        f32 x = step * i;
                        
                        u32 it = 0;
                        u64 cycles = 0;
                        u32 count = 0;
                        f32 t = 0;
                        
                        u64 cycle_start = __rdtsc( );
                        t = stats->func( start, cp_1, cp_2, end, x, initial_guess, epsilon * 0.75f, &it );
                        u64 cycle_end = __rdtsc( );
                        cycles = cycle_end - cycle_start;
                        
                        f32 x_check = interpolation_cubic( start, cp_1, cp_2, end, t );
                        f32 error = fabsf( x - x_check );
                        
                        stats->hits++;
                        stats->cycles_total += cycles;
                        
                        u32 found_range = 0;
                        
                        for ( u32 r = 0; r < 101; r++ ) {
                            
                            u32 range_start = r * 100;
                            u32 range_end = ( r + 1 ) * 100;
                            
                            if ( cycles >= range_start && cycles < range_end ) {
                                stats->cycle_ranges[ r ]++;
                                found_range = 1;
                                break;
                            }
                        }
                        
                        if ( !found_range ) {
                            stats->cycle_ranges[ 100 ]++;
                        }
                        
                        if ( cycles < stats->cycles_min ) {
                            stats->cycles_min = cycles;
                        }
                        
                        if ( cycles > stats->cycles_max ) {
                            stats->cycles_max = cycles;
                        }
                        
                        stats->iterations[ it ]++;
                        
                        if ( error < stats->error_min ) {
                            stats->error_min = error;
                        }
                        
                        if ( error > stats->error_max ) {
                            stats->error_max = error;
                        }
                        
                        if ( error > epsilon ) {
                            stats->above_epsilon++;
                        }
                        
                        if ( it >= 100 ) {
                            print( "failed: t: %f | cp_1: %f | cp_2: %f | x: %f | guess: %f | it: %d | iteration >= 100\n", t, cp_1, cp_2, x, initial_guess, it );
                        } else if ( error > epsilon ) {
                            print( "failed: t: %f | cp_1: %f | cp_2: %f | x: %f | guess: %f | it: %d | ", t, cp_1, cp_2, x, initial_guess, it );
                            
                            if ( t == -1 ) {
                                print( "f' = 0\n" );
                            } else if ( t == -2 ) {
                                print( "oscillate between 2 values\n" );
                            } else {
                                print( "error > epsilon : %f > %f : %e > %e\n", error, epsilon, error, epsilon );
                            }
                        }
                    }
                }
            }
        }
        
        for ( u32 i = 0; i < 100; i++ ) {
            
            if ( !stats->cycle_ranges[ i ] ) {
                continue;
            }
            
            print( "    cycles %d -> %d: %lld\n", i * 100, ( i + 1 ) * 100, stats->cycle_ranges[ i ] );
        }
        
        print( "    cycles > 10000: %lld\n", stats->cycle_ranges[ 100 ] );
        print( "    cycles: %lld | average: %lld | min: %lld | max: %lld\n", stats->cycles_total, stats->cycles_total / stats->hits, stats->cycles_min, stats->cycles_max );
        
        u64 iteration = 0;
        u64 iteration_min = 0xffffffffffffffff;
        u64 iteration_max = 0;
        
        for ( u32 i = 0; i < 100; i++ ) {
            
            if ( stats->iterations[ i ] == 0 ) {
                continue;
            }
            
            if ( i < iteration_min ) {
                iteration_min = i;
            }
            
            if ( i > iteration_max ) {
                iteration_max = i;
            }
            
            iteration += ( stats->iterations[ i ] * i );
            print( "    iterations[ %d ]: %lld\n", i, stats->iterations[ i ] );
        }
        
        u32 iteration_average = cast( u32, round( cast( f64, iteration ) / cast( f64, stats->hits ) ) );
        print( "    iteration min: %lld | iteration max: %lld | iteration average: %d\n", iteration_min, iteration_max, iteration_average );
        print( "    error min: %f | max: %f | above epsilon: %lld\n\n", stats->error_min, stats->error_max, stats->above_epsilon );
    }
    
    FILE* file = fopen( "stats.txt", "wb" );
    fwrite( out, pos, 1, file );
    fclose( file );
    
    return 0;
}

Simon Anciaux

#30472

April 30, 2025

[EDIT] There is an issue with the test setup, the numbers here aren't representative of a real use case. See next post.

To continue on this, I tried to do a SIMD version of the bisection. Since we can compute 4 values in the same time as we compute 1 value, we can split the range in 5 instead of 2, computing values at 0.2, 0.4, 0.6, 0.8. And then check which range is we fall in and again split that in 5 until we found the correct value. The variable names aren't very explicit, but hopefully it's understandable.

static f32 interpolation_cubic_t_from_x_bisection_simd( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 epsilon ) {
    
    f32 a = ( -start + 3 * cp_1 - 3 * cp_2 + end );
    f32 b = ( 3 * start - 6 * cp_1 + 3 * cp_2 );
    f32 c = ( -3 * start + 3 * cp_1 );
    f32 d = ( start - x );
    
    f32 result = 0;
    
    /* NOTE simon (24/04/25 18:41:29): The order is inverted to my mental model. */
    __m128 factors_1 = _mm_set_ps( 0.2f, 0.4f, 0.6f, 0.8f );
    __m128 factors_2 = _mm_set_ps( 0.8f, 0.6f, 0.4f, 0.2f );
    __m128 ts = factors_2;
    __m128 as = _mm_set1_ps( a );
    __m128 bs = _mm_set1_ps( b );
    __m128 cs = _mm_set1_ps( c );
    __m128 ds = _mm_set1_ps( d );
    __m128 epsilons = _mm_set_ps1( epsilon );
    __m128 xs = _mm_set1_ps( x );
    __m128 ls = _mm_set1_ps( 0 );
    __m128 rs = _mm_set1_ps( 1 );
    
    while ( 1 ) {
        
        /* NOTE simon (26/04/25 13:57:04): Compute cubic. */
        __m128 tt = _mm_mul_ps( ts, ts );
        __m128 ttt = _mm_mul_ps( tt, ts );
        __m128 attt = _mm_mul_ps( as, ttt );
        __m128 btt = _mm_mul_ps( bs, tt );
        __m128 ct = _mm_mul_ps( cs, ts );
        __m128 f1 = _mm_add_ps( attt, btt );
        __m128 f2 = _mm_add_ps( ct, ds );
        __m128 fr = _mm_add_ps( f1, f2 );
        
        /* NOTE simon (26/04/25 13:57:12): Absolute value */
        __m128 n_zero = _mm_set1_ps( -0.0f );
        __m128 abs = _mm_andnot_ps( n_zero, fr );
        
        /* NOTE simon (26/04/25 13:57:37): Have we found the result ? */
        __m128 cmp = _mm_cmple_ps( abs, epsilons );
        u32 mask = _mm_movemask_ps( cmp );
        
        if ( mask ) {
            
            /* NOTE simon (26/04/25 13:58:08): Extract the result. */
            __m128 shuffled;
            
            if ( mask & 0x1 ) {
                shuffled = _mm_shuffle_ps( ts, ts, 0 );
            } else if ( mask & 0x2 ) {
                shuffled = _mm_shuffle_ps( ts, ts, 1 );
            } else if ( mask & 0x4 ) {
                shuffled = _mm_shuffle_ps( ts, ts, 2 );
            } else  {
                // _assert( mask & 0x8 );
                shuffled = _mm_shuffle_ps( ts, ts, 3 );
            }
            
            result = _mm_cvtss_f32( shuffled );
            break;
        }
        
        /* NOTE simon (26/04/25 13:58:20): Update left and right limits. */
        __m128 fxs = _mm_sub_ps( fr, ds );
        __m128 setl = _mm_cmplt_ps( fxs, xs );
        
        u32 lmask = _mm_movemask_ps( setl );
        u32 rmask = ~lmask;
        
        if ( lmask & 0x8 ) {
            // 0b11111111
            ls = _mm_shuffle_ps( ts, ts, 0xff );
        } else if ( lmask & 0x4 ) {
            // 0b10101010
            ls = _mm_shuffle_ps( ts, ts, 0xaa );
        } else if ( lmask & 0x2 ) {
            // 0b01010101
            ls = _mm_shuffle_ps( ts, ts, 0x55 );
        } else if ( lmask & 0x1 ) {
            // 0b00000000
            ls = _mm_shuffle_ps( ts, ts, 0 );
        }
        
        if ( rmask & 0x1 ) {
            rs = _mm_shuffle_ps( ts, ts, 0 );
        } else if ( rmask & 0x2 ) {
            rs = _mm_shuffle_ps( ts, ts, 0x55 );
        } else if ( rmask & 0x4 ) {
            rs = _mm_shuffle_ps( ts, ts, 0xaa );
        } else if ( rmask & 0x8 ) {
            rs = _mm_shuffle_ps( ts, ts, 0xff );
        }
        
        /* NOTE simon (24/04/25 18:12:28): Lerp */
        __m128 left = _mm_mul_ps( ls, factors_1 );
        __m128 right = _mm_mul_ps( rs, factors_2 );
        ts = _mm_add_ps( left, right );
    }
    
    return result;
}

mmozeiko helped on discord to get a bit more perf and try other things. We tried "pack left" to try and remove the branching to update the left and right bounds, but it was slower than the branching version. They linked to this GDC talk that explained the pack left:

https://deplinenoise.wordpress.com/wp-content/uploads/2015/03/gdc2015_afredriksson_simd.pdf

https://www.gdcvault.com/play/1022248/SIMD-at-Insomniac-Games-How

At some point, they suggested to do bisection in a different way, by keeping track of the left value, and a distance from there, instead of using left and right. This method get rid of most of the branching. I did a scalar version of that, and it was a lot faster than the left and right version. It was close to the perf of the SIMD version.

static f32 interpolation_cubic_t_from_x_bisection( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 epsilon ) {
    
    f32 a = ( -start + 3 * cp_1 - 3 * cp_2 + end );
    f32 b = ( 3 * start - 6 * cp_1 + 3 * cp_2 );
    f32 c = ( -3 * start + 3 * cp_1 );
    f32 d = ( start - x );
    
    f32 l = 0;
    f32 w = 1;
    f32 t = 0;
    
    while ( 1 ) {
        
        w *= 0.5f;
        t = l + w;
        
        f32 f = a * t * t * t + b * t * t + c * t;
        f32 f_d = f + d;
        
        if ( fabsf( f_d ) < epsilon ) {
            break;
        }
        
        l = ( f > x ) ? l : t;
    }
    
    return t;
}

I than did a SIMD version of that which is the fastest way I have at the moment:

static f32 interpolation_cubic_t_from_x_bisection_simd( f32 start, f32 cp_1, f32 cp_2, f32 end, f32 x, f32 epsilon ) {
    
    f32 t = 0;
    
    f32 a = ( -start + 3 * cp_1 - 3 * cp_2 + end );
    f32 b = ( 3 * start - 6 * cp_1 + 3 * cp_2 );
    f32 c = ( -3 * start + 3 * cp_1 );
    f32 d = ( start - x );
    
    __m128 as = _mm_set1_ps( a );
    __m128 bs = _mm_set1_ps( b );
    __m128 cs = _mm_set1_ps( c );
    __m128 ds = _mm_set1_ps( d );
    __m128 epsilons = _mm_set_ps1( epsilon );
    __m128 xs = _mm_set_ps1( x );
    __m128 ls = _mm_set_ps1( 0 );
    __m128 factors = _mm_set_ps( 0.8f, 0.6f, 0.4f, 0.2f );
    __m128 distance = _mm_set_ps1( 1.0f );
    __m128 scale = _mm_set_ps1( 0.2f );
    
    while ( 1 ) {
        
        __m128 ws = _mm_mul_ps( distance, factors );
        __m128 ts = _mm_add_ps( ls, ws );
        
        /* NOTE simon (26/04/25 13:57:04): Compute cubic. */
        __m128 tt = _mm_mul_ps( ts, ts );
        __m128 ttt = _mm_mul_ps( tt, ts );
        __m128 attt = _mm_mul_ps( as, ttt );
        __m128 btt = _mm_mul_ps( bs, tt );
        __m128 ct = _mm_mul_ps( cs, ts );
        __m128 f1 = _mm_add_ps( attt, btt );
        __m128 f2 = _mm_add_ps( ct, ds );
        __m128 fr = _mm_add_ps( f1, f2 );
        
        /* NOTE simon (26/04/25 13:57:12): Absolute value */
        __m128 n_zero = _mm_set1_ps( -0.0f );
        __m128 abs = _mm_andnot_ps( n_zero, fr );
        
        /* NOTE simon (26/04/25 13:57:37): Have we found the result ? */
        __m128 cmp = _mm_cmple_ps( abs, epsilons );
        u32 mask = _mm_movemask_ps( cmp );
        
        if ( mask ) {
            
            /* NOTE simon (26/04/25 13:58:08): Extract the result. */
            __m128 shuffled;
            
            if ( mask & 0x1 ) {
                shuffled = _mm_shuffle_ps( ts, ts, 0 );
            } else if ( mask & 0x2 ) {
                shuffled = _mm_shuffle_ps( ts, ts, 1 );
            } else if ( mask & 0x4 ) {
                shuffled = _mm_shuffle_ps( ts, ts, 2 );
            } else  {
                // _assert( mask & 0x8 );
                shuffled = _mm_shuffle_ps( ts, ts, 3 );
            }
            
            t = _mm_cvtss_f32( shuffled );
            break;
        }
        
        /* NOTE simon (26/04/25 13:58:20): Update left and distance. */
        __m128 fxs = _mm_sub_ps( fr, ds );
        __m128 setl = _mm_cmplt_ps( fxs, xs );
        
        u32 lmask = _mm_movemask_ps( setl );
        
        if ( lmask & 0x8 ) {
            // 0b11111111
            ls = _mm_shuffle_ps( ts, ts, 0xff );
        } else if ( lmask & 0x4 ) {
            // 0b10101010
            ls = _mm_shuffle_ps( ts, ts, 0xaa );
        } else if ( lmask & 0x2 ) {
            // 0b01010101
            ls = _mm_shuffle_ps( ts, ts, 0x55 );
        } else if ( lmask & 0x1 ) {
            // 0b00000000
            ls = _mm_shuffle_ps( ts, ts, 0 );
        }
        
        distance = _mm_mul_ps( distance, scale );
    }
    
    return t;
}

If you're interested, here is a test file with some things we tested: https://sisyphe.be/stuff/simd_test.c

Edited by Simon Anciaux on May 8, 2025, 2:44pm

Simon Anciaux

#30475

May 8, 2025

After more testing I found an error in my test setup that biased the results. While working on the scalar bisection and Newton Raphson method, I had a loop to test different initial guess values. When I went to work on the SIMD versions, I stopped using the initial guess, but I didn't remove the loop, which caused the same test to run 100 times in a row. That seem to "train" the branch predictor (I'm not sure, it's a guess) which lead to results that don't reflect a real use case. After removing those useless iteration, the faster method seemed to be the "pack left + blend" version.

I also did a Newton Raphson method in SIMD. The idea is to try 4 different locations to have a better chance at finding a places that converges, and is in the 0 to 1 range.

In the end, the choice is between the Newton Raphson scalar and SIMD version. The scalar version is faster if we know a good guess for the point we are searching. Since this is used for animation and we will try to compute the points with a small increment between each call, we often know a very good guess: the last result we got from the function. The SIMD version is faster if we randomly choose point to search.

NR scalar with good guess: +- 30 cycles.
NR scalar without good guess: +- 170 cycles.
NR SIMD: +- 110 cycles.

#if 1
static f32 interpolation_cubic_t_from_x_NR_BI( cubic_state_t* state, f32 x, f32 guess, f32 epsilon ) {
    
    f32 d = ( state->start - x );
    
    b32 do_bisection = 1;
    f32 bisection_l = 0;
    f32 bisection_r = 1;
    f32 t = guess;
    
    for ( umm i = 0; i < 10; i++ ) {
        
        f32 f = state->a * t * t * t + state->b * t * t + state->c * t + d;
        
        if ( fabsf( f ) < epsilon ) {
            do_bisection = 0;
            break;
        }
        
        f32 derivative = 3 * state->a * t * t + 2 * state->b * t + state->c;
        
        if ( derivative == 0 ) {
            break;
        }
        
        t = t - ( f / derivative );
    }
    
    if ( t < 0 || t > 1 || do_bisection ) {
        
#if 0
        t = guess;
        
        while ( bisection_l < bisection_r ) {
            
            f32 f = state->a * t * t * t + state->b * t * t + state->c * t;
            f32 f_d = f + d;
            
            if ( fabsf( f_d ) < epsilon ) {
                break;
            }
            
            if ( f > x ) {
                bisection_r = t;
            } else {
                bisection_l = t;
            }
            
            t = bisection_l * 0.5f + bisection_r * 0.5f;
        }
#elif 1
        t = 0;
        f32 l = 0;
        f32 w = 1;
        
        while ( 1 ) {
            
            w *= 0.5f;
            t = l + w;
            
            f32 f = state->a * t * t * t + state->b * t * t + state->c * t;
            f32 f_d = f + d;
            
            if ( fabsf( f_d ) < epsilon ) {
                break;
            }
            
            l = ( f > x ) ? l : t;
        }
#else
        t = interpolation_cubic_t_from_x_bisection_simd_5( state, x, epsilon );
#endif
    }
    
    return t;
}

#else

/* NOTE simon (07/05/25 15:03:52): The generated code by MSVC doesn't look goods, lots of unpck ? */
static f32 interpolation_cubic_t_from_x_NR_BI( cubic_state_t* state, f32 x, f32 guess, f32 epsilon ) {
    
    f32 d = ( state->start - x );
    
    b32 do_bisection = 1;
    f32 bisection_l = 0;
    f32 bisection_r = 1;
    f32 t = guess;
    
    __m128 ts = _mm_set_ps1( t );
    __m128 as = _mm_set_ps1( state->a );
    __m128 bs = _mm_set_ps1( state->b );
    __m128 cs = _mm_set_ps1( state->c );
    __m128 abs_mask = _mm_castsi128_ps( _mm_set_epi32( 0, 0, 0x80000000, 0x80000000 ) );
    __m128 epsilons = _mm_set_ps1( epsilon );
    
    for ( umm i = 0; i < 10; i++ ) {
        
#if 0
        f32 f = state->a * t * t * t + state->b * t * t + state->c * t + d;
        
        if ( fabsf( f ) < epsilon ) {
            do_bisection = 0;
            break;
        }
        
        f32 derivative = 3 * state->a * t * t + 2 * state->b * t + state->c;
        
        if ( derivative == 0 ) {
            break;
        }
        
        t = t - ( f / derivative );
#else
        /* NOTE simon (06/05/25 18:29:22): Trying to compute function and derivative at the same time. */
        __m128 y = _mm_mul_ps( as, ts ); // at | at
        y = _mm_mul_ps( y, ts ); // att | att
        __m128 t_3 = _mm_blend_ps( ts, _mm_set_ps1( 3 ), 0x2 ); // t | 3
        y = _mm_mul_ps( y, t_3 ); // attt | 3att
        
        __m128 z = _mm_mul_ps( bs, ts ); // bt | bt
        __m128 t_2 = _mm_blend_ps( ts, _mm_set_ps1( 2 ), 0x2 ); // t | 2
        z = _mm_mul_ps( z, t_2 ); // btt | 2bt
        
        __m128 t_1 = _mm_blend_ps( ts, _mm_set_ps1( 1 ), 0x2 ); // t | 1
        __m128 w = _mm_mul_ps( cs, t_1 ); // ct | c
        
        __m128 f1 = _mm_add_ps( y, z );
        __m128 f2 = _mm_add_ps( w, _mm_set_ps( 0, 0, 0, d ) );
        __m128 f = _mm_add_ps( f1, f2 );
        
        __m128 abs = _mm_andnot_ps( abs_mask, f );
        __m128 cmp = _mm_cmple_ps( abs, epsilons );
        u32 mask = _mm_movemask_ps( cmp );
        
        if ( mask & 0x1 ) {
            t = _mm_cvtss_f32( ts );
            do_bisection = 0;
            break;
        }
#if 0
        /* NOTE simon (07/05/25 16:01:11): Derivative is close to zero. Not doing this test is faster on average. */
        else if ( mask & 0x2 ) {
            break;
        }
#endif
        
        __m128 a = _mm_shuffle_ps( f, f, 0x0 );
        __m128 b = _mm_shuffle_ps( f, f, 0x5 );
        __m128 div = _mm_div_ps( a, b );
        ts = _mm_sub_ps( ts, div );
#endif
    }
    
    if ( t < 0 || t > 1 || do_bisection ) {
        
#if 0
        t = guess;
        
        while ( bisection_l < bisection_r ) {
            
            f32 f = state->a * t * t * t + state->b * t * t + state->c * t;
            f32 f_d = f + d;
            
            if ( fabsf( f_d ) < epsilon ) {
                break;
            }
            
            if ( f > x ) {
                bisection_r = t;
            } else {
                bisection_l = t;
            }
            
            t = bisection_l * 0.5f + bisection_r * 0.5f;
        }
#elif 1
        t = 0;
        f32 l = 0;
        f32 w = 1;
        
        while ( 1 ) {
            
            w *= 0.5f;
            t = l + w;
            
            f32 f = state->a * t * t * t + state->b * t * t + state->c * t;
            f32 f_d = f + d;
            
            if ( fabsf( f_d ) < epsilon ) {
                break;
            }
            
            l = ( f > x ) ? l : t;
        }
#else
        t = interpolation_cubic_t_from_x_bisection_simd_5( state, x, epsilon );
#endif
    }
    
    return t;
}

#endif

static f32 interpolation_cubic_t_from_x_NR_BI_SIMD( cubic_state_t* state, f32 x, f32 epsilon ) {
    
    f32 d = ( state->start - x );
    
    __m128 as = _mm_set1_ps( state->a );
    __m128 bs = _mm_set1_ps( state->b );
    __m128 cs = _mm_set1_ps( state->c );
    __m128 ds = _mm_set1_ps( d );
    __m128 epsilons = _mm_set_ps1( epsilon );
    
    // __m128 zeros = _mm_set_ps1( 0 );
    // __m128 ones = _mm_set_ps1( 1 );
    __m128 ts = _mm_set_ps( 0.8f, 0.6f, 0.4f, 0.2f );
    
    b32 do_bisection = 1;
    f32 t = 0;
    u32 active_lanes = 0xf;
    
    for ( umm i = 0; i < 10; i++ ) {
        
        // f32 f = a * t * t * t + b * t * t + c * t + d;
        
        __m128 tt = _mm_mul_ps( ts, ts );
        __m128 at = _mm_mul_ps( as, ts );
        __m128 attt = _mm_mul_ps( at, tt );
        __m128 bt = _mm_mul_ps( bs, ts );
        __m128 btt = _mm_mul_ps( bt, ts );
        __m128 ct = _mm_mul_ps( cs, ts );
        __m128 f1 = _mm_add_ps( attt, btt );
        __m128 f2 = _mm_add_ps( ct, ds );
        __m128 fr = _mm_add_ps( f1, f2 );
        
        /* NOTE simon (26/04/25 13:57:12): Absolute value */
        __m128 n_zero = _mm_set1_ps( -0.0f );
        __m128 abs = _mm_andnot_ps( n_zero, fr );
        
        /* NOTE simon (26/04/25 13:57:37): Have we found the result ? */
        __m128 cmp = _mm_cmple_ps( abs, epsilons );
        u32 mask = _mm_movemask_ps( cmp );
        mask &= active_lanes;
        
        if ( mask ) {
            
            /* NOTE simon (05/05/25 14:42:12): We can converge outside the 0 to 1 range, so verify that the result is between 0 and 1 */
            __m128 gez = _mm_cmpge_ps( ts, _mm_set_ps1( 0 ) );
            __m128 leo = _mm_cmple_ps( ts, _mm_set_ps1( 1 ) );
            __m128 valid = _mm_and_ps( gez, leo );
            u32 valid_mask = _mm_movemask_ps( valid );
            u32 extract = mask & valid_mask;
            
            if ( extract ) {
                /* NOTE simon (26/04/25 13:58:08): Extract the result. */
                __m128i shuffle_mask = *( __m128i* ) ( shuffle_lut + ( extract * 16 ) );
                __m128i shuffled = _mm_shuffle_epi8( _mm_castps_si128( ts ), shuffle_mask );
                t = _mm_cvtss_f32( _mm_castsi128_ps( shuffled ) );
                do_bisection = 0;
                break;
            }
            
            active_lanes ^= ( mask & ~valid_mask );
        }
        
        /* NOTE simon (06/05/25 15:16:01): f32 derivative = 3 * a * t * t + 2 * b * t + c; */
        __m128 t3 = _mm_mul_ps( ts, _mm_set_ps1( 3 ) );
        __m128 att3 = _mm_mul_ps( at, t3 );
        __m128 bt2 = _mm_mul_ps( bt, _mm_set_ps1( 2 ) );
        __m128 derivative = _mm_add_ps( att3, bt2 );
        derivative = _mm_add_ps( derivative, cs );
        
        /* NOTE simon (02/05/25 16:22:59): Can produce inf as we might divide by zero. */
        __m128 divs = _mm_div_ps( fr, derivative );
        ts = _mm_sub_ps( ts, divs );
    }
    
    if ( do_bisection ) {
        t = interpolation_cubic_t_from_x_bisection_simd_5( state, x, epsilon );
    }
    
    return t;
}

There is also a version of NR that tries to compute the function and its derivative at the same time, using only 2 lane of the SIMD registers, but it didn't work great (it's not fast).

I also randomized the control points instead of incrementing them by a fixed amount to be sure I'm not generating "easy" values for the algorithm.

There are also a few tweaks, for example pre-computing a, b and c, or using a single lookup table for pack left and extracting the result to minimize the amount of cache we use.

The file is pretty disorganized at this point, but here it is: https://sisyphe.be/donotdelete/simd_test.c

Edited by Simon Anciaux on May 8, 2025, 3:08pm

Simon Anciaux

#30482

May 20, 2025

Found another error. In those tests, I always assumed p0 is at 0 and p1 is at 1. In my use case it isn't the case. I have animation curves that are split in several bezier segments, and while the full curve is between 0 and 1, each segment can be any part of that range (e.g. p0 = 0.213 and p1 = 0.426).

The code was setup to handle that, except when doing bisection I got it wrong. For the bisection only, d should just be equal to p0, and I need to subtract x when comparing to epsilon (instead of doing d = start - x). Since in the test start was equal to 0, the result of the test were correct, even if the logic being them wasn't good.

It doesn't change the performance results.

stu f32 interpolation_cubic_t_from_x_bisection( interpolation_cubic_t* cubic, f32 x, f32 epsilon ) {
    
    f32 d = cubic->start;
    
    f32 l = 0;
    f32 w = 1;
    f32 t = 0;
    
    while ( 1 ) {
        
        w *= 0.5f;
        t = l + w;
        
        f32 f = cubic->a * t * t * t + cubic->b * t * t + cubic->c * t + d;
        f32 f_minus_x = f - x;
        
        if ( math_abs( f_minus_x ) < epsilon ) {
            break;
        }
        
        l = ( f > x ) ? l : t;
    }
    
    return t;
}

I've updated the test file: https://sisyphe.be/donotdelete/simd_test.c

Edited by Simon Anciaux on May 20, 2025, 12:49pm

Reply to Thread

← Back to index