The fast inverse square root, one of my favourite Algorithms!
I like a lot of algorithms for their ingeniousness and simplicity into solving problems,but one in particular stands out when those two qualities are taken into account: the fast inverse square root, created by John Carmack, that made it possible for personal computers from the 90’s to render enough surfaces as to make 3D video-gaming viable. The inverse of a square root is used to normalize the magnitude ofthe vectors that represent each surface to be rendered, thus, this operation is used extensively by the game engine. Consequently, any reduction on its computational costs would heavily impact on the overall performance of the code.
Carmack wrote his implementation of the fast inverse square root (see Algorithm 1, that includes his original censored comments) in the C language despite the fact that all of its needed operations were already available through the <math.io> library. He made that in order to substitute costly operations for faster ones.
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the f***?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
The most interesting aspect of this algorithm is that Carmack found a very good approximation (for numbers in the required range, i.e., 0 to 1) to a function that allows the mantissa and the exponential parts of a number to be obtained by a simple operation over the bit values that represent that number. The in-depth explanation of this won’t be covered here but relies on the fact that log2(1+x) &cong log2(x). In fact the identity holds for x &isin {0,1} and can have it’s average error reduced by adding a constant.
In order to perform those operations in the most efficient way, some bit manipulation is required (see line 10 of Algorithm 1) which, in turn, requires the number to be stored by an integer registry. Nevertheless, if a simple cast to integer were to be applied to the float variable y the information regarding the floating part of the number represented by the float registry would be lost. That is why the conversion has to be made in such a way that the compiler interprets the information held by the float registry as a long, which explains the long pointer used over the float address (see line 9 of Algorithm 1).
Once the input number is represented as an integer in i is when the operations over the value start. From the inverse square root, the log function is applied and after some manipulation, the same result can be obtained by applying a division by two and a logarithm, two much simpler and faster operations.
The division by 2 can be optimally applied with the infamous bit shift to the right (see line 10 of Algorithm 1). For example: when 110 which represents 6 in binary, is shifted to the right, it becomes 11 which represents 3. That is how simple and fast a division (or multiplication, if it were to the left) by 2 happens on a registry level operation, that would most likely be done in one processing cycle, i.e. O(1)
That explains the - (i » 1) bit, but what about 0x5f3759df This hexadecimal number is a constant that results from the combination of the two constants mentioned before: the one that gives an approximation of the value of y from its bits representation; and the one that reduces the error of the logarithmic operations applied before.
As said, the result is an approximation, so after doing the inverse casting of the integer i back to our float y (see line 11 of Algorithm 1}), the last instruction is a Newton Iteration which, as the name suggests, is an iterative method for approximation. The idea behind this is to get a very good approximation from a decent one. Since one iteration yields very low errors for this case, the second one was commented out.
By switching square root and division for four multiplications, two subtractions and a bit shift operation, while obtaining a very good approximation, Carmack’s fast square root achieved way more than just a “bit” of processing time.