Charles Petzold



The Limits of float

May 30, 2010
Roscoe, N.Y.

One of the first things the veteran C# programmer notices when learning XNA programming is that all floating-point values are single-precision float rather than double-precision double. This not only reduces storage space (4 bytes each rather than 8 bytes) but also improves performance — at least in theory. (My extremely brief experimentation of the performance differential on the PC reveals something in the range of only about 5% improvement, but it may be more substantial on other devices.)

Although float is fine for most purposes in computer graphics, it can be problematic is some circumstances. For example, suppose you want to animate some graphic or text by continuously rotating it 360° every second. Both the Draw and DrawString methods of SpriteBatch have overloads that accept a rotation angle of type float. It's common to store this rotation angle as a field:

float angle;

A new value is then calculated during every call to the Game derivative's Update method. Recently I've been using a calculation that looks something like this:

angle = MathHelper.TwoPi * (float)gameTime.TotalGameTime.TotalSeconds;

The GameTime argument to the Update method has a TotalGameTime property of type TimeSpan indicating the total time since the game began. The TotalSeconds property of the TimeSpan object is of type double. I simply cast that to a float and multiply it by MathHelper.TwoPi (an XNA static field of type float) to obtain an angle in radians.

From the very first time I typed in a statement like this, I've known two things:

I knew the code would fail because of the insufficiency of float to maintain accuracy with large values. But I didn't have an intuitive sense of when the "short term" became the "long term"!

As you might know from reading Chapter 23 of my book Code: The Hidden Language of Computer Hardware and Software, the ANSI/IEE Standard 754-1985, also known as the IEEE Standard for Binary Floating-Point Arithmetic defines single-precision floating-point values with a 1-bit sign (s in the formula below), a 23-bit significand fraction (f), and an 8-bit exponent (e). In the normal case, a number stored in this format can be calculated as:

Let's take an example. Suppose I write a program with some code that calculates an angle value in the way I've shown. After the program has been running a full day, the TotalGameTime property represents a TimeSpan of 24 hours or 86,400 seconds. In binary, that's 1 0101 0001 1000 0000. Because the leading digit is always 1 in conversion to binary, it doesn't have to be stored. The remaining binary digits become the first 16 digits of the 23-bit fraction value:

Or:

Those 7 additional bits in the significand fraction allow the representation of fractional seconds. Thus, the values are accurate to 1/128 of a second. Since the video frame rate in XNA ranges from 30 frames per second (for a Zune or Windows Phone 7) to 60 or so (for a PC), accuracy of 1/128 second is fine.

Here's a shortcut technique for visualizing the float representation of numbers greater than 1: Simply write the number in binary with a leading 1 and exactly 24 digits. For example,

Notice the binary point preceding the fractional 7 digits. This shows clearly that numbers in this region have 7-bit fractions for an accuracy of 1/128.

Now let's run the game for a week. At the end of a week, the TotalGameTime property is a TimeSpan representing 604,800 seconds. Write that as a 24-bit binary:

Now the fractional part is only 4 bits, and the number is accurate to only 1/16 second. As you cast this TotalSeconds property of the TotalGameTime to a float you are essentially rounding to the nearest 1/16 second, and effectively reducing your frame rate to 16 frames per second. Between one day and seven days, as the float value becomes increasingly unable to accurately represent total time, you'll get some visible jitter and skippiness in the animation.

Might it help to perform the calculation using the original double value of the TotalSeconds property and cast to float when storing the final value? Here's the code:

angle = (float)(Math.PI * gameTime.TotalGameTime.TotalSeconds);

Now I'm using the regular .NET Math.PI static field of type double rather than the XNA MathHelper.Pi static field of type float. After one week, the angle value is calculated as 604,800 × 2π which is about 4,301,109.8988 or in 24-bit binary:

Now there's only a two-bit fractional part, so the angle is accurate to 1/4 radian, or 45°. No good! (This problem also exists with the original code but I chose to focus on the time rather than the resultant angle.) You'll have the same problem if you increment the angle based on the ElapsedGameTime property of the GameTime argument:

angle += MathHelper.TwoPi * (float)gameTime.ElapsedGameTime.TotalSeconds;

The ElapsedGameTime is (usually) the time between video frames, either 1/30 or 1/60 second or thereabouts. That's fine for a float and the multiplication is OK as well. The problem occurs when accumulating that incremental value in an already large angle field.

What's the solution? The easiest solution is simply assuming that nobody's going to be running your games for more than a few hours! But probably the best solution involves performing the calculations using double and then normalizing the result between 0 and 2π by finding the remainder using the modulus operation (%). Then it's safe to cast to a float:

angle = (float)(2 * Math.PI * gameTime.TotalGameTime.TotalSeconds % 
                                                 (2 * Math.PI));

Or, increment the angle field using the ElapsedGameTime property and then normalize the result:

angle += MathHelper.TwoPi * (float)gameTime.ElapsedGameTime.TotalSeconds;
angle %= MathHelper.TwoPi;

Either version will preserve adequate calculational accuracy well beyond a week and for at least a millennium.