Floating-Point Perils

To many programmers, floating-point numbers are seen as “real numbers” as opposed to integers, or even “decimal numbers”. Both of these views are ignorant at best, and dangerous at worst. Floating-point numbers are deceptively easy to use, but dangerously hard to understand fully.

Floating-point came about from the desire to represent fractional numbers in computer programs. The most simple way to do this is fixed-point arithmetic, a lesser known sibling of floating-point arithmetic .

Fixed-point vs floating-point

Fixed-point decimal number

Fixed-point decimal number

A fixed-point number is simply an integer with a certain number of bits set aside as the fraction. This is similar to how we first learn to write decimal numbers in school — a whole part and a fraction part of a given number of digits.

This technique was common to use in early computer games and other performance-intensive programs. The early PCs didn’t have floating-point co-processors, which meant you either had to rely on very computationally expensive software emulation or roll your own fixed-point arithmetic.

The advantages of using fixed-point math was that it was easy to understand and blazingly fast, since it turns out you could happily use the built-in integer arithmetic of your CPU to perform the fixed-point arithmetic (I’ll leave exactly how for you to ponder — building a fixed-point library is an interesting learning experience).

The disadvantages are that you have to decide in advance how you want to make the trade-off between the precision you get and the range you can represent. Just like regular integers, the moment you decide the size and fixed-point position of the type, you also know the maximum and minimum number it can represent.

This whole deal makes it harder to support fixed-point arithmetic as a packaged solution to the “fractional number math” problem. In comes floating-point arithmetic to save the day — here’s a numeric type capable of storing both the distance to the moon in inches and the distance between two cells in your body. This capacity comes at the price of added complexity, however. Before I get into the details of that, let’s look at what a floating-point number actually is.

Scientific Notation

Scientific Notation

Floating-point numbers bear a strong resemblance to scientific notation. They have two parts — a normalized mantissa and an exponential. Since we already know the base (10 in the example here, 2 in a binary float), we don’t need to save that, and to save a very large or very small number, we only need to increase or decrease the mantissa. Sweet!

Now, to actually store this thing, we need to decide how much space we want to use up for the mantissa, and how much for the exponent. Using the example from the images, let’s say we allocate 5 digits for the mantissa, and one for the exponent.

Floating-point decimal number

Floating-point decimal number

It’s immediately clear from this that we’ve used up one more digit than we did before. It’s also less clear how to perform arithmetic on these values (though that’s also an interesting experience to implement).

In actual IEEE standard single-precision floating-point, things get a bit more complicated: there’s a sign bit to save the sign of a number, 8 bits exponent that’s biased (adjusted by a value) to support negative exponents without making comparison a pain, and a 23-bit fraction which is the mantissa minus one (because the binary mantissa will always start with one, so you can save one bit).

This format is very cleverly designed to make integer comparisons work as floating-point comparisons as well, making them efficient and making the format obfuscated and hard to understand at the same time, allowing for exciting edge cases like a negative zero and denormalized numbers.

Floating-point pitfall #0.99999989

The first, and most major, mistake that people make with floating-point arithmetic is to treat a floating-point value as precise, or as decimal. It’s neither, and surprisingly some of the fractional numbers we find the easiest to grasp are impossible to represent, like 0.1.

If you try to store 1/10 as a binary fraction, what happens is essentially the same thing as if you try to store 1/3 as a decimal fraction: you get an infinite sequence of numbers. For 1/10, it looks like this:

0.0001100110011001100110011001100110011001100110011 ...

Converting this back to decimal, you get not-quite-0.1:


… which is surprising to most people, who react with a joke about the computer not being able to count. Adding this to itself 10 times thus gives you not-quite-1:


This happens for the same reason you don’t get 1 if you convert 1/3 to decimal form and add it to itself 3 times (which, by the way, is not a problem at all for the computer to do with floating-point arithmetic and come up with the correct answer). This inexactness is something to always take into account when programming with floating-point numbers — even if you had exact data to begin with, you probably lost a bit each step along the way.

The result of this is that you should never ever compare a floating-point number to a number and expect equality. Imagine, for instance, that your user gave you a series of data points:

sub printFractions(values) {
    float total = 0;
    foreach (value in values)
        total = total + value;
    foreach (value in values)
        value = value / total;
    // Print the fractions
    total = 0;
    foreach (value in values) {
        print "Value: " . value;
        total = total + value;
    print "Total: " . total;

At this point it’s likely that total actually doesn’t equal one at the end of the procedure. Not knowing this and expecting it can cause extremely odd bugs. The most common example is testing for equality of floating-point numbers (like comparing one to a constant). The chance that the numbers are equal is actually minimal, even if they seem to be if you inspect them.

So if you need to compare two floating-point numbers, what to you do? Figure out what kind of accuracy you need, and then compare the numbers within some epsilon (a small number, but larger than what you’ve accumulated as rounding errors), like this:

abs(a/b - 1) < epsilon

Where did my precision go?

Another thing to note about floating-point arithmetic is that as your value increases, the number of bits in your mantissa remains the same. This is completely different from fixed-point or integer arithmetic, where increasing the value will lead to more bits being used (until you run out of bits, that is).

The result of this is that you lose the finer details of your numbers — adding a small number to a large number may not make a difference at all! And subtracting two large numbers from each others may leave you with nothing at all:

10000000000000000 + 1 = 10000000000000000
1000000000000001 - 1000000000000000 = 0

This last effect is known as cancellation, and you can see it in action with google calc.

Here’s where things get interesting… in order to use a data type, you should really know what kind of precision you can expect, so you should calculate that. In whatever range of numbers you intend to represent, you’ll have more precision the closer you get to 0 — what’s interesting is what precision you’ll have when at the far end of your spectrum of numbers.

An example may make this clearer: Let’s say we want to represent positions along a 4 km long track. At 4 km, your exponent will be 11 (212 is 4096, which is more than you need). This means you’ll have 12 bits of mantissa after the binary point (since there’s 23 bits in total), so the precision will be 1/212 ≈ 0.000244.

That’s a fair amount of precision, but if your race track is 40 km long instead, you’re up to a precision of 0.00391. This sort of precision loss becomes quite apparent if you’re creating a computer game and trying to represent positions in a large world using floating-point values, for instance. At the same time, you’ll end up seeing lots of cancellation effects (since you’re likely to want to compare positions of objects).

Another application where these effects show clearly is time: Measuring time in seconds in floating-point can create very nasty bugs where programs run fine for weeks, and then suddenly stop working. After roughly 18 hours, you’ve lost your hundredth of a second, and after 3 months you can’t even separate one second from the next.

The normal solution to this problem is to throw more bits at it — use double-precision floats instead of single-precision floats. Whether or not that’s a good idea varies from situation to situation (that gives you 52 bits of mantissa and 11 bits of exponent — I’ll leave the calculations up to you).

How much precision is too much precision?

There’s one more issue you have to watch out for with floats. Your typical PC CPU will perform floating-point operations in a register with more bits than needed to store the numbers, in order to not lose precision during the calculations. The Intel Pentium runs 80-bit floating-point registers, for instance.

This seems like a great thing at first, until you realize that the results of your calculations are now dependent on whether or not the intermediate steps got saved to memory at some point along the way. The only way to find out may be to look at the assembly code of your compiled program — if you can even access that in your language of choice.

The conclusion to this is simply that before expecting any kind of precision from floating-point arithmetic, you need to think about a great deal of things.

Edge cases

I mentioned two edge cases above — negative-zero and denormalized numbers. There are several more you need to know about when dealing with floats.

Negative-zero is actually guaranteed to be equal to positive-zero, but may or may not compare as the same as positive zero. Follow me? No? Well, for instance, in java, the conditional:

(negativeZero == 0 && negativeZero < 0)

is true. In other languages, it varies. In C++, it’s not even defined which way it’s supposed to behave. This is yet another reason to not compare floating-point numbers directly. Denormalized numbers are thankfully something you don’t need to care much about, so I won’t go into more detail on those.

It turns out there are two other beasts you do need to care about, however: Infinity (positive and negative), and NaN. These numbers can be the result of operations like division by zero, and they eat anything in their path. Infinity plus anything is infinity. Infinity minus anything is infinity. Well, except infinity minus infinity, which is NaN, of course!

Even testing for these values can be tricky. NaN is not a number (duh), but a range of numbers, and testing for it turns out to be one of those things that can’t be tested in a portable way in some languages, as it seems to have been neglected by a fair number of standards, causing each implementation to have a slightly different function for it.


The next time you need a fractional number, think about what implementation is the best one for you. Are you really ok with only single-precision floating-point? Maybe you need double-precision? Maybe even a fixed-point number matches what you need better than the intricacies of floating-point arithmetic. Whichever implementation you choose — make sure you know the pros and cons and make an informed choice — don’t just go with the float because the language has one, and hope it has enough precision for you.

Many languages support decimal types, usually encoded as BCD (binary-coded decimal). These are rather slow, take up a bit more space, but can accurately represent numbers like we’re used to. If you’re ever dealing with concepts like money, your language’s decimal type is your best bet (or making your own, if you don’t have access to one).

If you want to know even more about floating-point arithmetic, go read What Every Computer Scientist Should Know About Floating-Point Arithmetic (though I disagree with the name — it’s filled with lots of theorems that you really do not need to know) or Java theory and practice: Where’s your point? (especially if you’re coding in java).


  • By magice, Thursday, May 28, 2009 @ 3:01

    Your opinion has 2 problems:

    a. The view that “fixed-point is fast, floating-point is slow” is extremely archaic in most of today computers (or at least desktop computers). Even in the general purpose CPU (like Pentium), the speed for floating-point number calculations is either as fast or sometimes faster than that of fixed-point calculation. I did not believe it myself, but that was the result after we time the calculation in one of my computer architecture class. Of course, this does not take in account the fact that many of these calculations (graphic related) are done by the GPU, free up CPU for other tasks.

    2. Double precision floating point number is actually pretty precise. In fact, many programming languages, such Javascript, Lua, etc. , provide only 1 type of number, namely double-precision floating point. For most of the cases, this precision is as good as one needs (as long as you don’t do == directly in C/C++ family).

  • By slicedlime, Thursday, May 28, 2009 @ 6:28

    My point was hardly that fixed-point is better — the part about the speed of calculations was meant to be in the context of early computing, not today. I’ve only actually used it once in my life — but I do know situations where a fixed-point solution is a whole lot better than double-precision floats (especially in the business of huge-world games, where the precision problems just become too noticeable).

    Thanks for the comment, I’ll edit the post to make those things more clear.

  • By Rick Regan, Thursday, May 28, 2009 @ 15:15

    The interesting thing about the “10 times 0.1” problem is that, not only is not equal to 1, but it’s LESS than 1. 10 times 0.1000000000000000055511151231257827021181583404541015625 is 1.000000000000000055511151231257827021181583404541015625, but double precision floating-point arithmetic gives 0.99999999999999988897769753748434595763683319091796875.

    So in other words, there are two aspects that come into play: the inexactness of the conversion and the inexactness of the arithmetic.

  • By Rick Regan, Thursday, May 28, 2009 @ 15:28

    I didn’t really mean “10 times” as in multiplication, I meant “adding 10 times”. Multiplying by 10 correctly gives 1.0!

Other Links to this Post

RSS feed for comments on this post. TrackBack URI

Leave a comment

WordPress Themes