Category: Code

Black Box Black Box Testing

One of my friends is in College, and is currently feeling the full idiocy of a system that was only beginning to be rolled out as I left. Let me explain how it works.

Essentially, the system is meant to test the students’ solutions to homework problems. This is done by providing a solid definition of what the input and output of the application are supposed to be on the standard in/out channels, and setting up a whole bunch of test cases, including a memory limit and a CPU time limit. Students submit their source code to the system, which compiles it and runs all the test cases against the application in a black box test. So far, so good.

Seeing these guys at work, compared to me and my colleagues at work, makes a few things very apparent: even with a fairly solid grasp of algorithms and datastructure, their number one problem is code. Where professional programmers swim through code like sharks in the sea, the students appear to be more or less drowning. Theoretical learning aside, the education lacks practical programming, debugging, practical programming and some more practical programming.

It would seem that these programming exercises would be the perfect opportunity to get that kind of experience, if it wasn’t for the fact that the test system is itself a black box. You put in your code, and it tells you yes or no. It’s not quite a boolean pass/fail answer, but close enough: you will get told a result from the set: Didn’t compile, Passed, Failed, Crashed, Time Limit Exceeded. When I first heard of the system, it was motivated with the fact that sometimes in professional programming, that’s all you get.

I agree. Sometimes, you get gnarly bugs that give you less information than a world pro’s poker face. I’ve spent weeks tracking bugs like that sometimes, using all kinds of tools at my disposal to try to wring more  information out of the error, until finally the knot was untied. But — for all the bugs like that I’ve been through, none of them were eventually solved by guessing what was wrong and how to fix it.

Supposedly, the tool is meant to teach the students to debug their code… which it somehow does by disallowing all normal debugging tools. You can’t run a debugger on it, you can’t print traces, you’re not allowed to log to a file or socket, you’re not even allowed to know what input caused the error. The only tools you have at your disposal are your wit in coming up with your own test cases and code reviews.

Any attempts at normal debugging would be classified as cheating. If I was faced with a bug under those circumstances, I would do whatever I could to get more information out of it. Hey, I can crash it with different signals — that’s a few bits of information I could get back from it. All those kinds of tricks of the trade that real programmers use to, you know, solve problems… would be cheating.

This leads to a skewing of results… very simple bugs turn into monster problems, since you can’t identify and fix them. What they are learning is not how to debug their programs but how to painstakingly solve the very specific problem of pleasing the system. By artificially making easy things hard, the system has effectively found a way to avoid teaching the students essential skills in programming: simple debugging tools like tracing and breaking into a debugger. Instead, they learn programming by coincidence: poke something until you (hopefully, eventually) get a green light.

That’s not a lesson to learn.

The only way to go about this, faced by the obstacle made up of this system, is to learn a different skill: testing. More on that later.

More on studies: An Exceptionally Stupid Idea, Go Tinker, What’s a Good Final Year Project?

Don’t Be an Open Source Douchebag

I love open source software. It provides both a neat training ground for programmers, a good place to go scratch that itch. On the other side of things, it provides awesome software for people, including some software that would never come out of a big development house.

Still, there are some issues with free software that don’t really show up to the same degree with commercial software. One such thing is documentation. It’s painfully obvious that documentation is written by people who:

  1. Already know the software in and out.
  2. Don’t like writing documentation.
  3. Know nothing about how people learn.

For instance, when I started a side project a few months back, I was looking for a build system. After settling on CMake, I set about trying to make sense of it. There’s the ever-present getting started example, of course. And then there’s the full reference of everything you could possibly want (almost).

But in between those, there’s nothing. Well, nothing except a book, which just goes to show you that there’s something missing — a professional writer could obviously make some money out of explaining things in a reasonable way.

The problem with this is that it doesn’t match how people learn. Getting started is a good step, but a relatively small one. Most of the time will be spent incrementally expanding the knowledge, moving from beginner to expert. Most time will thus be spend in some kind of zone in between the “getting started” and “reference of everything” levels.

Worse than that, some open source programmers have a tendency to view their full reference documentation as an appropriate resource for everyone. “It’s all in there,” right? But pointing a beginner at a 40-page document detailing all the options of some application when all they want is to run it properly isn’t very helpful. I’m sure you know what I’m talking about if you’ve ever used an open source command line tool.

That ends us up with the really dark side of free software culture. The true douchebags out there will not only be extremely smartass in their RTFM comments, they’ll also be incredibly sensitive and defensive about the software they’re working on.

I ran into a problem with cygwin’s SSHD implementation last week. In searching for the solution, I found this mail list answer:

  Wrong.  That is uninformed speculation and guesswork.  Stop
spreading misinformation.

  Cygwin SSHD has had the support for fully logging in as any
user since 1.7, as you have already been told and completely
ignored.  Go and read the manual.  The link was in the previous
email I sent in this thread.

  freesshd works exactly as Cygwin *used* to before it got
subauth support: when you log in with a key, rather than a
password, you just end up as an admin user.

Wow. This kind of answer is wrong on so many levels. First of all, while he makes it seem like the functionality has been there forever, cygwin 1.7 is still not even out of beta. The chance that an end user has it is about 0. So, with the current version (1.5),  supposedly cygwin sshd works just like freesshd. This is clearly false, because the original poster reports one working and the other not (which is, by the way, exactly the same results that I had).

So, a user reporting a problem about logging in gets pointed to a long documentation about security settings in a beta version, doesn’t understand a word from that document (no surprise there), and as a result gets told to “stop spreading misinformation”. Truth is, simply installed like any normal user installs applications, one works and the other doesn’t, something made quite clear by an answer from the original poster in a different place in the thread:

> Are you talking about password or public key authentication?
> If the latter, Have you tried the LSA authentication package
> in Cygwin 1.7?
I don't know. I'll try to deciper that. Sounds complicated. In
the meantime, friend is using freesshd.

The essence of what he’s saying (which has been completely missed by the cygwin developers) is that the effort required to get cygwin to work like one would reasonably expect of it is much higher than the effort required to just google for something that just works out of the box. The fact that you could potentially make it work is irrelevant, because he’s not getting any help actually making it work.

He might as well just have said, “I don’t care about making it work for you. It works for me.”

Software companies usually compensate for their complete lack of useful technical support with a good (or at least reasonably decent) amount of help documentation. Free software usually has neither.

I encourage any programmer to practice their technical skills on an open source project. But while you do so, take the opportunity to practice your people skills a bit as well, or why not your writing skills? Don’t be an open source douchebag — someone reporting your software’s flaws is not attacking you personally.

Deleting Code

I came upon an event handling function a year or two back. It was late, there was a crunch going on, I was tired as hell, and I needed to figure out what it did. Here’s something like it (or see the whole thing if you can’t read it in this format):

void EventHandler::handleEvent(const Vector& eventPos, const Transform& objectTrans, Object* sourceObject, int sourceId, float time)
{
    if (sourceObject && sourceId != -1)
    {
        Actor* sourceActor = getActorFromId(sourceId);
        if (sourceActor)
        {
            // Event was fired by an actor. Accumulate data in the actor's memory about how much firing strength it has fired.
            // Get information about source weapon.
 
            Object* actorVehicle = sourceActor->getPlayer()->getCurrentObject();
            Armaments* actorArms = actorVehicle ? (actorVehicle->getArmament()) : 0;
            int weaponId = actorArms ? actorArms->getActiveWeaponIndex() : -1;
            if (weaponId != -1)
            {
                if (actorArms)
                {
                    Weapon* weapon = actorArms->getWeapon(weaponId);
                    if (weapon)
                    {
//                        if (weapon->getExplosionRadius() > 0)
//                            sourceActor->getMemory()->setTimeOfEvent(time);
 
                        // Get information about target, and bots memory of target.
                        ObjectProxy currentTargetHandle = sourceActor->getMemory()->getTarget();
                        Object* currentTarget = currentTargetHandle.get();
 
                        // Always deal with primaryObject
                        if (currentTarget && !currentTarget->isPrimary())
                        {
                            currentTargetHandle = currentTarget->getPrimaryObject()->proxy();
                            currentTarget = currentTargetHandle.get();
                        }
 
                        const SensingData* currentTargetSensingData = sourceActor->getSenses()->getSensingData(currentTarget);
                        if (currentTarget && currentTargetSensingData)
                        {
                            ASSERT(0 <= weaponId && weaponId < 8, "Illegal weaponId: &u", weaponId);
                            //DEBUG_OUTPUT("FiringStrength accumulation for weaponId %u... %f\n", weaponId, cyrrentTargetSensingData->typesFired
 
[weaponId];
                            int targetType = currentTarget->getType();
                            if (targetType <= 3) // Lighter types
                            {
                                //for (int strType = targetType; strType <= 3; ++strType)
                                //    currentTargetSensingData->typesFired[weaponId] += weapon->getTypeValue(strType);
                            }
                            else if (targetType == 4) // Other type
                            {
                                //currentTargetSensingData->typesFired[weaponId] += weapon->getTypeValue(4);
                            }
                            else if (targetType == 5) // Third type
                            {
                                //currentTargetSensingData->typesFired[weaponId] += weapon->getTypeValue(2); // Other type
                                //currentTargetSensingData->typesFired[weaponId] += weapon->getTypeValue(3); // Other type
                                //currentTargetSensingData->typesFired[weaponId] += weapon->getTypeValue(5); // Third type
                                /*                                ObjectProxy victimHandle = targetObject->getVehicle(sourceActor->getTeam());
                                SensingData* spottedObj = sourceActor->getSensingDataFromHandle(victimHandle);
                                if (spottedObj)
                                {
                                spottedObj->fireSuccess += 1;
                                }*/
                            }
                            //DEBUG_OUTPUT("...became %f\n", currentTargetSensingData->typesFired[weaponId]);
                        }
                    }
                }
            }
        }
    }
}

There are lots of things I could say about the quality of the code… but it’s late, there’s a crunch, I’m tired and… what does the function do? I’ll let you think about that for a bit. You can safely assume that functions don’t have side effects.

Many people focus almost exclusively on writing new code. Some people even stay away from deleting code — instead simply commenting things out. As you see above, this makes the code extremely hard to read. Also, the commented-out code doesn’t work anymore. With other changes, the compiler no longer checks this code for errors (for instance, some of the commented-out functions had been removed entirely).

This also misses the entire purpose of having source control in the first place! If you want the code back, go check it out in perforce, subversion, cvs, or whatever source control system you’re using. If you’re not using a source control system… well you have bigger problems than commented code.

For all there is to be said about writing code, often the best thing you can do is delete some code. Delete the code you don’t need — delete the extra pieces that are in the way of working with the code efficiently.

I once took over ownership of a code base, and the first thing I did was strip nearly half of the code out in form of unused support code, unnecessary interfaces and adaptation code for layers that no one wanted or needed and code that was simply so bad quality it was better to rewrite than to maintain them.

So… what does the code do? Cookies for the first commenter with the right answer.

(Disclaimer: the code above is not exactly the code I found… but it matches in form, and is similar enough to make the point)

Update: Added a link to the code as a separate file so you can view it without the scrollbar headache.

Floating-Point Perils

To many programmers, floating-point numbers are seen as “real numbers” as opposed to integers, or even “decimal numbers”. Both of these views are ignorant at best, and dangerous at worst. Floating-point numbers are deceptively easy to use, but dangerously hard to understand fully.

Floating-point came about from the desire to represent fractional numbers in computer programs. The most simple way to do this is fixed-point arithmetic, a lesser known sibling of floating-point arithmetic .

Fixed-point vs floating-point

Fixed-point decimal number

Fixed-point decimal number

A fixed-point number is simply an integer with a certain number of bits set aside as the fraction. This is similar to how we first learn to write decimal numbers in school — a whole part and a fraction part of a given number of digits.

This technique was common to use in early computer games and other performance-intensive programs. The early PCs didn’t have floating-point co-processors, which meant you either had to rely on very computationally expensive software emulation or roll your own fixed-point arithmetic.

The advantages of using fixed-point math was that it was easy to understand and blazingly fast, since it turns out you could happily use the built-in integer arithmetic of your CPU to perform the fixed-point arithmetic (I’ll leave exactly how for you to ponder — building a fixed-point library is an interesting learning experience).

The disadvantages are that you have to decide in advance how you want to make the trade-off between the precision you get and the range you can represent. Just like regular integers, the moment you decide the size and fixed-point position of the type, you also know the maximum and minimum number it can represent.

This whole deal makes it harder to support fixed-point arithmetic as a packaged solution to the “fractional number math” problem. In comes floating-point arithmetic to save the day — here’s a numeric type capable of storing both the distance to the moon in inches and the distance between two cells in your body. This capacity comes at the price of added complexity, however. Before I get into the details of that, let’s look at what a floating-point number actually is.

Scientific Notation

Scientific Notation

Floating-point numbers bear a strong resemblance to scientific notation. They have two parts — a normalized mantissa and an exponential. Since we already know the base (10 in the example here, 2 in a binary float), we don’t need to save that, and to save a very large or very small number, we only need to increase or decrease the mantissa. Sweet!

Now, to actually store this thing, we need to decide how much space we want to use up for the mantissa, and how much for the exponent. Using the example from the images, let’s say we allocate 5 digits for the mantissa, and one for the exponent.

Floating-point decimal number

Floating-point decimal number

It’s immediately clear from this that we’ve used up one more digit than we did before. It’s also less clear how to perform arithmetic on these values (though that’s also an interesting experience to implement).

In actual IEEE standard single-precision floating-point, things get a bit more complicated: there’s a sign bit to save the sign of a number, 8 bits exponent that’s biased (adjusted by a value) to support negative exponents without making comparison a pain, and a 23-bit fraction which is the mantissa minus one (because the binary mantissa will always start with one, so you can save one bit).

This format is very cleverly designed to make integer comparisons work as floating-point comparisons as well, making them efficient and making the format obfuscated and hard to understand at the same time, allowing for exciting edge cases like a negative zero and denormalized numbers.

Floating-point pitfall #0.99999989

The first, and most major, mistake that people make with floating-point arithmetic is to treat a floating-point value as precise, or as decimal. It’s neither, and surprisingly some of the fractional numbers we find the easiest to grasp are impossible to represent, like 0.1.

If you try to store 1/10 as a binary fraction, what happens is essentially the same thing as if you try to store 1/3 as a decimal fraction: you get an infinite sequence of numbers. For 1/10, it looks like this:

0.0001100110011001100110011001100110011001100110011 ...

Converting this back to decimal, you get not-quite-0.1:

0.1000000000000000055511151231257827021181583404541015625

… which is surprising to most people, who react with a joke about the computer not being able to count. Adding this to itself 10 times thus gives you not-quite-1:

0.99999999999999989

This happens for the same reason you don’t get 1 if you convert 1/3 to decimal form and add it to itself 3 times (which, by the way, is not a problem at all for the computer to do with floating-point arithmetic and come up with the correct answer). This inexactness is something to always take into account when programming with floating-point numbers — even if you had exact data to begin with, you probably lost a bit each step along the way.

The result of this is that you should never ever compare a floating-point number to a number and expect equality. Imagine, for instance, that your user gave you a series of data points:

sub printFractions(values) {
    float total = 0;
    foreach (value in values)
        total = total + value;
    foreach (value in values)
        value = value / total;
 
    // Print the fractions
    total = 0;
    foreach (value in values) {
        print "Value: " . value;
        total = total + value;
    }
    print "Total: " . total;
}

At this point it’s likely that total actually doesn’t equal one at the end of the procedure. Not knowing this and expecting it can cause extremely odd bugs. The most common example is testing for equality of floating-point numbers (like comparing one to a constant). The chance that the numbers are equal is actually minimal, even if they seem to be if you inspect them.

So if you need to compare two floating-point numbers, what to you do? Figure out what kind of accuracy you need, and then compare the numbers within some epsilon (a small number, but larger than what you’ve accumulated as rounding errors), like this:

abs(a/b - 1) < epsilon

Where did my precision go?

Another thing to note about floating-point arithmetic is that as your value increases, the number of bits in your mantissa remains the same. This is completely different from fixed-point or integer arithmetic, where increasing the value will lead to more bits being used (until you run out of bits, that is).

The result of this is that you lose the finer details of your numbers — adding a small number to a large number may not make a difference at all! And subtracting two large numbers from each others may leave you with nothing at all:

10000000000000000 + 1 = 10000000000000000
1000000000000001 - 1000000000000000 = 0

This last effect is known as cancellation, and you can see it in action with google calc.

Here’s where things get interesting… in order to use a data type, you should really know what kind of precision you can expect, so you should calculate that. In whatever range of numbers you intend to represent, you’ll have more precision the closer you get to 0 — what’s interesting is what precision you’ll have when at the far end of your spectrum of numbers.

An example may make this clearer: Let’s say we want to represent positions along a 4 km long track. At 4 km, your exponent will be 11 (212 is 4096, which is more than you need). This means you’ll have 12 bits of mantissa after the binary point (since there’s 23 bits in total), so the precision will be 1/212 ≈ 0.000244.

That’s a fair amount of precision, but if your race track is 40 km long instead, you’re up to a precision of 0.00391. This sort of precision loss becomes quite apparent if you’re creating a computer game and trying to represent positions in a large world using floating-point values, for instance. At the same time, you’ll end up seeing lots of cancellation effects (since you’re likely to want to compare positions of objects).

Another application where these effects show clearly is time: Measuring time in seconds in floating-point can create very nasty bugs where programs run fine for weeks, and then suddenly stop working. After roughly 18 hours, you’ve lost your hundredth of a second, and after 3 months you can’t even separate one second from the next.

The normal solution to this problem is to throw more bits at it — use double-precision floats instead of single-precision floats. Whether or not that’s a good idea varies from situation to situation (that gives you 52 bits of mantissa and 11 bits of exponent — I’ll leave the calculations up to you).

How much precision is too much precision?

There’s one more issue you have to watch out for with floats. Your typical PC CPU will perform floating-point operations in a register with more bits than needed to store the numbers, in order to not lose precision during the calculations. The Intel Pentium runs 80-bit floating-point registers, for instance.

This seems like a great thing at first, until you realize that the results of your calculations are now dependent on whether or not the intermediate steps got saved to memory at some point along the way. The only way to find out may be to look at the assembly code of your compiled program — if you can even access that in your language of choice.

The conclusion to this is simply that before expecting any kind of precision from floating-point arithmetic, you need to think about a great deal of things.

Edge cases

I mentioned two edge cases above — negative-zero and denormalized numbers. There are several more you need to know about when dealing with floats.

Negative-zero is actually guaranteed to be equal to positive-zero, but may or may not compare as the same as positive zero. Follow me? No? Well, for instance, in java, the conditional:

(negativeZero == 0 && negativeZero < 0)

is true. In other languages, it varies. In C++, it’s not even defined which way it’s supposed to behave. This is yet another reason to not compare floating-point numbers directly. Denormalized numbers are thankfully something you don’t need to care much about, so I won’t go into more detail on those.

It turns out there are two other beasts you do need to care about, however: Infinity (positive and negative), and NaN. These numbers can be the result of operations like division by zero, and they eat anything in their path. Infinity plus anything is infinity. Infinity minus anything is infinity. Well, except infinity minus infinity, which is NaN, of course!

Even testing for these values can be tricky. NaN is not a number (duh), but a range of numbers, and testing for it turns out to be one of those things that can’t be tested in a portable way in some languages, as it seems to have been neglected by a fair number of standards, causing each implementation to have a slightly different function for it.

Options

The next time you need a fractional number, think about what implementation is the best one for you. Are you really ok with only single-precision floating-point? Maybe you need double-precision? Maybe even a fixed-point number matches what you need better than the intricacies of floating-point arithmetic. Whichever implementation you choose — make sure you know the pros and cons and make an informed choice — don’t just go with the float because the language has one, and hope it has enough precision for you.

Many languages support decimal types, usually encoded as BCD (binary-coded decimal). These are rather slow, take up a bit more space, but can accurately represent numbers like we’re used to. If you’re ever dealing with concepts like money, your language’s decimal type is your best bet (or making your own, if you don’t have access to one).

If you want to know even more about floating-point arithmetic, go read What Every Computer Scientist Should Know About Floating-Point Arithmetic (though I disagree with the name — it’s filled with lots of theorems that you really do not need to know) or Java theory and practice: Where’s your point? (especially if you’re coding in java).

Handling Errors

bsod-1One of the trickiest subjects in programming is the proper handling of errors. What do you do when things go wrong? Some errors are predictable so you can plan for them occurring. Some errors are predictable, but you still wont plan for their occurrence, and a third category of horrible situations are circumstances you could never have guessed would occur.

In order to properly manage errors, you first need to identify what kind of errors you’re dealing with. The best software deals with all errors, but deals with different kinds of errors in different ways. I divide errors into four different categories:

  • Full crashes.
  • Programming errors.
  • Exceptional circumstances.
  • User errors.

Depending on where in the list you are (the worst kind of error is the first one), you’ll want to take a different approach to handing the error.

Crashes

The outright crash is the worst kind of programming error you’ll find. The code is malfunctioning in such a bad way that uninitialized memory is accessed, memory is overwritten, null pointers are dereferenced, unaligned memory is read or something similar.

In native code, crashes normally cause messages like General Protection Fault, Segmentation Fault (segfault) or Bus Error (misaligned memory access). In bytecode-compiled languages (managed code), crashes are usually raised as exceptions. For all languages (worth mentioning) it’s possible to handle outright crashes. In managed languages like Java or C#, you can catch the exception and do something with it (regardless of how significant). In C or C++, you can install error handling code for these occurances.

Regardless of how your language deals with crashes, you should be treating them as separate issues from other exceptions. In general, there are two things to consider when dealing with crashes:

  1. Crash early. This is one of the vital tips from the book The Pragmatic Programmer by Andrew Hunt and David Thomas. If you crash as soon as possible when there’s an error, you avoid running with trashed data, trying to salvage the situation. The probable result of trying to recover from a crash is that you’ll run with data that is broken, and save that data somewhere — maybe you’ll overwrite the user’s settings with gibberish, or trash data in a vital database.
  2. Don’t crash. This may seem a bit contradictory to the above rule, but in essence this is about what you expose to your user. Jeff Atwood calls it crashing responsibly, but in my world it’s about not putting the user through the experience of a crash — you reduce the crash to a normal application failure (which is better) by showing the user your own explanatory text, preferably with an apology and some way for them to know that you’re working on fixing that crash (you are, aren’t you?).This means you’ve got to automatically report and track all crashes. Don’t leave it up to your users to report crashes to you — they most probably wont, since they’ll be too busy either getting your application restarted so they can finish what they were doing, or looking at your competitor to find an application that doesn’t crash.

Programming Errors

A programming error is an error caused by the code failing to abide by the rules set forth by other parts of the code. Violating contracts, failing to follow the documented restrictions of an interface or similar. Normally, you’ll use asserts to catch programming errors. In languages that don’t have asserts, you’ll cry for a while, spend a few minutes contemplating switching to a better language, and then probably do the check and throw an exception.

There are a few common mistakes with regards to asserts and programming errors:

  • Using asserts for other things that programming errors — asserts should be used only to check things in called code that the calling code could have and should have checked before making the call. You can think of an assertion as a statement of something that should never happen.
  • Allowing asserts to be ignored — Assertion failures should be treated just like crashes when it comes to handling. An assertion is an unconditional error in the code, something that should be fixed immediately, and if you ended up getting an assertion failure you have lost track of the well being of the system. Crash, automatically report, and fix the problem for your next release.This is again good advice from The Pragmatic Programmer. Switching assertions off when you ship an application indicates that you think you’ve fixed all bugs. This is a rather naïve attitude, and you’ll quickly learn it doesn’t hold true. The only difference between debug and release might be how you handle your assertion failures.Assertions make it easier for you to find and fix the errors than the crash you might otherwise get, even after you’ve shipped.
  • Switching asserts off for release — Asserts are nearly always switched off for release builds. The built in assertion mechanism of C and C++ does this unconditionally, but building your own assert is not as hard as junior (and even some senior) programmers tend to think it is.Sometimes, you may need to switch some assertions off for release, when performance concerns are addressed. This should be a conscious, well considered decision on specific asserts however, not a default.

Exceptional Circumstances

Exceptions are the somewhat mangled used-for-everything error handlers of most object oriented languages. Be careful of how you use exceptions — they should only be used for exceptional circumstances. Unexpected, but detectable, problems.

Note that the one thing you should never do is exception checking. Say for instance that you’re reading user settings from a file, but the file may not exist since the user may not have started the application before. The wrong thing to do here is to try to open the file, catch any exceptions and move on. The right thing to do is to check whether the file exists before trying to open it.

Remember — exceptions are supposed to be used when something unexpected happens — if you already know the file may not exist, it’s not unexpected that it doesn’t. However, if when saving the settings file it won’t open because it’s write-protected, that’s a good place for an exception.

To summarize this as a simple rule: “Never use exceptions as a control structure”. There’s several reasons for this, but with exceptions representing something gone wrong, it should be reasonably easy to understand that things should not continuously go wrong during normal execution. Practically, a program that only throws exceptions when things go wrong is much easier to debug than one that throws exceptions here and there.

Another thing to think about with exceptions is that in general, they are specific to the context in which they’re thrown. For instance, a FileAccessException makes sense when the file can’t be opened in the above example, but as little as one or two steps up the call stack, a FileAccessException makes no sense at all. Usually it’s a good idea to catch the exceptions and convert them to a type that makes sense in the current context. This makes it easier to decide where it’s appropriate to handle them.

User Errors

The final category of errors is user errors. These are things that aren’t even (or shouldn’t be) unexpected, and certainly not exceptional. Always assume that your user will input something wrong in any kind of input form. The file name you asked for wasn’t a valid file name, you asked for a number but got the string “ten” — your imagination will not be capable of coming up with all the wickedly “stupid” ways your users will try to use your application (stupid in this context is a programmer-view of the world — to programmers many of the natural ways people communicate seem stupid when applied to computers, but the stupidity is generally on the side of the computer, not of the user).

A user error should never manifest itself as any of the above kinds of errors — you should always be checking and validating user input before letting it propagate into the system. Failing to do so will likely cause unexpected and weird behaviour from your application — which in turn is a programming error, not a user error. The fault was yours for not validating the input, not the user’s for trying to use your application.

Error Handling Code

Big GPF Error Message

Once you start working on properly handling errors, you’ll inevitably start producing lots of code. This code will be run very rarely, which means it’ll likely be less well tested than the rest of  your code base. You only get occational shots at fixing this code (when something else is broken), so fix error handling code first.

Another thing to note about error handlers is that you’re usually limited in what you do in them. Depending on the kind of error, you may have no possibility of allocating memory to deal with your error message (although applications that gracefully manage out of memory errors are a truly rare find).

I recently fixed a problem which caused a hard crash of our data building pipeline, which was getting stuck in an infinite loop tryng to build a malformed shader. There are several steps of things to fix here:

  • Fix the hard crash – why wasn’t our error handling gracefully exiting the build? It turned out that our error handler attempted to read the callstack and print it. This is a good idea on a normal crash, since it could then be reported to the programmers — but it’s a very bad idea on a stack overflow. Not only is the stack extremely large and unlikely to yield much information to the programmers, but there’s no stack space left to deal with the stack overflow.The hard crash was the application’s error handler entering a function with a stack-allocated string to manage the callstack lines. Changing the handler to not try to list the callstack fixed the crash.
  • Fix the error of the programmer error — the pipeline ending up in an infinite loop due to bad user data. We added a validation to ensure that the input graph was properly acyclic. Note that this is something you should do after you’ve fixed your stack overflow error handler — otherwise you can’t know if your fixes for the handler worked.
  • Fix the user data – actually fix the faulty shader graph. As a bonus we added error checking code to our editor to prevent the error from ever occurring again.

Errors can teach you much about the health of your code base if you listen to them. What have your errors taught you?

(Thanks to cdamian and Justin Marty on flickr for the images)

WordPress Themes