Introduction
There I am, working on the computer. Running all kinds of programs to accomplish my goals. Just cruising along, in that mental space where I can do no wrong.
Dink! “Operation Failed.”
Huh? What operation? I thought I was just saving a file. I try again.
Dink! “Operation Failed.”
This is frustrating. I try again and again, tweaking settings and preferences and options, all to no avail. The stupid program just won't save my work because the “Operation failed”. I'm no longer in the groove and suddenly everything is going wrong.
Welcome to the world of poor error handling.
Definition of an Error
What is a software error anyway? Every program is made up of many actions. Each of those actions has two sets of requirements: 1) one or more requirements that must be met before the action can be performed and 2) one or more requirements that determine if the action was a success. If any of these requirements, before or after, are not met, then an error has occurred.
For example, an action that adds two numbers together might have the following requirements:
Before the Action
- Two numbers have been provided
- The first number is an integer between the most negative 32-bit integer and the most positive 32-bit integer
- The second number is an integer between the most negative 32-bit integer and the most positive 32-bit integer
After the Action (success requirements)
- The result is an integer between the most negative 32-bit integer and the most positive 32-bit integer
- The result did not overflow the limits of a 32-bit integer
Errors in a program are generally caused by influences from outside the program. The user enters bad data, a file contains a bad format, a hardware device fails and responds unpredictably, a disk drive fills up, and so on.
Errors can also occur in the design of a program. You can think of a design feature as an action and thus the feature has a list of requirements. In fact, those feature requirements become part of the requirements for each of the actions that make up the feature. Errors in design are most often known as design flaws or bugs, which brings us to the next question.
What is a bug? A bug is an error for which the programmer is responsible. There are three common scenarios in which a bug is introduced:
- The programmer did not understand all of the requirements for an action and therefore the implementation is faulty
- The programmer did not know all of the requirements for an action and therefore the implementation does not work as expected
- The action was simply not implemented correctly so as to satisfy the success requirements
Sounds simple enough. For a bug-free program, know and understand all of the requirements for every action that a program can take and then implement those actions correctly. So why are they so many buggy programs around? There are a lot of smart programmers creating programs big and little, and yet bugs still creep in. Look at the action of adding two numbers together. I provided five key requirements for that simple action. However, there are other requirements I didn't list, such as the action must be implemented in a specific language, must run on a certain kind of hardware, must use registers or must use a stack, must meet a certain level of accuracy and perform as fast possible or must use as little memory as possible. Note that some of these requirements are contradictory.
Faced with contradictory or vague requirements, the programmer must make a choice. That choice could be made based on additional requirements (for example, the whole program must run as fast as possible so the requirement about being faster is the correct choice). But if a requirement is vague, how does the programmer know they implemented the action correctly?
After making all the (hopefully correct) choices, the programmer finishes creating the action and it looks like all the requirements that can be achieved have been achieved. Later, another action is added to the program that has a requirement that says the first action must produce a result of a certain accuracy. Now suppose the first action cannot produce accurate enough results due to the programmer's choices on how best to meet the original requirements of the first action (for example, one original requirement might have stated that speed was a higher priority than greater accuracy). To allow the second action to have one of its requirements met, the first action must be modified to produce higher accuracy, but what about the first action's requirement for speed? Is that now sacrificed to satisfy the second action? More choices must be made.
Multiply that situation by a hundred for a small program. Multiply that by ten thousand for a large program. Now change the programmer of the second action to someone else and put three years between the time when the first action and the second action is created. And have the first programmer no longer be around to answer questions. Oh, and by the way, management wants to move to a newer version of the computer language.
That is one way in how bugs happen, despite the best efforts of very smart people.
Or bugs could happen because a programmer skipped breakfast or had too little sleep the night before or was at the end of a third twelve-hour day in a row. When the human mind is overwhelmed from both within (by the enormous complexity of the software) and without (too little sleep, too many distractions), it's bound to make mistakes. Unfortunately, this is usually how bugs appear in software, even to the very smartest of people.
A programmer can follow good sleeping habits, eat right, get plenty of exercise, and work sensible hours. And the programmer can still make mistakes.
So we have things that can cause errors from outside the program and bugs that cause trouble from inside the program. So what can be done about this ridiculous complexity? The simplest approach is to recognize how amazingly complicated software really is and then build in mechanisms that let the program respond in a graceful manner to unpredictable situations. In other words: error handling.
Why Error Handling
We have established that error handling is probably a good idea. Now we have to figure out what it means to have error handling.
Let's start with a simple goal: An error handled successfully leaves the program in a usable state to try again.
The simplest approach to satisfying our goal is for an action to put itself into a known good state immediately after an error has been detected. For example, if an action is to read a file from disk and an error occurred while opening the file, the action doesn't have to do much to put itself back into a good state (the state before the file was opened). However, if the action was in the middle of reading the file, the action should close the file and discard whatever was read in to get back into the state before the file was opened. Only then can the error be processed. If each action leaves itself in a known good state after an error has occurred but before that error is processed, then error handling is reduced to either the program trying to work around the error or reporting the problem to the user and let them work around the error.
Things gets more complicated if an action detects an error but the action can't get back to a known good state. The only choice then is to pass the error back up to the parent of that action and let the parent handle the error. For example, using the file reading action, suppose that a buffer of data is read from a file and passed to a sub-action which looks for a piece of text surrounded by square brackets. If the piece of text couldn't be found or the wrong piece of text was found first, that is an error. The sub-action can't get back to a known good state (the data was malformed) so it passes the error to the action that called it. The parent action sees the error, closes the file and discards whatever data it had read in up to that point (getting back to a known good state), then reports the error to the user.
The sub-actions could also call other sub-sub-actions and so on. The program design needs to identify which action can handle getting the program back into a known good state in the face of an error. It is that action that handles the error. In general terms, errors should be handled — that is, get the program back into a known good state — at the point where the fewest number of steps are needed to get the program back into a good state (fewer steps means fewer opportunities for additional errors). Following this approach to the extreme, the error is passed all the way back to the top level of the program, which reports the error and shuts the program down. This is, of course, little better than the error causing the program to crash. Look for a point where there is a balance between taking several steps to put the program back on track and taking one step and shutting the program down.
On the other hand, a lot of programs don't handle the error except at the topmost level. In these situations, the only thing the program can really do is reset the entire program (almost as bad as exiting the program and telling the user to start again), which could be very inefficient as well as costly to the user's data. Also, there may no longer be sufficient information to clearly explain to the user what went wrong when errors are handled at the top level (Dink! “Operation Failed.”). And if the user doesn't know how the error happened, they will either try the whole set of actions again and run into the error again or give up on the program entirely because it appears to be too buggy to use — even if the program doesn't actually contain any bugs in those actions.
And that brings us to the perception of errors.
Error Perception
In the following exercise, we are going to put ourselves in the mind of a typical user of our program to figure out what we as programmers need to put into an error message so that we as users can solve a problem.
So, say we (as users) are using a program and something goes wrong. How do we know it went wrong? Typically, an error message is displayed. If no message is given, perhaps the program refuses to perform some action (we keep selecting the menu option but nothing appears to happen). Or the penultimate failure occurs and the program crashes. Of course, the ultimate failure is all our data being wiped out, which we don't discover until after the program exits.
As a user, what do we need to know about an error to allow us to successfully complete our task? The first thing we need to know is what went wrong. This is typically the action that reports the error (usually the top level action being executed at the time of the error). For example, the program says:
That piece of information tells us the problem occurred while the program was reading a file, as opposed to saving a file. However, this is not enough detail to keep us from running into the error again (other than stop trying to get the program to read the file). We know the context in which the error occurred but we could use more information. At the moment, all we know is the program can't read a specific file.
The next bit of useful information is what data was the program working on at the time of the error. In our file-reading example, this data would be the name of the file:
Now we're getting somewhere. As long as we don't try to load the “simonsays.txt” file, we should be all right. But what if we try loading another file and something goes wrong:
Now it seems we're back to not being able to load any files. Note that we as users haven't tried all files, but after two files we are generally not going to have the patience to try any more. Our perception of the program at this point is “what a stupid program! I'm going to find something else!”
It would be very useful to have a little more detail about the action the program was performing when the error occurred. We already know it was reading a file but what, exactly, about that file caused problems for the program? This information comes from the action that actually detected the error. For example, we have the primary action, which was reading the file, then there's the sub-action that was looking for some kind of header in the file. Let's add that to the error message:
This is better but it doesn't actually tell us what the error is and without that tidbit of information, we are stuck with not reading any files that have headers in them, whatever a “header” is.
The final piece of information we need is which requirement failed. Remember, all actions have requirements that must be met before the action starts and requirements that when met, mean the action is a success. If the error message specifies which requirement failed, we should have enough information to avoid the error in the future. Now we have the error message:
We now know the program was looking for a piece of text called “HEADER” in the file and failed to find it before it encountered the “DATA” section. This suggests there may be a problem with the text file and that gives us as users something to work with. We can look in the “simonsays.txt” file with a text editor and see if there is a “HEADER” section and if found, see if the “HEADER” section appears after the “DATA” section. We might even be brave enough to move the “HEADER” section to the start of the file, save the file, and see if the program can now load the file.
Error Reporting
When an error occurred, the program provided us with the following information:
- The context in which the error happened (typically the primary action being executed)
- What data was being worked on at the time of the error
- What action actually detected the error
- What requirement of that action failed to be met
With this information, we (the user) were able to correct a bad file and continue working.
If all actions that handle errors provide the aforementioned bits of information, the user stands a good chance of being able to correct the errors themselves. And when the user cannot fix the problem, there is enough information that a developer can narrow down the problem to one area of the code. Although this kind of error handling is very useful and the program will be all the more stable and robust because of it (and the user won't be made to feel like a dummy by bad error messages), this level of error handling does have a cost. Expect to spend anywhere between 50%-80% of your code just handling errors. Good error handling is not an afterthought, something to be slapped into place with a hope and a prayer. Good error handling, like a good algorithm, must be properly designed and tested. Good error handling is worth the investment because the payoff is more satisfied customers.
Now, if the program was just smarter about handling errors on its own so we don't have to do all the work, we'd be really happy users. For a program to be smart enough to recover from many errors, the program needs to be designed to deal with errors.
Design Away Errors
The first step in dealing with errors in a program is to design the program to not have errors. Okay, so that's perhaps reaching for the impossible. We can at least design the program to minimize the errors that can occur. For each error our design makes impossible, that's one less error we have to handle and the user has to see.
Way back in the beginning, I had an example of an action that added two numbers. Here is a summary:
Before the Action
- Two numbers have been provided
- The first number is an integer between the most negative 32-bit integer and the most positive 32-bit integer
- The second number is an integer between the most negative 32-bit integer and the most positive 32-bit integer
After the Action (success requirements)
- The result is an integer between the most negative 32-bit integer and the most positive 32-bit integer
- The result did not overflow the limits of a 32-bit integer
If the program can handle it, we can eliminate one error if we change the requirements for success to be:
After the Action (success requirements)
- The result is an integer between the most negative 64-bit integer and the most positive 64-bit integer
By making the result a 64-bit integer, we've eliminated the possibility of an overflow (no matter how you add two 32-bit values, the result will always fit in a 64-bit integer). However, now our program must handle 64-bit values. Plus, what happens if we want to add two 64-bit numbers? Change everything to use 128-bit integers? In this example, to avoid a never-ending escalation of bitness, we could make a choice, a compromise, that says our code will never add two 64-bit integers.
Let's look at the file reading example, where the file reader was looking for a “HEADER” section before a “DATA” section. If we design the actions to allow the sections to appear in any order, we can eliminate the error where the “HEADER” section must appear before the “DATA” section. The trade-off is the action needs to load in the entire file instead of reading the file one line at a time, looking for the desired sections. So the action will take up more memory.
A second alternative could be the action reads the file twice, once to find the “HEADER” section and a second time to find the “DATA” section. That might take a long time, if the file is stored across a network connection. So the action will take more time.
A third alternative is to have a default “HEADER” section, in case an actual “HEADER” section cannot be found. The trade-off here is whether the default section will work for most “DATA” sections. There is probably a very good reason the “HEADER” section is expected in the file, after all.
Don't expect to be able to design away all errors; the best you can do is eliminate the more obvious source of errors. Sometimes you have to go through the process of implementing the design to see a way to eliminate the possibility of an error. Design and implementation of a complex product such as software is an iterative process, where one step requires revisiting a previous step with new information. Also, every change made in the design usually means a trade-off was made somewhere else.
Summary of Error Handling Design
So now you know the fundamental art of designing error handling:
- Identify the actions a program can take
- Identify the requirements for each action
- Determine which actions will handle errors so the program can be left in a usable state
- Design those actions to detect the appropriate errors
- Provide sufficient information about each error so the user has a good chance of correcting it themselves rather than calling support for help
- Where possible, design away the possibility of at least some errors
The Mechanics of Error Handling
What about the mechanics, the nuts-and-bolts of error handling? Just how do you go about implementing error handling?
A program is built up from a collection of procedures, sub-routines, functions, methods, call them what you will. At the topmost level, each action a program can take is represented by one primary procedure (such as load a file, save a file, or add two numbers together). That means each procedure has a list of requirements to be checked before starting and a list of requirements for determining success after the procedure is done.
For example, the high level requirements for loading a file could be:
Before the Action
- A valid file name is specified
After the Action
- The file has loaded successfully
Steps for recovering from an error
- Remove any side effects from trying to load the file
- Report the error to the user
Sub-actions that help with loading the file would each have more specific requirements but at the top level, the requirements are pretty simple.
Up until now, I've provided mainly a lot of theory about error handling, but what actually happens when an error is detected? And if you subscribe to the theory that programs should be structured in layers (to keep changes in one layer from affecting another layer), how do you communicate an error at a lower layer to an upper layer? Just how are errors detected and communicated within the program and ultimately communicated to the user?
There are several schools of thought on how to pass detected errors from one level of a program to another and eventually to the user. All of them boil down to two approaches: error codes and exceptions. Which approach you take largely depends on the computer language you are using and the coding style (if any) put into place for error handling. Pick one approach and be consistent. Mixing multiple styles of error handling generally leads to confusion on how an error is supposed to be sent up through the layers of code.
Detecting Errors
The actual detection of errors is straightforward: make sure the data coming into the procedure is as expected and make sure the state of the program itself is as expected. The requirements for going into the procedure are known and they should translate almost directly into tests to ensure the validity of the data and the state of the program. The same goes for the success requirements, which are nothing more than tests to verify the results are as expected. If any test fails, an error condition has occurred and must be dealt with.
For example, when reading a file, the following error detections are created based on the “before execution” and “after execution” requirements:
- (BEFORE) If the file name is not valid (by whatever measure you use to determine a valid file name), reject the file name with an error
- (BEFORE) If the file is expected to exist and it does not exist, reject the file name with an error
- (AFTER) If the format of the file does not fit what is expected, then the load was unsuccessful
- (AFTER) If the file did not contain all the expected information, then the load was unsuccessful
Error Codes
There are two ways to handle error codes: 1) return the error code from each procedure and 2) use a global location to store an error code and then have the procedure return a value indicating an error has occurred.
Return Error Code
The first approach is the simplest. Define a list of error codes that your program knows about and then designate one of those error codes as indicating success. Every procedure that can detect an error is written to return one or more error codes for error conditions or return the success code if no error is found. Every call to a procedure tests the return value to see if an error has occurred and if so, either handle the error right there or pass the error on to a higher layer.
In pseudo-code, returning an error code looks something like this:
mainlevel() procedure: set done = false while done is false do errorcode = call sublevel() procedure if errorcode does not equal success then report error to user else continue processing if processing is done then set done = true end if end if end while end mainlevel() sublevel() procedure: if program state is wrong then return error code indicating bad program state else do something useful if something useful failed then return error code indicating the algorithm failed else return success code end if end if end sublevel()
Global Error Location
The C standard library used a global error code in its original design. Each
function would return a value that was pertinent to what the function did,
for example, fopen()
returns a pointer to a FILE
structure representing the
file just opened. If fopen()
failed, it returns a 0 or NULL
pointer,
indicating an error had occurred. Your program would then consult the
global errno
variable to find out what happened. Later versions of the C
standard library, such as the one from Microsoft, moved away from using a global
errno
variable to returning error codes from each function. For example,
fopen()
changed to fopen_s()
, and now returns an error code instead of a
FILE
pointer. Global error variables such as errno
are not a good idea
because of the problem of one procedure possibly overwriting the error code
for a previous procedure before the first error was handled. It can be
argued that the program was not properly structured to prevent overwritten
error codes but a global variable makes it very easy to get into this kind
of situation.
There is one place where the concept of a “global” error code can still find use and that is with classes. A software “class” represents data and methods (procedures) that are closely associated. The error code becomes just a property of the class and all of that class's methods use the class error code. There is much less chance of one procedure stomping on the error code of another.
The only difficulty with using any kind of “global” error location is figuring out what is an invalid value to return from a procedure. For memory pointers it's easy: return a 0 or null pointer, since such pointers are almost universally not allowed to be accessed. But an arithmetic operation that returns an integer where the value could be any one of the allowed integer values presents a problem. Do you reserve one value that is not likely to occur and treat that as the flag that an error occurred? Or do you tell the caller to check the global error location to see if it has been set to an error after the arithmetic operation has completed? The latter approach would look something like this:
set global_error = success mainlevel() procedure: result = call do_add(1, 2) procedure if global_error does not equal success return to user that an error occurred in do_add() else report result to user end if end mainlevel() do_add(a, b) procedure: set global_error = success set result = 0 if a is out of bounds or b is out of bounds then set global_error = code for out of bounds error else set result = a + b end if return result to caller end do_add()
Every procedure that can return an error must be sure to set the global error location to a success value as the first step. Every caller of a procedure must remember to check the global error location to see if an error occurred. That's two things that can be missed by a developer and if either one is missed, an error could be missed and the user will get bad results for no explained reason.
This is why global error locations are not used much anymore: it is too easy to miss one vital step and cause the program to no longer be trustworthy. Returning error codes is a little better; however, it is still up to the caller to check the return value and respond to errors accordingly. All too often, a programmer will ignore the return codes and thus an error can escape unnoticed until the user's data is corrupted. To be fair, error return codes may be the only way to handle errors in situations where more robust approaches such as exceptions are not supported. The only thing you can do then is be very diligent and consistent in checking return codes.
One tip for making error codes work a little easier is to mandate that all procedures either return nothing or they return an error code. If a procedure needs to return a non-error value, it should do so through some other mechanism than the return statement (for example, through a reference or pointer to a variable passed in as an argument).
Error Code as Message
An error code is a number and, by itself, a number is meaningless. So it is very useful to convert an error code into human-readable text. But how to translate an error code into the right kind of message for the user?
There are several ways to solve this problem, what I present here is just one solution. The assumption in this example is every error code is unique, although parts of the actual message might change depending on the exact failure mechanism (for example, an error code for File Not Found might have a message that says “Failed to find ‘simonsays.txt’ file.”).
The following is presented in pseudo-code form. There is the action that detects and reports the error, and two helper actions. One helper action is used to save the error message with the error code while the second helper action retrieves the error message given the error code.
readfile(filename) procedure: if filename does not exist then set message = "Failed to find {0}", where {0} is filename call set_error_message('File Not Found' code, message) procedure return 'File Not Found' code end if read the file return success code end readfile() set_error_message(error code, message) procedure: set global error list at index error code = message end set_error_message() get_error_message(error code) procedure: set message = value in global error list at index error code return message end get_error_message()
Here is an example of how the above would be called to get the error message:
mainlevel(filename) procedure: set return code = readfile(filename) if return code is success code then do something with file contents else set message = call get_error_message(return code) procedure report return code, message end if end mainlevel()
Exceptions have the advantage of storing the message in the exception object itself, thus eliminating the need for a “global” list that maps an error code to a string.
Exceptions
Exceptions are used in computer languages that support them to jump from one location in the code to another. However, unlike a goto statement (that is, “goto line #”), an exception works by backing out of the procedure that triggered the exception and following the call stack up until the exception is caught. However, before I explain how this works, here are some terms you need to know regarding exceptions.
Exception Terminology
Trigger an exception. This action is commonly called raising or throwing an exception, depending on the mechanism used for exceptions in the computer language. Triggering an exception puts the exception in motion and is done after an error has been detected.
Catching an exception. The part of the program that catches the exception is typically known as an exception handler and is indicated by a code construct that says “all exceptions of a particular type stop here”. The code that catches the exception can then do something about the exception, from handling it to reporting it to the user to re-triggering the exception. Another term sometimes seen for an exception handler is a “catch block”, named after the language keyword used to mark an exception handler (the term “catch block” is specific to C++, C#, and Java since “catch” is the keyword those languages use to mark an exception handler).
Unwinding the stack. This action represents what occurs when an exception is triggered. The exception follows the call stack, undoing each procedure call as the exception looks for an exception handler. Note that some computer languages don't have stacks, but nonetheless have some way of following a sequence of calls that needs to be unwound. As the stack is unwound, each procedure is allowed to clean up after itself. This is why exceptions are so nice to use, as there is a (more or less) automatic way to handle the process of getting the program back into a good state.
What are Exceptions?
The fact that I'm providing definitions for terms associated with exceptions should be a clue that exceptions are much more complex than simple error codes. Exceptions were created to deal with error conditions in object-oriented programming where an object fails to be constructed. Very early on it became obvious that exceptions were good for a whole range of situations, not just object construction.
Exceptions are implemented in a computer language to minimize the damage done to a program in the face of an error, and are almost always associated with object-oriented languages. A computer language generally has to work very hard to process a thrown exception. For example, in the process of unwinding the stack to look for exception handlers, the language must make sure that all objects allocated in a procedure being unwound are properly de-allocated and cleaned up. Not to mention any local variables that might have been declared on the stack. The developer has to be involved in every step along the way to make sure the program is left in some kind of usable state as the stack is unwound. In fact, this is the heart of error handling using exceptions, also known as exception handling.
Exceptions affect the overall structure of the program in a more fundamental way than using error codes. Error codes follow the normal program flow and thus are easy to trace (call a procedure and if it fails, do something; otherwise do something else). Exceptions don't seem to follow normal program flow as an exception can jump from the middle of a procedure up many levels before being caught. You have to be aware of all the calls between the point where the exception was thrown and where it was caught and make sure each procedure leaves the program in a usable state in the face of an exception passing through — even if the exception is not explicitly handled in those intermediate procedures.
So what benefits do exceptions provide that makes all this complexity
worthwhile? There are three primary benefits: 1) it concentrates error
handling in selected locations, 2) the code is much easier to read, and 3)
it is easy to provide custom error messages with each exception.
Here is a fragment of code in C# that shows these benefits (e.Message
provides the custom error message).
- void ShowDataFile(string filename)
- {
- try
- {
- DataFile file = OpenDataFile(filename);
- ParseTree tree = ParseDataFile(file);
- DataFilters filters = LoadFilters(tree);
- RunFilters(filters, tree);
- ShowData(tree);
- }
- catch (FileNotFoundException e)
- {
- Console.WriteLine(
- "Error! While opening file {0}: {1}",
- filename, e.Message);
- Console.WriteLine(
- "Did you misspell the file name?");
- }
- catch (BadDataFileException e)
- {
- Console.WriteLine(
- "Error! The file {0} is not in the correct format: {1}",
- filename, e.Message);
- }
- catch (ParseDataException e)
- {
- Console.WriteLine(
- "Error! While parsing file {0}: {1}",
- filename, e.Message);
- }
- catch (FilterException e)
- {
- Console.WriteLine(
- "Error! While running filters on file {0}: {1}",
- filename, e.Message);
- }
- catch (Exception e)
- {
- Console.WriteLine(
- "An unexpected error occurred while processing file {0}: {1}",
- filename, e.Message);
- }
- }
Whether you read C# or not, the overall flow should be understandable. At
no time in the normal flow of the code (lines 5 through 9) are errors even
considered. Instead, the try..catch
block that surrounds the normal flow
of code will catch any exceptions thrown as a result of an error. Also
note the number of exception handlers (“catch blocks”), each dedicated to a
specific exception. This allows the code to provide meaningful messages to
the user.
As you may have noticed, there are more lines dedicated here to error
handling than there are lines apparently doing anything useful. This is
normal. There are two good reasons for this: 1) all of the error handling
for the ShowDataFile()
action is in one place and 2) the functions being
called do not have to report errors to the user, only detect them (making
those functions simpler).