Introduction
Logging is the art of debugging without a debugger. The intention of logging is to get information about what is going on in a program where you cannot attach a debugger. In fact, with sufficient information logged, the goal is to deduce what the problem is and figure out how to fix it without ever resorting to a debugger.
In my experience, about 95% of the problems found in a program can be resolved with logging. And those last 5% were identified as needing a debugger because logging narrowed the problem down.
For logging to be really useful, it must be built into the program from the start. It can be quite tedious to add logging afterwards and it can be error prone due to simple copy-paste errors. For logging to be trusted, it must be accurate.
I'm going to discuss what needs to be logged and where in a program the logging statements should be placed. I'm also going to discuss the features most needed from a logging infrastructure to make logging very useful.
Logging is applicable to any programming language; I'll be using C# where code is presented because C# is reasonably easy to comprehend without needing a lot of background knowledge.
Kinds of Logging
There are a number of different kinds of logging that can appear in an application. Logging can be used to inform the user about a series of steps being taken by the program. For example, an installer can show details of what is being installed. Another application may run scripts as part of its task and the output of those scripts may be displayed in a window of the application.
Another form of logging is for key or interesting events. These key events could be things such as the application starting up or shutting down, opening or closing a connection to an external server, or a crash error. These key events may be written to an application log or an operating system log or both.
The third form of logging is more advanced and more invasive. It is what a developer can use to diagnose problems in an application running remotely. This diagnostic logging is generally directed to a file that can then be sent to the developer, who can then attempt to reproduce the problem based on what is in the log. Depending on the design of the application and the type of diagnostic logging, it is even possible for the developer to "replay" the log to drive the application to reproduce the problem.
All of these kinds of logging are related. Diagnostic logging is discussed in this document but the other kinds of logging are really just a sub-set of diagnostic logging. Basically, diagnostic logging with some details turned off.
What to Log
There are two key pieces of information that need to appear in a log for the log to be useful for diagnostics. These are location (in the flow of execution) and state of the program (at the location being logged). I will talk later about how to make the logs useful for users as well.
Location, Location, Location
For determining where the program is in the flow of execution, you can provide two kinds of location logging: milestone and trace.
Milestone Logging
Milestone logging provides a hint where the program is in the flow of execution. A milestone can represent a few functions or a dozen but it must be unambiguous about its location (that is, no two milestones should have the same message in the log). The logging of milestones can allow the developer to manually trace through the source code from one milestone to the next. Milestone logging is also relatively cheap in terms of performance so it will have a minimal impact on the overall performance of the program. The developer just has to work harder with milestones to determine where in the program something might have gone wrong. The challenge is deciding how many milestones to place so as to get the optimal trade-off between logging information and performance.
Note that errors are a milestone because the code is in an error handler. All errors should thus be treated as milestones.
An example of milestone logging are the events seen in the Windows Event Viewer. These tell a user when something starts, stops, and has an error, among other messages deemed important by the application developer.
Trace Logging
Trace logging means log the entrance to and exit from every function. Every. Single. Function. Leave one function out of the trace and there will always be that nagging doubt about whether that function was actually called or was the source of an unhandled exception. Trace logging is very expensive and will impact the overall performance of the program. However, sometimes trace logging is needed to figure out exactly what the program is doing. Trace logging is especially useful in diagnosing multi-threaded problems by helping to uncover race and deadlock conditions.
For example, the following log shows two threads waiting for a signal that can only be set by the two threads. This is a classic deadlock situation:
In this case, the Main thread starts two background threads. Each thread is going to try to lock a global queue of resources. If the lock is achieved, the resource is extracted from the queue and the thread waits for the resource to actually become available. The Main thread is going to wait for both threads to signal they have started.
This log highlights two obvious problems: 1) The threads are apparently not setting a signal saying they have started so the Main thread is going to wait forever and 2) Thread A apparently doesn't release the lock on the resource queue so Thread B is stuck until Thread A releases the lock on the resource queue; if Thread A never gets the resource, Thread B is blocked forever. Without examining the source code, we don't know the exact nature of the problem but based on the above log a good guess is the threads are not setting a signal to the Main thread that they are ready to start and the threads are not releasing the lock on the resource queue immediately after getting the resource from the queue.
State of the Program
The state of a program changes during execution. This is to be expected, of course. To help debug a program using logging, a snapshot of relevant state of the program needs to be logged. A good rule of thumb is to log relevant state whenever a milestone log entry is made. That way, each milestone has some amount of useful program state. If the program state at a milestone is bad, trace back through previous milestones until the program state is good. The problem lies after the milestone with good state and before the milestone with bad state.
So what state should be logged? That depends on the location. For example, a low level function that is accessing a driver should log the parameters passed to the driver but really doesn't need to log the state of any menu items that might be in the Graphical User interface (GUI) the user is looking at. On the other hand, the code handling changes in the GUI might very well need to log the state of the menu items but doesn't care about the parameters passed to a low level driver.
When looking at a milestone in the code, determine what state needs to be shown to determine if the program is in a good state or a bad state at that milestone. For example, being in an error handler because an error occurred is a milestone so the state of the program would include information about the error and any state that contributed to the error, such as a bad parameter or a bad result of a calculation.
Example of Milestone and State Logging
A common approach to milestone and state logging is to combine the two into a single log entry. For example:
If device discovery started and there are devices already in the list, you know something went wrong before the discovery process. After the discovery process completes and the device does not contain the expected number of items, you know something went wrong with the discovery process itself. Additional state could be a list of the devices found.
What Not to Log
Okay, so it's all good and fine and dandy that logging can reduce the need for a debugger because you can log anything and everything. However, now is the time to step back and consider what is actually being logged. It might be that logging is revealing some sensitive information that shouldn't be revealed. For example, if an encryption key is needed to accomplish a particular task, it is not a good idea to log the key itself, only that a key was being used (possibly along with its type and where it came from).
Another example, for low level programming that talks with a USB device, it is often useful to log the USB packets sent out to the USB device or the packets received from the device. However, great care needs to be taken in to ensure that if that log of packets is given to someone who shouldn't have it, they can't learn something that you want to keep secret, such as the encryption key that might have been sent to the USB device.
A final example: Do not log passwords. This may seem obvious but it does happen every so
often. It does not matter if the password is obfuscated or encrypted, do not log passwords. If
you really need to log the milestone about the password, you can replace the password with
“<CENSORED>
” or
“<REDACTED>
” or just
“*********
”.
Logging Infrastructure
Being able to log at different levels of granularity (a little logging, some logging, a lot of logging) provides a way to run the program at nearly full speed and still log the overall flow through the program (that is, using milestone logging) but also sacrifice performance to log all detail as required (trace logging). The program will run slower but the extra detail in the log can make it easier to debug the problem.
Logging levels also provides a good way of thinking about what needs to be logged.
A good logging infrastructure should provide at least four levels of logging. Level 0 is no logging (logging is disabled), level 1 is all possible logging, level 2 is some logging, and level 3 is minimal logging. Another way to think of this is to assign labels to the logging level. For example:
Each level encompasses the previous level, so DEBUG includes INFO and ERROR while INFO includes ERROR.The minimum logging should always include errors, which is why it is often labeled ERROR.
Additional levels are possible, increasing the granularity of what is logged, but that depends on the logging infrastructure. For example, some logging infrastructures include TRACE (below DEBUG, indicating maximum logging) and WARNING (typically between INFO and ERROR).
The way logging works with logging levels is the user sets the desired logging level. The program writes entries to the log at a designated level. If the entry's designated logging level is contained within the desired logging level, the entry is written, otherwise, the entry is discarded.
Logging does introduce some performance overhead, which is why multiple levels are desired. If all levels are logged, the performance impact is greatest. When logging is disabled, the performance impact should be minimal to non-existent (depending on the logging infrastructure).
Example of what needs to be logged
Below is a typical function that takes in parameters, does something, and returns something. This is written in C# but the intention of the logging should be clear enough (this uses log4net).
This function reads a specific register from a device using the internal helper function,
_GetRegisterValue()
. That function can throw a number of exceptions. The function then
returns the value obtained unless an exception occurs.
- // Create and execute an operation to read a register from a device.
- BitData _GetRegisterValue(Device device, string registerName)
- {
- _logger.DebugFormat("(Internal)_GetRegisterValue(" +
- "device={0} (id={1}), registerName='{2}') in",
- device.Name, device.ID, registerName);
- Operation operation = new Operation(READ_REGISTER, registerName, device);
- // Throws InvalidDeviceException, InvalidRegisterException, or
- // LostConnectionException
- _ExecuteOperation(operation);
- BitData returnValue = operation.GetValue();
- _logger.DebugFormat("(Internal)_GetRegisterValue():" +
- "on device '{0}', register '{1}' = {2}",
- device.Name, registerName,returnValue.ToString());
- _logger.Debug("(Internal)_GetRegisterValue() out");
- return returnValue;
- }
- // Read a register from a device, converting from internal value representation
- // to external.
- ulong ReadRegister(uint deviceId, string registerName)
- {
- _logger.DebugFormat("ReadRegister(deviceId={0}, registerName='{1}') in",
- deviceId, registerName);
- // Throws DeviceNotFoundException
- Device device = _GetDevice(deviceId);
- BitData value = _GetRegisterValue(device, registerName);
- ulong registerValue = value.ToUInt64();
- _logger.InfoFormat("ReadRegister(): Value of register '{0}' = 0x{1:x}",
- registerName, registerValue);
- _logger.Debug("ReadRegister() out");
- return registerValue;
- }
- // Handle the command for reading a specific register and displaying
- // the output.
- void CmdHandler_ReadRegister(uint deviceId, string registerName,)
- StreamWriter output)
- {
- _logger.DebugFormat(
- "CmdHandler_ReadRegister(deviceId={0}, registerName='{1}') in",
- deviceId, registerName);
- try
- {
- _logger.InfoFormat("CmdHandler_ReadRegister():" +
- "reading register '{0}' from device {1}", registerName, deviceId);
- ulong value = ReadRegister(deviceid, registerName);
- output.WriteLine("Register {0} = 0x{1:x}", registerName, value);
- }
- catch(Exception e)
- {
- string message;
- message = String.Format("Failed to get the value for register '{0}'" +
- " from device {1}!{2}: {3}",
- registerName, deviceId, e.GetType().Name, e.Message);
- output.WriteLine("Error! {0}", message);
- _logger.ErrorFormat("CmdHandler_ReadRegister(): Error! {0}", message);
- _logger.Error(e);
- }
- _logger.Debug("CmdHandler_ReadRegister() out");
- }
What is being logged?
Here is what is being logged in this example.
Debug Level Logging
Lines 4-6, 17, 26-27, 35, 45-47, and 66 log entering and exiting the function. This is known as a function
trace because when all functions have this, it is possible to trace the flow of execution throw the
program. However, since this can get rather tedious in the log, — not to mention greatly slowing
performance — it is logged at a DEBUG
level. If the logging infrastructure has a
TRACE
level then that would be used instead (log4net does not support
a “Trace” level). Since trace logging already has a performance impact,
we go ahead and include the parameters passed in as part of the logging. These
parameters are very useful as they allow us to determine exactly what the function
is working with.
Lines 13-15 are technically a milestone logging but because _GetRegisterValue()
is an internal
function and not normally of interest to an end user, it is logged at a DEBUG
level. The results
of the register read are displayed in the calling function as a milestone logging so no information is lost.
Information Level Logging
Lines 33-34 and 50-51 are milestone log entries. They provide context and results. If all goes well and
the register value is obtained, we log the value. If only INFO
level of logging was enabled, the
log entries would look something like this:
From this we know that a command handler was called to read a register called
'SomeRegisterName'
on device 6 and that the resulting value is 0x231231
.
Since there are no errors in the log (which would be included because the INFO
level includes
the ERROR
level as well), we know this was successful.
Error Level Logging
Errors happen. In this example, errors are propagated as exceptions. The example shows a typical flow of error handling where a command handler (possibly called from a GUI) makes the requested call and traps all exceptions. In this particular example, the command handler doesn't attempt to handle any of the exceptions but instead just reports them to the user using the given output mechanism.
If only error logging is enabled, the error message provides sufficient context to figure out what is going on when the error occurred. For example, if an InvalidDeviceException was thrown, the logged message would look something like this:
This would be followed by a stack trace of the exception.
This log entry tells us the context (in a command handler to
"Read a register named
'SomeRegistername' from device 6"
) and what went wrong
("InvalidDeviceException"
) along with a reason for the failure
("Device "UCORE0" does not support the "READ_REGISTER" operation."
).
We can further deduce that device ID 6 presumably corresponds to device named
“UCORE0”
and that device does not have registers.
If full logging was enabled for this example and we attempted to read “SomeRegisterName” from device 6, the log would look something like this:
Note that we don't see any of the other logging from the _GetRegisterValue()
or
ReadRegister()
functions because of the exception. In fact, since we don't see
the expected “out
” trace logging from those two functions,
that is a strong clue an exception occurred.
Who Uses Logs Anyway
There are two main consumers of logs: users and developers (I'm grouping quality assurance (QA) testers in with developers). Developers can use logs to debug a program at a distance, when the option of attaching a debugger is not available. In this case, the user enables logging to the requested level, reproduces the problem, then sends the log to the developer for analysis. From the log, the developer figures out what went wrong and either advises the user to change how they use the program or change the program to fix a bug.
Users can use logs to verify the program is doing what they think it should be doing. This means verifying parameters are passed in correctly and the flow of the program is correct. If the developer did their job correctly and logged all errors, the user can use the log file for errors to determine more detail about an error. Of course, the user interface of the program should be providing these details already as part of good error handling. However, not all programs are created with users in mind or may not even have a user interface (such as a service) so logging is sometimes the only way a user has to find out what is going on behind the scenes.
The needs of these two consumer types are not all that different from each other. The secret to making a log more user-friendly is to be clear in the log message. This works for both user and developer, especially when the log needs to be analyzed by a developer (or QA person) who did not write the original code.
Avoid dumping variable names unless the variable names are self-explanatory. For example, don't use:
but instead use:
Avoid dumping data structures with no explanation of what is being shown (again, self- explanatory variable names can help here). For example, don't do this:
Instead, the following is more readable:
Yes, this takes more effort to log but it makes the log much easier to read and understand, even for a developer.
In short, use clear language to describe what's going on. Avoid abbreviations and acronyms by spelling things out. For example, instead of:
write
The second approach is much clearer and there is no opportunity
for misinterpretation (is the first approach a command to write
a register named 'AEEN'
with the value 0x131313
?).
Consider what is being logged. If it has no value to the user, log it at a DEBUG
or TRACE
level. That way, the user can avoid seeing it if they don't want to
by setting the log filtering to a higher level.
Of course, in the ideal world, levels would be of great use in reducing the clutter a user has to see in a log file thereby making the log file more useful to the user. In reality, the user is going to turn on maximum logging, generate the log to show the error, then throw the log file at the developer with “Here! It's broken. Fix it!” It is also conceivable that setting log levels is too complicated for most users, in which case the whole instruction to the user is reduced to “Turn on logging”. Making it easy to select different log levels can help so account for it in whatever GUI feature enables logging.
Of course, with logging levels clearly called out, a utility can be used to filter the log after the fact so always utilize the logging levels available.
Logging Optimizations
The single biggest objection to logging is its impact on performance of the application. And logging with all levels enabled can certainly impact performance in a big way. However, there are ways to mitigate that performance impact, as discussed in the following sections.
(Don't) Remove Logging at Compile Time
In the olden days of computing, a common trick was to enable logging during the compile or build phase of the program (assuming the program needed to be compiled). This way, logging could be added or removed easily using a compiler switch. And logging that has been removed has no performance impact on the code.
For example, in C or C++, use a macro to leave out logging entirely. This requires all log statements themselves to be macros. For example:
Although this could be made prettier or more functional or more flexible, it still relies on macros. But beyond that — and much more important — is that it requires re-compiling the code to enable logging.
The last thing you want to tell a customer is to recompile the program to enable logging. In fact, it will likely be the last thing you tell that customer.
This approach should never be used if the application needs to perform any kind of logging. It is included here to inform you that it once existed and to tell you not to use it.
Use Logging Level Enable Flags
In most logging infrastructures that support logging levels, the logging object also supports querying the currently desired logging level or querying whether a particular logging level is enabled. This allows for logging code like this (this is log4net in C#):
In this case, log4net has properties it exposes on the logger object that can be queried whether a particular logging level is enabled. If it is enabled then go ahead and log at that level; otherwise, skip the logging entirely.
In terms of performance, this adds a method call (getting the value for the IsDebugEnabled
property) but uses it to avoid passing multiple parameters to the DebugFormat()
function, which
is more expensive.
Note that a function call with a number of parameters that does nothing (because logging at that level is disabled) might not have a measurable effect on performance if the function doing the logging is not called very often. Since the presence of the logging level enable flag makes the code harder to read, it might be more beneficial for code readability to not use the level enable flag until it is really needed.
On the other hand, if that logging is very expensive, then the level enable flag should be used. For example:
In this case, we want to log all the device IDs coming into the ReadRegisters()
function so we
call a helper function to format the array into a string so we can log it. Since this helper function
is called only when debug level logging is enabled, checking the level enable flag avoids the
expense of calling the helper function all the time, especially when logging is disabled.
If you know the logging function is going to get called dozens of times a second or the code that formats what is going to be logged is expensive, then use the logging level enable flags. Otherwise, if the logging is just a string or a string with a handful of parameters, don't worry about the performance hit.
Of course, anything related to performance should be optimized after getting measurable results that proves there is a significant performance problem.
Partition Logging into Different Categories
Some logging infrastructures allows logs to be organized into different categories. For example, logging from one class would go in one category while logging from another class would go into a different category. The logging infrastructure then provides a way to enable or disable logging based on category. This way, the user can selectively disable logging in one area of code while leaving logging enabled for another area, thus reducing the impact of logging on the program as a whole.
This can be extended to sending log output for a category to a different file, thus reducing the chance of multiple threads hitting the same log file at the same time, causing a bottleneck. For example, log4net and its ilk have the ability to create a logger for a particular class, like this:
In this case, the logger object is based on the class it is in (“MyClass
”)
and that is the category all logging made with that logger object is written under. The parameter
to the GetLogger()
method could also be any string such as “MyClass” or
“Feature1”. It is generally more convenient in this situation to use the name of the
class within which the logger object resides (log4net derives the name from the type passed in).
In addition to help filtering out what needs to be logged, categorization is another way in which to provide context for each logging statement since the category is generally sent out to the log file along with the timestamp, thread identifier, and actual log text.
Leave Logging Disabled Until Needed
With almost any logging infrastructure, it is possible to disable logging altogether. In fact, that's the whole point of a logging infrastructure: the user has the power to turn on logging or leave it off. If the program is running just fine, logging is not needed and the program proceeds at an optimal speed.
This assumes the developer has cooperated with the logging infrastructure to break up the logging into multiple levels, with multiple categories, and avoiding expensive logging calls if a particular logging level is not enabled.
Remember, logging is generally not needed except when trying to diagnose a problem. If the problem is reproducible, the user exits the program, enables the appropriate level of logging, and reruns the program to reproduce the failure, this time with logging enabled. In this situation, the user is deliberately enabling logging and therefore acknowledges that some performance impact is going to occur. The result is a log file that should provide sufficient information to narrow in on the problem and eventually provide a fix for it. Ideally, this would give the user the power to fix the problem; otherwise, it is the developer who develops the fix and provides an update to the program.
Happy logging!