Introduction

There has been a lot written about how to write good code in C over the last 20 years or more. I myself started writing a book on the subject in 1993. Despite all the good tips I don't believe the average code quality has improved a great deal. What has changed, from my own observations, is that there is far more variety in the way bad code is written. Superficially much code appears better but there is a lot of confusion when you look at it closely.

Part of the problem is that there are quite a few fallacies kicking around about how to write good code. Often these are just good ideas that are misunderstood or wrongly applied.

This article is an attempt to dispel some of them. Almost certainly you will disagree with some of what I have to say, but please keep an open mind. If you have a good argument against my points then post a comment, but please no flame wars. (From personal experience I know that many of these fallacies can be defended with almost religious zeal.) It's better to shed light without generating too much heat.

Note that, generally, the discussion is about programming in C, though much is applicable to other languages. There are also a few special notes about C++, and other languages, where relevant.

1. Use descriptive variable names

I thought I would start with a simple one. For most people this is a no-brainer, but I still sometimes see code (usually poorly written code) with very long variable names (i.e. 30 characters or more).

I guess this fallacy goes back to another fallacy (not covered here) that code should be self-describing. The idea is to avoid or minimize the need for comments. The thinking goes that comments, if they even exist at all, are often wrong, often out of date, or simply not read. So why not avoid them altogether?

I won't go into a long discussion of comments and their maintenance, as that has been written about quite enough. I will just make the observation that identifiers can become out of date just as easily as comments -- and maintainers seem even more reluctant to refactor code than they are to fix comments. What is worse is that some programmers take this further and create temporary variables just so they can give them a name that helps to explains the code -- I have more to say about unnecessary temporary variables in Fallacy 9, but it is always better to have simple code with explanatory comments (that can be referred to if necessary) than more complex code with no comments.

Getting back to the fallacy, my point is that trying to fit too much information in an identifier is the wrong approach. If a variable is used often in the code you don't want to be continually reminded what it is used for. You need a simple name that is easily remembered and distinguishable from other identifiers used in the same part of the code.

In brief, use a short variable name to make the code easier to scan. Use a comment (usually at the place that the variable is declared) to describe what the variable is for.

A common statement in many coding standards is to use identifiers that are as long as necessary to describe their purpose. Notwithstanding that this is rather vague, I actually agree that this is a good idea for something rarely seen (such as a global variable) but in general it is not. Most variables have a purpose that is too difficult to summarize in a few words. The result is usually a long name that tries to describe what it is used for but (for some brevity) is incomplete and/or ambiguous. Worse, is that the programmer now feels justified in not giving a proper description in a comment at the point of declaration.

Guidelines:

use fairly short, easy to read identifiers for high-use, limited-scope variables
use longer, more descriptive names for less often seen identifiers
use variable names that give an idea of their purpose (see K&R for examples)
ensure identifiers are long enough and different enough to be easily distinguished
ensure identifiers are easily remembered (i.e. not a random jumble of letters)
give a full description of the purpose of a variable in a comment where it is declared

2. Use lots of parentheses and/or spaces in expressions

This is really two fallacies, though closely related. The first fallacy is that it is good to add extra, redundant parentheses (also called round brackets) in an expression to ensure that anyone reading the code understands the order of operations. The second fallacy is that you should add lots of spaces in expressions to aid readability.

Let's look at parentheses first. It is true that the precedence of some operators in C is not well-known. Examples, are the shift operators (which some people expect to have higher precedence than addition and subtraction), bit-wise operators, and the relative precedence of assignment (=, +=, etc), ternary (?:) and comma (,) operators. You should use parentheses when you are unsure, even if this means that you sometimes have redundant parentheses.

However, extra parentheses where precedence is well known can make the code much less readable. (If you like lots of parentheses you should be programming in Lisp.) I have seen coding standards that require this:

  if (((variable >= 10) && (variable < 20)) || ((variable >= 30) && (variable < 40)))
     ....

rather than:

  if (variable >= 10 && variable < 20 || variable >= 30 && variable < 40)
     ....

I think you would agree that the latter is at least slightly more readable. When more complex expressions are used the extra parentheses just add to the clutter. You should instead use spaces to highlight precedence.

This brings us to the use of spaces. Everyone agrees that use of spaces makes C code easier to read. My contention is that excessive use of spaces has the opposite effect. In fact some informal tests I conducted showed that, at least in some circumstances, adding too many spaces can make reading code more error-prone.

Of course, if you conduct your own studies your results might vary -- it depends, to a large extent, on the type of code you are used to reading. My argument is that all C programmers should be familiar with the style of code that I am advocating since it is the style that is used in K&R and in the examples in the C standard as well as many other important books like The C++ Programming Language (with the exception of Bjarne's unusual spacing for pointer and reference declarations).

Unfortunately, parentheses have too many uses in the syntax of the C language -- e.g., for function calls, grouping in expressions, casts, etc. So it is useful to uses spaces to emphasize their different use. A good convention is to always have a space after a keyword which is followed by a left-bracket (if, while, etc), but never have a space between a function name and the following left-bracket. (An exception would be function-like keywords such as sizeof.) This makes scanning the code easier and less error-prone.

Some programmers use the convention of having a space after a left-bracket and another before every right-bracket. I don't mind this convention so much, but I find that it can make code harder, not easier, to read, when many parentheses are necessary.

One place you should always add a space is after a comma. First it makes scanning a long comma-separated list extremely tedious if you don't. A comma is also easily mistaken for a dot (especially at the screen resolution I use) and so may be interpreted as the member operator or a decimal point when between some types of arguments.

Guidelines:

don't add redundant parentheses unless the precedence is confusing
use spaces for readability - e.g., to highlight precedence of operators
don't add too many spaces as it detracts from readability
always place a space after a comma and after keywords (except sizeof)
when in doubt look at how parentheses and spaces are used in K&R

3. Use shotgun initialization

Shotgun initialization is a name used for a practice where all variables are initialized to some value when they are created. This is done in case a variable is never set to its proper initial value due to some bug in the code. It is an example, of defensive programming in that the program is more likely to be able to recover from the problem than if the variable contained some random value. (Defensive programming is a useful strategy but does not fix bugs; it just ameliorates their effect.)

One problem with this approach is it can defeat the ability of the compiler to detect uninitialized (and unused) variables at compile time. Another problem with defensive programming is that, although (and because) the program gracefully recovers from a bug this very recovery can hide the fact that a bug even exists - so nobody will ever be asked to fix it.

However, my main concern is that it is an example of redundant code since the variable is initialized twice (once with some arbitrary value, usually zero, then later with its correct initial value). Redundant code is bad for many reasons, one of which is: it can confuse anyone maintaining the code. For example, someone might modify the code assuming that the zero value is the correct initial value.

Note that in C++ this problem does not occur as often, as you typically only declare variables at the point you first need them and then you can initialize them immediately. It is prevalent in C because you can only declare variables in C at the start of a compound statement (and by convention at the start of the function - see Fallacy 8 below).

[You can still have the problem in a constructor in C++, if the initial value of all member variables is not known at the time the object is created.]

Shotgun initialization is discussed in depth in my recent blog.

Guidelines:

if you don't know the correct initial value for a variable when you declare it then don't initialize it to some arbitrary value like zero
let the compiler set the value of uninitialized locals (debug builds) in order to make bugs reproducible (e.g. VC++ uses 0xCC)
as far as possible, don't declare a variable until you know what its correct initial value should be (JIT declarations)

4. Don't use magic numbers

A great fuss is made of not using magic numbers in code, and rightly so. However, sometimes this goes too far. I have seen coding standards that specify that code should have no literals of any type which can lead to strange things like this:

#define ZERO 0
#define ONE  1

Of course, a later version of the coding standard required use of enum for integer constants. The above had to be changed to this:

enum { ZERO, ONE, };

Obviously, doing this is stupid. We need to look at the reasoning behind the no magic numbers rule and decide when it is appropriate. To do this we first need to understand that there are two reasons to avoid magic numbers - understandability and maintainability.

Understandability

Magic numbers are called that for the very reason that their meaning is mysterious. One way to remove that mystery (and hence make the code more understandable) is to give them a meaningful name. I guess this also goes back to the idea that code should be self-describing (see Fallacy 1 above). However, a comment in the code can also explain the purpose just as well, except for the risk of the code and comments becoming out of sync.

Maintainability

Giving a name to a value (using #define, enum, or const) can also make your code more maintainable if the same value is used in different places in the code. If it is likely, or even just possible, that the value may change in the future then you should define it in one place. This is usually in a header file for shared values or at the top of a .C file if it is private to that file.

static const int cards_per_deck = 53;  // includes a joker

It is then a simple matter of changing the value in one place and rebuilding the affected parts. This avoids creating bugs if you somehow fail to find all the places in the code where you used the value. Further it also makes it easy to enhance the code if, for example, the value was no longer required to be a constant - that is if it needs to change dynamically.

static int cards_per_deck = 53;
  ...
  if (players == 4)
     cards_per_deck = 43;
  else if (players == 6)
     cards_per_deck = 63;
  ...

The Fallacy

There are great advantages in not using magic numbers in code, so where is the fallacy? When both of the above reasons do not apply then there is no reason not to embed literals in the code. Moreover, excessive use of the technique can make code very difficult to read.

As a simple example, I once saw this in a header file:

#define ENDSTR 0   // end of string character

then ENDSTR was used throughout the code when testing for end of string. Now even the most novice C programmer should know that an end of string is signified by a character having the value zero which is conventionally written as '\0'. Not only is using ENDSTR more confusing than simply using '\0', it also makes the code harder to read.

A larger example is taken from some code I had to maintain that was used to generate CBDs (command blocks) to send to SCSI devices. The code made extensive use of #defined values like this:

  unsigned char cdb[LOG_SENSE_CDB_SIZE];

  cdb[OPCODE]   = LOG_SENSE_OPCODE;
  cdb[PCOFFSET] = (pc & LOG_SENSE_PC_BITS) << LOG_SENSE_PC_POS |
                  (pcode & LOG_SENSE_PCODE_BITS) << LOG_SENSE_PCODE_POS;

Now imagine thousands of lines of code like this and you get the picture. I found that I could not just look at the code and the SCSI standard and understand the code at a glance. First the long names made everything harder to read. I also found that I was always checking the header file to make sure that all the #defined values were correct and what I expected them to be (and there were some mistakes).

I later rewrote some of the code in this way:

  unsigned char cdb[10];   // LOG SENSE CDB is 10 bytes

  cdb[0] = 0x4D;            // LOG SENSE = 4D
  cdb[1] = 0;
  cdb[2] = (pc & 0x3) << 6 | (pcode & 0x3F);

I think anyone would agree that this change makes it much easier to see what is happening. It is also safer as it will be obvious if one of the bytes of cbd[] is not initialized. (For example, in the original code above cdb[1] was not set.)

These changes also do not make the code any less maintainable since values (like the LOG SENSE opcode) are only used at one place in the code (and in any case are part of the SCSI standard and so will never change). The values may appear to be magic to anyone casually inspecting the source code, but anyone who has to maintain it should be referring to the SCSI standard which explains precisely what all the numbers mean.

Guidelines:

never embed literals in your code when they could possibly change
consider embedding literals in your code if it makes it easier to understand

5. Use typedef's

A similar fallacy to the previous is that using lots of typedefs makes code better in some way. Typedef is very useful in some circumstances (like pointers to functions) but it is often abused. It should not be used to create aliases for transparent types such as many structs and pointers.

[Transparent types are those where internal data is exposed such as fields in a struct or the type of object a pointer points to. Opaque types are where you have an int or a pointer which acts as a handle to an object.]

This idea gained a lot of traction due to its use in the Windows header files and promotion by certain authors.

Sometimes the justification given for typedef'ing a struct is that it is information hiding. The benefits of information hiding for reducing complexity of code are enormous (see my blog post), but information hiding is all about hiding irrelevant information. When typedef'ing a struct the information is often not irrelevant so should not be hidden. Even just hiding the fact that we are dealing with a struct reduces the understandability of the code.

C++

Another argument for the use of typedefs on structs is that it is effectively what is actually built into C++. When you use a struct (or class) in C++ you do not need to prefix the tag with the struct keyword in declarations. So this is acceptable in c++:

  struct s { int a, b; };
  ...
  s s1, *ps;

The difference is that generally classes (and many structs) are opaque types. They often have different uses to structs used in C code, which is why they have things like private members. However, I recommend that when using a struct in C++ for a transparent type you still prefix declarations with the struct keyword for clarity.

Standard Headers

I have also seen the argument that the practice is appropriate in C as it is used in the standard C header files - for example the FILE type.

[Actually the standard C headers are not a paragon of good practices, so this is not a good argument anyway.]

While the the way stdio.h defines the FILE type is a good idea this is a furphy as the internals of the FILE structure can be, and should be, hidden. The FILE typedef is just used to create handles (of type FILE *) which are passed around between the file handling functions (fopen(), fread(), etc). This is a good example, of the valid use of information hiding.

Example

I guess it's time for an example. I will just use one of the first typedefs I came across in the Windows headers.

typedef struct tagSIZE
{
    LONG cx;
    LONG cy;
} SIZE, *PSIZE, *LPSIZE;

Now if I saw a declaration like this:

extern SIZE GetSize(HANDLE h, PSIZE inner);

I might assume that the return value of GetSize() was just a scalar value. Even if I knew that a SIZE had an X and Y component I might think that these were perhaps stored in the upper and lower halves of a long. Perhaps I know that SIZE is a struct but it would still be easy to miss that inner was a pointer to the same type and that it is actually used to return another SIZE.

Now consider:

struct size { long cx, cy; };
extern struct size GetSize(HANDLE h, struct size *inner);

Here it's plain to see that the function returns a struct. Further, since inner is a pointer (and not a const pointer) it's obvious that another size is also returned via *inner.

So poor use of typedefs can hide useful information and make code less understandable. Furthermore, typedefs of structs can lead to confusion about the size of data (especially with nested structs). It is often important to know the rough size of types, for example, to know whether to pass by value or using a const pointer (or const reference in C++). Copying large structures around can slow things down appreciably.

Guidelines:

never use typedefs to hide important information
use typedef for opaque types such as FILE
use typedef to simplify declarations involving function pointers

6. Define your own integer types

Like most of the fallacies in this list, a useful idea is blindly used even when inappropriate. Again this comes back to understanding the point of the original idea so that you know when and when not to use it.

Like the previous fallacy this one is perpetrated by the Windows header files, which typedef many integer types (WORD, DWORD, etc). However, I have seen this basic idea promoted in coding standards since the early 1980's. For example, in one of my first jobs as a C programmer the coding standards required that the normal C integer data types were not used but instead we were required to use typedef'ed values with precisely defined sizes called INT8, INT16, INT32, UINT8, etc.

In brief, the idea can be useful for portability and compatibility, but in general you should simply use int (or long if int may not always be large enough). (Similarly, for floating point data just use double, or perhaps long double.)

Portability

The idea began as an attempt to make code more portable, and indeed it is sometimes useful in this regard (see below). The problem is that there is a perceived vagueness of the C language regarding the size of different integer types, which is incorrect. In fact the types in C are very carefully designed to allow efficient and portable code.

[Personally, I don't insist on code being portable when it does not need to be, but generally well written code is usually quite portable anyway.]

First, I will give an example of how the practice should be used to make code portable. A typical use would be a library, written in C, that reads and writes a binary data file, or creates packets for a communications protocol. In this case it is essential that on all target systems that the code always uses the same sizes for data fields of specific types. This allows data files created on one system to be used on another or the same software on different types of systems to talk to each other. (Of course, there are other considerations such as byte order, etc, which I don't want to go into here.)

Here is an example:

#if SYSTEM1
   typedef int   INT16;
   ...
#else
   typedef short INT16;
   ...
#endif

struct packet
{
  UINT8   type;
  INT8    status;
  UINT16  number;
  ...
};

The problem is when the same technique is used to overcome problems of poorly written code. It is not uncommon for C code to make unnecessary assumptions about the data types in use. For example this code was written for a 16-bit processor (ie, sizeof(int) == 2):

  int  i = 0xFFFF;     // set all bits on

the trouble with this code is on a 32-bit processor it only sets the bottom 16-bits. In an attempt to make the code more portable something like this is often used:

  INT16  i = 0xFFFF;

There are so many things I don't like about this, but it is mainly that it is simply unnecessary and clutters the code with non-standard types. It is generally better to use an int, since that has been chosen as the natural size for the target system. Using another type (like short) can often be less efficient on a 32-bit system since it may require extra instructions to truncate the result wasting time and/or memory.

Here is code that is simple, portable and efficient:

  int i = ~0;    // set all bits on

Finally, I will mention a problem that I have seen a lot of recently. I am currently working with some very old legacy code. I am continually coming across pieces of code that use 16-bit integers (eg using short or even the Windows WORD type) where using an int would have been more obvious. Now this code worked fine in its original (16-bit) environment but is being asked to do a lot more when ported to 32-bit environment. I have lost count how many bugs I have fixed which would simply have been avoided by using an int instead of a short.

C Language Problems

As an aside I will mention that the C language is showing its age, and the idea that an int is the natural size for a processor has been blurred somewhat. I guess Dennis Ritchie envisioned that one day the hardware would be advanced enough that shorts could be 16-bits, ints 32-bits and longs 64-bits. Well, we have already passed that point where we now have processors where the natural size for integers is 64-bits, but we still probably need 8, 16, and 32 bit integers (as well as perhaps 128).

Of course, the above-mentioned problem (of poorly written code making assumptions about the size of data types) means that int will never be larger than 32-bits. Actually, there have been very few compiler manufacturers who have even been game enough to use 64-bit longs - which is why the long long data type was invented, and recently added to the C standard.

Guidelines:

avoid writing code that makes assumptions about data types (such as the size of an int)
prefer int for integer data
also see the guidelines for the next fallacy
prefer double for floating point data

7. Use the exact type that you require

This is a less common fallacy but I think it should be dispelled as it is rarely discussed. I guess it is not normally a problem because people follow the conventions in K&R and most (if not all) books on C programming. But despite all the evidence to the contrary some C programmers seem to believe that there are advantages to making use of various "short" primitive data types (unsigned char, short, float, etc). A variation, that is more common, is to declare integers to be unsigned if they are never going to be negative (even though they will also never have large positive values).

Note that this is different to (but related to) the previous fallacy, since it is all about the use (or misuse) of standard C integer types.

Let's consider an example. Here is something that I just took from some real code (slightly simplified):

void message(char *buf, unsigned short len, unsigned char message_type, short id)
{
   char is_blank = TRUE;
   short count = 0;
   ...

The use of short and char types serves little purpose over simply using int.

Before I explain this I will note that this code was written for a 32-bit processor, so sizes used by the compiler were: char = 8 bits, short = 16 bits, pointer/int/size_t = 32 bits. It is important to remember that a 32-bit processor reads and writes to memory 4 bytes at a time (ie, 32-bits or a double-word in Windows parlance). This means that to read a byte from memory the processor must read a double-word and extract the required 8 bits. The C code could be compiled for another compiler/environment such as a 64-bit processor but the basics are the same.

Of course, it is up to the compiler how variables are arranged in memory, so we will consider two scenarios:

1. The compiler places the function arguments and local variables consecutively in memory (on the stack) for total memory requirements of 4 + 2 + 1 + 2 + 1 + 2 = 10. (Of course, there will be other things on the stack too, such as the return address.)

In this scenario up to 14 bytes are saved (on the stack) while the function is running. However, considering (in my example) that this code will run on systems with at least 256 MBytes of memory, the saving is miniscule and only for a very short amount of time (while the function is running).

Although the advantages are negligible there are significant potential disadvantages. The processor may need more time to extract the relevant bits after loading the 32-bit memory value. The processor will definitely need two memory reads to load id since it straddles a 4-byte memory boundary.

Even worse, the compiler may need to generate extra instructions (masking and shifting) to get to the extra bits of the variables. So you might save a few bytes of stack space (only while the function is executing) but waste a lot more memory in the code segment with extra instructions. This memory will be in use until the program terminates (unless swapped to disk).

2. The compiler stores the arguments and local variables on DWORD (4 byte) memory boundaries for total memory requirements: 4 + 4 + 4 + 4 + 4 + 4 = 24.

In this scenario the compiler basically ignores your declarations of chars and short and stores everything in 32-bit memory locations. Typically, this results in the same assembly code as you would get if you just used this:

void message(char *buf, int len, int message_type, int id)
{
    int is_blank = TRUE;
    int count = 0;
    ...

However, with the first code the compiler may need to (depending on the CPU instruction set), or feel obliged to, create extra code to handle the shorter data types.

In brief, whatever the compiler does, using char and short etc is no better than using int and possibly worse. Similar arguments usually apply to using float rather than double.

For completeness I will say that the code is probably best written like this:

void message(const char *buf, size_t len, enum message_type msg_type, int id)
{
   BOOL is_blank = TRUE;      // BOOL is equivalent to int
   int count = 0;
   ...

When should you use them?

Obviously signed char, short, float etc were invented for a reason, so they must have a valid use. They should only be used when you need a lot of different numbers to save space (e.g., in memory or on disk) or time (eg, transmission time). For example, you might have an array of millions of doubles where a high level of precision is not important so instead you use an array of floats. Or you may have a struct that is repeatedly stored in a file so that there could be billions of instances of the data in files on many different disk drives - in this case you want to use the smallest possible integers types (such as short, unsigned char, etc.) that can store the full range of valid values. If you have very specific ranges or want control over individual bits (e.g., for Boolean flags) then you could even use bit-fields.

Understandability

Having said all the above I will concede that there are times when using shorter types is useful. First, you should use char when dealing with single-byte character data, like ASCII. Even though char is essentially an 8-bit integer using char instead of int makes it more obvious that you are working with text. (And you have to use arrays of char when working with strings.)

Also, I still occasionally declare parameters of char or short, when there are multiple integer parameters with different purposes. As an example, consider the standard C library function memchr(). I have created and encountered several bugs in the past caused by the 2nd and 3rd parameters passed to memchr() (and memset()) being transposed. Here is the prototype for memchr():

void *memchr(const void *s, int c, size_t n);

The 2nd and 3rd parameters are easily switched, and there is no way for the compiler to warn of this mistake. So this erroneous call would not be flagged by the compiler:

[Actually a C compiler could warn if size_t and int have different sizes, but for most compilers they are the same size. One exception is MSDOS C compilers from a long time ago - when using large memory model ints were 16 bits, whereas size_t and ptrdiff_t were 32 bits in size.]

  char *s, c;
  ...
  memchr(s, strlen(s), c); // ERROR: should be memchr(s, c, strlen(s))

Of course, you can't change the prototypes for standard C functions like memchr() (though you could create a wrapper), but you can use this to make calls to your own functions safer.

Unsigned

Finally, some people seem to have an overwhelming compulsion to declare integer values unsigned just because they can never take a negative value. This is a bad idea except in some specific circumstances mentioned below.

First, the code may later need to be changed to store a special value to indicate an unused or error state. Conventionally, for integers the value -1 is used for this (even when zero is not a valid value).

More importantly, it is very easy for expressions involving unsigned ints to have unexpected behaviour. This code looks reasonable:

  unsigned int i, first, last;
  ...
  for (i = last; i >= first; i--) ...

The problem here is that when first is zero the code goes into an infinite loop.

Here's another one:

  FILE_ADDRESS first, last, diff;
  ...
  diff = last - first;

The problem here is when FILE_ADDRESS is typedef'ed to be unsigned long. If first is bigger than last then diff needs to take a negative value but since it is unsigned it will actually end up with a very large positive value. Even if you notice this problem and declare diff as signed long you may still get warnings from the compiler when you mix signed and unsigned types in an expression.

[One thing that puzzles many new C programmers is why certain functions in the C standard library return signed values when unsigned would have seemed logical. For example, ftell() returns a long, but the length of a file can never be negative so why does it not return an unsigned long? It is precisely for the reasons mentioned here.]

Having tracked down and fixed many such bugs due to use of unsigned variables I recommend always using signed integers unless you are performing some sort of bit-manipulation or doing modular arithmetic such as used sometimes in encryption, random number generators, etc (i.e. where overflow is ignored).

Guidelines:

prefer int for integer arithmetic and double for floating point
use long if your code needs to be portable and values can possibly be greater than 32,767
use long long if values can possibly be greater than 2,147,483,647
when you need to store a lot of data you may want to use char, short, or float
use char for integers that are less than 128
use unsigned char for integers that are less than 256 and always zero or positive
use signed char for integers that may be negative but are always in the range -127 to 127
use short for integers that are less than 32,768
use unsigned short for positive integers possibly greater than 32,767 and always less than 65,536
use char for text
use signed char (or possibly unsigned char) when you have a large number of small values
use unsigned integers for bit-manipulation and modular arithmetic

8. Place all local declarations at the start of functions

This fallacy is the only one that I actually followed for a long time. In C, by convention (and according to many coding standards) all local variables are declared at the top of the function. The idea is to make it quick and easy to find the declaration when you need to refer to it.

On a related (side) note, many coding standards are also highly prescriptive of where other things are placed. For example, I have seen it said that header files must have various sections in an exact order such as #includes (in alphabetical order) at the top of header files, followed by #defines, then type definitions, etc. It *is* a good idea to have header files well organized when it assists understandability but if the rules are too rigid they are counterproductive. Using the above ordering would prevent grouping logically related things together.

There are many reasons that declaring variables at the start of a function is a bad idea. First, as explained in Fallacy 3, declaring a variable well before its use causes problems such as the quandary of having uninitialized memory which may be accidentally used.

Also by declaring variables with a larger scope than necessary you run the risk that they will be accidentally used by some other piece of code.

  int i = something;
  if ...
  {
     ...
     i = something_else;   // code added which accidentally uses existing variable
  }

  use(i);                  // use of i did not expect something_else

There is also a tendency to reuse a variable later on in the same function when you know that it has been finished with. The urge to do this can be irresistible. The problem is that you may not always be fully cognizant of all uses of the variable or later changes to the code may cause problems. To avoid confusion each variable should be used for one thing only.

In C++ this is not a problem as you can (and should) postpone declaring a variable until you need to use it. In C you can only declare variables at the start of a compound statement, but you should declare local variables in the inner-most scope in which they are used.

As a general rule of thumb: keep related things as close as possible. For one thing this increases the chances of spotting problems. Postponing declaring variables until they are needed is one example of this. (So is grouping related things in header files.)

Guidelines:

declare variables as close as possible to their first use

9. Never use goto

The first mainstream high-level languages (Fortran and Cobol) initially used goto for the purposes of loops and branches. By the time I went to university the goto statement was considered to be pure evil. The abuse of goto in Fortran and BASIC programs even then was mind-boggling and I believe the advent and general acceptance of structured programming had enormous benefits (e.g., far more than OO) for the advancement of our industry.

That said, the blanket banning of goto (as in some coding standards) is a bad idea. A commonly cited example of when you should use goto is where you need to break from a nested loop. The C break statement only allows you to exit the innermost loop (and also can't be used to break out of a loop if you are within a switch).

Flag Variables

Sometimes in an attempt to avoid a goto a programmer will set a flag variable which is used to later control the flow of the program. This makes the code difficult to follow. (In fact I consider this technique to be the "poor cousin" of that most heinous of programmer crimes - self-modifying code.)

  BOOL time_to_exit = FALSE;
  while (!time_to_exit)
  {
     ....
     do
     {
        if (some_error_condition_detected())
           time_to_exit = TRUE;
        else
           ...
     } while (not_finished && !time_to_exit);
  }
  ...

Getting rid of the flag variable makes this simpler and easier to understand. You also avoid the possibility that the flag is incorrectly set or accidentally modified.

  while (TRUE)
  {
     ....
     do
     {
        if (some_error_condition_detected())
           goto error_condition;
        ...
     } while (not_finished);
  }
error_condition:
  ...

Note that flag variables seem to be used a lot more in Pascal than in C, which is one reason I hate reading Pascal and Delphi programs. This may be due to the fact that Pascal does not support break, continue, and return statements, and goto (though allowed) is rarely used.

Single Exit Point

Another place where goto is useful is when you have a function that needs to have a single exit point in order to tidy up resources. Many functions need to return from more than one place, generally when some sort of error is detected. Luckily, the C language makes this easy to handle using the return statement. However, sometimes a function may need to tidy up before returning which means the tidy up code has to be duplicated before every return.

void f()
{
   ...
   if (file_not_found())
   {
      // clean up code
      return;
   }
	
   while ...
   {
      ...
      if (unexpected_eof())
      {
         // clean up code
         return;
      }
   }
   // clean up code
}

The cleanup code could be used to close a file, release some memory, free some other type of resource, etc. The point is that duplicating this code at every return statement is a bad idea - for one, it makes the code less maintainable. A better way is to go through a single exit point.

[Some coding standards require that all functions should have a single exit point. I disagree with this as a blanket rule, but that debate will have to wait for a future article.]

void f()
{
   ...
   if (file_not_found())
      goto exit_f;
	
   while ...
   {
      ...
      if (unexpected_eof())
         goto exit_f;
   }
exit_f:
   // clean up code
}

A similar use for goto is to jump between case labels in a switch. (In C# it is essential to use goto for this as there is no fall-through between cases). Consider this switch where different cases share common code:

  switch (val)
  {
  case 1:
     // unique code 1
     // common code
     break;
  case 2:
     // unique code 2
     // common code
     break;
  case 3:
     // common code
     break;
  ...
  }

It is always a good idea to eliminate duplicate code. (An alternative might be to push all the common code into a function, but for a few lines of repeated code or something that uses many local variables this may not be practical.)

  switch (val)
  {
  case 1:
     // unique code 1
     goto common_code;
  case 2:
     // unique code 2
     goto common_code;
  case 3:
  common_code:
     // common code
     break;
  ...
  }

Again using goto avoids the bad practice of having the same code in multiple places.

Guidelines:

avoid goto unless it makes the code easier to understand and maintain
never use a backward goto or jump into control statements

10. Always return an error status

The most ubiquitous problem in the history of the C language is that of code that ignores the possibility of an error value being returned from a function. There is no C programmer who has not at some time thought "I don't need to check the return code from this function as it can never fail in this case". For example, I have never seen code that checks the return value of printf() even though it can return -1 for a variety of (admittedly unlikely) errors.

Even worse is the programmer who is oblivious to any error return value. It is good practice to indicate that you are aware that the function can generate an error but consider it impossible or extremely unlikely by casting the return value to void.

  (void)printf("No errors here!\n");

It can often be safe to ignore error return values but I have seen many cases where the practice has caused nasty problems. If lucky, an unhandled error might immediately cause the program to abort with a run-time error - for example, by dereferencing through a NULL pointer that was returned to indicate an error. Worse is where the problem is silently ignored but later results in a nasty surprise like data corruption.

Now a great part of this problem is due to the fact that many functions return error values when they don't need to (which encourages an ill-disciplined attitude to checking error return values). This is the fallacy that I want to discuss. There are some bad practices that are used in the name of implementing "professional standard" error-handling which I will look at.

Worst Case

At the extreme end is a function that always returns success:

int f()
{
    // some code that does not generate errors
    return SUCCESS;
}

The usual justification for this, is that it may be possible that in the future the function would be modified to do things that might generate an error. My first riposte is the Extreme Programming principle of YAGNI (You Aren't Going to Need It). The problem is that you burden the caller with the responsibility of handling the error, even though there is none. Of course, in all likelihood the caller will just ignore the return value (since it never changes) with the consequence that if it is changed to return errors in the future that these errors will not be detected and handled.

This is far better:

void f()
{
    // some code that does not generate errors
    return;
}

Then even if the function is changed in the future to return errors, it is more likely that all the calls will be tracked down and modified appropriately.

Inventing Errors

Another bad practice is turning bugs into run-time errors. This is when the software detects some internal inconsistency and then returns an error code. For example:

int display(const char *str)
{
    if (str == NULL)
        return -1;
		
    // display str
    return 0;
}

Now it is obvious (and should be documented) that display() should always be passed a valid string. It is a bug if whoever calls it passes a NULL pointer. It should simply be written thusly:

void display(const char *str)
{
    assert(str != NULL);
    // display str
    return;
}

C++

Note that, like many of these fallacies, the problem can be avoided in C++, which supports exception-handling to handle errors. In fact, one major benefit of C++ exceptions, which is not generally recognised, is that there is no way to accidentally ignore errors and blissfully continue. (The other major benefit is that it makes the code much easier to understand, not being cluttered with much error-handling code.)

[I probably should also mention the standard C library function setjmp(), which is a poor man's version of C++ exception handling. However, setjmp() is not often used and I don't recommend it in general as it is fraught with danger even if you know exactly how to use it. It is also not as useful as C++ exceptions anyway.]

If you are using C++ then Scott Meyers has more to say about removing the possibility of errors in his excellent book Effective C++. (See Item 46: Prefer compile-time and link-time errors to runtime errors.)

Guidelines:

avoid creating error return values whenever possible
don't turn bugs into run-time errors
understand the possible errors that a function returns and don't ignore them

Conclusion

When I started this article I did not have a hard list of 10 fallacies. I even considered changing the title to "Seven Fallacies...". However, as I wrote the article (and also worked on maintaining some horrible old C code at the same time) I have thought of quite a few more, so that I am well on my way to another ten. One I probably should have mentioned is Hungarian Notation which I have been saying for 20 years is a bad idea and thankfully even Microsoft are saying not to use now (though the original Application Hungarian was actually a very good idea). I will probably write a follow up article if there is enough interest.

You probably don't even agree with all of the ten I have listed above. However, I implore you to at least think about what I have said. This list is based on about 30 years experience of C programming, a lot of thought and research. If you don't like something in the list maybe you didn't really understand my point. It may become clear on a second reading. If you still disagree then post a message or send me an email saying why.