C Traps and Pitfalls*
Andrew Koenig
AT&T Bell Laboratories
Murray Hill, New Jersey 07974
ABSTRACT
The C language is like a carving knife: simple, sharp, and extremely useful in
skilled hands. Like any sharp tool, C can injure people who don’t know how to handle it.
This paper shows some of the ways C can injure the unwary, and how to avoid injury.
0. Introduction
The C language and its typical implementations are designed to be used easily by experts. The lan-
guage is terse and expressive. There are few restrictions to keep the user from blundering. A user who has
blundered is often rewarded by an effect that is not obviously related to the cause.
In this paper, we will look at some of these unexpected rewards. Because they are unexpected, it
may well be impossible to classify them completely. Nevertheless, we have made a rough effort to do so
by looking at what has to happen in order to run a C program. We assume the reader has at least a passing
acquaintance with the C language.
Section 1 looks at problems that occur while the program is being broken into tokens. Section 2 fol-
lows the program as the compiler groups its tokens into declarations, expressions, and statements. Section
3 recognizes that a C program is often made out of several parts that are compiled separately and bound
together. Section 4 deals with misconceptions of meaning: things that happen while the program is actually
running. Section 5 examines the relationship between our programs and the library routines they use. In
section 6 we note that the program we write is not really the program we run; the preprocessor has gotten at
it first. Finally, section 7 discusses portability problems: reasons a program might run on one implementa-
tion and not another.
1. Lexical Pitfalls
The first part of a compiler is usually called a lexical analyzer. This looks at the sequence of charac-
ters that make up the program and breaks them up into tokens. A token is a sequence of one or more char-
acters that have a (relatively) uniform meaning in the language being compiled. In C, for instance, the
token -> has a meaning that is quite distinct from that of either of the characters that make it up, and that is
independent of the context in which the -> appears.
For another example, consider the statement:
if (x > big) big = x;
Each non-blank character in this statement is a separate token, except for the keyword if and the two
instances of the identifier big.
In fact, C programs are broken into tokens twice. First the preprocessor reads the program. It must
tokenize the program so that it can find the identifiers, some of which may represent macros. It must then
replace each macro invocation by the result of evaluating that macro. Finally, the result of the macro
replacement is reassembled into a character stream which is given to the compiler proper. The compiler
then breaks the stream into tokens a second time.
__________________
* This paper, greatly expanded, is the basis for the book C Traps and Pitfalls (Addison-Wesley, 1989, ISBN
0–201–17928–8); interested readers may wish to refer there as well.
In this section, we will explore some common misunderstandings about the meanings of tokens and
the relationship between tokens and the characters that make them up. We will talk about the preprocessor
later.
1.1. = is not ==
Programming languages derived from Algol, such as Pascal and Ada, use := for assignment and =
for comparison. C, on the other hand, uses = for assignment and == for comparison. This is because
assignment is more frequent than comparison, so the more common meaning is given to the shorter symbol.
Moreover, C treats assignment as an operator, so that multiple assignments (such as a=b=c) can be
written easily and assignments can be embedded in larger expressions.
This convenience causes a potential problem: one can inadvertently write an assignment where one
intended a comparison. Thus, this statement, which looks like it is checking whether x is equal to y:
if (x = y)
foo();
actually sets x to the value of y and then checks whether that value is nonzero. Or consider the following
loop that is intended to skip blanks, tabs, and newlines in a file:
while (c == ’ ’ || c = ’\t’ || c == ’\n’)
c = getc (f);
The programmer mistakenly used = instead of == in the comparison with ’\t’. This ‘‘comparison’’ actu-
ally assigns ’\t’ to c and compares the (new) value of c to zero. Since ’\t’ is not zero, the ‘‘compari-
son’’ will always be true, so the loop will eat the entire file. What it does after that depends on whether the
particular implementation allows a program to keep reading after it has reached end of file. If it does, the
loop will run forever.
Some C compilers try to help the user by giving a warning message for conditions of the form e1 =
e2. To avoid warning messages from such compilers, when you want to assign a value to a variable and
then check whether the variable is zero, consider making the comparison explicit. In other words, instead
of:
if (x = y)
foo();
write:
if ((x = y) != 0)
foo();
This will also help make your intentions plain.
1.2. & and | are not && or ||
It is easy to miss an inadvertent substitution of = for == because so many other languages use = for
comparison. It is also easy to interchange & and &&, or | and ||, especially because the & and | operators
in C are different from their counterparts in some other languages. We will look at these operators more
closely in section 4.
1.3. Multi-character Tokens
Some C tokens, such as /, *, and =, are only one character long. Other C tokens, such as /* and ==,
and identifiers, are several characters long. When the C compiler encounters a / followed by an *, it must
be able to decide whether to treat these two characters as two separate tokens or as one single token. The C
reference manual tells how to decide: ‘‘If the input stream has been parsed into tokens up to a given charac-
ter, the next token is taken to include the longest string of characters which could possibly constitute a
token.’’ Thus, if a / is the first character of a token, and the / is immediately followed by a *, the two
characters begin a comment, regardless of any other context.
The following statement looks like it sets y to the value of x divided by the value pointed to by p:
y = x/*p
/* p points at the divisor */;
In fact, /* begins a comment, so the compiler will simply gobble up the program text until the */ appears.
In other words, the statement just sets y to the value of x and doesn’t even look at p. Rewriting this state-
ment as
y = x / *p
/* p points at the divisor */;
or even
y = x/(*p)
/* p points at the divisor */;
would cause it to do the division the comment suggests.
This sort of near-ambiguity can cause trouble in other contexts. For example, older versions of C use
=+ to mean what present versions mean by +=. Such a compiler will treat
a=-1;
as meaning the same thing as
a =- 1;
or
a = a - 1;
This will surprise a programmer who intended
a = -1;
On the other hand, compilers for these older versions of C would interpret
a=/*b;
as
a =/ * b ;
even though the /* looks like a comment.
1.4. Exceptions
Compound assignment operators such as += are really multiple tokens. Thus,
a + /* strange */ = 1
means the same as
a += 1
These operators are the only cases in which things that look like single tokens are really multiple tokens. In
particular,
p - > a
is illegal. It is not a synonym for
p -> a
As another example, the >> operator is a single token, so >>= is made up of two tokens, not three.
On the other hand, those older compilers that still accept =+ as a synonym for += treat =+ as a single
token.
- 2 -
1.5. Strings and Characters
Single and double quotes mean very different things in C, and there are some contexts in which con-
fusing them will result in surprises rather than error messages.
A character enclosed in single quotes is just another way of writing an integer. The integer is the one
that corresponds to the given character in the implementation’s collating sequence. Thus, in an ASCII
implementation, ’a’ means exactly the same thing as 0141 or 97. A string enclosed in double quotes, on
the other hand, is a short-hand way of writing a pointer to a nameless array that has been initialized with the
characters between the quotes and an extra character whose binary value is zero.
The following two program fragments are equivalent:
printf ("Hello world\n");
char hello[] = {’H’, ’e’, ’l’, ’l’, ’o’, ’ ’,
’w’, ’o’, ’r’, ’l’, ’d’, ’\n’, 0};
printf (hello);
Using a pointer instead of an integer (or vice versa) will often cause a warning message, so using
double quotes instead of single quotes (or vice versa) is usually caught. The major exception is in function
calls, where most compilers do not check argument types. Thus, saying
printf(’\n’);
instead of
printf ("\n");
will usually result in a surprise at run time.
Because an integer is usually large enough to hold several characters, some C compilers permit mul-
tiple characters in a character constant. This means that writing ’yes’ instead of "yes" may well go
undetected. The latter means ‘‘the address of the first of four consecutive memory locations containing y,
e, s, and a null character, respectively.’’ The former means ‘‘an integer that is composed of the values of
the characters y, e, and s in some implementation-defined manner.’’ Any similarity between these two
quantities is purely coincidental.
2. Syntactic Pitfalls
To understand a C program, it is not enough to understand the tokens that make it up. One must also
understand how the tokens combine to form declarations, expressions, statements, and programs. While
these combinations are usually well-defined, the definitions are sometimes counter-intuitive or confusing.
In this section, we look at some syntactic constructions that are less than obvious.
2.1. Understanding Declarations
I once talked to someone who was writing a C program that was going to run stand-alone in a small
microprocessor. When this machine was switched on, the hardware would call the subroutine whose
address was stored in location 0.
In order to simulate turning power on, we had to devise a C statement that would call this subroutine
explicitly. After some thought, we came up with the following:
(*(void(*)())0)();
Expressions like these strike terror into the hearts of C programmers. They needn’t, though, because
they can usually be constructed quite easily with the help of a single, simple rule: declare it the way you use
it.
Every C variable declaration has two parts: a type and a list of stylized expressions that are expected
to evaluate to that type. The simplest such expression is a variable:
- 3 -
float f, g;
indicates that the expressions f and g, when evaluated, will be of type float. Because the thing declared
is an expression, parentheses may be used freely:
float ((f));
means that ((f)) evaluates to a float and therefore, by inference, that f is also a float.
Similar logic applies to function and pointer types. For example,
float ff();
means that the expression ff() is a float, and therefore that ff is a function that returns a float.
Analogously,
float *pf;
means that *pf is a float and therefore that pf is a pointer to a float.
These forms combine in declarations the same way they do in expressions. Thus
float *g(), (*h)();
says that *g() and (*h)() are float expressions. Since () binds more tightly than *, *g() means
the same thing as *(g()): g is a function that returns a pointer to a float, and h is a pointer to a func-
tion that returns a float.
Once we know how to declare a variable of a given type, it is easy to write a cast for that type: just
remove the variable name and the semicolon from the declaration and enclose the whole thing in parenthe-
ses. Thus, since
float *g();
declares g to be a function returning a pointer to a float, (float *()) is a cast to that type.
Armed with this knowledge, we are now prepared to tackle (*(void(*)())0)(). We can ana-
lyze this statement in two parts. First, suppose that we have a variable fp that contains a function pointer
and we want to call the function to which fp points. That is done this way:
(*fp)();
If fp is a pointer to a function, *fp is the function itself, so (*fp)() is the way to invoke it. The paren-
theses in (*fp) are essential because the expression would otherwise be interpreted as *(fp()). We
have now reduced the problem to that of finding an appropriate expression to replace fp.
This problem is the second part of our analysis. If C could read our mind about types, we could
write:
(*0)();
This doesn’t work because the * operator insists on having a pointer as its operand. Furthermore, the
operand must be a pointer to a function so that the result of * can be called. Thus, we need to cast 0 into a
type loosely described as ‘‘pointer to function returning void.’’
If fp is a pointer to a function returning void, then (*fp)() is a void value, and its declaration
would look like this:
void (*fp)();
Thus, we could write:
void (*fp)();
(*fp)();
at the cost of declaring a dummy variable. But once we know how to declare the variable, we know how to
cast a constant to that type: just drop the name from the variable declaration. Thus, we cast 0 to a ‘‘pointer
to function returning void’’ by saying:
- 4 -
(void(*)())0
and we can now replace fp by (void(*)())0:
(*(void(*)())0)();
The semicolon on the end turns the expression into a statement.
At the time we tackled this problem, there was no such thing as a typedef declaration. Using it,
we could have solved the problem more clearly:
typedef void (*funcptr)();
(* (funcptr) 0)();
2.2. Operators Don’t Always Have the Precedence You Want
Suppose that the manifest constant FLAG is an integer with exactly one bit turned on in its binary
representation (in other words, a power of two), and you want to test whether the integer variable flags
has that bit turned on. The usual way to write this is:
if (flags & FLAG) ...
The meaning of this is plain to most C programmers: an if statement tests whether the expression in the
parentheses evaluates to 0 or not. It might be nice to make this test more explicit for documentation pur-
poses:
if (flags & FLAG != 0) ...
The statement is now easier to understand. It is also wrong, because != binds more tightly than &, so the
interpretation is now:
if (flags & (FLAG != 0)) ...
This will work (by coincidence) if FLAG is 1 or 0 (!), but not for any other power of two.*
Suppose you have two integer variables, h and l, whose values are between 0 and 15 inclusive, and
you want to set r to an 8-bit value whose low-order bits are those of l and whose high-order bits are those
of h. The natural way to do this is to write:
r = h<<4 + l;
Unfortunately, this is wrong. Addition binds more tightly than shifting, so this example is equivalent to
r = h << (4 + l);
Here are two ways to get it right:
r = (h << 4) + l;
r = h << 4 | l;
One way to avoid these problems is to parenthesize everything, but expressions with too many paren-
theses are hard to understand, so it is probably useful to try to remember the precedence levels in C.
Unfortunately, there are fifteen of them, so this is not always easy to do. It can be made easier,
though, by classifying them into groups.
The operators that bind the most tightly are the ones that aren’t really operators: subscripting, func-
tion calls, and structure selection. These all associate to the left.
Next come the unary operators. These have the highest precedence of any of the true operators.
Because function calls bind more tightly than unary operators, you must write (*p)() to call a function
pointed to by p; *p() implies that p is a function that returns a pointer. Casts are unary operators and
have the same precedence as any other unary operator. Unary operators are right-associative, so *p++ is
__________________
* Recall that the result of != is always either 1 or 0.
- 5 -
interpreted as *(p++) and not as (*p)++.
Next come the true binary operators. The arithmetic operators have the highest precedence, then the
shift operators, the relational operators, the logical operators, the assignment operators, and finally the con-
ditional operator. The two most important things to keep in mind are:
1.
2.
Every logical operator has lower precedence than every relational operator.
The shift operators bind more tightly than the relational operators but less tightly than the arithmetic
operators.
Within the various operator classes, there are few surprises. Multiplication, division, and remainder
have the same precedence, addition and subtraction have the same precedence, and the two shift operators
have the same precedence.
One small surprise is that the six relational operators do not all have the same precedence: == and !=
bind less tightly than the other relational operators. This allows us, for instance, to see if a and b are in the
same relative order as c and d by the expression
a < b == c < d
Within the logical operators, no two have the same precedence. The bitwise operators all bind more
tightly than the sequential operators, each and operator binds more tightly than the corresponding or opera-
tor, and the bitwise exclusive or operator (ˆ) falls between bitwise and and bitwise or.
The ternary conditional operator has lower precedence than any we have mentioned so far. This per-
mits the selection expression to contain logical combinations of relational operators, as in
z = a < b && b < c ? d : e
This example also shows that it makes sense for assignment to have a lower precedence than the con-
ditional operator. Moreover, all the compound assignment operators have the same precedence and they all
group right to left, so that
a = b = c
means the same as
b = c; a = b;
Lowest of all is the comma operator. This is easy to remember because the comma is often used as a
substitute for the semicolon when an expression is required instead of a statement.
Assignment is another operator often involved in precedence mixups. Consider, for example, the fol-
lowing loop intended to copy one file to another:
while (c=getc(in) != EOF)
putc(c,out);
The way the expression in the while statement is written makes it look like c should be assigned the value
of getc(in) and then compared with EOF to terminate the loop. Unhappily, assignment has lower prece-
dence than any comparison operator, so the value of c will be the result of comparing getc(in), the
value of which is then discarded, and EOF. Thus, the ‘‘copy’’ of the file will consist of a stream of bytes
whose value is 1.
It is not too hard to see that the example above should be written:
while ((c=getc(in)) != EOF)
putc(c,out);
However, errors of this sort can be hard to spot in more complicated expressions. For example, several ver-
sions of the lint program distributed with the UNIX
system have the following erroneous line:
if( (t=BTYPE(pt1->aty)==STRTY) || t==UNIONTY ){
This was intended to assign a value to t and then see if t is equal to STRTY or UNIONTY. The actual
- 6 -
effect is quite different.*
The precedence of the C logical operators comes about for historical reasons. B, the predecessor of
C, had logical operators that corresponded rougly to C’s & and | operators. Although they were defined to
act on bits, the compiler would treat them as && and || if they were in a conditional context. When the
two usages were split apart in C, it was deemed too dangerous to change the precedence much.**
2.3. Watch Those Semicolons!
An extra semicolon in a C program usually makes little difference: either it is a null statement, which
has no effect, or it elicits a diagnostic message from the compiler, which makes it easy to remove. One
important exception is after an if or while clause, which must be followed by exactly one statement.
Consider this example:
if (x[i] > big);
big = x[i];
The semicolon on the first line will not upset the compiler, but this program fragment means something
quite different from:
if (x[i] > big)
big = x[i];
The first one is equivalent to:
if (x[i] > big) { }
big = x[i];
which is, of course, equivalent to:
big = x[i];
(unless x, i, or big is a macro with side effects).
Another place that a semicolon can make a big difference is at the end of a declaration just before a
function definition. Consider the following fragment:
struct foo {
int x;
}
f()
{
}
. . .
There is a semicolon missing between the first } and the f that immediately follows it. The effect of this is
to declare that the function f returns a struct foo, which is defined as part of this declaration. If the
semicolon were present, f would be defined by default as returning an integer.†
2.4. The Switch Statement
C is unusual in that the cases in its switch statement can flow into each other. Consider, for exam-
ple, the following program fragments in C and Pascal:
__________________
* Thanks to Guy Harris for pointing this out to me.
** Dennis Ritchie and Steve Johnson both pointed this out to me.
† Thanks to an anonymous benefactor for this one.