-
Notifications
You must be signed in to change notification settings - Fork 2
C is a low level procedural programming language. C is very fast because it is very close to machine language, but because of that it is also much harder to program. C code is usually lengthy because C only does exactly what we tell it to do. There is no overhead; thus, we gain speed but lose a lot of the control and checking that other higher level programming languages do for us behind the scene. We should be especially careful in writing C code and we should try to follow the best practices. Of course as with any other language, we should always strive to write code using the most efficient algorithm, because when we say C is faster we mean to compare it to equivalent implementations in other languages. Other than our standard code manual, RPs should also pay attention to the followings principles for coding in C.
The pointer is one of the distinctive features of C and C++. Pointers are versatile and powerful, but their use can also lead to disasters. Before diving into C coding, one should understand pointers well - most of the other differences between C and other languages are mainly in the syntax. C pointers relate directly to how machine deals with memory allocation and complicated data structures such as matrices and strings. At the end of this C manual we provide a list of tutorials and references that can be very helpful in learning about pointers.
When we allocate memory to pointers we could either state only memory size or we could use sizeof()
function to properly illustrate the type of this pointer. We should always prefer the latter for clarity purposes. Moreover, all pointers (for example a pointer to a double or a pointer to another pointer to an int) share the same size (4 bytes), so sizeof(double *)
and sizeof(int ***)
will both evaluate to 4. However, we need to choose the right declaration even though the wrong one will have the same effective consequence.
For example, if we have a pointer to pointer to pointer of double, double ***a
, and we want to allocate memory to a[1]
, which is a pointer to a pointer of double, then we should write
a[1] = mxMalloc(nsize * sizeof(double**));
instead of
a[1] = mxMalloc(nsize * 4);
or
a[1] = mxMalloc(nsize * sizeof(int***));
so that it is clear to any user what indeed the object a[1]
is.
To allocate memory, mxMalloc()
is preferred to mxCalloc()
when we do not need to initialize the pointer content to be 0. mxCalloc()
performs mxMalloc()
, and then initializes the pointer content. If we have multiple memory allocation inside a loop, mxMalloc()
will be the more efficient choice. Also, mxMalloc
/mxCalloc
/mxFree
is preferred over malloc
/calloc
/free
, because the former group of functions allocates/deallocates memory space on the Matlab heap, therefore as soon as the script ends memory will be released automatically (to prevent memory leak). However, we should still always perform mxFree
on pointers because (a) it is a good programming practice and (b) doing so generally makes the entire system runs more efficiently (excerpt from Matlab documentation on mxFree
). More about memory clean-ups in the next point...
Dereferencing NULL pointers, attempting to access or alter memory the program does not own will cause segmentation fault. This is a fatal error in C code, one that is serious and can potentially harm the computer when protected memory space is accidentally altered. C pointers have to be allocated with memory space before being used, and after the program ends the memory space has to be freed. Imagine a simple script being called a million times by our project, and each time it eats up a few bytes of memory permanently because we did not free the memory space allocated to certain pointers. This is what we call a “memory leak” - an extremely bad programming practice (though using mxMalloc
/mxCalloc
instead of malloc
/calloc
could prevent such memory leak because Matlab automatically cleans up). Due to the above reasons, we should make memory allocation and cleaning as symmetric and transparent as possible. We should also check carefully after compiling C programs to make sure that all pointers are properly cleaned up. A quick-and-dirty way to do this is by calling the script a million times and monitor the memory usage of “matlab” in Windows Task Manager. A more methodological way is to use valgrind
, a very useful open-source tool (though no one in our lab has installed this tool and actually used it yet).
If we want to pass many related variables between functions, for clarity and efficiency purposes we should combine them into structures and pass the pointers to these structures. For example, if we want to pass the 3-D coordinates (x, y, z)
of an object to the function foo
, then
void foo(...variable list..., typeCoord *coord);
is clearer than
void foo(...variable list..., double x, double y, double z);
When passing arrays and structures between functions, we should pass pointers of those arrays and structures as function inputs instead of the arrays and structures themselves. Pointers always take up 4 bytes, which is the smallest data type unit in C. Otherwise, C would create a copy of those arrays and structures within the function call and it affects both performance and memory space. In most other high-level languages, this is done automatically behind the scene.
Some pointer arguments of functions only serve as inputs and are not supposed to change during function calls. The standard C practice is to declare them as “const” so that the user will get error messagers if the values of these arguments are changed. This also helps differentiate between input and output pointer arguments. For example, one could write
void foo(double *output, const double *>input);
For variables and functions that should only be accessed within the same script, we should declare them as static
to prevent them from being accidentally accessed by outside scripts.
C allows users to define macros that basically serve as “search and replace” by the preprocessor within the code. For basic functions such as min
and max
, macros work better than functions because they do not require the overhead of invoking function calls. One could also use macros to do indexing so that the code could look cleaner (e.g. replace data[j*nrows + i]
with data[INDEX(i, j)]
). Macro names are commonly capitalized. The parentheses are important in macros, for example in the following indexing macro,
#define INDEX(i, j) ((i) + (j) * nrows)
to understand what happens without the parentheses, we can write out what the preprocessor will replace the call INDEX(i, j + 1)
by. INDEX(i, j + 1)
is replaced with ((i) + (j + 1) * nrows)
, which is different from (i + j + 1 * nrows)
.
One should be aware that the logical structure of macros will be evaluated every time it is called, and embedding “if-else” type of macros in loops could be inefficent. For example, the performance of the following code could greatly improve if we replace the macro with an “if-else” statement outside of the loop:
#define VAL(switch) (switch == 1 ? 1 : 2);
...
out = 0;
for(i=0; i<n; i++){
out += VAL(switch);
}
The standard practice for coding in C is that header files should only contain declarations, but not the actual function definitions. Header files are used to link different C files together so that shared functions and data types are declared in one central location. By including a header file, the preprocessor will copy and paste all common declarations in place of the \#include
command. A header file should not be included multiple times due to nested calls, and to prevent this, they should also always have an “include guard”. The template for “include guard” is the following:
//calc.h
#ifndef _CALC_H_
#define _CALC_H_
...(content of the header file)...
#endif
C requires variable types to be declared up front and unchanged during the course of the program. Thus, it is important to keep track of variable types. The best practice is to group variables with the same type in the same section of the declaration. For example:
int i, j;
int *index;
double input, output;
typePoint *p;
Moreover, even though C allows implicit type conversion, data of one or more subtypes can be converted to a supertype as needed at runtime so that the program will run correctly. For example:
double d;
int i;
if (d > i) d = i;
Data can be lost when floating-point representations are converted from one type to another, especially from broader types to more narrow types. If this is what we want, it is best to make the type conversion explicit. In many cases when we want to convert from a more narrow type to a broader type, typecasting is necessary to avoid unexpected results. For example:
if (d > i) d = (double) i / 2;
If i == 2, d will be assigned the correct value 2.5. Without typecasting, d will be assigned 2 instead.
Whenever possible, we should use memset
, memcopy
, and memmove
instead of for loops. These functions are optimized in C and they can be used in very flexible ways. memset
is particularly good when we want to set a block of data to 0. memcopy
should be used to copy the content in an array to another array. memmove
is similar to memcopy
, but used when there is overlapping between the target and destination. Please see a good C reference for more details. Here is an example with memcopy instead of for loop:
int a[100], b[100];
int i;
// compute values for a
...
// copy values from a to b
memcpy(b, a, 100 * sizeof(a));
// now all 100 values pointed to by b
// are the same as those pointed to by a
A good understanding of bit pattern representations of C data types are useful. For example, knowing that integers are in fact represented by binary numbers, we can see that 2 ^ n
can be represented by 1 << n
: shifting the bit 1 to the left n times. Not only is this method faster because it is a primitive action directly supported by the processor, but it also results in higher precision compared to exp(2, n)
due to the latter evaluating to a double representation. Bitwise operators are thus important.
The order of computation can affect precision. It is also best practice to be mindful of potential arithmetic overflow and underflow. The following n choose k program is a good example:
#define MIN(a, b) (((a) < (b)) ? (a) : (b))
// nchoosek(n, k) = nchoosek(n, n - k);
r = MIN(k, n - k);
multiplier = n;
divisor = 1;
result = 1;
for(i = 1; i <= r; i++){
result = result * multiplier / divisor;
multiplier --; divisor ++;
}
Notice that alternating multiplying and dividing makes sure that we do not calculate n! which can potentially be a very large number resulting in overflow before the divisions even take place. Similarly, we do not divide first, resulting in numbers that are essentially 0 with machine precision.
When compiling C code into MEX files, Visual Studio by default uses the /MD
flag, such that MSVCR100.dll
will be required for the program to run. MSVCR100.dll
is a system package that only comes with the Visual Studio SDK installation. For this reason, we override the /MD
flag with /MT
when compiling C code, so that MSVCR100.dll
is loaded as part of the compiled program and is no longer required for the execution of the program. To do this, we feed the code COMPFLAGS=\$COMPFLAGS /MT
as an option to the mex
Matlab command.
An example for compiling:
mex COMPFLAGS=\$COMPFLAGS /MT -outdir ../../mex
../../src/demand\_c.c
Note that by manually overriding the /MD
flag with /MT
, the compiler will create the following warning in the log:
cl : Command line warning D9025 : overriding ’/MD’ with ’/MT’
-
The C Programming Language (by Kernighan and Ritchie, also known as the “K&R” book);
-
The C Book;
- Practice Task
- Autofilling Values
- Overleaf Workflow
- IT Support
- Research Clusters
- Legacy Tools
- Style Guides
- Mothballing Projects
- Recruiting on Social Media
- PhD Applications
- Gentzkow-Shapiro Lab Notes
- Allcott-Gentzkow Lab Notes