Merge pull request #4 from diana-hep/introduce-cpp-to-fix-column-name…

…s-bug Introduce cpp to fix column names bug
diana-hep · Oct 24, 2016 · 3adedba · 3adedba
2 parents 2d0a3c9 + a36a00d
commit 3adedba
Show file tree

Hide file tree

Showing 3 changed files with 80 additions and 145 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # c2numpy
 
-Write Numpy (.npy) files from C or C++ for analysis in [Numpy](http://www.numpy.org/), [Scipy](https://www.scipy.org/), [Scikit-Learn](http://scikit-learn.org/stable/), [Pandas](http://pandas.pydata.org/), etc.
+Write Numpy (.npy) files from C++ for analysis in [Numpy](http://www.numpy.org/), [Scipy](https://www.scipy.org/), [Scikit-Learn](http://scikit-learn.org/stable/), [Pandas](http://pandas.pydata.org/), etc.
 
 Fills a collection of .npy files with a maximum size, a common prefix and a rotating number (like rotating log files). Each file contains one [structured array](http://docs.scipy.org/doc/numpy/user/basics.rec.html), consisting of named, typed columns (numbers and fixed-size strings) and many rows. In Python, you access rows and columns with string and integer indexing:
 
@@ -10,11 +10,11 @@ myarray[3:5]         # all columns, slice of rows
                      # etc.
 ```
 
-This project does not support _reading_ of Numpy files in C.
+This project does not support _reading_ of Numpy files in C++.
 
 ## Installation
 
-Put `c2numpy.h` in your C or C++ project and compile. No libraries are required. Adheres to strict [ISO C99](http://www.iso-9899.info/wiki/The_Standard).
+Put `c2numpy.h` in your C++ project and compile. No libraries are required. Earlier versions of this worked with strict C99, but this project now requires C++.
 
 For an example and testing, `test.c` is provided. Compile and run it with
 
@@ -29,9 +29,9 @@ python -c "import numpy; print numpy.load(open('testout0.npy'));"
 python -c "import numpy; print numpy.load(open('testout1.npy'));"
 ```
 
-## C example
+## C++ example
 
-```c
+```c++
 // declare writer
 c2numpy_writer writer;
 
@@ -60,13 +60,15 @@ c2numpy_string(&writer, "THREE");
 c2numpy_close(&writer);
 ```
 
-## C API
+## C-like API
+
+The original version of this project could be used in pure C projects, and hence it has a pure C API. However, now that the internals require C++, this API will be replaced by a C++ API. This documentation will always be in sync with the codebase (in the same branch of GitHub).
 
 ### Enumeration constants for Numpy types: `c2numpy_type`
 
 See [number type definitions](http://docs.scipy.org/doc/numpy/user/basics.types.html) in the Numpy documentation.
 
-```c
+```c++
 C2NUMPY_BOOL        // Boolean (True or False) stored as a byte
 C2NUMPY_INT         // Default integer type (same as C long; normally either int64 or int32)
 C2NUMPY_INTC        // Identical to C int (normally int32 or int64)
@@ -103,18 +105,18 @@ Not currently supported:
 
 A writer contains the following fields. Some of them are internal, and all of them should be treated as read-only. Use the associated functions to manipulate.
 
-```c
+```c++
 typedef struct {
     char buffer[16];              // (internal) used for temporary copies in c2numpy_row
 
     FILE *file;                   // output file handle
-    char *outputFilePrefix;       // output file name, not including the rotating number and .npy
+    std::string outputFilePrefix; // output file name, not including the rotating number and .npy
     int64_t sizeSeekPosition;     // (internal) keep track of number of rows to modify before closing
     int64_t sizeSeekSize;         // (internal)
 
     int32_t numColumns;           // number of columns in the record array
-    char **columnNames;           // column names
-    c2numpy_type *columnTypes;    // column types
+    std::vector<std::string> columnNames;  // column names
+    std::vector<c2numpy_type> columnTypes; // column types
 
     int32_t numRowsPerFile;       // maximum number of rows per file
     int32_t currentColumn;        // current column number
@@ -125,15 +127,15 @@ typedef struct {
 
 ### Numpy description string from type: `c2numpy_descr`
 
-```c
+```c++
 const char *c2numpy_descr(c2numpy_type type);
 ```
 
 Rarely needed by typical users; converts a `c2numpy_type` to the corresponding Numpy "descr" string. **Returns** `NULL` if the `type` is invalid.
 
 ### Initialize a writer object: `c2numpy_init`
 
-```c
+```c++
 int c2numpy_init(c2numpy_writer *writer, const char *outputFilePrefix, int32_t numRowsPerFile);
 ```
 
@@ -148,7 +150,7 @@ This is the first function you should call on a new writer. After this, call `c2
 
 ### Add a column to the writer: `c2numpy_addcolumn`
 
-```c
+```c++
 int c2numpy_addcolumn(c2numpy_writer *writer, const char *name, c2numpy_type type);
 ```
 
@@ -163,7 +165,7 @@ This is the second function you should call on a new writer. Call it once for ea
 
 ### Optional open file: `c2numpy_open`
 
-```c
+```c++
 int c2numpy_open(c2numpy_writer *writer);
 ```
 
@@ -175,7 +177,7 @@ Open a file and write its header to disk. If you don't call this explicitly, wri
 
 The following suite of functions push one datum (item in a row/column) to the writer. They check the requested data type against the expected data type for the current column, but cannot prevent column-misalignment if all data types are the same.
 
-```c
+```c++
 int c2numpy_bool(c2numpy_writer *writer, int8_t data);        // "bool" is just a byte
 int c2numpy_int(c2numpy_writer *writer, int64_t data);        // Numpy's default int is 64-bit
 int c2numpy_intc(c2numpy_writer *writer, int data);           // the built-in C int
@@ -204,18 +206,14 @@ The string form, `c2numpy_string`, **only writes** the string `data`, so you are
 
 ### Required close file: `c2numpy_close`
 
-```c
+```c++
 int c2numpy_close(c2numpy_writer *writer);
 ```
 
 If you do not explicitly close the writer, your last file may be corrupted. Be sure to do this after your loop over data.
 
 **Returns:** 0 if successful and -1 otherwise.
 
-## C++ example and C++ API
-
-Not written yet (will wrap the C functions with C++ class structure using [__cplusplus](http://stackoverflow.com/a/6779715/1623645)).
-
 ## To do
 
    * Add convenience function to calculate number of rows for a target file size.
@@ -224,4 +222,4 @@ Not written yet (will wrap the C functions with C++ class structure using [__cpl
    * Faster guessing of header size and column types.
    * Float16 and complex numbers.
    * Distinct return values for different errors and documentation of those errors.
-   * Optional C++ API.
+   * C++ API.
diff --git a/c2numpy.h b/c2numpy.h
@@ -19,7 +19,11 @@
 #include <stdarg.h>
 #include <string.h>
 
-const char* C2NUMPY_VERSION = "1.1";
+#include <sstream>
+#include <string>
+#include <vector>
+
+const char* C2NUMPY_VERSION = "1.2";
 
 // http://docs.scipy.org/doc/numpy/user/basics.types.html
 typedef enum {
@@ -49,16 +53,14 @@ typedef enum {
 
 // a Numpy writer object
 typedef struct {
-    char buffer[16];              // (internal) used for temporary copies in c2numpy_row
-
     FILE *file;                   // output file handle
-    char *outputFilePrefix;       // output file name, not including the rotating number and .npy
+    std::string outputFilePrefix;       // output file name, not including the rotating number and .npy
     int64_t sizeSeekPosition;     // (internal) keep track of number of rows to modify before closing
     int64_t sizeSeekSize;         // (internal)
 
     int32_t numColumns;           // number of columns in the record array
-    char **columnNames;           // column names
-    c2numpy_type *columnTypes;    // column types
+    std::vector<std::string> columnNames;           // column names
+    std::vector<c2numpy_type> columnTypes;    // column types
 
     int32_t numRowsPerFile;       // maximum number of rows per file
     int32_t currentColumn;        // current column number
@@ -137,16 +139,13 @@ const char *c2numpy_descr(c2numpy_type type) {
     return NULL;
 }
 
-int c2numpy_init(c2numpy_writer *writer, const char *outputFilePrefix, int32_t numRowsPerFile) {
+int c2numpy_init(c2numpy_writer *writer, const std::string outputFilePrefix, int32_t numRowsPerFile) {
     writer->file = NULL;
-    writer->outputFilePrefix = (char*)malloc(strlen(outputFilePrefix) + 1);
-    strcpy(writer->outputFilePrefix, outputFilePrefix);
+    writer->outputFilePrefix = outputFilePrefix;
     writer->sizeSeekPosition = 0;
     writer->sizeSeekSize = 0;
 
     writer->numColumns = 0;
-    writer->columnNames = NULL;
-    writer->columnTypes = NULL;
 
     writer->numRowsPerFile = numRowsPerFile;
     writer->currentColumn = 0;
@@ -156,108 +155,67 @@ int c2numpy_init(c2numpy_writer *writer, const char *outputFilePrefix, int32_t n
     return 0;
 }
 
-int c2numpy_addcolumn(c2numpy_writer *writer, const char *name, c2numpy_type type) {
+int c2numpy_addcolumn(c2numpy_writer *writer, const std::string name, c2numpy_type type) {
     writer->numColumns += 1;
-
-    char *newColumnName = (char*)malloc(strlen(name) + 1);
-    strcpy(newColumnName, name);
-
-    char **oldColumnNames = writer->columnNames;
-    writer->columnNames = (char**)malloc(writer->numColumns * sizeof(char*));
-    for (int column = 0;  column < writer->numColumns - 1;  ++column)
-        writer->columnNames[column] = oldColumnNames[column];
-    writer->columnNames[writer->numColumns - 1] = newColumnName;
-    if (oldColumnNames != NULL)
-        free(oldColumnNames);
-
-    c2numpy_type *oldColumnTypes = writer->columnTypes;
-    writer->columnTypes = (c2numpy_type*)malloc(writer->numColumns * sizeof(c2numpy_type));
-    for (int column = 0;  column < writer->numColumns - 1;  ++column)
-        writer->columnTypes[column] = oldColumnTypes[column];
-    writer->columnTypes[writer->numColumns - 1] = type;
-    if (oldColumnTypes != NULL)
-        free(oldColumnTypes);
-
+    writer->columnNames.push_back(name);
+    writer->columnTypes.push_back(type);
     return 0;
 }
 
 int c2numpy_open(c2numpy_writer *writer) {
-    char *fileName = (char*)malloc(strlen(writer->outputFilePrefix) + 15);
-    sprintf(fileName, "%s%d.npy", writer->outputFilePrefix, writer->currentFileNumber);
-    writer->file = fopen(fileName, "wb");
-
-    // FIXME: better initial guess about header size before going in 128 byte increments
-    char *header = NULL;
-    for (int64_t headerSize = 128;  headerSize <= 4294967295;  headerSize += 128) {
-        if (header != NULL) free(header);
-        header = (char*)malloc(headerSize + 1);
-
-        char version1 = headerSize <= 65535;
-        uint32_t descrSize;
-        if (version1)
-            descrSize = headerSize - 10;
-        else
-            descrSize = headerSize - 12;
-
-        header[0] = 147;                            // magic
-        header[1] = 'N';
-        header[2] = 'U';
-        header[3] = 'M';
-        header[4] = 'P';
-        header[5] = 'Y';
-        if (version1) {
-            header[6] = 1;                          // format version 1.0
-            header[7] = 0;
-            const uint16_t descrSize2 = descrSize;
-            *(uint16_t*)(header + 8) = descrSize2;   // version 1.0 has a 16-byte descrSize
-        }
-        else {
-            header[6] = 2;                          // format version 2.0
-            header[7] = 0;
-            *(uint32_t*)(header + 8) = descrSize;   // version 2.0 has a 32-byte descrSize
-        }
+    std::stringstream fileNameStream;
+    fileNameStream << writer->outputFilePrefix;
+    fileNameStream << writer->currentFileNumber;
+    fileNameStream << ".npy";
+    std::string fileName = fileNameStream.str();
+    writer->file = fopen(fileName.c_str(), "wb");
+
+    std::stringstream headerStream;
+    headerStream << "{'descr': [";
+
+    int column;
+    for (column = 0;  column < writer->numColumns;  ++column) {
+      headerStream << "('" << writer->columnNames[column] << "', '" << c2numpy_descr(writer->columnTypes[column]) << "')";
+      if (column < writer->numColumns - 1)
+        headerStream << ", ";
+    }
 
-        int64_t offset = headerSize - descrSize;
-        offset += snprintf((header + offset), headerSize - offset + 1, "{'descr': [");
-        if (offset >= headerSize) continue;
+    headerStream << "], 'fortran_order': False, 'shape': (";
 
-        for (int column = 0;  column < writer->numColumns;  ++column) {
-            offset += snprintf((header + offset), headerSize - offset + 1, "('%s', '%s')",
-                              writer->columnNames[column],
-                              c2numpy_descr(writer->columnTypes[column]));
-            if (offset >= headerSize) continue;
+    writer->sizeSeekPosition = headerStream.str().size();
 
-            if (column < writer->numColumns - 1)
-                offset += snprintf((header + offset), headerSize - offset + 1, ", ");
-            if (offset >= headerSize) continue;
-        }
+    headerStream << writer->numRowsPerFile;
 
-        offset += snprintf((header + offset), headerSize - offset + 1, "], 'fortran_order': False, 'shape': (");
-        if (offset >= headerSize) continue;
+    writer->sizeSeekSize = headerStream.str().size() - writer->sizeSeekPosition;
 
-        writer->sizeSeekPosition = offset;
-        writer->sizeSeekSize = snprintf((header + offset), headerSize - offset + 1, "%d", writer->numRowsPerFile);
-        offset += writer->sizeSeekSize;
-        if (offset >= headerSize) continue;
+    headerStream << ",), }";
 
-        offset += snprintf((header + offset), headerSize - offset + 1, ",), }");
-        if (offset >= headerSize) continue;
+    int headerSize = headerStream.str().size();
+    char version = 1;
 
-        while (offset < headerSize) {
-            if (offset < headerSize - 1)
-                header[offset] = ' ';
-            else
-                header[offset] = '\n';
-            offset += 1;
-        }
-        header[headerSize] = 0;
-
-        fwrite(header, 1, headerSize, writer->file);
+    if (headerSize > 65535) version = 2;
+    while ((6 + 2 + (version == 1 ? 2 : 4) + headerSize) % 16 != 0) {
+      headerSize += 1;
+      headerStream << " ";
+      if (headerSize > 65535) version = 2;
+    }
 
-        return 0;
+    fwrite("\x93NUMPY", 1, 6, writer->file);
+    if (version == 1) {
+      fwrite("\x01\x00", 1, 2, writer->file);
+      fwrite(&headerSize, 1, 2, writer->file);
+      writer->sizeSeekPosition += 6 + 2 + 2;
+    }
+    else {
+      fwrite("\x02\x00", 1, 2, writer->file);
+      fwrite(&headerSize, 1, 4, writer->file);
+      writer->sizeSeekPosition += 6 + 2 + 4;
     }
 
-    return -1;
+    std::string header = headerStream.str();
+    fwrite(header.c_str(), 1, header.size(), writer->file);
+
+    return 0;
 }
 
 #define C2NUMPY_CHECK_ITEM {                                                    \
@@ -453,7 +411,8 @@ int c2numpy_close(c2numpy_writer *writer) {
             // so go back to the part of the header where that was written
             fseek(writer->file, writer->sizeSeekPosition, SEEK_SET);
             // overwrite it with spaces
-            for (int i = 0;  i < writer->sizeSeekSize;  ++i)
+            int i;
+            for (i = 0;  i < writer->sizeSeekSize;  ++i)
                 fputc(' ', writer->file);
             // now go back and write it again (it MUST be fewer or an equal number of digits)
             fseek(writer->file, writer->sizeSeekPosition, SEEK_SET);
@@ -463,13 +422,6 @@ int c2numpy_close(c2numpy_writer *writer) {
         fclose(writer->file);
     }
 
-    // and clear the malloc'ed memory
-    free(writer->outputFilePrefix);
-    for (int column = 0;  column < writer->numColumns;  ++column)
-        free(writer->columnNames[column]);
-    free(writer->columnNames);
-    free(writer->columnTypes);
-
     return 0;
 }