Skip to content
This repository has been archived by the owner on Mar 31, 2019. It is now read-only.

Commit

Permalink
Merge pull request #4 from diana-hep/introduce-cpp-to-fix-column-name…
Browse files Browse the repository at this point in the history
…s-bug

Introduce cpp to fix column names bug
  • Loading branch information
jpivarski authored Oct 24, 2016
2 parents 2d0a3c9 + a36a00d commit 3adedba
Show file tree
Hide file tree
Showing 3 changed files with 80 additions and 145 deletions.
42 changes: 20 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# c2numpy

Write Numpy (.npy) files from C or C++ for analysis in [Numpy](http://www.numpy.org/), [Scipy](https://www.scipy.org/), [Scikit-Learn](http://scikit-learn.org/stable/), [Pandas](http://pandas.pydata.org/), etc.
Write Numpy (.npy) files from C++ for analysis in [Numpy](http://www.numpy.org/), [Scipy](https://www.scipy.org/), [Scikit-Learn](http://scikit-learn.org/stable/), [Pandas](http://pandas.pydata.org/), etc.

Fills a collection of .npy files with a maximum size, a common prefix and a rotating number (like rotating log files). Each file contains one [structured array](http://docs.scipy.org/doc/numpy/user/basics.rec.html), consisting of named, typed columns (numbers and fixed-size strings) and many rows. In Python, you access rows and columns with string and integer indexing:

Expand All @@ -10,11 +10,11 @@ myarray[3:5] # all columns, slice of rows
# etc.
```

This project does not support _reading_ of Numpy files in C.
This project does not support _reading_ of Numpy files in C++.

## Installation

Put `c2numpy.h` in your C or C++ project and compile. No libraries are required. Adheres to strict [ISO C99](http://www.iso-9899.info/wiki/The_Standard).
Put `c2numpy.h` in your C++ project and compile. No libraries are required. Earlier versions of this worked with strict C99, but this project now requires C++.

For an example and testing, `test.c` is provided. Compile and run it with

Expand All @@ -29,9 +29,9 @@ python -c "import numpy; print numpy.load(open('testout0.npy'));"
python -c "import numpy; print numpy.load(open('testout1.npy'));"
```

## C example
## C++ example

```c
```c++
// declare writer
c2numpy_writer writer;

Expand Down Expand Up @@ -60,13 +60,15 @@ c2numpy_string(&writer, "THREE");
c2numpy_close(&writer);
```
## C API
## C-like API
The original version of this project could be used in pure C projects, and hence it has a pure C API. However, now that the internals require C++, this API will be replaced by a C++ API. This documentation will always be in sync with the codebase (in the same branch of GitHub).
### Enumeration constants for Numpy types: `c2numpy_type`
See [number type definitions](http://docs.scipy.org/doc/numpy/user/basics.types.html) in the Numpy documentation.
```c
```c++
C2NUMPY_BOOL // Boolean (True or False) stored as a byte
C2NUMPY_INT // Default integer type (same as C long; normally either int64 or int32)
C2NUMPY_INTC // Identical to C int (normally int32 or int64)
Expand Down Expand Up @@ -103,18 +105,18 @@ Not currently supported:

A writer contains the following fields. Some of them are internal, and all of them should be treated as read-only. Use the associated functions to manipulate.

```c
```c++
typedef struct {
char buffer[16]; // (internal) used for temporary copies in c2numpy_row

FILE *file; // output file handle
char *outputFilePrefix; // output file name, not including the rotating number and .npy
std::string outputFilePrefix; // output file name, not including the rotating number and .npy
int64_t sizeSeekPosition; // (internal) keep track of number of rows to modify before closing
int64_t sizeSeekSize; // (internal)

int32_t numColumns; // number of columns in the record array
char **columnNames; // column names
c2numpy_type *columnTypes; // column types
std::vector<std::string> columnNames; // column names
std::vector<c2numpy_type> columnTypes; // column types

int32_t numRowsPerFile; // maximum number of rows per file
int32_t currentColumn; // current column number
Expand All @@ -125,15 +127,15 @@ typedef struct {

### Numpy description string from type: `c2numpy_descr`

```c
```c++
const char *c2numpy_descr(c2numpy_type type);
```
Rarely needed by typical users; converts a `c2numpy_type` to the corresponding Numpy "descr" string. **Returns** `NULL` if the `type` is invalid.
### Initialize a writer object: `c2numpy_init`
```c
```c++
int c2numpy_init(c2numpy_writer *writer, const char *outputFilePrefix, int32_t numRowsPerFile);
```

Expand All @@ -148,7 +150,7 @@ This is the first function you should call on a new writer. After this, call `c2

### Add a column to the writer: `c2numpy_addcolumn`

```c
```c++
int c2numpy_addcolumn(c2numpy_writer *writer, const char *name, c2numpy_type type);
```
Expand All @@ -163,7 +165,7 @@ This is the second function you should call on a new writer. Call it once for ea
### Optional open file: `c2numpy_open`
```c
```c++
int c2numpy_open(c2numpy_writer *writer);
```

Expand All @@ -175,7 +177,7 @@ Open a file and write its header to disk. If you don't call this explicitly, wri

The following suite of functions push one datum (item in a row/column) to the writer. They check the requested data type against the expected data type for the current column, but cannot prevent column-misalignment if all data types are the same.

```c
```c++
int c2numpy_bool(c2numpy_writer *writer, int8_t data); // "bool" is just a byte
int c2numpy_int(c2numpy_writer *writer, int64_t data); // Numpy's default int is 64-bit
int c2numpy_intc(c2numpy_writer *writer, int data); // the built-in C int
Expand Down Expand Up @@ -204,18 +206,14 @@ The string form, `c2numpy_string`, **only writes** the string `data`, so you are
### Required close file: `c2numpy_close`
```c
```c++
int c2numpy_close(c2numpy_writer *writer);
```

If you do not explicitly close the writer, your last file may be corrupted. Be sure to do this after your loop over data.

**Returns:** 0 if successful and -1 otherwise.

## C++ example and C++ API

Not written yet (will wrap the C functions with C++ class structure using [__cplusplus](http://stackoverflow.com/a/6779715/1623645)).

## To do

* Add convenience function to calculate number of rows for a target file size.
Expand All @@ -224,4 +222,4 @@ Not written yet (will wrap the C functions with C++ class structure using [__cpl
* Faster guessing of header size and column types.
* Float16 and complex numbers.
* Distinct return values for different errors and documentation of those errors.
* Optional C++ API.
* C++ API.
164 changes: 58 additions & 106 deletions c2numpy.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,11 @@
#include <stdarg.h>
#include <string.h>

const char* C2NUMPY_VERSION = "1.1";
#include <sstream>
#include <string>
#include <vector>

const char* C2NUMPY_VERSION = "1.2";

// http://docs.scipy.org/doc/numpy/user/basics.types.html
typedef enum {
Expand Down Expand Up @@ -49,16 +53,14 @@ typedef enum {

// a Numpy writer object
typedef struct {
char buffer[16]; // (internal) used for temporary copies in c2numpy_row

FILE *file; // output file handle
char *outputFilePrefix; // output file name, not including the rotating number and .npy
std::string outputFilePrefix; // output file name, not including the rotating number and .npy
int64_t sizeSeekPosition; // (internal) keep track of number of rows to modify before closing
int64_t sizeSeekSize; // (internal)

int32_t numColumns; // number of columns in the record array
char **columnNames; // column names
c2numpy_type *columnTypes; // column types
std::vector<std::string> columnNames; // column names
std::vector<c2numpy_type> columnTypes; // column types

int32_t numRowsPerFile; // maximum number of rows per file
int32_t currentColumn; // current column number
Expand Down Expand Up @@ -137,16 +139,13 @@ const char *c2numpy_descr(c2numpy_type type) {
return NULL;
}

int c2numpy_init(c2numpy_writer *writer, const char *outputFilePrefix, int32_t numRowsPerFile) {
int c2numpy_init(c2numpy_writer *writer, const std::string outputFilePrefix, int32_t numRowsPerFile) {
writer->file = NULL;
writer->outputFilePrefix = (char*)malloc(strlen(outputFilePrefix) + 1);
strcpy(writer->outputFilePrefix, outputFilePrefix);
writer->outputFilePrefix = outputFilePrefix;
writer->sizeSeekPosition = 0;
writer->sizeSeekSize = 0;

writer->numColumns = 0;
writer->columnNames = NULL;
writer->columnTypes = NULL;

writer->numRowsPerFile = numRowsPerFile;
writer->currentColumn = 0;
Expand All @@ -156,108 +155,67 @@ int c2numpy_init(c2numpy_writer *writer, const char *outputFilePrefix, int32_t n
return 0;
}

int c2numpy_addcolumn(c2numpy_writer *writer, const char *name, c2numpy_type type) {
int c2numpy_addcolumn(c2numpy_writer *writer, const std::string name, c2numpy_type type) {
writer->numColumns += 1;

char *newColumnName = (char*)malloc(strlen(name) + 1);
strcpy(newColumnName, name);

char **oldColumnNames = writer->columnNames;
writer->columnNames = (char**)malloc(writer->numColumns * sizeof(char*));
for (int column = 0; column < writer->numColumns - 1; ++column)
writer->columnNames[column] = oldColumnNames[column];
writer->columnNames[writer->numColumns - 1] = newColumnName;
if (oldColumnNames != NULL)
free(oldColumnNames);

c2numpy_type *oldColumnTypes = writer->columnTypes;
writer->columnTypes = (c2numpy_type*)malloc(writer->numColumns * sizeof(c2numpy_type));
for (int column = 0; column < writer->numColumns - 1; ++column)
writer->columnTypes[column] = oldColumnTypes[column];
writer->columnTypes[writer->numColumns - 1] = type;
if (oldColumnTypes != NULL)
free(oldColumnTypes);

writer->columnNames.push_back(name);
writer->columnTypes.push_back(type);
return 0;
}

int c2numpy_open(c2numpy_writer *writer) {
char *fileName = (char*)malloc(strlen(writer->outputFilePrefix) + 15);
sprintf(fileName, "%s%d.npy", writer->outputFilePrefix, writer->currentFileNumber);
writer->file = fopen(fileName, "wb");

// FIXME: better initial guess about header size before going in 128 byte increments
char *header = NULL;
for (int64_t headerSize = 128; headerSize <= 4294967295; headerSize += 128) {
if (header != NULL) free(header);
header = (char*)malloc(headerSize + 1);

char version1 = headerSize <= 65535;
uint32_t descrSize;
if (version1)
descrSize = headerSize - 10;
else
descrSize = headerSize - 12;

header[0] = 147; // magic
header[1] = 'N';
header[2] = 'U';
header[3] = 'M';
header[4] = 'P';
header[5] = 'Y';
if (version1) {
header[6] = 1; // format version 1.0
header[7] = 0;
const uint16_t descrSize2 = descrSize;
*(uint16_t*)(header + 8) = descrSize2; // version 1.0 has a 16-byte descrSize
}
else {
header[6] = 2; // format version 2.0
header[7] = 0;
*(uint32_t*)(header + 8) = descrSize; // version 2.0 has a 32-byte descrSize
}
std::stringstream fileNameStream;
fileNameStream << writer->outputFilePrefix;
fileNameStream << writer->currentFileNumber;
fileNameStream << ".npy";
std::string fileName = fileNameStream.str();
writer->file = fopen(fileName.c_str(), "wb");

std::stringstream headerStream;
headerStream << "{'descr': [";

int column;
for (column = 0; column < writer->numColumns; ++column) {
headerStream << "('" << writer->columnNames[column] << "', '" << c2numpy_descr(writer->columnTypes[column]) << "')";
if (column < writer->numColumns - 1)
headerStream << ", ";
}

int64_t offset = headerSize - descrSize;
offset += snprintf((header + offset), headerSize - offset + 1, "{'descr': [");
if (offset >= headerSize) continue;
headerStream << "], 'fortran_order': False, 'shape': (";

for (int column = 0; column < writer->numColumns; ++column) {
offset += snprintf((header + offset), headerSize - offset + 1, "('%s', '%s')",
writer->columnNames[column],
c2numpy_descr(writer->columnTypes[column]));
if (offset >= headerSize) continue;
writer->sizeSeekPosition = headerStream.str().size();

if (column < writer->numColumns - 1)
offset += snprintf((header + offset), headerSize - offset + 1, ", ");
if (offset >= headerSize) continue;
}
headerStream << writer->numRowsPerFile;

offset += snprintf((header + offset), headerSize - offset + 1, "], 'fortran_order': False, 'shape': (");
if (offset >= headerSize) continue;
writer->sizeSeekSize = headerStream.str().size() - writer->sizeSeekPosition;

writer->sizeSeekPosition = offset;
writer->sizeSeekSize = snprintf((header + offset), headerSize - offset + 1, "%d", writer->numRowsPerFile);
offset += writer->sizeSeekSize;
if (offset >= headerSize) continue;
headerStream << ",), }";

offset += snprintf((header + offset), headerSize - offset + 1, ",), }");
if (offset >= headerSize) continue;
int headerSize = headerStream.str().size();
char version = 1;

while (offset < headerSize) {
if (offset < headerSize - 1)
header[offset] = ' ';
else
header[offset] = '\n';
offset += 1;
}
header[headerSize] = 0;

fwrite(header, 1, headerSize, writer->file);
if (headerSize > 65535) version = 2;
while ((6 + 2 + (version == 1 ? 2 : 4) + headerSize) % 16 != 0) {
headerSize += 1;
headerStream << " ";
if (headerSize > 65535) version = 2;
}

return 0;
fwrite("\x93NUMPY", 1, 6, writer->file);
if (version == 1) {
fwrite("\x01\x00", 1, 2, writer->file);
fwrite(&headerSize, 1, 2, writer->file);
writer->sizeSeekPosition += 6 + 2 + 2;
}
else {
fwrite("\x02\x00", 1, 2, writer->file);
fwrite(&headerSize, 1, 4, writer->file);
writer->sizeSeekPosition += 6 + 2 + 4;
}

return -1;
std::string header = headerStream.str();
fwrite(header.c_str(), 1, header.size(), writer->file);

return 0;
}

#define C2NUMPY_CHECK_ITEM { \
Expand Down Expand Up @@ -453,7 +411,8 @@ int c2numpy_close(c2numpy_writer *writer) {
// so go back to the part of the header where that was written
fseek(writer->file, writer->sizeSeekPosition, SEEK_SET);
// overwrite it with spaces
for (int i = 0; i < writer->sizeSeekSize; ++i)
int i;
for (i = 0; i < writer->sizeSeekSize; ++i)
fputc(' ', writer->file);
// now go back and write it again (it MUST be fewer or an equal number of digits)
fseek(writer->file, writer->sizeSeekPosition, SEEK_SET);
Expand All @@ -463,13 +422,6 @@ int c2numpy_close(c2numpy_writer *writer) {
fclose(writer->file);
}

// and clear the malloc'ed memory
free(writer->outputFilePrefix);
for (int column = 0; column < writer->numColumns; ++column)
free(writer->columnNames[column]);
free(writer->columnNames);
free(writer->columnTypes);

return 0;
}

Expand Down
Loading

0 comments on commit 3adedba

Please sign in to comment.