Add format metadata to each data to generated files. #70

hamishmorgan · 2012-04-18T09:33:10Z

It would greatly simplify the whole system if format meta data was stored in the data files. This would mean that the IO layer could automatically open data files in the correct manner without having to be specifically directed by the high layers. This would simply everything by reducing coupling.

Information that should be stored in the files would include:

Whether entires are enumerated, and if enabled:
- The relevant enumeration file(s)
- Whether skip index is enabled on each column
- The enumeration file format (JDBM vs TSV dump)
Whether compact format is enabled.
Character encoding
Ordering of the data.
Column names
Date written

Other information that may be useful but is harder to write in a header (because this information is not always available a priori) could include:

Number of records
Data statistics
Checksum

This feature could be be implemented with the addition of comments and structured escaping in the file format.


# Sample accumulated events file
#
# This line is a comment
#
# The following line is blank (which should now be allowed)

# The following lines are meta data, denoted with the additional hash character
#
## type=weighted-token-pair
## column.names = entry,features,count
## column.types = string,string,integer
## entry.enumeration.enabled = true
## entry.enumeration.file = sample-thesaurus.entry-index
## entry.enumeration.format = jdbm
## entry.enumeration.skip = true
## feature.enumeration.file = sample-thesaurus.feature-index
## feature.enumeration.format = jdbm
## feature.enumeration.skip = true
## compact = true
## charset = UTF-8
## order = entry id asc, feature id asc
## date = 10:29:03 16-04-2012
#
#

1   1   4   1   1   2   1   2   6   3   4   2   3   8   2   3   2   15  1   6   2   3   1   2   12  1   22  1   11  1   11  1   74  1   3   1   10  1   148 1   82  1   164 1   53  1   34  1   83  1   72  1   32  1   1374    1   456 1   6   1   1376    1   12  1   189 1   429 1   143 1   50  1   127 1   994 1   455 1   1342    1   548 1   6   1   1131    1
1   -9497   7   2   1   1   34  2   18  1   1   4   1   26  5   3   2   2   1   1   2   30  3   17  14  1   1   1   27  1   20  3   2   2   113 1   12  1   2   1   22  1   54  1   110 1   114 1   45  1   37  1   4   1   9   1   3   1   49  4   26  1   67  1   47  1   12  1   27  1   192 1   191 1   38  1   50  1   3   2   11  1   28  1   53  1   19  1   193 1   159 1   7   1   43  1   183 1   64  2   66  1   140 1   66  1   46  1   227 1   787 1   135 1   151 1   43  1   596 1   26  1   388 1   174 1   19  1   5   1   1   1   1102    1   287 1   319 1   694 1   1591    1   475 1   529 1
1   -10011  2   3   2   1   1   1   3   3   3   8   1   1   1   1   1   1   1   35  1   1   1   1   226 1   383 1   626 1   612 1   719 1   1729    1   1   1   3632    2   883 1
1   -8867   2   3   1   2   10  1   1   2   7   10  1   3   1   11  1   1   3   14  3   4   1   1   15  1   3   1   1   1   32  1   162 1   38  1   53  1   19  4   32  1   1   1   97  1   154 1   158 1   39  1   369 1   18  1   624 1   1447    1   432 1   3233    1   1002    1   1   1   32  1
1   -8011   1
1   0   1
1   1   2   1   3   1   1   39  1   13  1   1053    1   347 1
1   -1457   2   2   1   1   5   2   6   1   3   16  1   22  1   13  1   12  1   85  1   110 2   143387  1   50  1   266 1   347 1   43  1   321 1   528 1   2502    1   12  1   1860    1
1   -6720   2   2   1   52  1   1400    1
1   -1457   3   2   12  1   1   2   5   2   1   24  1   4   1   3   3   28  1   1   1   3   1   16  16  2   34  1   18  1   701 1   37  1   167 1   28  1   1241    1   753 1   806 1   383 1   662 1   470 1   2807    1   1   1   227 1

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add format metadata to each data to generated files. #70

Add format metadata to each data to generated files. #70

hamishmorgan commented Apr 18, 2012

Add format metadata to each data to generated files. #70

Add format metadata to each data to generated files. #70

Comments

hamishmorgan commented Apr 18, 2012