Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add format metadata to each data to generated files. #70

Open
hamishmorgan opened this issue Apr 18, 2012 · 0 comments
Open

Add format metadata to each data to generated files. #70

hamishmorgan opened this issue Apr 18, 2012 · 0 comments

Comments

@hamishmorgan
Copy link
Member

It would greatly simplify the whole system if format meta data was stored in the data files. This would mean that the IO layer could automatically open data files in the correct manner without having to be specifically directed by the high layers. This would simply everything by reducing coupling.

Information that should be stored in the files would include:

  • Whether entires are enumerated, and if enabled:
    • The relevant enumeration file(s)
    • Whether skip index is enabled on each column
    • The enumeration file format (JDBM vs TSV dump)
  • Whether compact format is enabled.
  • Character encoding
  • Ordering of the data.
  • Column names
  • Date written

Other information that may be useful but is harder to write in a header (because this information is not always available a priori) could include:

  • Number of records
  • Data statistics
  • Checksum

This feature could be be implemented with the addition of comments and structured escaping in the file format.


# Sample accumulated events file
#
# This line is a comment
#
# The following line is blank (which should now be allowed)

# The following lines are meta data, denoted with the additional hash character
#
## type=weighted-token-pair
## column.names = entry,features,count
## column.types = string,string,integer
## entry.enumeration.enabled = true
## entry.enumeration.file = sample-thesaurus.entry-index
## entry.enumeration.format = jdbm
## entry.enumeration.skip = true
## feature.enumeration.file = sample-thesaurus.feature-index
## feature.enumeration.format = jdbm
## feature.enumeration.skip = true
## compact = true
## charset = UTF-8
## order = entry id asc, feature id asc
## date = 10:29:03 16-04-2012
#
#

1   1   4   1   1   2   1   2   6   3   4   2   3   8   2   3   2   15  1   6   2   3   1   2   12  1   22  1   11  1   11  1   74  1   3   1   10  1   148 1   82  1   164 1   53  1   34  1   83  1   72  1   32  1   1374    1   456 1   6   1   1376    1   12  1   189 1   429 1   143 1   50  1   127 1   994 1   455 1   1342    1   548 1   6   1   1131    1
1   -9497   7   2   1   1   34  2   18  1   1   4   1   26  5   3   2   2   1   1   2   30  3   17  14  1   1   1   27  1   20  3   2   2   113 1   12  1   2   1   22  1   54  1   110 1   114 1   45  1   37  1   4   1   9   1   3   1   49  4   26  1   67  1   47  1   12  1   27  1   192 1   191 1   38  1   50  1   3   2   11  1   28  1   53  1   19  1   193 1   159 1   7   1   43  1   183 1   64  2   66  1   140 1   66  1   46  1   227 1   787 1   135 1   151 1   43  1   596 1   26  1   388 1   174 1   19  1   5   1   1   1   1102    1   287 1   319 1   694 1   1591    1   475 1   529 1
1   -10011  2   3   2   1   1   1   3   3   3   8   1   1   1   1   1   1   1   35  1   1   1   1   226 1   383 1   626 1   612 1   719 1   1729    1   1   1   3632    2   883 1
1   -8867   2   3   1   2   10  1   1   2   7   10  1   3   1   11  1   1   3   14  3   4   1   1   15  1   3   1   1   1   32  1   162 1   38  1   53  1   19  4   32  1   1   1   97  1   154 1   158 1   39  1   369 1   18  1   624 1   1447    1   432 1   3233    1   1002    1   1   1   32  1
1   -8011   1
1   0   1
1   1   2   1   3   1   1   39  1   13  1   1053    1   347 1
1   -1457   2   2   1   1   5   2   6   1   3   16  1   22  1   13  1   12  1   85  1   110 2   143387  1   50  1   266 1   347 1   43  1   321 1   528 1   2502    1   12  1   1860    1
1   -6720   2   2   1   52  1   1400    1
1   -1457   3   2   12  1   1   2   5   2   1   24  1   4   1   3   3   28  1   1   1   3   1   16  16  2   34  1   18  1   701 1   37  1   167 1   28  1   1241    1   753 1   806 1   383 1   662 1   470 1   2807    1   1   1   227 1


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant