filesdb is a simple utility to of files generated under different experimental conditions.
Keeping track of files generated in computational experiments is very important,
and gets complicated quickly. For simple problems, using filenames is sufficient
(e.g., a-1_b-2.txt
). However, sometimes code changes, or more parameters
become important, and now filenames may look like
a-1_b-2_c-5_version-2.txt
, breaking your naming scheme and
complicating any code you might have that parses file names.
filesdb solves these problems by tracking all your files in an sqlite database. Additional parameters can be added at any time. The database can quickly be searched by parameter values using a command line or python interface.
Say you are have some code that generates an output file and takes two parameters, a and b. To get a new filename and add it to the database:
Bash | Python |
---|---|
filename=$(filesdb add a=${a} b=${b}) |
filename = filesdb.add({'a': a, 'b': b}) |
Then ${filename}
can be passed as the output parameter to your code. (filesdb
does not actually create the file.) This process can be repeated for different
values of a and b. You can also specify the filename yourself:
Bash | Python |
---|---|
filesdb add --filename=myfile.txt a=${a} b=${b} |
filesdb.add({'a': a, 'b': b}, filename='myfile.txt') |
Note that keys cannot end in "!".
To list all the files with a=${a}
Bash | Python |
---|---|
filesdb search a=${a} |
entries = filesdb.search({'a': a}) |
Later, if you decided you also want to track files by a new parameter, (e.g, c), you can simply add a new parameter:
Bash | Python |
---|---|
filename=$(filesdb add a=${a} b=${b} c=${c}) |
filename = filesdb.add({'a': a, 'b': b, 'c': c}) |
Finally, let's say you discover a bug and your code, and all files generated with a=0 are invalid and you decide to delete them.
Bash | Python |
---|---|
filesdb delete a=0 |
deleted_entries = filesdb.delete({'a': 0}) |
This WILL delete the file from the file system as well as remove it from the database. The bash version will also print the deleted entries. To make sure that you're only deleting the files you want, delete has a dry run option, which prints (or returns, in the python case) the entries to be deleted but does not actually delete them.
Bash | Python |
---|---|
filesdb delete --dry_run a=0 |
entries = filesdb.delete({'a': 0}, dryrun=True) |
Both search and delete support "not equal" comparisons.
Bash | Python |
---|---|
filesdb search a!=${a} |
entries = filesdb.search({'a!': a}) |
Note that this syntax prevents keys from ending in "!".
The copy command copies a file and its data base entry to a new directory. For example:
Bash | Python |
---|---|
N/A | filesdb.copy(filename, outdir, db='files.db', wd='.', outdb='files.db') |
Will copy the ${filename}
in the current directory to ${outdir}
,
and copy its row entry to files.db
in the current director to
files.db
in ${outdir}
.
Note that this method does not currently have a command line interface
Merge one database into another.
Bash | Python |
---|---|
filesdb --wd='.' --db=files.db merge files2.db |
filesdb.merge('files2.db', 'files.db', wd='.') |
will add all entries from files2.db
to files.db
(if they aren't already
present)
For values that are mostly the same for every experiment (e.g., the versions of each software package used), filesdb supports a separate environment dictionary. (Internally, this is stored as a separate table, with an identifying environment hash value in the main table.)
filesdb.add({'a': a, 'b': b}, filename='myfile.txt',
environment={'pkgname1': '0.1.1', pkgname2: '2.1.0'})
To return environment data with search results (via a JOIN):
filesdb.search({}, with_environments=True)
To search based on environment values:
filesdb.search({}, environment={'pkgname1': '0.1.1'})