Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Collection: Assemble Development Databases #6

Open
hrovira opened this issue Nov 18, 2013 · 7 comments
Open

Data Collection: Assemble Development Databases #6

hrovira opened this issue Nov 18, 2013 · 7 comments
Assignees

Comments

@hrovira
Copy link
Contributor

hrovira commented Nov 18, 2013

Latest run of FMP into group of databases per tumor type for use in development

  • Consolidate reference data for use in RE and GS
  • Establish development server : KRAKEN
@spacepod
Copy link
Contributor

Is the goal for this issue to have a list of the relevant FMP directories/files, or to have the contents of those files imported?

Likewise, for this issue, should the reference datasets be collected and listed, or actually imported?

@hrovira
Copy link
Contributor Author

hrovira commented Nov 20, 2013

The goal is to have a set of databases for each tumor type and a datamodel.json that can be used in development for RE and GS. The database should reside in the server, connections should be allowed from dev workstations.

Reference data is a lesser priority for this task.

@spacepod
Copy link
Contributor

FYI I've installed mongodb on kraken, under /local/mongodb. I've created /local/mongodb/bin which links to the current versions of the mongodb installation, and I've updated the www user's path accordingly.

@spacepod
Copy link
Contributor

Notes: start mongo server under www user with

numactl --interleave=all mongod --dbpath /local/mongodb/db

see http://docs.mongodb.org/manual/administration/production-notes/#mongodb-on-numa-hardware

@spacepod
Copy link
Contributor

current filesystem structure:

mongodb files live here:

/local/mongodb/db

data files

$cr9/workspaces/canonical_datasets
|-- BLCA
|   `-- 20131113
|       `-- BLCA.SEQ.20131113.tsv
|       `-- BLCA.SEQ.20131113-provenance.tsv
|-- BRCA
|   `-- 20131113
|       `-- BRCA.SEQ.20131113.tsv
|       `-- BRCA.SEQ.20131113-provenance.tsv
|       `-- BRCA.ARY.20131113.tsv
|       `-- BRCA.ARY.20131113-provenance.tsv
…

@spacepod
Copy link
Contributor

Resolved.

Bare-bones sample datamodel.json with only one tumor type for review at $cr9/workspaces/canonical_datasets.json;
All SEQ data loaded into local mongodb; each tumortype per platform per date is one database, such as:
BRCA-SEQ-20131113
Within each database is currently one collection: feature_matrix

Optional changes which can be discussed:

  1. renaming databases with underscores instead of dashes, if that is an existing convention;
  2. db name might not include the platform type (seq vs ary) and instead each type could exist as a separate feature_matrix collection (feature_matrix_seq, etc) with some metadata in the datamodel.json.

@spacepod spacepod reopened this Nov 27, 2013
@spacepod
Copy link
Contributor

(awaiting hrovira's comments and/or closing issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants