Skip to content

File Formats

ypnos edited this page Feb 4, 2021 · 2 revisions

Datasets

Belki imports datasets from tab-separated values (TSV) files. They are text files representing tabular data and may written by spreadsheet applications or data processing software, e.g. Python Pandas.

Data points

In general, Belki operates on multidimensional data, ie. a vector of numerical values for each protein, of same length for all proteins. The dimensions can be freely named.

Practical example: You perform several MS measurements and would like to visualize their combined output in Belki. Each of these measurements contributes a dimension of a protein's data vector. If a protein is missing in one of the measurements, insert zero.

You may offer data in a normalized in the range [0;1] or raw. If there is a high dynamic range in your data, and auto-normalization is off, it will be displayed in log-scale in Belki.

You may also offer a confidence score for each data point. These scores are visualized in Belki alongside their resp. data points. Scores can be any number without a predefined range. Currently, lower scores are seen as better (this will be configurable).

Pivot table format

This is the most simple format understood by Belki. It has a one-line header, followed by each protein in a column. Illustratory example:

Measurment 1 Meas. B Another run
PLEC_HUMAN 2.1 0.5 0
AHNK_HUMAN 3.145 0.6 0
MYH9_HUMAN 0.456 0.7 12
MYOF_HUMAN 0.123 0.8 0
DYHC1_HUMAN 0.123 0.9 6

⚠️ When using this format, ensure that the very first field in the table is empty. The pivot table is detected by the first column of the header row being empty.

⚠️ This data format currently enforces auto-normalization.

Regular table format

This format is more complex but may contain additional information. In this format, multiple lines are used for each protein. Again, an illustratory example representing the same data as above:

Protein Pair Dist Score
PLEC_HUMAN Measurment 1 2.1 1
PLEC_HUMAN Meas. B .5 1
PLEC_HUMAN Another run .5 1
AHNK_HUMAN Measurment 1 3.145 0.6
AHNK_HUMAN Meas. B 0.6 1
AHNK_HUMAN Another run 0 1

⚠️ The first column needs to list proteins and must be titled "Protein" in the header row. The header fields "Pair", "Dist", and "Score" are currently mandatory (but may appear in any order, and more columns are allowed but will be ignored).

ℹ️ To disable auto-normalization, right now you need to change the feature column name from "Dist" to "AbundanceLeft" and use "Load Abundance Values" from File menu to import the dataset.

Other data sources

Projects

Belki projects are stored in files with suffix .belki. These files use the CBOR format to encode the data.

CBOR binary data serialization format loosely based on JSON, similar to MessagePack. It can be easily read and written by other applications. File contents are self-explanatory, but a schema is currently not available.

Clone this wiki locally