diff --git a/lapis2-docs/src/components/TsvExample.astro b/lapis2-docs/src/components/TsvExample.astro new file mode 100644 index 00000000..4e3d497a --- /dev/null +++ b/lapis2-docs/src/components/TsvExample.astro @@ -0,0 +1,8 @@ +--- + +--- + +{/*prettier-ignore -- prettier will remove the tabs*/} +
primaryKey pango_lineage region age qc_value insertions aaInsertions
+sequence001 B.1.1.7 Europe 46 0.98 segment1:123:AAA,segment2:456:GTT gene1:123:EPE
+sequence002 B.1.1.7 46 0.98 gene2:123:EPE,gene2:125:EPE
diff --git a/lapis2-docs/src/content/docs/maintainer-docs/references/preprocessing.mdx b/lapis2-docs/src/content/docs/maintainer-docs/references/preprocessing.mdx
index 0ebc0e21..1925a9e6 100644
--- a/lapis2-docs/src/content/docs/maintainer-docs/references/preprocessing.mdx
+++ b/lapis2-docs/src/content/docs/maintainer-docs/references/preprocessing.mdx
@@ -3,9 +3,298 @@ title: Preprocessing
description: Reference on the SILO preprocessing
---
-TODO #565
+import TsvExample from '../../../../components/TsvExample.astro';
-- preprocessing config. Has defaults set in Docker. Only set what you need, leave the rest to the defaults
-- input format of data
+:::tip[Why preprocessing?]
+SILO contains an in-memory database.
+Building this database from the raw input data is computation intensive,
+thus this is done before starting SILO.
+This is called "preprocessing".
+The result is a serialized version of the database that can be loaded into SILO in a much shorter time.
+:::
-## Preprocessing config
+The SILO preprocessing accepts input data in two formats:
+
+- `NDJSON`: a single [NDJSON](https://ndjson.org/) file containing all the data,
+- `TSV/FASTA`: a directory containing
+- a TSV file with the metadata
+- FASTA files with the sequences
+
+The preprocessing configuration file determines which format should be used.
+
+## Preprocessing Configuration
+
+The preprocessing configuration file is a YAML file that allows the keys shown in the table below.
+All keys are optional and have default values.
+Some keys are relevant only for one of the two input file formats.
+
+:::tip
+When using the Docker image, you can adhere to the defaults and mount the files to the correct locations.
+You only need to specify `ndjsonInputFilename` or `pangoLineageDefinitionFilename`
+if you wish to use the corresponding features.
+:::
+
+| Key | Input Format | Default | Default in Docker Image |
+| -------------------------------- | ------------ | -------------------------------- | ------------------------ |
+| `inputDirectory` | both | `./` (current working directory) | `/preprocessing/input/` |
+| `outputDirectory` | both | `./output/` | `/preprocessing/output/` |
+| `intermediateResultsDirectory` | both | `./temp/` | `/preprocessing/temp/` |
+| `preprocessingDatabaseLocation` | both | (absent) | |
+| `ndjsonInputFilename` | `NDJSON` | (absent) | |
+| `metadataFilename` | `TSV/FASTA` | `metadata.tsv` | |
+| `pangoLineageDefinitionFilename` | both | (absent) | |
+| `referenceGenomeFilename` | both | `reference_genomes.json` | |
+| `nucleotideSequencePrefix` | `TSV/FASTA` | `nuc_` | |
+| `genePrefix` | `TSV/FASTA` | `gene_` | |
+
+:::note
+All filenames are relative to the `inputDirectory`.
+:::
+
+:::caution
+`ndjsonInputFilename` and `metadataFilename` must not be specified simultaneously as they determine the format.
+:::
+
+### Description of Keys for Both Formats
+
+- `inputDirectory`:
+ The directory where input files are located.
+- `outputDirectory`:
+ The directory where output files will be placed.
+- `intermediateResultsDirectory`:
+ The directory for storing intermediate results not relevant to the end user, mainly for debugging.
+- `preprocessingDatabaseLocation`:
+ The file for storing internal, intermediate database states for debugging.
+- `pangoLineageDefinitionFilename`:
+ The file with Pango lineage definitions, relative to the inputDirectory.
+ See the section on the [Pango Lineage Definition File below](#the-pango-lineage-definition-file) for details.
+- `referenceGenomeFilename`:
+ The file with [reference genomes](/maintainer-docs/references/reference-genomes), relative to the inputDirectory.
+
+## `NDJSON` Format
+
+SILO will initiate preprocessing in the `NDJSON` format
+if `ndjsonInputFilename` is specified in the preprocessing configuration.
+
+Each line in the NDJSON file must be a JSON object with the following keys:
+
+| Key | Type | Description |
+| ---------------------------- | -------- | ---------------------------------------------------------------------------- |
+| metadata | `object` | An object containing all metadata as key-value pairs. |
+| unalignedNucleotideSequences | `object` | A [sequences object](#sequences-object) with unaligned nucleotide sequences. |
+| alignedNucleotideSequences | `object` | A [sequences object](#sequences-object) with aligned nucleotide sequences. |
+| alignedAminoAcidSequences | `object` | A [sequences object](#sequences-object) with aligned amino acid sequences. |
+| aminoAcidInsertions | `object` | An [insertions object](#insertions-object) with amino acid insertions. |
+| nucleotideInsertions | `object` | An [insertions object](#insertions-object) with nucleotide insertions. |
+
+:::note
+You must configure two metadata columns for insertions in the
+[database configuration](/maintainer-docs/references/database-configuration)
+with the exact names and types as in this snippet:
+
+```yaml
+schema:
+ metadata:
+ - name: nucleotideInsertions
+ type: insertion
+ - name: aminoAcidInsertions
+ type: aaInsertion
+```
+
+Otherwise, SILO will not recognize insertions in the NDJSON format.
+:::
+
+#### Sequences Object
+
+The sequences object contains sequences for each segment or gene.
+It must include all `nucleotideSequences` (or `genes`, respectively) specified in the
+[reference genomes](/maintainer-docs/references/reference-genomes)
+as keys.
+Its values are the sequences as strings of
+[valid symbols](/references/nucleotide-and-amino-acid-symbols)
+or `null`.
+
+#### Insertions Object
+
+The insertions object contains a list of insertions for each segment or gene.
+It must include all `nucleotideSequences` (or `genes`, respectively) specified in the
+[reference genomes](/maintainer-docs/references/reference-genomes)
+as keys.
+Its values are arrays of strings in the format `