[ENH] Create a glossary for unclear terms (#138)

* Initial glossary * Apply suggestions from code review Co-authored-by: Alyssa Dai <alyssa.ydai@gmail.com> --------- Co-authored-by: Alyssa Dai <alyssa.ydai@gmail.com>
neurobagel · Dec 20, 2023 · e42338b · e42338b
1 parent e98298d
commit e42338b
Show file tree

Hide file tree

Showing 2 changed files with 168 additions and 6 deletions.
diff --git a/docs/glossary.md b/docs/glossary.md
@@ -0,0 +1,158 @@
+This glossary compiles some key terms used in the Neurobagel documentation and defines them in the context of the Neurobagel ecosystem.
+
+### Data dictionary
+:   A JSON file that describes the information contained in columns from a tabular data file, 
+    along with the meaning and properties (format of numerical data, unique “levels” 
+    of categorical data, etc.) of values in each column. In the context of Neurobagel, 
+    the meanings of columns and column values are encoded using terms from standardised vocabularies.
+
+### Data model
+**Used interchangeably with**: data schema
+
+:   A structure that has been designed with the 
+    purpose to represent a specific kind of information. 
+    A data model is made up of generic types or classes that are relevant
+    to the data model designers (for Neurobagel, examples include "Research Participant"
+    and "Neuroimaging Dataset"), the properties these types can
+    have (e.g., "Age in years", "Dataset name"), and the 
+    relationships that can exist between them (e.g., "is part of").
+    The goal of a data model is to give information a structure
+    so that we can write programs that can consume the information.
+
+    The Neurobagel data model is designed to represent the kind
+    of information that is important to support the most relevant
+    cohort definition queries, and thus models types, properties,
+    and relationships that are important for this purpose.
+    It is not a static thing, and we constantly add new things to 
+    the data model as we support new use cases that rely on this
+    information.
+
+### Controlled term
+:   A unique identifier or code for a concept that is described in a controlled vocabulary.
+
+    A controlled term has a 
+
+    - a clear definition
+    - a unique and persistent identifier
+    - from a specific curated list of terms like a vocabulary, taxonomy or ontology
+
+    An example is the controlled term for 
+    ["Parkinson's disease" from the ICD-11 taxonomy](https://icd.who.int/browse11/l-m/en#/http://id.who.int/icd/entity/296066191)
+    with the unique code `8A00.0`.
+
+### Controlled vocabulary
+**Used interchangeably with**: taxonomy, and ontology
+
+:   A **controlled vocabulary** is a collection of controlled terms that
+    are often all about one specific topic. The main benefit of a 
+    controlled vocabulary is that it provides unambiguous terms with
+    clear definitions that people have agreed to use to describe their
+    information - removing the need to align variable names and value
+    formats between datasets and enabling interoperability. 
+
+    For example, most websites use the [schema.org](https://schema.org/)
+    vocabulary to describe things like 
+    [products](https://schema.org/Product) to purchase, 
+    [events](https://schema.org/Event) to book, 
+    [recipes](https://schema.org/Recipe) to cook etc.
+    in a consistent way that can be understood by 
+    the search spiders of big search engines.
+
+??? Note "Reusing controlled vocabularies"
+    Creating a controlled vocabulary is a laborious task 
+    that involves deep subject matter expertise, often from many experts, 
+    and needs to be maintained to remain relevant.
+    You should therefore almost always **reuse** an existing vocabulary
+    rather than creating your own. 
+
+    A **taxonomy** is a more specific form of a controlled vocabulary 
+    that organizes terms into hierarchical relationships. For example,
+    a ["Recipe"](https://schema.org/Recipe) in schema.org is a subtype of
+    a "HowTo" which itself is a subtype of a "CreativeWork". This hierarchy 
+    let's you do things like search for "CreativeWork" and also find
+    "Recipe", even if you have never made this link directly.
+
+    An **ontology** is an even more specific form of a taxonomy 
+    where terms can have very complex relationships with each other
+    that include logical constraints. In an ontology, you could for example
+    express that for someone to be a "sister" to someone else, 
+    both the subject and the object of the relationship have to be "human",
+    only the subject of the relation has to be "female", and both have to 
+    have at least one parent in common. These complex expressions are very
+    labour intensive to create but can provide also very 
+    rich ways of validating and even inferring information.
+
+### Graph database
+**Used interchangeably with**: knowledge graph store, graph store, graph
+
+:   A type of database, in the same way that a relational databases is a type of database.
+    The main distinguishing feature of graph databases is that they 
+    represent entities as nodes in a graph, 
+    and relationships between entities as edges between these nodes.
+    This data model makes it easy to easily add new information
+    by drawing a new edge between two nodes.
+
+??? note
+    A single Neurobagel graph database can contain harmonised information about multiple datasets and their respective subjects. Each subject is represented by a node, and their harmonised phenotypic and imaging data characteristics are described using controlled terms connected to the subject node via a series of edges that individually encode the type of attribute described by the controlled term.
+
+    Neurobagel uses the RDF graph data model, see also [https://en.wikipedia.org/wiki/Graph_database](https://en.wikipedia.org/wiki/Graph_database).
+
+### Annotation
+:   In the context of Neurobagel, annotation refers to the process
+    of describing tabular demographic, cognitive, and/or clinical (phenotypic) data for a dataset
+    with terms from controlled vocabularies to create machine 
+    understandable data dictionaries for the data. You can learn
+    more about this process in our [documentation](annotation_tool.md).
+
+### Aggregated results
+:   If the owner of a Neurobagel node decides that query responses
+    should not include information at the level of individual 
+    participants, they can configure their node to only return
+    aggregated results. In this mode, the node will aggregate
+    all participants that match a query at the dataset level
+    and only respond with counts of matching participants.
+
+### Data owner
+:   A person or an institute
+    who is responsible in the data governance sense 
+    for one or many datasets. In the context of Neurobagel, one data owner can have one or
+    more Neurobagel nodes, but every Neurobagel node can only
+    have one data owner who is responsible for all of the data
+    stored inside the node.
+
+### Federation API
+**Used interchangeably with**: f-API
+:   A standalone service that allows query users to send a single
+    query and have it automatically sent to many Neurobagel node APIs
+    (n-API) without having to know where these node APIs are located.
+    The f-API takes care of keeping an up to date list of available 
+    n-APIs, federating queries, retrieving and combining results, 
+    and returning them to the user.
+
+    Designed to very closely resemble the behaviour and
+    the endpoints of a n-API so that services can be built that are
+    able to work either directly with a single n-API or with an f-API.
+
+### Node API
+**Used interchangeably with**: n-API
+:   A Neurobagel "node" is a locally deployed service
+    that holds information about data for one data owner who controls
+    and manages the node. A node has two core components:
+
+    - a graph backend to store the harmonised data for querying
+    - a RESTful **node API** that exposes query endpoints for
+    users or programs to send queries and retrieve results
+
+    One important purpose of the n-API is to act as a barrier
+    between the user and the graph backend so that the user cannot
+    execute arbitrary queries on the graph, and the data owner
+    can control how detailed the query responses should be.
+
+### Tabular data
+**Used interchangeably with**: phenotypic data
+
+:   Tabular text files (e.g., .tsv or .csv) that contain information about
+    participants such as their demographic information or data from
+    cognitive or clinical assessments they have completed. 
+    We often refer to this information as phenotypic data
+    because they describe observable characteristics of the participant.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -52,20 +52,24 @@ nav:
     - Pull requests: "contributing/pull_requests.md"
     - Our team: "contributing/team.md"
   - Getting help: "getting_help.md"
+  - Glossary: "glossary.md"
   - Citing Neurobagel: "cite.md"
 
 markdown_extensions:
-  - tables
+  - abbr
   - admonition
-  - pymdownx.details
-  - pymdownx.superfences
-  - pymdownx.tabbed:
-      alternate_style: true 
   - attr_list
+  - def_list
+  - md_in_html  # for annotations
+  - pymdownx.details
   - pymdownx.emoji:
       emoji_index: !!python/name:materialx.emoji.twemoji
       emoji_generator: !!python/name:materialx.emoji.to_svg
-  - md_in_html  # for annotations
+  - pymdownx.superfences
+  - pymdownx.tabbed:
+      alternate_style: true 
+  - tables
+
 
 plugins:
   - search