Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document dj.Top() and add missing pages #1197

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,5 @@
"[dockercompose]": {
"editor.defaultFormatter": "disable"
},
"files.autoSave": "off"
"files.autoSave": "afterDelay"
}
131 changes: 93 additions & 38 deletions docs/src/concepts/data-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,23 @@

## What is a data model?

A **data model** refers to a conceptual framework for thinking about data and about
operations on data.
A data model defines the mental toolbox of the data scientist; it has less to do with
the architecture of the data systems, although architectures are often intertwined with
data models.
A **data model** is a conceptual framework that defines how data is organized,
represented, and transformed. It gives us the components for creating blueprints for the
structure and operations of data management systems, ensuring consistency and efficiency
in data handling.

Data management systems are built to accommodate these models, allowing us to manage
data according to the principles laid out by the model. If you’re studying data science
or engineering, you’ve likely encountered different data models, each providing a unique
approach to organizing and manipulating data.

A data model is defined by considering the following key aspects:

+ What are the fundamental elements used to structure the data?
+ What operations are available for defining, creating, and manipulating the data?
+ What mechanisms exist to enforce the structure and rules governing valid data interactions?

## Types of data models

Among the most familiar data models are those based on files and folders: data of any
kind are lumped together into binary strings called **files**, files are collected into
Expand All @@ -24,17 +36,16 @@ objects in memory with properties and methods for transformations of such data.
## Relational data model

The **relational model** is a way of thinking about data as sets and operations on sets.
Formalized almost a half-century ago
([Codd, 1969](https://dl.acm.org/citation.cfm?doid=362384.362685)), the relational data
model provides the most rigorous approach to structured data storage and the most
precise approach to data querying.
The model is defined by the principles of data representation, domain constraints,
uniqueness constraints, referential constraints, and declarative queries as summarized
below.
Formalized almost a half-century ago ([Codd,
1969](https://dl.acm.org/citation.cfm?doid=362384.362685)). The relational data model is
one of the most powerful and precise ways to store and manage structured data. At its
core, this model organizes all data into tables--representing mathematical
relations---where each table consists of rows (representing mathematical tuples) and
columns (often called attributes).

### Core principles of the relational data model

**Data representation**
**Data representation:**
Data are represented and manipulated in the form of relations.
A relation is a set (i.e. an unordered collection) of entities of values for each of
the respective named attributes of the relation.
Expand All @@ -43,26 +54,26 @@ below.
A collection of base relations with their attributes, domain constraints, uniqueness
constraints, and referential constraints is called a schema.

**Domain constraints**
Attribute values are drawn from corresponding attribute domains, i.e. predefined sets
of values.
Attribute domains may not include relations, which keeps the data model flat, i.e.
free of nested structures.
**Domain constraints:**
Each attribute (column) in a table is associated with a specific attribute domain (or
datatype, a set of possible values), ensuring that the data entered is valid.
Attribute domains may not include relations, which keeps the data model
flat, i.e. free of nested structures.

**Uniqueness constraints**
**Uniqueness constraints:**
Entities within relations are addressed by values of their attributes.
To identify and relate data elements, uniqueness constraints are imposed on subsets
of attributes.
Such subsets are then referred to as keys.
One key in a relation is designated as the primary key used for referencing its elements.

**Referential constraints**
**Referential constraints:**
Associations among data are established by means of referential constraints with the
help of foreign keys.
A referential constraint on relation A referencing relation B allows only those
entities in A whose foreign key attributes match the key attributes of an entity in B.

**Declarative queries**
**Declarative queries:**
Data queries are formulated through declarative, as opposed to imperative,
specifications of sought results.
This means that query expressions convey the logic for the result rather than the
Expand All @@ -86,23 +97,26 @@ Similar to spreadsheets, relations are often visualized as tables with *attribut
corresponding to *columns* and *entities* corresponding to *rows*.
In particular, SQL uses the terms *table*, *column*, and *row*.

## DataJoint is a refinement of the relational data model

DataJoint is a conceptual refinement of the relational data model offering a more
expressive and rigorous framework for database programming
([Yatsenko et al., 2018](https://arxiv.org/abs/1807.11104)).
The DataJoint model facilitates clear conceptual modeling, efficient schema design, and
precise and flexible data queries.
The model has emerged over a decade of continuous development of complex data pipelines
for neuroscience experiments
([Yatsenko et al., 2015](https://www.biorxiv.org/content/early/2015/11/14/031658)).
DataJoint has allowed researchers with no prior knowledge of databases to collaborate
effectively on common data pipelines sustaining data integrity and supporting flexible
access.
DataJoint is currently implemented as client libraries in MATLAB and Python.
These libraries work by transpiling DataJoint queries into SQL before passing them on
to conventional relational database systems that serve as the backend, in combination
with bulk storage systems for storing large contiguous data objects.
## The DataJoint Model

DataJoint is a conceptual refinement of the relational data model offering a more
expressive and rigorous framework for database programming ([Yatsenko et al.,
2018](https://arxiv.org/abs/1807.11104)). The DataJoint model facilitates conceptual
clarity, efficiency, workflow management, and precise and flexible data
queries. By enforcing entity normalization,
simplifying dependency declarations, offering a rich query algebra, and visualizing
relationships through schema diagrams, DataJoint makes relational database programming
more intuitive and robust for complex data pipelines.

The model has emerged over a decade of continuous development of complex data
pipelines for neuroscience experiments ([Yatsenko et al.,
2015](https://www.biorxiv.org/content/early/2015/11/14/031658)). DataJoint has allowed
researchers with no prior knowledge of databases to collaborate effectively on common
data pipelines sustaining data integrity and supporting flexible access. DataJoint is
currently implemented as client libraries in MATLAB and Python. These libraries work by
transpiling DataJoint queries into SQL before passing them on to conventional relational
database systems that serve as the backend, in combination with bulk storage systems for
storing large contiguous data objects.

DataJoint comprises:

Expand All @@ -115,3 +129,44 @@ modeled entities
The key refinement of DataJoint over other relational data models and their
implementations is DataJoint's support of
[entity normalization](../design/normalization.md).

### Core principles of the DataJoint model

**Entity Normalization**
DataJoint enforces entity normalization, ensuring that every entity set (table) is
well-defined, with each element belonging to the same type, sharing the same
attributes, and distinguished by the same primary key. This principle reduces
redundancy and avoids data anomalies, similar to Boyce-Codd Normal Form, but with a
more intuitive structure than traditional SQL.

**Simplified Schema Definition and Dependency Management**
DataJoint introduces a schema definition language that is more expressive and less
error-prone than SQL. Dependencies are explicitly declared using arrow notation
(->), making referential constraints easier to understand and visualize. The
dependency structure is enforced as an acyclic directed graph, which simplifies
workflows by preventing circular dependencies.

**Integrated Query Operators producing a Relational Algebra**
DataJoint introduces five query operators (restrict, join, project, aggregate, and
union) with algebraic closure, allowing them to be combined seamlessly. These
operators are designed to maintain operational entity normalization, ensuring query
outputs remain valid entity sets.

**Diagramming Notation for Conceptual Clarity**
DataJoint’s schema diagrams simplify the representation of relationships between
entity sets compared to ERM diagrams. Relationships are expressed as dependencies
between entity sets, which are visualized using solid or dashed lines for primary
and secondary dependencies, respectively.

**Unified Logic for Binary Operators**
DataJoint simplifies binary operations by requiring attributes involved in joins or
comparisons to be homologous (i.e., sharing the same origin). This avoids the
ambiguity and pitfalls of natural joins in SQL, ensuring more predictable query
results.

**Optimized Data Pipelines for Scientific Workflows**
DataJoint treats the database as a data pipeline where each entity set defines a
step in the workflow. This makes it ideal for scientific experiments and complex
data processing, such as in neuroscience. Its MATLAB and Python libraries transpile
DataJoint queries into SQL, bridging the gap between scientific programming and
relational databases.
14 changes: 7 additions & 7 deletions docs/src/concepts/data-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,10 +157,10 @@ with external groups.
## Summary of DataJoint features

1. A free, open-source framework for scientific data pipelines and workflow management
1. Data hosting in cloud or in-house
1. MySQL, filesystems, S3, and Globus for data management
1. Define, visualize, and query data pipelines from MATLAB or Python
1. Enter and view data through GUIs
1. Concurrent access by multiple users and computational agents
1. Data integrity: identification, dependencies, groupings
1. Automated distributed computation
2. Data hosting in cloud or in-house
3. MySQL, filesystems, S3, and Globus for data management
4. Define, visualize, and query data pipelines from MATLAB or Python
5. Enter and view data through GUIs
6. Concurrent access by multiple users and computational agents
7. Data integrity: identification, dependencies, groupings
8. Automated distributed computation
52 changes: 52 additions & 0 deletions docs/src/design/alter.md
Original file line number Diff line number Diff line change
@@ -1 +1,53 @@
# Altering Populated Pipelines

Tables can be altered after they have been declared and populated. This is useful when
you want to add new secondary attributes or change the data type of existing attributes.
Users can use the `definition` property to update a table's attributes and then use
`alter` to apply the changes in the database. Currently, `alter` does not support
changes to primary key attributes.

Let's say we have a table `Student` with the following attributes:

```python
@schema
class Student(dj.Manual):
definition = """
student_id: int
---
first_name: varchar(40)
last_name: varchar(40)
home_address: varchar(100)
"""
```

We can modify the table to include a new attribute `email`:

```python
Student.definition = """
student_id: int
---
first_name: varchar(40)
last_name: varchar(40)
home_address: varchar(100)
email: varchar(100)
"""
Student.alter()
```

The `alter` method will update the table in the database to include the new attribute
`email` added by the user in the table's `definition` property.

Similarly, you can modify the data type or length of an existing attribute. For example,
to alter the `home_address` attribute to have a length of 200 characters:

```python
Student.definition = """
student_id: int
---
first_name: varchar(40)
last_name: varchar(40)
home_address: varchar(200)
email: varchar(100)
"""
Student.alter()
```
27 changes: 26 additions & 1 deletion docs/src/design/tables/blobs.md
Original file line number Diff line number Diff line change
@@ -1 +1,26 @@
# Work in progress
# Overview

DataJoint provides functionality for serializing and deserializing complex data types
into binary blobs for efficient storage and compatibility with MATLAB's mYm
serialization. This includes support for:

+ Basic Python data types (e.g., integers, floats, strings, dictionaries).
+ NumPy arrays and scalars.
+ Specialized data types like UUIDs, decimals, and datetime objects.

## Serialization and Deserialization Process

Serialization converts Python objects into a binary representation for efficient storage
within the database. Deserialization converts the binary representation back into the
original Python object.

Blobs over 1 KiB are compressed using the zlib library to reduce storage requirements.

## Supported Data Types

DataJoint supports the following data types for serialization:

+ Scalars: Integers, floats, booleans, strings.
+ Collections: Lists, tuples, sets, dictionaries.
+ NumPy: Arrays, structured arrays, and scalars.
+ Custom Types: UUIDs, decimals, datetime objects, MATLAB cell and struct arrays.
23 changes: 12 additions & 11 deletions docs/src/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,18 @@

It is common to enter data during experiments using a graphical user interface.

1. [DataJoint LabBook](https://github.com/datajoint/datajoint-labbook) is an open
source project for data entry.
1. The [DataJoint Works](https://works.datajoint.com) platform is a web-based, fully
managed service to host and execute data pipelines.

2. The DataJoint Works platform is set up as a fully managed service to host and
execute data pipelines.
2. [DataJoint LabBook](https://github.com/datajoint/datajoint-labbook) is an open
source project for data entry but is no longer actively maintained.

## Does DataJoint support other programming languages?

DataJoint [Python](https://datajoint.com/docs/core/datajoint-python/) and
[Matlab](https://datajoint.com/docs/core/datajoint-matlab/) APIs are both actively
supported. Previous projects implemented some DataJoint features in
DataJoint [Python](https://datajoint.com/docs/core/datajoint-python/) is the most
up-to-date version and all future development will focus on the Python API. The
[Matlab](https://datajoint.com/docs/core/datajoint-matlab/) API was actively developed
through 2023. Previous projects implemented some DataJoint features in
[Julia](https://github.com/BrainCOGS/neuronex_workshop_2018/tree/julia/julia) and
[Rust](https://github.com/datajoint/datajoint-core). DataJoint's data model and data
representation are largely language independent, which means that any language with a
Expand Down Expand Up @@ -92,15 +93,15 @@ The entry of metadata can be manual, or it can be an automated part of data acqu
into the database).

Depending on their size and contents, raw data files can be stored in a number of ways.
In the simplest and most common scenario, raw data continue to be stored in either a
In the simplest and most common scenario, raw data continue to be stored in either a
local filesystem or in the cloud as collections of files and folders.
The paths to these files are entered in the database (again, either manually or by
automated processes).
This is the point at which the notion of a **data pipeline** begins.
Below these "manual tables" that contain metadata and file paths are a series of tables
that load raw data from these files, process it in some way, and insert derived or
summarized data directly into the database.
For example, in an imaging application, the very large raw .TIFF stacks would reside on
For example, in an imaging application, the very large raw `.TIFF` stacks would reside on
the filesystem, but the extracted fluorescent trace timeseries for each cell in the
image would be stored as a numerical array directly in the database.
Or the raw video used for animal tracking might be stored in a standard video format on
Expand Down Expand Up @@ -163,8 +164,8 @@ This brings us to the final important question:

## How do I get my data out?

This is the fun part. See [queries](query/operators.md) for details of the DataJoint
query language directly from MATLAB and Python.
This is the fun part. See [queries](query/operators.md) for details of the DataJoint
query language directly from Python.

## Interfaces

Expand Down
14 changes: 7 additions & 7 deletions docs/src/internal/transpilation.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ restriction appending the new condition to the input's restriction.

Property `support` represents the `FROM` clause and contains a list of either
`QueryExpression` objects or table names in the case of base queries.
The joint operator `*` adds new elements to the `support` attribute.
The join operator `*` adds new elements to the `support` attribute.

At least one element must be present in `support`. Multiple elements in `support`
indicate a join.
Expand All @@ -56,10 +56,10 @@ self: `heading`, `restriction`, and `support`.

The input object is treated as a subquery in the following cases:

1. A restriction is applied that uses alias attributes in the heading
1. A projection uses an alias attribute to create a new alias attribute.
1. A join is performed on an alias attribute.
1. An Aggregation is used a restriction.
1. A restriction is applied that uses alias attributes in the heading.
2. A projection uses an alias attribute to create a new alias attribute.
3. A join is performed on an alias attribute.
4. An Aggregation is used a restriction.

An error arises if

Expand Down Expand Up @@ -117,8 +117,8 @@ input — the *aggregated* query expression.
The SQL equivalent of aggregation is

1. the NATURAL LEFT JOIN of the two inputs.
1. followed by a GROUP BY on the primary key arguments of the first input
1. followed by a projection.
2. followed by a GROUP BY on the primary key arguments of the first input
3. followed by a projection.

The projection works the same as `.proj` with respect to the first input.
With respect to the second input, the projection part of aggregation allows only
Expand Down
2 changes: 1 addition & 1 deletion docs/src/manipulation/transactions.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ interrupting the sequence of such operations halfway would leave the data in an
state.
While the sequence is in progress, other processes accessing the database will not see
the partial results until the transaction is complete.
The sequence make include [data queries](../query/principles.md) and
The sequence may include [data queries](../query/principles.md) and
[manipulations](index.md).

In such cases, the sequence of operations may be enclosed in a transaction.
Expand Down
2 changes: 1 addition & 1 deletion docs/src/publish-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The code and the data can be found at https://github.com/sinzlab/Sinz2018_NIPS

## Exporting into a collection of files

Another option for publishing and archiving data is to export the data from the
Another option for publishing and archiving data is to export the data from the
DataJoint pipeline into a collection of files.
DataJoint provides features for exporting and importing sections of the pipeline.
Several ongoing projects are implementing the capability to export from DataJoint
Expand Down
Loading
Loading