diff --git a/book/00-introduction/49-connect.ipynb b/book/00-introduction/49-connect.ipynb index 5c4c2ad..2cfd956 100644 --- a/book/00-introduction/49-connect.ipynb +++ b/book/00-introduction/49-connect.ipynb @@ -28,19 +28,34 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Connect with DataJoint\n\nDataJoint is the primary way to connect to the database in this book. The DataJoint client library reads the database credentials from the environment variables `DJ_HOST`, `DJ_USER`, and `DJ_PASS`. \n\nSimply importing the DataJoint library is sufficient—it will connect to the database automatically when needed. Here we call `dj.conn()` only to verify the connection, but this step is not required in normal use." + "source": [ + "# Connect with DataJoint\n", + "\n", + "DataJoint is the primary way to connect to the database in this book. The DataJoint client library reads the database credentials from the environment variables `DJ_HOST`, `DJ_USER`, and `DJ_PASS`. \n", + "\n", + "Simply importing the DataJoint library is sufficient—it will connect to the database automatically when needed. Here we call `dj.conn()` only to verify the connection, but this step is not required in normal use." + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": "import datajoint as dj\ndj.conn() # test the connection (optional)" + "source": [ + "import datajoint as dj\n", + "dj.conn() # test the connection (optional)" + ] }, { "cell_type": "markdown", "metadata": {}, - "source": "# Connect with SQL Magic\n\nSQL \"Jupyter magic\" allows executing SQL statements directly in Jupyter notebooks, implemented by the [`jupysql`](https://ploomber.io/blog/jupysql/) library. This is useful for quick interactive SQL queries and for learning SQL syntax. We will use SQL magic in this book for demonstrating SQL concepts, but it is not used as part of Python application code.\n\nThe following cell sets up the SQL magic connection to the database." + "source": [ + "# Connect with SQL Magic\n", + "\n", + "SQL \"Jupyter magic\" allows executing SQL statements directly in Jupyter notebooks, implemented by the [`jupysql`](https://ploomber.io/blog/jupysql/) library. This is useful for quick interactive SQL queries and for learning SQL syntax. We will use SQL magic in this book for demonstrating SQL concepts, but it is not used as part of Python application code.\n", + "\n", + "The following cell sets up the SQL magic connection to the database." + ] }, { "cell_type": "code", @@ -51,43 +66,74 @@ } }, "outputs": [], - "source": "%load_ext sql\n%sql mysql+pymysql://dev:devpass@db" + "source": [ + "%load_ext sql\n", + "%sql mysql+pymysql://dev:devpass@db" + ] }, { "cell_type": "markdown", "metadata": {}, - "source": "You can issue SQL commands from a Jupyter cell by starting it with `%%sql`.\nChange the cell type to `SQL` for appropriate syntax highlighting." + "source": [ + "You can issue SQL commands from a Jupyter cell by starting it with `%%sql`.\n", + "Change the cell type to `SQL` for appropriate syntax highlighting." + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": "%%sql\n-- show all users\nSELECT User FROM mysql.user" + "source": [ + "%%sql\n", + "-- show all users\n", + "SELECT User FROM mysql.user" + ] }, { "cell_type": "markdown", "metadata": {}, - "source": "# Connect with a Python MySQL Client\n\nTo issue SQL queries directly from Python code (outside of Jupyter magic), you can use a conventional SQL client library such as `pymysql`. This approach gives you full programmatic control over database interactions and is useful when you need to execute raw SQL within Python scripts." + "source": [ + "# Connect with a Python MySQL Client\n", + "\n", + "To issue SQL queries directly from Python code (outside of Jupyter magic), you can use a conventional SQL client library such as `pymysql`. This approach gives you full programmatic control over database interactions and is useful when you need to execute raw SQL within Python scripts." + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": "import os\nimport pymysql\n\n# create a database connection\nconn = pymysql.connect(\n host=os.environ['DJ_HOST'], \n user=os.environ['DJ_USER'], \n password=os.environ['DJ_PASS']\n)" + "source": [ + "import os\n", + "import pymysql\n", + "\n", + "# create a database connection\n", + "conn = pymysql.connect(\n", + " host=os.environ['DJ_HOST'], \n", + " user=os.environ['DJ_USER'], \n", + " password=os.environ['DJ_PASS']\n", + ")" + ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": "# create a query cursor and issue an SQL query\ncur = conn.cursor()\ncur.execute('SELECT User FROM mysql.user')\ncur.fetchall()" + "source": [ + "# create a query cursor and issue an SQL query\n", + "cur = conn.cursor()\n", + "cur.execute('SELECT User FROM mysql.user')\n", + "cur.fetchall()" + ] }, { "cell_type": "markdown", "metadata": {}, - "source": "We are all set for executing all the database queries in this book!" + "source": [ + "We are all set for executing all the database queries in this book!" + ] } ], "metadata": { diff --git a/book/20-concepts/04-integrity.md b/book/20-concepts/04-integrity.md index 32f4877..e38e83a 100644 --- a/book/20-concepts/04-integrity.md +++ b/book/20-concepts/04-integrity.md @@ -1,8 +1,5 @@ --- title: Data Integrity -date: 2025-10-31 -authors: - - name: Dimitri Yatsenko --- # Why Data Integrity Matters @@ -95,7 +92,7 @@ Entity integrity ensures a **one-to-one correspondence** between real-world enti **Example:** Each mouse in the lab has exactly one unique ID, and that ID refers to exactly one mouse—never two different mice sharing the same ID, and never one mouse having multiple IDs. **Covered in:** -- [Primary Keys](../30-design/020-primary-key.md) — Entity integrity and the 1:1 correspondence guarantee (elaborated in detail) +- [Primary Keys](../30-design/018-primary-key.md) — Entity integrity and the 1:1 correspondence guarantee (elaborated in detail) - [UUID](../85-special-topics/025-uuid.ipynb) — Universally unique identifiers --- @@ -111,7 +108,7 @@ Referential integrity maintains logical associations across tables: **Example:** A recording session cannot reference a non-existent mouse. **Covered in:** -- [Foreign Keys](../30-design/030-foreign-keys.ipynb) — Cross-table relationships +- [Foreign Keys](../30-design/030-foreign-keys.md) — Cross-table relationships - [Relationships](../30-design/050-relationships.ipynb) — Dependency patterns --- @@ -161,7 +158,7 @@ Workflow integrity maintains valid operation sequences through: **Example:** An analysis pipeline cannot compute results before acquiring raw data. If `NeuronAnalysis` depends on `SpikeData`, which depends on `RecordingSession`, the database enforces that recordings are created before spike detection, which occurs before analysis—maintaining the integrity of the entire scientific workflow. **Covered in:** -- [Foreign Keys](../30-design/030-foreign-keys.ipynb) — How foreign keys encode workflow dependencies +- [Foreign Keys](../30-design/030-foreign-keys.md) — How foreign keys encode workflow dependencies - [Populate](../40-operations/050-populate.ipynb) — Automatic workflow execution and dependency resolution --- @@ -212,8 +209,8 @@ Now that you understand *why* integrity matters, the next chapter introduces how The [Design](../30-design/010-schema.ipynb) section then shows *how* to implement each constraint type: 1. **[Tables](../30-design/015-table.ipynb)** — Basic structure with domain integrity -2. **[Primary Keys](../30-design/020-primary-key.md)** — Entity integrity through unique identification -3. **[Foreign Keys](../30-design/030-foreign-keys.ipynb)** — Referential integrity across tables +2. **[Primary Keys](../30-design/018-primary-key.md)** — Entity integrity through unique identification +3. **[Foreign Keys](../30-design/030-foreign-keys.md)** — Referential integrity across tables Each chapter builds on these foundational integrity concepts. ``` diff --git a/book/30-design/010-schema.ipynb b/book/30-design/010-schema.ipynb index 69fad32..2bedbb2 100644 --- a/book/30-design/010-schema.ipynb +++ b/book/30-design/010-schema.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "---\ntitle: Schemas\nauthors:\n - name: Dimitri Yatsenko\n---\n\n# What is a schema?\n\nThe term schema has two related meanings in the context of databases:\n\n## 1. Schema as a Data Blueprint\nA **schema** is a formal specification of the structure of data and the rules governing its integrity.\nIt serves as a blueprint that defines how data is organized, stored, and accessed within a database.\nThis ensures that the database reflects the rules and requirements of the underlying business or research project it supports.\n\nIn structured data models, such as the relational model, a schema provides a robust framework for defining:\n* The structure of tables (relations) and their attributes (columns).\n* Rules and constraints that ensure data consistency, accuracy, and reliability.\n* Relationships between tables, such as primary keys (unique identifiers for records) and foreign keys (references to related records in other tables).\n\n### Aims of Good Schema Design\n* **Data Integrity**: Ensures consistency and prevents anomalies.\n* **Query Efficiency**: Facilitates fast and accurate data retrieval, supports complex queries, and optimizes database performance.\n* **Scalability**: Allows the database to grow and adapt as data volumes increase.\n\n### Key Elements of Schema Design\n* **Tables and Attributes**: Each table is defined with specific attributes (columns), each assigned a data type.\n* **Primary Keys**: Uniquely identify each record in a table.\n* **Foreign Keys**: Establish relationships between entities in tables.\n* **Indexes**: Support efficient queries.\n\nThrough careful schema design, database architects create systems that are both efficient and flexible, meeting the current and future needs of an organization. The schema acts as a living document that guides the structure, operations, and integrity of the database.\n\n## 2. Schema as a Database Module\n\nIn complex database designs, the term \"schema\" is also used to describe a distinct module of a larger database with its own namespace that groups related tables together. \nThis modular approach:\n* Separates tables into logical groups for better organization.\n* Avoids naming conflicts in large databases with multiple schemas." + "source": "---\ntitle: Schemas\n---\n\n# What is a schema?\n\nThe term schema has two related meanings in the context of databases:\n\n## 1. Schema as a Data Blueprint\nA **schema** is a formal specification of the structure of data and the rules governing its integrity.\nIt serves as a blueprint that defines how data is organized, stored, and accessed within a database.\nThis ensures that the database reflects the rules and requirements of the underlying business or research project it supports.\n\nIn structured data models, such as the relational model, a schema provides a robust framework for defining:\n* The structure of tables (relations) and their attributes (columns).\n* Rules and constraints that ensure data consistency, accuracy, and reliability.\n* Relationships between tables, such as primary keys (unique identifiers for records) and foreign keys (references to related records in other tables).\n\n### Aims of Good Schema Design\n* **Data Integrity**: Ensures consistency and prevents anomalies.\n* **Query Efficiency**: Facilitates fast and accurate data retrieval, supports complex queries, and optimizes database performance.\n* **Scalability**: Allows the database to grow and adapt as data volumes increase.\n\n### Key Elements of Schema Design\n* **Tables and Attributes**: Each table is defined with specific attributes (columns), each assigned a data type.\n* **Primary Keys**: Uniquely identify each record in a table.\n* **Foreign Keys**: Establish relationships between entities in tables.\n* **Indexes**: Support efficient queries.\n\nThrough careful schema design, database architects create systems that are both efficient and flexible, meeting the current and future needs of an organization. The schema acts as a living document that guides the structure, operations, and integrity of the database.\n\n## 2. Schema as a Database Module\n\nIn complex database designs, the term \"schema\" is also used to describe a distinct module of a larger database with its own namespace that groups related tables together. \nThis modular approach:\n* Separates tables into logical groups for better organization.\n* Avoids naming conflicts in large databases with multiple schemas." }, { "cell_type": "markdown", @@ -40,7 +40,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Using the `schema` Object\n\nThe schema object groups related tables together and helps prevent naming conflicts.\n\nBy convention, the object created by `dj.Schema` is named `schema`. Typically, only one schema object is used in any given Python namespace, usually at the level of a Python module.\n\nThe schema object serves multiple purposes:\n* **Creating Tables**: Used as a *class decorator* (`@schema`) to declare tables within the schema. \nFor details, see the next section, [Create Tables](015-table.ipynb)\n* **Visualizing the Schema**: Generates diagrams to illustrate relationships between tables.\n* **Exporting Data**: Facilitates exporting data for external use or backup.\n\nWith this foundation, you are ready to begin declaring tables and building your data pipeline." + "source": "# Using the `schema` Object\n\nThe schema object groups related tables together and helps prevent naming conflicts.\n\nBy convention, the object created by `dj.Schema` is named `schema`. Typically, only one schema object is used in any given Python namespace, usually at the level of a Python module.\n\nThe schema object serves multiple purposes:\n* **Creating Tables**: Used as a *class decorator* (`@schema`) to declare tables within the schema. \nFor details, see the next section, [Tables](015-table.ipynb)\n* **Visualizing the Schema**: Generates diagrams to illustrate relationships between tables.\n* **Exporting Data**: Facilitates exporting data for external use or backup.\n\nWith this foundation, you are ready to begin declaring tables and building your data pipeline." }, { "cell_type": "markdown", diff --git a/book/30-design/015-table.ipynb b/book/30-design/015-table.ipynb index fc4845d..7ad9072 100644 --- a/book/30-design/015-table.ipynb +++ b/book/30-design/015-table.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "---\ntitle: Create Tables\nauthors:\n - name: Dimitri Yatsenko\ndate: 2025-10-31\n---\n\n# Tables: The Foundation of Data Integrity\n\nIn the [Data Integrity](../20-concepts/04-integrity.md) chapter, we learned that relational databases excel at enforcing **data integrity constraints**. Tables are where these constraints come to life through:\n\n1. **Domain Integrity** — Data types restrict values to valid ranges\n2. **Completeness** — Required vs. optional attributes ensure necessary data is present\n3. **Entity Integrity** — Primary keys uniquely identify each record (covered in [Primary Keys](020-primary-key.md))\n4. **Referential Integrity** — Foreign keys enforce relationships (covered in [Foreign Keys](030-foreign-keys.ipynb))\n\nThis chapter shows how to declare tables in DataJoint with proper data types and attribute specifications that enforce these constraints automatically.\n\n```{admonition} Learning Objectives\n:class: note\n\nBy the end of this chapter, you will:\n- Declare tables with the `@schema` decorator\n- Specify attributes with appropriate data types\n- Distinguish between primary key and dependent attributes\n- Insert, view, and delete data\n- Understand how table structure enforces integrity\n```" + "source": "---\ntitle: Tables\n---\n\n# Tables: The Foundation of Data Integrity\n\nIn the [Data Integrity](../20-concepts/04-integrity.md) chapter, we learned that relational databases excel at enforcing **data integrity constraints**. Tables are where these constraints come to life through:\n\n1. **Domain Integrity** — Data types restrict values to valid ranges\n2. **Completeness** — Required vs. optional attributes ensure necessary data is present\n3. **Entity Integrity** — Primary keys uniquely identify each record (covered in [Primary Keys](018-primary-key.md))\n4. **Referential Integrity** — Foreign keys enforce relationships (covered in [Foreign Keys](030-foreign-keys.md))\n\nThis chapter shows how to declare tables in DataJoint with proper data types and attribute specifications that enforce these constraints automatically.\n\n```{admonition} Learning Objectives\n:class: note\n\nBy the end of this chapter, you will:\n- Declare tables with the `@schema` decorator\n- Specify attributes with appropriate data types\n- Distinguish between primary key and dependent attributes\n- Insert, view, and delete data\n- Understand how table structure enforces integrity\n```" }, { "cell_type": "markdown", @@ -199,7 +199,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "```{admonition} Integrity Enforcement at Insert\n:class: important\n\nThe database validates each insert:\n- **Domain integrity**: Rejects invalid data types (e.g., string in `int` field)\n- **Completeness**: Rejects missing required attributes\n- **Entity integrity**: Rejects duplicate primary keys (see [Primary Keys](020-primary-key.md))\n\nViolations raise immediate errors — no invalid data ever enters the database.\n```" + "source": "```{admonition} Integrity Enforcement at Insert\n:class: important\n\nThe database validates each insert:\n- **Domain integrity**: Rejects invalid data types (e.g., string in `int` field)\n- **Completeness**: Rejects missing required attributes\n- **Entity integrity**: Rejects duplicate primary keys (see [Primary Keys](018-primary-key.md))\n\nViolations raise immediate errors — no invalid data ever enters the database.\n```" }, { "cell_type": "markdown", @@ -297,11 +297,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "```{warning}\n", - "Deletion is **permanent** and will cascade to dependent tables (see [Foreign Keys](030-foreign-keys.ipynb)). Always verify what will be deleted before confirming.\n", - "```" - ] + "source": "```{warning}\nDeletion is **permanent** and will cascade to dependent tables (see [Foreign Keys](030-foreign-keys.md)). Always verify what will be deleted before confirming.\n```" }, { "cell_type": "markdown", @@ -336,12 +332,12 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Table Base Classes\n\nDataJoint provides four base classes for different data management patterns:\n\n| Base Class | Purpose | When to Use |\n|------------|---------|-------------|\n| `dj.Manual` | Manually entered data | Subject info, experimental protocols |\n| `dj.Lookup` | Reference data, rarely changes | Equipment lists, parameter sets |\n| `dj.Imported` | Data imported from external files | Raw recordings, behavioral videos |\n| `dj.Computed` | Derived from other tables | Spike sorting results, analyses |\n\nWe'll explore `Imported` and `Computed` tables in the [Populate](050-populate.ipynb) chapter.\n\n```{seealso}\n- [Lookup Tables](018-lookup-tables.ipynb) — Managing reference data\n- [Populate](050-populate.ipynb) — Automated data processing\n```" + "source": "# Table Base Classes\n\nDataJoint provides four base classes for different data management patterns:\n\n| Base Class | Purpose | When to Use |\n|------------|---------|-------------|\n| `dj.Manual` | Manually entered data | Subject info, experimental protocols |\n| `dj.Lookup` | Reference data, rarely changes | Equipment lists, parameter sets |\n| `dj.Imported` | Data imported from external files | Raw recordings, behavioral videos |\n| `dj.Computed` | Derived from other tables | Spike sorting results, analyses |\n\nWe'll explore `Imported` and `Computed` tables in the [Populate](050-populate.ipynb) chapter.\n\n```{seealso}\n- [Lookup Tables](020-lookup-tables.ipynb) — Managing reference data\n- [Populate](050-populate.ipynb) — Automated data processing\n```" }, { "cell_type": "markdown", "metadata": {}, - "source": "# Summary\n\nTables are the fundamental building blocks where data integrity is enforced:\n\n1. **Table declarations** specify structure using the `@schema` decorator and `definition` string\n2. **Data types** enforce domain integrity by restricting values to valid ranges\n3. **Primary keys** (above `---`) enforce entity integrity through unique identification\n4. **Required attributes** enforce completeness by ensuring necessary data is present\n5. **DataJoint operations** (`insert`, `delete`, `drop`) respect these integrity constraints\n\n```{admonition} Next Steps\n:class: tip\n\nNow that you understand table structure and data types, the next chapters explore:\n- **[Default Values](017-default-values.ipynb)** — Making attributes optional\n- **[Primary Keys](020-primary-key.md)** — Strategies for unique entity identification\n- **[Foreign Keys](030-foreign-keys.ipynb)** — Linking tables through relationships\n```" + "source": "# Summary\n\nTables are the fundamental building blocks where data integrity is enforced:\n\n1. **Table declarations** specify structure using the `@schema` decorator and `definition` string\n2. **Data types** enforce domain integrity by restricting values to valid ranges\n3. **Primary keys** (above `---`) enforce entity integrity through unique identification\n4. **Required attributes** enforce completeness by ensuring necessary data is present\n5. **DataJoint operations** (`insert`, `delete`, `drop`) respect these integrity constraints\n\n```{admonition} Next Steps\n:class: tip\n\nNow that you understand table structure and data types, the next chapters explore:\n- **[Default Values](017-default-values.ipynb)** — Making attributes optional\n- **[Primary Keys](018-primary-key.md)** — Strategies for unique entity identification\n- **[Foreign Keys](030-foreign-keys.md)** — Linking tables through relationships\n```" } ], "metadata": { diff --git a/book/30-design/018-primary-key.md b/book/30-design/018-primary-key.md new file mode 100644 index 0000000..404f483 --- /dev/null +++ b/book/30-design/018-primary-key.md @@ -0,0 +1,780 @@ +--- +title: Primary Keys +--- + +# Primary Keys: Ensuring Entity Integrity + +In the [Tables](015-table.ipynb) chapter, we learned that attributes above the `---` line form the **primary key**. But why does this matter? The primary key is the cornerstone of **entity integrity**—the guarantee that each real-world entity corresponds to exactly one database record, and vice versa. + +```{admonition} Learning Objectives +:class: note + +By the end of this chapter, you will: +- Understand entity integrity and its importance +- Apply the "three questions" framework for designing primary keys +- Choose between natural keys and surrogate keys +- Understand when keys become composite (multiple attributes) +- Recognize schema dimensions and their role in semantic matching +- Design primary keys that reflect real-world identification systems +``` + +# What is a Primary Key? + +A **primary key** is a column or combination of columns that uniquely identifies each row in a table. + +```{card} Primary Key Requirements +In DataJoint, every table must have a primary key. Primary key attributes: +- Cannot be NULL +- Must be unique across all rows +- Cannot be changed after insertion (immutable) +- Are declared above the `---` line in the table definition +``` + +# Entity Integrity: The Core Concept + +**Entity integrity** ensures a one-to-one correspondence between real-world entities and their database records: + +- Each real-world entity → exactly one database record +- Each database record → exactly one real-world entity + +Without entity integrity, databases become unreliable: + +| Integrity Failure | Consequence | +|-------------------|-------------| +| Same entity, multiple records | Fragmented data, conflicting information | +| Multiple entities, same record | Mixed data, privacy violations | +| Cannot match entity to record | Lost data, broken workflows | + +Imagine what kinds of difficulties would arise if entity integrity broke down in the systems you interact with every day: + +- What would happen if your university or company HR department had two different identifiers for you in their records? +- What would happen if your HR department occasionally updated your records with another person's information? +- What if the same occurred in your dentist's office? + +**Example:** If your university had two student records for you, your transcript might show incomplete courses, financial aid could be miscalculated, and graduation requirements might be incorrectly tracked. + +# The Three Questions of Entity Integrity + +When designing a primary key, you must answer three questions: + +1. **How do I prevent duplicate records?** — Ensure the same entity cannot appear twice +2. **How do I prevent record sharing?** — Ensure different entities cannot share a record +3. **How do I match entities to records?** — When an entity arrives, how do I find its record? + +## Example: Laboratory Mouse Database + +Consider a neuroscience lab tracking mice: + +| Question | Answer | +|----------|--------| +| Prevent duplicates? | Each mouse gets a unique ear tag at arrival; database rejects duplicate tags | +| Prevent sharing? | Ear tags are never reused; retired tags are archived | +| Match entities? | Read the ear tag → look up record by primary key | + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Mouse(dj.Manual): + definition = """ + ear_tag : char(6) # unique ear tag (e.g., 'M00142') + --- + date_of_birth : date + sex : enum('M', 'F', 'U') + strain : varchar(50) + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE mouse ( + ear_tag CHAR(6) NOT NULL COMMENT 'unique ear tag (e.g., M00142)', + date_of_birth DATE NOT NULL, + sex ENUM('M', 'F', 'U') NOT NULL, + strain VARCHAR(50) NOT NULL, + PRIMARY KEY (ear_tag) +); +``` +```` +````` + +## Example: University Student Database + +Consider a university registrar's office tracking students: + +| Question | Answer | +|----------|--------| +| Prevent duplicates? | Each student gets a unique ID at enrollment; verification against existing records using name, date of birth, and government ID | +| Prevent sharing? | Photo ID cards issued; IDs are never reused even after graduation | +| Match entities? | Student presents ID card → look up record by student ID | + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Student(dj.Manual): + definition = """ + student_id : char(8) # unique student ID (e.g., 'S2024001') + --- + first_name : varchar(50) + last_name : varchar(50) + date_of_birth : date + enrollment_date : date + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE student ( + student_id CHAR(8) NOT NULL COMMENT 'unique student ID (e.g., S2024001)', + first_name VARCHAR(50) NOT NULL, + last_name VARCHAR(50) NOT NULL, + date_of_birth DATE NOT NULL, + enrollment_date DATE NOT NULL, + PRIMARY KEY (student_id) +); +``` +```` +````` + +Notice how both examples follow the same pattern: a real-world identification system (ear tags, student IDs) enables the three questions to be answered consistently. + +The database enforces the first two questions automatically through the primary key constraint. The third question requires a **physical identification system**—ear tags, barcodes, or RFID chips that link physical entities to database records. + +```{admonition} Entity Integrity Requires Real-World Systems +:class: important + +The database can enforce uniqueness, but cannot create it. You must establish identification systems *outside* the database: +- Laboratory animals: ear tags, microchips +- Students: ID cards, student numbers +- Products: SKUs, barcodes +- Citizens: government IDs, SSNs + +The primary key in the database mirrors and enforces the real-world identification system. +``` + +```{admonition} Historical Example: The Social Security Number +:class: note dropdown + +Establishing the Social Security system in the United States required reliable identification of workers by all employers to report their income across their entire careers. For this purpose, in 1936, the Federal Government established a new process to ensure that each US worker would be assigned a unique number—the Social Security Number (SSN). + +The SSN would be assigned at birth or upon entering the country for employment, and no person would be allowed to have two such numbers. Establishing and enforcing such a system is not easy and takes considerable effort. + +**Questions to consider:** +- Why do you think the US government did not need to assign unique identifiers to taxpayers when it began levying federal taxes in 1913? +- What abuses would become possible if a person could obtain two SSNs, or if two persons could share the same SSN? + +**Learn more** about the history and uses of the SSN: +- [History of establishing the SSN](https://www.ssa.gov/history/ssn/firstcard.html) +- [How the SSN works](https://www.ssa.gov/policy/docs/ssb/v69n2/v69n2p55.html) +- [IRS timeline](https://www.irs.gov/irs-history-timeline) +``` + +# Types of Primary Keys + +Primary keys can be classified along two independent dimensions: + +1. **Usage**: Natural keys (used in the real world) vs. Surrogate keys (used only inside the database) +2. **Composition**: Simple keys (one attribute) vs. Composite keys (multiple attributes) + +These dimensions are independent—a natural key can be simple or composite, and so can a surrogate key. + +## Natural Keys + +A **natural key** is an identifier used *outside* the database to refer to entities in the real world. The defining characteristic is that the key requires a **real-world mechanism** to establish and maintain the permanent association between entities and their identifiers. + +Natural keys may originate from: +- External standards (ISBN for books, VIN for vehicles) +- Government systems (SSN, passport numbers) +- Institutional systems (student IDs, employee numbers) +- Laboratory systems (animal IDs generated by colony management software) + +Even when a database or management system *generates* the identifier, if that identifier is then used in the real world to refer to the entity—printed on labels, written in lab notebooks, referenced in conversations—it functions as a natural key. + +**Example: Laboratory Animal IDs** + +A colony management system might generate animal IDs like `M00142`. Once that ID is printed on an ear tag and attached to a mouse, it becomes the natural key. The real-world mechanism (the ear tag) maintains the association between the physical mouse and its identifier. + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Mouse(dj.Manual): + definition = """ + animal_id : char(6) # colony-assigned ID (e.g., 'M00142') + --- + date_of_birth : date + sex : enum('M', 'F', 'U') + strain : varchar(50) + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE mouse ( + animal_id CHAR(6) NOT NULL COMMENT 'colony-assigned ID (e.g., M00142)', + date_of_birth DATE NOT NULL, + sex ENUM('M', 'F', 'U') NOT NULL, + strain VARCHAR(50) NOT NULL, + PRIMARY KEY (animal_id) +); +``` +```` +````` + +**Examples of composite natural keys:** +- (State, District) for U.S. Congressional Districts +- (Building, Room Number) for rooms +- (Subject, Session) when session numbers are recorded in lab notebooks + +**Advantages:** +- Meaningful to users—they can discuss and search for entities by their key +- Enables matching between database records and physical entities +- Already established and enforced by external systems + +**Disadvantages:** +- Requires reliable real-world identification systems +- May change (though ideally should not) +- Privacy concerns for personal identifiers +- Format inconsistencies across sources + +```{admonition} Real-World Identification Standards +:class: seealso dropdown + +Establishing rigorous identification systems often requires costly standardization efforts with many systems for enforcement and coordination. Examples include: + +- [Vehicle Identification Number (VIN)](https://www.iso.org/standard/52200.html) — regulated by the International Organization for Standardization +- [Radio-Frequency Identification for Animals (ISO 11784/11785)](https://en.wikipedia.org/wiki/ISO_11784_and_ISO_11785) — standards for implanted microchips in animals +- [US Aircraft Registration Numbers](https://www.faa.gov/licenses_certificates/aircraft_certification/aircraft_registry/forming_nnumber) — the N-numbers seen on aircraft tails, regulated by the FAA + +When a science lab establishes a data management process, the first step is often to establish a uniform system for identifying test subjects, experiments, protocols, and treatments. Standard nomenclatures exist to standardize names across institutions, and labs must be aware of them and follow them. +``` + +## Surrogate Keys + +A **surrogate key** is an identifier used *primarily inside* the database, with minimal or no exposure to end users. Users typically don't search for entities by surrogate keys or use them in conversation. + +**Examples:** +- Internal post IDs on social media (users search by content, not by ID) +- Database row identifiers that never appear in user interfaces +- System-generated UUIDs for internal tracking + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class InternalRecord(dj.Manual): + definition = """ + record_id : int unsigned # internal identifier, not exposed to users + --- + created_timestamp : timestamp + data : longblob + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE internal_record ( + record_id INT UNSIGNED NOT NULL + COMMENT 'internal identifier, not exposed to users', + created_timestamp TIMESTAMP NOT NULL, + data LONGBLOB NOT NULL, + PRIMARY KEY (record_id) +); +``` +```` +````` + +**Key distinction from natural keys:** Surrogate keys don't require external identification systems because users don't need to match physical entities to records by these keys. The database maintains uniqueness, but the key itself isn't used for entity identification in the real world. + +**When surrogate keys are appropriate:** +- Entities that exist only within the system (no physical counterpart) +- Privacy-sensitive contexts where natural identifiers shouldn't be stored +- Internal system records that users never reference directly + +``````{admonition} No Default Values in Primary Keys +:class: important + +**DataJoint prohibits default values for primary key attributes.** Every primary key value must be explicitly provided by the client when inserting a new record. This includes prohibiting the use of `auto_increment`, which is commonly used in other frameworks. + +This design enforces entity integrity at the point of data entry: + +- **Explicit identification required**: The client must communicate the identifying information for each new entity. This forces users to think about entity identity *before* insertion. +- **Prevents communication errors**: If a client fails to provide a key value, the insertion fails rather than silently creating a record with a generated key that may not correspond to the intended entity. +- **Prevents duplicate entities**: Running the same insertion code multiple times with the same explicit key produces an error (duplicate key) rather than creating multiple records for the same entity. + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Session(dj.Manual): + definition = """ + -> Subject + session : smallint unsigned # session number for this subject + --- + session_date : date + notes : varchar(1000) + """ + +# Explicit key required - this is the DataJoint way +Session.insert1({ + 'subject_id': 'M001', 'session': 1, + 'session_date': '2024-01-15', 'notes': '' +}) + +# Running the same insert again produces a duplicate key error, not a second record +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE session ( + subject_id VARCHAR(12) NOT NULL, + session SMALLINT UNSIGNED NOT NULL COMMENT 'session number for this subject', + session_date DATE NOT NULL, + notes VARCHAR(1000) NOT NULL, + PRIMARY KEY (subject_id, session), + FOREIGN KEY (subject_id) REFERENCES subject(subject_id) +); + +-- Explicit key required +INSERT INTO session (subject_id, session, session_date, notes) +VALUES ('M001', 1, '2024-01-15', ''); + +-- Running the same insert again produces a duplicate key error, not a second record +``` +```` +````` + +**Generating surrogate keys**: Since DataJoint requires explicit key values, how do you generate unique surrogate keys? Use client-side generation methods: + +- **UUIDs and related systems**: Generate universally unique identifiers client-side before insertion. UUIDs (UUID1, UUID4, UUID5), ULIDs (sortable), and NANOIDs (compact) all provide collision-resistant unique identifiers. See [UUIDs](../85-special-topics/025-uuid.ipynb) for implementation details and guidance on choosing the right type. +- **Client-side counters**: Query the current maximum value and increment before insertion. +- **External ID services**: Use institutional or laboratory ID assignment systems that generate unique identifiers. + +These approaches maintain DataJoint's requirement for explicit key specification while providing unique identifiers for surrogate keys. +`````` + +## Composite Keys in Hierarchical Relationships + +Composite primary keys commonly arise when tables inherit foreign keys as part of their primary key. This creates hierarchical relationships where child entities are identified within the context of their parent. + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Subject(dj.Manual): + definition = """ + subject_id : varchar(12) # subject identifier + --- + species : varchar(30) + """ + +@schema +class Session(dj.Manual): + definition = """ + -> Subject + session : smallint unsigned # session number within subject + --- + session_date : date + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE subject ( + subject_id VARCHAR(12) NOT NULL COMMENT 'subject identifier', + species VARCHAR(30) NOT NULL, + PRIMARY KEY (subject_id) +); + +CREATE TABLE session ( + subject_id VARCHAR(12) NOT NULL COMMENT 'subject identifier', + session SMALLINT UNSIGNED NOT NULL COMMENT 'session number within subject', + session_date DATE NOT NULL, + PRIMARY KEY (subject_id, session), + FOREIGN KEY (subject_id) REFERENCES subject(subject_id) +); +``` +```` +````` + +In this example, `Session` has a composite primary key `(subject_id, session)`. Each session is uniquely identified by *which subject* and *which session number*. This pattern is covered in detail in the [Relationships](050-relationships.ipynb) chapter. + +```{seealso} +For detailed coverage of composite keys through foreign key inheritance and hierarchical relationships, see [Relationships](050-relationships.ipynb). +``` + +# Schema Dimensions + +A **schema dimension** is created when a table defines a new primary key attribute directly, rather than inheriting it through a foreign key. Tables that introduce new primary key attributes are said to create new schema dimensions. + +## Identifying Schema Dimensions + +Consider this hierarchy: + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Subject(dj.Manual): + definition = """ + subject_id : varchar(12) # NEW DIMENSION: defines subject identity + --- + species : varchar(30) + """ + +@schema +class Session(dj.Manual): + definition = """ + -> Subject # inherits subject_id dimension + session : smallint unsigned # NEW DIMENSION: defines session identity within subject + --- + session_date : date + """ + +@schema +class Scan(dj.Manual): + definition = """ + -> Session # inherits subject_id and session dimensions + scan : smallint unsigned # NEW DIMENSION: defines scan identity within session + --- + scan_time : time + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +-- NEW DIMENSION: defines subject identity +CREATE TABLE subject ( + subject_id VARCHAR(12) NOT NULL COMMENT 'defines subject identity', + species VARCHAR(30) NOT NULL, + PRIMARY KEY (subject_id) +); + +-- inherits subject_id dimension; NEW DIMENSION: session +CREATE TABLE session ( + subject_id VARCHAR(12) NOT NULL, + session SMALLINT UNSIGNED NOT NULL + COMMENT 'defines session identity within subject', + session_date DATE NOT NULL, + PRIMARY KEY (subject_id, session), + FOREIGN KEY (subject_id) REFERENCES subject(subject_id) +); + +-- inherits subject_id and session dimensions; NEW DIMENSION: scan +CREATE TABLE scan ( + subject_id VARCHAR(12) NOT NULL, + session SMALLINT UNSIGNED NOT NULL, + scan SMALLINT UNSIGNED NOT NULL + COMMENT 'defines scan identity within session', + scan_time TIME NOT NULL, + PRIMARY KEY (subject_id, session, scan), + FOREIGN KEY (subject_id, session) + REFERENCES session(subject_id, session) +); +``` +```` +````` + +In this example: +- `Subject` creates the `subject_id` dimension +- `Session` inherits `subject_id` and creates the `session` dimension +- `Scan` inherits both `subject_id` and `session`, and creates the `scan` dimension + +## Diagram Notation + +In DataJoint diagrams, tables that introduce new schema dimensions have their names **underlined**. Tables that only inherit their primary key through foreign keys (without adding new attributes) are not underlined—they represent the same identity as their parent. + +```{admonition} Underlined Names in Diagrams +:class: tip + +When viewing a schema diagram: +- **Underlined table names** indicate tables that introduce new dimensions +- **Non-underlined table names** indicate tables whose identity is fully determined by their parent(s) + +This visual distinction helps you quickly identify which tables define new entity types versus which extend existing ones. +``` + +## Why Schema Dimensions Matter + +Schema dimensions are fundamental to how DataJoint performs **semantic matching** in queries. When you join tables or use one table to restrict another, DataJoint matches rows based on shared schema dimensions—not just attributes with the same name. + +Two attributes match semantically when they: +1. Have the **same name** +2. Trace back to the **same original dimension** through foreign key chains + +This is why `subject_id` in `Subject`, `Session`, and `Scan` all refer to the same dimension and will be matched in joins, while an unrelated `subject_id` in a completely separate table hierarchy would not match. + +## Schema Dimensions and Auto-Populated Tables + +Auto-populated tables (`dj.Computed` and `dj.Imported`) have a special constraint: **they cannot introduce new schema dimensions directly**. Their primary key must be fully determined by their upstream dependencies through foreign keys. + +This constraint ensures that auto-populated tables compute results for entities that are already defined elsewhere in the pipeline. The `make` method receives a key from the key source (derived from parent tables), and the computation produces results for that specific key. + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class ProcessedScan(dj.Computed): + definition = """ + -> Scan # inherits subject_id, session, scan dimensions + --- # NO new primary key attributes allowed here + processed_data : longblob + quality_score : float + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +-- Primary key inherits all dimensions from scan; no new dimensions added +CREATE TABLE processed_scan ( + subject_id VARCHAR(12) NOT NULL, + session SMALLINT UNSIGNED NOT NULL, + scan SMALLINT UNSIGNED NOT NULL, + processed_data LONGBLOB NOT NULL, + quality_score FLOAT NOT NULL, + PRIMARY KEY (subject_id, session, scan), + FOREIGN KEY (subject_id, session, scan) + REFERENCES scan(subject_id, session, scan) +); +``` +```` +````` + +However, **part tables can introduce new dimensions**. When a computation produces multiple related results (e.g., detecting multiple cells in an image), the part table can add a new dimension to distinguish them: + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class CellDetection(dj.Computed): + definition = """ + -> Scan # master table inherits dimensions + --- + detection_method : varchar(60) + """ + + class Cell(dj.Part): + definition = """ + -> master + cell_id : smallint unsigned # NEW DIMENSION: identifies cells within scan + --- + cell_x : float + cell_y : float + cell_type : varchar(30) + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +-- Master table: inherits dimensions from scan +CREATE TABLE cell_detection ( + subject_id VARCHAR(12) NOT NULL, + session SMALLINT UNSIGNED NOT NULL, + scan SMALLINT UNSIGNED NOT NULL, + detection_method VARCHAR(60) NOT NULL, + PRIMARY KEY (subject_id, session, scan), + FOREIGN KEY (subject_id, session, scan) + REFERENCES scan(subject_id, session, scan) +); + +-- Part table: adds cell_id as NEW DIMENSION +CREATE TABLE cell_detection__cell ( + subject_id VARCHAR(12) NOT NULL, + session SMALLINT UNSIGNED NOT NULL, + scan SMALLINT UNSIGNED NOT NULL, + cell_id SMALLINT UNSIGNED NOT NULL + COMMENT 'identifies cells within scan', + cell_x FLOAT NOT NULL, + cell_y FLOAT NOT NULL, + cell_type VARCHAR(30) NOT NULL, + PRIMARY KEY (subject_id, session, scan, cell_id), + FOREIGN KEY (subject_id, session, scan) + REFERENCES cell_detection(subject_id, session, scan) +); +``` +```` +````` + +In this example, `CellDetection` (the master) cannot introduce new dimensions, but `CellDetection.Cell` (the part table) introduces the `cell_id` dimension to identify individual detected cells. + +```{admonition} Why This Constraint Exists +:class: note + +This design ensures that: +- Computations are reproducible and traceable to their inputs +- The key source for auto-populated tables is well-defined +- New entity types are introduced through manual or lookup tables, not through automated computation +- Part tables handle the case where a single computation produces multiple output entities +``` + +# Choosing the Right Primary Key Strategy + +| Scenario | Recommended Approach | +|----------|---------------------| +| Established external ID system exists | Use the natural key | +| Entity naturally identified by multiple attributes | Use composite natural key | +| Entity identified within parent context | Inherit foreign key + add local identifier | +| No natural identifier exists | Create explicit surrogate key | +| Privacy-sensitive context | Surrogate key (not natural) | + +```{admonition} No Default Values in Primary Keys +:class: warning + +DataJoint prohibits default values (including `auto_increment`) for primary key attributes. All key values must be explicitly provided at insertion. See [No Default Values in Primary Keys](#no-default-values-in-primary-keys) above for details and alternatives. +``` + +# Entity Integrity Varies by Context + +Different applications require different levels of entity integrity: + +| Level | Example | Enforcement | +|-------|---------|-------------| +| **Strict** | Airlines, banks | Government ID verification, biometrics | +| **Moderate** | Universities, hospitals | Photo ID, documentation | +| **Flexible** | Gyms, loyalty programs | Basic verification, some sharing tolerated | +| **Minimal** | Social media | Email verification only | + +**Example: Strict vs. Flexible** + +An airline *must* know exactly who boards each flight (strict entity integrity). A grocery store loyalty program may not care if family members share a card (flexible entity integrity). + +## Partial Entity Integrity + +Sometimes only **one direction** of entity integrity is required: + +- **Record → Entity (uniqueness)**: Each record corresponds to at most one entity, but an entity might have multiple records +- **Entity → Record (completeness)**: Each entity has a record, but records might be shared + +**Example:** A social media platform might ensure that each user account is tied to exactly one person (preventing account sharing), but not prevent a person from creating multiple accounts. This is partial entity integrity—the record-to-entity direction is enforced, but not entity-to-record. + +For many applications, partial integrity is sufficient. Design your primary keys to match your actual requirements—don't over-engineer for scenarios that don't matter to your domain. + +## Entity Integrity Without Natural Keys + +When no natural key can be established—no external identifier exists and no real-world mechanism can maintain the entity-to-record association—full entity integrity is still possible but requires a **multi-step identification process**. + +Consider a scenario where anonymous survey responses must be linked to follow-up surveys from the same respondent: + +1. **Generate a unique token** at the time of first response +2. **Provide the token** to the respondent (email, printed card, etc.) +3. **Require the token** for follow-up responses +4. **Trust the process** to maintain the association + +The database ensures uniqueness of records through the primary key, but **matching records to real-world entities requires comprehensive process design** outside the database. The token becomes a natural key only if the external process reliably maintains the association. + +```{admonition} The Database's Role +:class: note + +The database can only ensure: +- **Uniqueness**: No two records share the same primary key +- **Referential integrity**: Foreign keys point to valid records + +What the database *cannot* ensure: +- That a given record corresponds to the intended real-world entity +- That an entity doesn't have multiple records (unless enforced externally) + +Entity integrity for real-world entities always requires some external identification process—whether it's ear tags on mice, ID cards for students, or carefully designed token systems. +``` + +# Primary Keys in DataJoint Queries + +Primary keys have special significance in DataJoint queries: + +1. **Semantic matching in joins** — When you join tables with `*`, DataJoint matches on shared schema dimensions, not just attribute names +2. **Semantic matching in restrictions** — When you restrict a table by another (`A & B`), matching is performed on shared schema dimensions +3. **Restrictions are efficient** — Queries by primary key use indexes for fast lookups +4. **Results always have primary keys** — Every query result is itself a valid relation with a well-defined primary key + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +# Efficient: restriction by primary key +Mouse & {'ear_tag': 'M00142'} + +# Join matches on shared schema dimensions +Subject * Session * Scan # All three share the subject_id dimension + +# The result of any query has a well-defined primary key +(Subject * Session).primary_key # Combines dimensions from both tables +``` +```` +````{tab-item} SQL +:sync: sql +```sql +-- Efficient: restriction by primary key +SELECT * FROM mouse WHERE ear_tag = 'M00142'; + +-- Join matches on shared schema dimensions +SELECT * FROM subject + NATURAL JOIN session + NATURAL JOIN scan; + +-- Combined primary key from joined tables: (subject_id, session, scan) +``` +```` +````` + +```{admonition} Semantic Matching via Schema Dimensions +:class: note + +DataJoint's join and restriction operations differ from SQL's `NATURAL JOIN`. Two attributes are matched only when they belong to the **same schema dimension**: + +1. They have the **same name** in both tables +2. They trace back to the **same original definition** through foreign key chains + +This prevents accidental matches on attributes that happen to share a name but originate from different dimensions. For example, two tables might both have a `name` attribute, but if one refers to a person's name and the other to a course name, they represent different dimensions and will not be matched. + +For details, see the [Join](../50-queries/040-join.ipynb) chapter. +``` + +# Summary + +Primary keys are the foundation of entity integrity in relational databases: + +| Concept | Key Points | +|---------|------------| +| **Entity Integrity** | 1:1 correspondence between entities and records; requires external processes | +| **Three Questions** | Prevent duplicates, prevent sharing, enable matching | +| **Natural Keys** | Identifiers used in the real world to refer to entities; require external association mechanisms | +| **Surrogate Keys** | Identifiers used only inside the database; not exposed to users | +| **Composite Keys** | Multiple attributes forming the key (applies to both natural and surrogate) | +| **Partial Integrity** | Sometimes only one direction of entity-record correspondence is needed | +| **Schema Dimensions** | New primary key attributes define dimensions; inherited attributes share them | +| **Semantic Matching** | Joins and restrictions match on shared schema dimensions | + +```{admonition} Design Principles +:class: tip + +1. **Design external processes** — The database ensures uniqueness; you must design processes to match entities to records +2. **Use natural keys when possible** — If identifiers are used in the real world, use them as primary keys +3. **Define explicitly** — Avoid auto-increment; always specify identifiers explicitly to maintain entity integrity +4. **Match requirements** — Don't over-engineer; partial entity integrity may be sufficient for your application +``` + +```{admonition} Next Steps +:class: note + +Now that you understand how primary keys ensure entity integrity, the next chapters explore: +- **[Lookup Tables](020-lookup-tables.ipynb)** — Reference data with pre-populated primary keys +- **[Foreign Keys](030-foreign-keys.md)** — How primary keys enable referential integrity across tables +``` diff --git a/book/30-design/017-default-values.ipynb b/book/30-design/019-default-values.ipynb similarity index 71% rename from book/30-design/017-default-values.ipynb rename to book/30-design/019-default-values.ipynb index 2cc2152..56d2ad2 100644 --- a/book/30-design/017-default-values.ipynb +++ b/book/30-design/019-default-values.ipynb @@ -3,25 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# Default Values\n", - "\n", - "When designing database tables, setting default values for attributes can be a powerful tool for ensuring data consistency, reducing errors, and simplifying data entry. \n", - "\n", - "Default values allow you to predefine certain field values in a table, so that that attribute can be omitted at insert and the default value is then used.\n", - "\n", - "## Benefits of Using Default Values\n", - "1. **Consistency**: Default values help maintain uniformity across records by ensuring that certain fields always have a predefined value unless explicitly overridden. This is particularly useful for fields that have common or standard values.\n", - "\n", - "2. **Error Reduction**: By automatically filling in certain fields with default values, you minimize the chances of missing or incorrect data entry. This is especially beneficial in large-scale data entry operations where manual input errors can occur.\n", - "\n", - "3. **Efficiency**: Default values streamline the process of adding new records, as users do not need to repeatedly enter the same information for every new record. This saves time and reduces the cognitive load on researchers.\n", - "\n", - "4. **Clarity:** Setting default values can make the intent of a database design clearer. It signals to users that certain fields are expected to have a particular value unless there is a specific reason to deviate.\n", - "\n", - "\n", - "Frequent default values are the empty string `\"\"`, current date or time, zero, or `null`." - ] + "source": "# Default Values\n\nWhen designing database tables, setting default values for attributes can be a powerful tool for ensuring data consistency, reducing errors, and simplifying data entry. \n\nDefault values allow you to predefine certain field values in a table, so that that attribute can be omitted at insert and the default value is then used.\n\n```{admonition} Primary Key Attributes Cannot Have Default Values\n:class: warning\n\nDefault values apply only to **dependent attributes** (those below the `---` line). DataJoint prohibits default values for primary key attributes to ensure entity integrity—every identifying value must be explicitly provided by the client at insertion time. See [Primary Keys](018-primary-key.md) for details.\n```\n\n## Benefits of Using Default Values\n1. **Consistency**: Default values help maintain uniformity across records by ensuring that certain fields always have a predefined value unless explicitly overridden. This is particularly useful for fields that have common or standard values.\n\n2. **Error Reduction**: By automatically filling in certain fields with default values, you minimize the chances of missing or incorrect data entry. This is especially beneficial in large-scale data entry operations where manual input errors can occur.\n\n3. **Efficiency**: Default values streamline the process of adding new records, as users do not need to repeatedly enter the same information for every new record. This saves time and reduces the cognitive load on researchers.\n\n4. **Clarity:** Setting default values can make the intent of a database design clearer. It signals to users that certain fields are expected to have a particular value unless there is a specific reason to deviate.\n\n\nFrequent default values are the empty string `\"\"`, current date or time, zero, or `null`." }, { "cell_type": "markdown", @@ -126,4 +108,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/book/30-design/018-lookup-tables.ipynb b/book/30-design/020-lookup-tables.ipynb similarity index 78% rename from book/30-design/018-lookup-tables.ipynb rename to book/30-design/020-lookup-tables.ipynb index d0640c2..ead42a2 100644 --- a/book/30-design/018-lookup-tables.ipynb +++ b/book/30-design/020-lookup-tables.ipynb @@ -53,22 +53,10 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "@schema\n", - "class LetterGrade(dj.Lookup):\n", - " definition = \"\"\"\n", - " grade_letter: char(2) # Letter grade\n", - " ----\n", - " grade_point = null: decimal(3,2) unsigned # Corresponding grade point\n", - " \"\"\"\n", - " contents = [\n", - " ('A', 4.00), ('A-', 3.67), ('B+', 3.33), ('B', 3.00), ('B-', 2.67), ('C+', 2.33),\n", - " ('C', 2.00), ('C-', 1.67), ('D+', 1.33), ('D', 1.00), ('F', 0.00), ('I', None)\n", - " ]" - ] + "source": "@schema\nclass LetterGrade(dj.Lookup):\n definition = \"\"\"\n grade_letter: char(2) # Letter grade\n ----\n grade_point = null: decimal(3,2) unsigned # Corresponding grade point\n \"\"\"\n contents = [\n ('A', 4.00), ('A-', 3.67), ('B+', 3.33), ('B', 3.00),\n ('B-', 2.67), ('C+', 2.33), ('C', 2.00), ('C-', 1.67),\n ('D+', 1.33), ('D', 1.00), ('F', 0.00), ('I', None)\n ]" }, { "cell_type": "code", @@ -230,55 +218,10 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "elements = [\n", - " (1, 'Hydrogen', 'H', 1.008, '1s1'), (2, 'Helium', 'He', 4.0026, '1s2'),\n", - " (3, 'Lithium', 'Li', 6.94, '[He] 2s1'), (4, 'Beryllium', 'Be', 9.0122, '[He] 2s2'), \n", - " (5, 'Boron', 'B', 10.81, '[He] 2s2 2p1'), (6, 'Carbon', 'C', 12.011, '[He] 2s2 2p2'), \n", - " (7, 'Nitrogen', 'N', 14.007, '[He] 2s2 2p3'), (8, 'Oxygen', 'O', 15.999, '[He] 2s2 2p4'), \n", - " (9, 'Fluorine', 'F', 18.998, '[He] 2s2 2p5'), (10, 'Neon', 'Ne', 20.18, '[He] 2s2 2p6'), \n", - " (11, 'Sodium', 'Na', 22.99, '[Ne] 3s1'), (12, 'Magnesium', 'Mg', 24.305, '[Ne] 3s2'),\n", - " (13, 'Aluminum', 'Al', 26.982, '[Ne] 3s2 3p1'), (14, 'Silicon', 'Si', 28.085, '[Ne] 3s2 3p2'), \n", - " (15, 'Phosphorus', 'P', 30.974, '[Ne] 3s2 3p3'), (16, 'Sulfur', 'S', 32.06, '[Ne] 3s2 3p4'),\n", - " (17, 'Chlorine', 'Cl', 35.45, '[Ne] 3s2 3p5'), (18, 'Argon', 'Ar', 39.948, '[Ne] 3s2 3p6'),\n", - " (19, 'Potassium', 'K', 39.098, '[Ar] 4s1'), (20, 'Calcium', 'Ca', 40.078, '[Ar] 4s2'),\n", - " (21, 'Scandium', 'Sc', 44.956, '[Ar] 3d1 4s2'), (22, 'Titanium', 'Ti', 47.867, '[Ar] 3d2 4s2'),\n", - " (23, 'Vanadium', 'V', 50.942, '[Ar] 3d3 4s2'), (24, 'Chromium', 'Cr', 51.996, '[Ar] 3d5 4s1'),\n", - " (25, 'Manganese', 'Mn', 54.938, '[Ar] 3d5 4s2'), (26, 'Iron', 'Fe', 55.845, '[Ar] 3d6 4s2'), \n", - " (27, 'Cobalt', 'Co', 58.933, '[Ar] 3d7 4s2'), (28, 'Nickel', 'Ni', 58.693, '[Ar] 3d8 4s2'),\n", - " (29, 'Copper', 'Cu', 63.546, '[Ar] 3d10 4s1'), (30, 'Zinc', 'Zn', 65.38, '[Ar] 3d10 4s2'),\n", - " (31, 'Gallium', 'Ga', 69.723, '[Ar] 3d10 4s2 4p1'), (32, 'Germanium', 'Ge', 72.63, '[Ar] 3d10 4s2 4p2'), \n", - " (33, 'Arsenic', 'As', 74.922, '[Ar] 3d10 4s2 4p3'), (34, 'Selenium', 'Se', 78.971, '[Ar] 3d10 4s2 4p4'), \n", - " (35, 'Bromine', 'Br', 79.904, '[Ar] 3d10 4s2 4p5'), (36, 'Krypton', 'Kr', 83.798, '[Ar] 3d10 4s2 4p6'), \n", - " (37, 'Rubidium', 'Rb', 85.468, '[Kr] 5s1'), (38, 'Strontium', 'Sr', 87.62, '[Kr] 5s2'), \n", - " (39, 'Yttrium', 'Y', 88.906, '[Kr] 4d1 5s2'), (40, 'Zirconium', 'Zr', 91.224, '[Kr] 4d2 5s2'), \n", - " (41, 'Niobium', 'Nb', 92.906, '[Kr] 4d3 5s2'), (42, 'Molybdenum', 'Mo', 95.95, '[Kr] 4d4 5s2'), \n", - " (43, 'Technetium', 'Tc', 98, '[Kr] 4d5 5s2'), (44, 'Ruthenium', 'Ru', 101.07, '[Kr] 4d6 5s2'), \n", - " (45, 'Rhodium', 'Rh', 102.91, '[Kr] 4d7 5s2'), (46, 'Palladium', 'Pd', 106.42, '[Kr] 4d8 5s2'), \n", - " (47, 'Silver', 'Ag', 107.87, '[Kr] 4d10'), (48, 'Cadmium', 'Cd', 112.41, '[Kr] 4d10 5s2'), \n", - " (49, 'Indium', 'In', 114.82, '[Kr] 4d10 5s2 5p1'), (50, 'Tin', 'Sn', 118.71, '[Kr] 4d10 5s2 5p2'), \n", - " (51, 'Antimony', 'Sb', 121.76, '[Kr] 4d10 5s2 5p3'), (52, 'Tellurium', 'Te', 127.6, '[Kr] 4d10 5s2 5p4'), \n", - " (53, 'Iodine', 'I', 126.9, '[Kr] 4d10 5s2 5p5'), (54, 'Xenon', 'Xe', 131.29, '[Kr] 4d10 5s2 5p6'), \n", - " (55, 'Cesium', 'Cs', 132.91, '[Xe] 6s1'), (56, 'Barium', 'Ba', 137.33, '[Xe] 6s2'), \n", - " (57, 'Lanthanum', 'La', 138.91, '[Xe] 4f1 5d1 6s2'), (58, 'Cerium', 'Ce', 140.12, '[Xe] 4f2 6s2'), \n", - " (59, 'Praseodymium', 'Pr', 140.91, '[Xe] 4f3 6s2'), (60, 'Neodymium', 'Nd', 144.24, '[Xe] 4f4 6s2')]\n", - "\n", - "@schema\n", - "class Element(dj.Lookup):\n", - " definition = \"\"\"\n", - " atomic_number : tinyint unsigned \n", - " ---\n", - " element : varchar(30)\n", - " symbol : char(2)\n", - " atomic_weight : float \n", - " electron_orbitals : varchar(20)\n", - " unique index(symbol)\n", - " \"\"\"\n", - " contents = elements" - ] + "source": "elements = [\n (1, 'Hydrogen', 'H', 1.008, '1s1'),\n (2, 'Helium', 'He', 4.0026, '1s2'),\n (3, 'Lithium', 'Li', 6.94, '[He] 2s1'),\n (4, 'Beryllium', 'Be', 9.0122, '[He] 2s2'),\n (5, 'Boron', 'B', 10.81, '[He] 2s2 2p1'),\n (6, 'Carbon', 'C', 12.011, '[He] 2s2 2p2'),\n (7, 'Nitrogen', 'N', 14.007, '[He] 2s2 2p3'),\n (8, 'Oxygen', 'O', 15.999, '[He] 2s2 2p4'),\n (9, 'Fluorine', 'F', 18.998, '[He] 2s2 2p5'),\n (10, 'Neon', 'Ne', 20.18, '[He] 2s2 2p6'),\n (11, 'Sodium', 'Na', 22.99, '[Ne] 3s1'),\n (12, 'Magnesium', 'Mg', 24.305, '[Ne] 3s2'),\n (13, 'Aluminum', 'Al', 26.982, '[Ne] 3s2 3p1'),\n (14, 'Silicon', 'Si', 28.085, '[Ne] 3s2 3p2'),\n (15, 'Phosphorus', 'P', 30.974, '[Ne] 3s2 3p3'),\n (16, 'Sulfur', 'S', 32.06, '[Ne] 3s2 3p4'),\n (17, 'Chlorine', 'Cl', 35.45, '[Ne] 3s2 3p5'),\n (18, 'Argon', 'Ar', 39.948, '[Ne] 3s2 3p6'),\n (19, 'Potassium', 'K', 39.098, '[Ar] 4s1'),\n (20, 'Calcium', 'Ca', 40.078, '[Ar] 4s2'),\n (21, 'Scandium', 'Sc', 44.956, '[Ar] 3d1 4s2'),\n (22, 'Titanium', 'Ti', 47.867, '[Ar] 3d2 4s2'),\n (23, 'Vanadium', 'V', 50.942, '[Ar] 3d3 4s2'),\n (24, 'Chromium', 'Cr', 51.996, '[Ar] 3d5 4s1'),\n (25, 'Manganese', 'Mn', 54.938, '[Ar] 3d5 4s2'),\n (26, 'Iron', 'Fe', 55.845, '[Ar] 3d6 4s2'),\n (27, 'Cobalt', 'Co', 58.933, '[Ar] 3d7 4s2'),\n (28, 'Nickel', 'Ni', 58.693, '[Ar] 3d8 4s2'),\n (29, 'Copper', 'Cu', 63.546, '[Ar] 3d10 4s1'),\n (30, 'Zinc', 'Zn', 65.38, '[Ar] 3d10 4s2'),\n (31, 'Gallium', 'Ga', 69.723, '[Ar] 3d10 4s2 4p1'),\n (32, 'Germanium', 'Ge', 72.63, '[Ar] 3d10 4s2 4p2'),\n (33, 'Arsenic', 'As', 74.922, '[Ar] 3d10 4s2 4p3'),\n (34, 'Selenium', 'Se', 78.971, '[Ar] 3d10 4s2 4p4'),\n (35, 'Bromine', 'Br', 79.904, '[Ar] 3d10 4s2 4p5'),\n (36, 'Krypton', 'Kr', 83.798, '[Ar] 3d10 4s2 4p6'),\n (37, 'Rubidium', 'Rb', 85.468, '[Kr] 5s1'),\n (38, 'Strontium', 'Sr', 87.62, '[Kr] 5s2'),\n (39, 'Yttrium', 'Y', 88.906, '[Kr] 4d1 5s2'),\n (40, 'Zirconium', 'Zr', 91.224, '[Kr] 4d2 5s2'),\n (41, 'Niobium', 'Nb', 92.906, '[Kr] 4d3 5s2'),\n (42, 'Molybdenum', 'Mo', 95.95, '[Kr] 4d4 5s2'),\n (43, 'Technetium', 'Tc', 98, '[Kr] 4d5 5s2'),\n (44, 'Ruthenium', 'Ru', 101.07, '[Kr] 4d6 5s2'),\n (45, 'Rhodium', 'Rh', 102.91, '[Kr] 4d7 5s2'),\n (46, 'Palladium', 'Pd', 106.42, '[Kr] 4d8 5s2'),\n (47, 'Silver', 'Ag', 107.87, '[Kr] 4d10'),\n (48, 'Cadmium', 'Cd', 112.41, '[Kr] 4d10 5s2'),\n (49, 'Indium', 'In', 114.82, '[Kr] 4d10 5s2 5p1'),\n (50, 'Tin', 'Sn', 118.71, '[Kr] 4d10 5s2 5p2'),\n (51, 'Antimony', 'Sb', 121.76, '[Kr] 4d10 5s2 5p3'),\n (52, 'Tellurium', 'Te', 127.6, '[Kr] 4d10 5s2 5p4'),\n (53, 'Iodine', 'I', 126.9, '[Kr] 4d10 5s2 5p5'),\n (54, 'Xenon', 'Xe', 131.29, '[Kr] 4d10 5s2 5p6'),\n (55, 'Cesium', 'Cs', 132.91, '[Xe] 6s1'),\n (56, 'Barium', 'Ba', 137.33, '[Xe] 6s2'),\n (57, 'Lanthanum', 'La', 138.91, '[Xe] 4f1 5d1 6s2'),\n (58, 'Cerium', 'Ce', 140.12, '[Xe] 4f2 6s2'),\n (59, 'Praseodymium', 'Pr', 140.91, '[Xe] 4f3 6s2'),\n (60, 'Neodymium', 'Nd', 144.24, '[Xe] 4f4 6s2'),\n]\n\n@schema\nclass Element(dj.Lookup):\n definition = \"\"\"\n atomic_number : tinyint unsigned \n ---\n element : varchar(30)\n symbol : char(2)\n atomic_weight : float \n electron_orbitals : varchar(20)\n unique index(symbol)\n \"\"\"\n contents = elements" }, { "cell_type": "code", @@ -480,4 +423,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/book/30-design/020-primary-key.md b/book/30-design/020-primary-key.md deleted file mode 100644 index 7a2bc75..0000000 --- a/book/30-design/020-primary-key.md +++ /dev/null @@ -1,817 +0,0 @@ -# Primary Key - -The **primary key** is the cornerstone of relational database design, serving as the unique identifier for each record in a table. It ensures **entity integrity**—the guarantee of a one-to-one correspondence between real-world entities and their digital representations in the database. - -```{card} -**Primary Key** is a column or combination of columns that uniquely identifies each row in a table. - -In DataJoint, each table must have a primary key. This applies to both tables stored in the database as well as tables resulting from queries. - -**Entity integrity** is the guarantee of a 1:1 correspondence between real-world entities and their digital representations. - -Within the domain governed by the data management process, each real-world entity must be represented by exactly one unique record in the database; conversely, each record must correspond to a single, distinct real-world entity. -``` - -Imagine what kinds of difficulties would arise if entity integrity broke down in the systems you interact with every day. -* For example, what would happen if your university or company HR department had two different identifies for you in their records? -* What would happen your HR department occasionally updated your records with another person's information? -* What if the same ocurred in your dentist's office? - -Without entity integrity, it is impossible to maintain other aspects of integrity within the database. -For example, a foreign key relationship assumes that every referenced entity exists uniquely and correctly in the database—an assumption that can only hold true if entity integrity is enforced. - -# Challenges to Entity Integrity -It is perhaps no coincidence that the word *integrity* is synonymous with *honesty* and entity integrity often relies on the the participants' knowledge, honesty, trust, transparency, and open communication. -However, for large and complex data operations, entity integrity must be designed into the system. - -The challenge of ensuring entity integrity lies in the fact that it cannot be fully solved by the database system alone. -A reliable system for identifying objects in the real world must be established outside the database to ensure that each entity has a unique and persistent identifier that can be consistently used across all related data records by all participants. - -This requires setting up a disciplined process outside the database. - -## The Three Questions of Entity Integrity - -When designing a database system, you must be able to answer three fundamental questions about entity integrity: - -1. **How do I prevent duplicate records?** - Ensure that the same real-world entity cannot be represented by multiple database records. - -2. **How do I prevent entities sharing the same record?** - Ensure that different real-world entities cannot be represented by the same database record. - -3. **How do I match entities?** - When a real-world entity comes to you, how do you find its corresponding record in the database? - -### Example: Laboratory Mice Database - -Consider a neuroscience laboratory that needs to track mice used in experiments: - -**Question 1: How do I prevent duplicate records?** -- Each mouse gets a unique ear tag number when it arrives at the lab -- The database enforces that no two mice can have the same ear tag number -- Before inserting a new mouse record, the system checks if that ear tag already exists - -**Question 2: How do I prevent entities sharing the same record?** -- Each ear tag number can only be assigned to one mouse -- If a mouse dies and the ear tag is reused, the old record must be properly archived or marked as inactive - -**Question 3: How do I match entities?** -- When a researcher brings a mouse to the lab, they can look up the mouse by its ear tag number -- The database can quickly find the mouse's record using the ear tag as the primary key -- All related experiment records can be linked to this mouse through the ear tag - -## Entity Integrity in Practice - -The three questions of entity integrity must be answered not just in theory, but in practice through your database design and business processes. - -### Example: University Student Database - -**Question 1: How do I prevent duplicate records?** -- Each student gets a unique student ID number when they enroll -- To ensure that the student has not previously registered, additional verification is performed to search the existing records. The student may be asked to provide other names that they have used in the past. -- Additional uniqueness constraints may be imposed on other attributes such as drivers license number, passport number, email, social security number, cellphone, etc. -- Before inserting a new student record, the system checks if that student ID already exists - -**Question 2: How do I prevent entities sharing the same record?** -- Each student ID can only be assigned to one person. -- At registration, the university will issue a photo ID card to the student that allows verifying that another student is not pretending to be the student in the records. -- If a student graduates and the ID is reused years later, the old record must be properly archived, although in most systems, such IDs are retired and not reused. - -**Question 3: How do I match entities?** -- When a student comes to the registrar's office, they can look up their record by student ID. -- The database can quickly find the student's record using the student ID as the primary key. -- All related records (grades, courses, payments) can be linked to this student through the student ID. - -### Example: Laboratory Animal Database - -**Question 1: How do I prevent duplicate records?** -- Each animal gets a unique ear tag or microchip ID when it arrives at the lab -- The database enforces that no two animals can have the same ID -- Before inserting a new animal record, the system checks if that ID already exists - -**Question 2: How do I prevent entities sharing the same record?** -- Each ID can only be assigned to one animal -- If an animal dies and the ID is reused, the old record must be properly archived or marked as inactive - -**Question 3: How do I match entities?** -- When a researcher brings an animal to the lab, they can look up the animal by its ID -- The database can quickly find the animal's record using the ID as the primary key -- All related experiment records can be linked to this animal through the ID - -If you can answer these three questions clearly for your domain, then you have designed for entity integrity. - -For example, establishing the Social Security system in the United States required a reliable identification of workers by all employers to report their income across their entire careers. -For this purpose, in 1936, the Federal Government established a new process to ensure that each US worker would be assigned a unique number, the Social Security Number (SSN). -The SSN would be assigned at birth or at entering the country for employment and no person would be allowed to have two such numbers. -Establishing and enforcing such a system is not easy and takes a considerable effort. -The SSN allows for the accurate and consistent representation of individuals across various government databases, ensuring that each person is correctly identified. - -**Question**: Why do you think the US government did not have the need to assign unique identifiers to tax payers when it began levying federal taxes in 1913? - -**Question**: What abuses would become possible if a person could obtain two SSNs or if two persons could share the same SSN? - -**Learn more** about the history and uses of the SSN: - * [History of establishing the SSN.](https://www.ssa.gov/history/ssn/firstcard.html) - * [How the SSN works.](https://www.ssa.gov/policy/docs/ssb/v69n2/v69n2p55.html) - * [IRS timeline.](https://www.irs.gov/irs-history-timeline) - -Similar rigor is required for identifying other objects in the real world: - * [Vehicle Identification Number](https://www.iso.org/standard/52200.html), regulated by the International Organization for Standardization. - * [Radio-Frequency Identification for Animals](https://en.wikipedia.org/wiki/ISO_11784_and_ISO_11785) for implanted microchips in animals. - * [US Aircraft Registration numbers](https://www.faa.gov/licenses_certificates/aircraft_certification/aircraft_registry/forming_nnumber), the N-numbers seen on the tails, regulated by the FAA. - -These examples demonstrate that establishing a rigorous system for identification may require costly standardization efforts with many systems for enforcement and coordination. - -When a science lab sets out to establish a data management process for its experiment, the first step is to establish a uniform system for identifying test subjects, experiments, protocols, and treatments. -Standard nomenclatures are established to standardize names across all institutions and labs must be aware of them and follow them. - -# Entity Integrity in Schema Design - -Several key aspects of relational database design contribute to maintaining entity integrity: - -## Step 1. Determine Entity Types - -Each table in a relational database should clearly indicate the type of real-world entity it represents. -The table name plays a crucial role in conveying this information. -For instance, if a table is named `Person`, the database must enforce entity integrity for individuals, ensuring each record corresponds to a unique person. - -However, if the table uses identifiers that do not ensure a 1:1 mapping to actual persons—such as cell phone numbers, which might be shared or changed—a more appropriate table name should be chosen, like `UserAccount`, to reflect the specific entity type being represented. -This clarity helps avoid confusion and ensures that the database design accurately mirrors the real-world relationships it models. - -## Step 2. Establish Entity Identification -For all entities tracked by the database, determine how the entities will be identified in the physical world. -This may require establishing and enforcing an identification system. - -## Step 3. Define Primary Keys and Secondary Unique Indexes - -Every table in a relational database must have a **primary key**, and the attributes of the primary key cannot be nullable. This requirement is essential for maintaining **entity integrity**, as it prevents the creation of records that cannot be uniquely identified. - -In addition to the primary key, a table may have **secondary unique indexes** to help enforce a one-to-one correspondence between records and the real-world entities they represent. These unique indexes can be applied to other attributes that need to be unique across the table but are allowed to be nullable. - -For example, in a `Person` table, unique indexes might be enforced on attributes like **email addresses**, **usernames**, **driver’s licenses**, **cellphone numbers**, or **Social Security numbers**. These indexes ensure that no duplicate entries exist for attributes that are intended to be unique, further supporting entity integrity. - -## Looser Entity Integrity - -Depending on the business needs, **entity integrity** requirements can vary. Some businesses may allow multiple digital identities per person, while others may tolerate multiple people sharing a single digital identity. Other businesses may require a strict one-to-one match between real-world entities and their digital representations. - -### Example 1: Gym Memberships - -A gym may enforce that no two people use the same membership (ensuring uniqueness per membership). However, they may not need to prevent an individual from opening multiple memberships, leading to looser enforcement of entity integrity. - -**Gym's Entity Integrity Policy:** -- ✅ **Prevent**: Two people sharing the same membership -- ❌ **Allow**: One person having multiple memberships -- **Reasoning**: They want to track individual usage patterns but don't mind duplicate memberships for revenue - -### Example 2: Grocery Store Discount Cards (2010s) - -Grocery stores issued discount cards to shoppers to qualify them for discounts, but they did not strictly enforce a one-to-one mapping between cards and individual shoppers. - -**Grocery Store's Entity Integrity Policy:** -- ✅ **Allow**: Multiple people using the same card -- ✅ **Allow**: One person using multiple cards -- **Reasoning**: Primary goal is data collection for marketing, not strict individual tracking - -**Why this works for grocery stores:** -- They care more about **purchase patterns** than individual identity -- Multiple family members can share one card -- Customers can get multiple cards for different households -- The data is still valuable for inventory and marketing decisions - -### Example 3: Airline Security (Strict Entity Integrity) - -Airlines require **absolute entity integrity** because: -- **Security**: Must verify passenger identity matches ticket holder -- **Safety**: Need to know exactly who is on each flight -- **Legal**: Required by government regulations - -**Airline's Entity Integrity Policy:** -- ❌ **Prevent**: Two people sharing the same ticket -- ❌ **Prevent**: One person having multiple identities -- **Enforcement**: Photo ID verification, biometric checks, government databases - -This flexibility in entity integrity allows businesses to balance strict data rules with practical needs for customer management and data collection. - - -## Using Natural Keys - -A table can be designed with a **natural key**, which is an identifier that exists in the real world. For example, a Social Security Number (SSN) can serve as a natural key for a person because it is a unique number used and recognized in real-world systems. - -In some cases, a natural key already exists, or one can be specifically created for data management purposes and then introduced into the real world to be permanently associated with physical entities. - -For instance, grocery and hardware stores use **SKU (Stock Keeping Unit)** numbers to track inventory. Each item is assigned an SKU, and store clerks can look it up for specific products. - -Other common examples of natural keys include **phone numbers** and **email addresses**, which are often used as unique identifiers in phone apps and online services. - -Phone numbers, in particular, have become popular as identifiers as mobile phones have evolved from being associated with homes or offices to being personal devices carried by individuals. - -# Composite Primary Keys - -Sometimes, a single column cannot uniquely identify a record. In these cases, we use a **composite primary key**—a primary key made up of multiple columns that together uniquely identify each row. - -## Example: U.S. House of Representatives - -Consider tracking U.S. representatives. A single district number (like "District 1") is not sufficient because there are multiple District 1s across different states. To uniquely identify a representative, you need both the **state** and the **district number**. - -(DataJoint) -```python -@schema -class USRepresentative(dj.Manual): - definition = """ - state : char(2) - district : tinyint unsigned - --- - name : varchar(60) - party : char(1) - phone : varchar(20) - office_room : varchar(20) - """ -``` -(Equivalent SQL) -```sql -CREATE TABLE us_representative ( - state CHAR(2) NOT NULL, - district TINYINT UNSIGNED NOT NULL, - name VARCHAR(60) NOT NULL, - party CHAR(1) NOT NULL, - phone VARCHAR(20), - office_room VARCHAR(20), - PRIMARY KEY (state, district) -); -``` - -The composite primary key `(state, district)` ensures that: -- No two representatives can have the same state-district combination -- Each representative is uniquely identified by their state and district -- The table accurately represents the real-world constraint that each congressional district belongs to exactly one state - -## Example: Boston Marathon Champions - -For tracking marathon champions, you need both the **year** and the **division** (men's or women's) to uniquely identify each champion. - -::::{tab-set} -::: {tab-item} DataJoint -```python -@schema -class MarathonChampion(dj.Manual): - definition = """ - year : int - division : enum('men', 'women') - --- - name : varchar(60) - country : char(2) - time_in_seconds : decimal(8,3) - """ -``` -::: -::: {tab-item} SQL -```sql -CREATE TABLE marathon_champions ( - year YEAR NOT NULL, - division ENUM('men', 'women') NOT NULL, - name VARCHAR(60) NOT NULL, - country CHAR(2) NOT NULL, - time_in_seconds DECIMAL(8,3) NOT NULL, - PRIMARY KEY (year, division) -); -``` -::: -:::: -This design ensures that each year has exactly one men's champion and one women's champion, preventing duplicate entries for the same year-division combination. - -## When to Use Composite Primary Keys - -Use composite primary keys when: -- **Multiple attributes together** uniquely identify an entity -- **Single attributes are insufficient** for unique identification -- **Real-world constraints** require multiple pieces of information for identification -- **Natural business rules** dictate that combinations must be unique - -# Using Surrogate Keys - -In many cases, it makes more sense to use a **surrogate key** as the primary key in a database. A surrogate key has no relationship to the real world and is used solely within the database for identification purposes. These keys are often generated automatically as an **auto-incrementing number** or a **random string** like a UUID (Universally Unique Identifier) or GUID (Globally Unique Identifier). - -When using surrogate keys, entity integrity can still be maintained by using other unique attributes (such as secondary unique indexes) to help identify and match entities to their digital representations. - -Surrogate keys are especially useful for entities that exist only in digital form (e.g., social media posts, email messages) and don't need to be uniquely identified outside of the digital system. In these cases, surrogate keys are an appropriate and efficient choice. - -## Universally Unique Identifiers (UUIDs) - -**UUIDs** (Universally Unique Identifiers) are 128-bit identifiers that are designed to be globally unique across time and space. They are standardized by [RFC 9562](https://www.rfc-editor.org/rfc/rfc9562.html) (which obsoletes RFC 4122) and provide a reliable way to generate surrogate keys without coordination between different systems. - -### UUID Format - -UUIDs are typically represented as hexadecimal strings in the format: `xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx` - -For example: `550e8400-e29b-41d4-a716-446655440000` - -This format uses 32 hexadecimal digits (128 bits total) arranged in groups of 8-4-4-4-12 characters, separated by hyphens. - -### UUID Types - -There are several types of UUIDs, each with different characteristics: - -#### UUID1: Time-based UUIDs - -UUID1 generates identifiers based on: -- **Timestamp**: Current time (60-bit timestamp) -- **Node ID**: MAC address of the network interface (48 bits) -- **Clock sequence**: Sequence number to handle clock rollbacks (14 bits) - -```python -import uuid - -# Generate UUID1 -uuid1_value = uuid.uuid1() -print(uuid1_value) # e.g., 6ba7b810-9dad-11d1-80b4-00c04fd430c8 -``` - -**Characteristics:** -- **Sequential**: UUIDs generated close in time will be similar -- **Sortable**: Can be used for ordering records by creation time -- **Traceable**: Contains information about the computer that generated it -- **Collision-resistant**: Very low probability of duplicates - -**Use cases:** -- Database primary keys where ordering matters -- Log entries -- Event tracking systems - -#### UUID4: Random UUIDs - -UUID4 generates purely random identifiers: - -```python -# Generate UUID4 -uuid4_value = uuid.uuid4() -print(uuid4_value) # e.g., f47ac10b-58cc-4372-a567-0e02b2c3d479 -``` - -**Characteristics:** -- **Random**: No predictable pattern -- **Not sortable**: Random values don't maintain chronological order -- **Anonymous**: No information about the generating system -- **Collision-resistant**: Extremely low probability of duplicates - -**Use cases:** -- Anonymous identifiers -- Session tokens -- API keys -- Any case where you don't need ordering - -#### UUID3 and UUID5: Name-based UUIDs - -UUID3 and UUID5 generate deterministic identifiers based on a namespace and a name: - -```python -# Define a namespace (typically another UUID) -namespace = uuid.uuid4() - -# Generate UUID5 (recommended over UUID3) -uuid5_value = uuid.uuid5(namespace, "neuroscience") -print(uuid5_value) # Same result every time for same namespace + name - -# Generate UUID3 (uses MD5 hash) -uuid3_value = uuid.uuid3(namespace, "neuroscience") -print(uuid3_value) # Different from UUID5 but also deterministic -``` - -**Characteristics:** -- **Deterministic**: Same input always produces same UUID -- **Hierarchical**: Can create structured identifier systems -- **Collision-resistant**: Different names produce different UUIDs -- **Reproducible**: Same namespace + name = same UUID - -**Use cases:** -- Hierarchical data structures -- Content-addressable systems -- Topic categorization -- File system identifiers - -### Practical UUID Examples - -For detailed examples of using UUIDs in DataJoint tables, including table definitions, insertion code, and working with UUIDs in foreign key relationships, see [UUIDs in DataJoint](../85-special-topics/025-uuid.ipynb). - -Here are some conceptual examples showing UUIDs as primary keys: - -#### Example 1: Social Media Posts -(DataJoint) -```python -@schema -class Post(dj.Manual): - definition = """ - post_id : uuid - --- - -> User - content : varchar(1024) - created_at = CURRENT_TIMESTAMP : timestamp - updated_at = CURRENT_TIMESTAMP : timestamp - visibility : enum('public', 'friends', 'private') - """ -``` -(Equivalent SQL) -```sql -CREATE TABLE posts ( - post_id CHAR(36) PRIMARY KEY, -- UUID as string - user_id INT NOT NULL, - content TEXT NOT NULL, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP -); - --- Insert with UUID4 (random) -INSERT INTO posts (post_id, user_id, content) -VALUES ('f47ac10b-58cc-4372-a567-0e02b2c3d479', 123, 'Hello world!'); -``` - -#### Example 2: Hierarchical Categories - -```python -import uuid - -# Create a hierarchical category system -root_namespace = uuid.uuid4() - -# Generate consistent IDs for categories -science_id = uuid.uuid5(root_namespace, "science") -biology_id = uuid.uuid5(science_id, "biology") -neuroscience_id = uuid.uuid5(biology_id, "neuroscience") - -print(f"Science: {science_id}") -print(f"Biology: {biology_id}") -print(f"Neuroscience: {neuroscience_id}") -``` - -#### Example 3: File Management System - -(DataJoint) -```python -@schema -class File(dj.Manual): - definition = """ - file_id : uuid - --- - original_name : varchar(255) - file_path : varchar(500) - file_size : bigint unsigned - upload_date = CURRENT_TIMESTAMP : timestamp - """ -``` -(Equivalent SQL) -```sql -CREATE TABLE files ( - file_id binary(16) PRIMARY KEY, - original_name VARCHAR(255) NOT NULL, - file_path VARCHAR(500) NOT NULL, - file_size BIGINT UNSIGNED NOT NULL, - upload_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP - PRIMARY KEY (file_id) -); -``` -**Inserting data** -Below is the code for inserting an entry into the `File` table. -(DataJoint) -```python -File.insert1((uuid.uuid1(), 'document.pdf', '/uploads/doc.pdf', 1024000)) -``` - -(Equivalent SQL) -```sql -INSERT INTO files (file_id, original_name, file_path, file_size) -VALUES (UUID_TO_BIN(UUID()), 'document.pdf', '/uploads/doc.pdf', 1024000); -``` - -```sql --- MySQL --- Using UUID1 for sortable file IDs -INSERT INTO files (file_id, original_name, file_path, file_size) -VALUES (UUID_TO_BIN('6ba7b810-9dad-11d1-80b4-00c04fd430c8'), 'document.pdf', '/uploads/doc.pdf', 1024000); -``` - -### Choosing the Right UUID Type - -| UUID Type | When to Use | Advantages | Disadvantages | -|-----------|-------------|------------|--------------| -| **UUID1** | Need chronological ordering | Sortable, traceable | Contains system info | -| **UUID4** | Need random identifiers | Anonymous, simple | Not sortable | -| **UUID3/5** | Need deterministic IDs | Reproducible, hierarchical | Requires namespace management | - -### Database Support for UUIDs - -Different databases handle UUIDs differently: - -- **PostgreSQL**: Native UUID type -- **MySQL**: CHAR(36) or BINARY(16) -- **SQLite**: TEXT or BLOB -- **SQL Server**: UNIQUEIDENTIFIER type - -In DataJoint, UUIDs are automatically stored as `BINARY(16)` in MySQL for efficient storage. See [UUIDs in DataJoint](../85-special-topics/025-uuid.ipynb) for practical implementation examples. - - -# Practical Examples of Ensuring Entity Integrity - -Consider the importance and challenges of entity integrity in the following scenarios. Each organization must implement specific processes to establish and verify unique identification. - -## University Students - -**Entity Type**: Individual students -**Primary Key**: Student ID number -**Identification Process**: -- Assigned unique ID during first enrollment -- Cross-referenced with government ID (driver's license, passport) -- Photo ID verification for campus access -- Regular verification against enrollment records - -**Database Design**: -```sql -CREATE TABLE students ( - student_id INT PRIMARY KEY, - ssn CHAR(11) UNIQUE, -- Secondary unique index - email VARCHAR(100) UNIQUE, - first_name VARCHAR(50) NOT NULL, - last_name VARCHAR(50) NOT NULL, - enrollment_date DATE NOT NULL -); -``` - -**Challenges**: Students may change names (marriage), addresses, or contact information. The student ID remains constant. - -## Daycare Center Children - -**Entity Type**: Individual children -**Primary Key**: Child ID number -**Identification Process**: -- Unique ID assigned at registration -- Parent/guardian verification required -- Photo documentation for pickup authorization -- Emergency contact verification - -**Database Design**: -```sql -CREATE TABLE child ( - child_id INT PRIMARY KEY, - first_name VARCHAR(50) NOT NULL, - last_name VARCHAR(50) NOT NULL, - birth_date DATE NOT NULL, - parent_guardian_id INT NOT NULL, - enrollment_date DATE NOT NULL, - photo_path VARCHAR(255), - UNIQUE INDEX (first_name, last_name, birth_date) -); -``` - -**Challenges**: Children grow and change appearance. Parents may divorce or change custody arrangements. - -## Airline Bookings - -**Entity Type**: Individual passengers -**Primary Key**: Booking reference number -**Identification Process**: -- Government-issued photo ID verification -- Biometric checks at security -- Cross-reference with government watch lists -- Real-time verification against booking system - -**Database Design**: -(DataJoint) -```python -@schema -class Booking(dj.Manual): - definition = """ - booking_ref : char(6) - --- - -> Passenger - -> [unique] FlightSeat - """ -``` -(Equivalent SQL) -```sql -CREATE TABLE booking ( - booking_ref CHAR(6) PRIMARY KEY, - passenger_id INT NOT NULL, - flight_number VARCHAR(6) NOT NULL, - flight_date DATE NOT NULL, - seat_number VARCHAR(10) NOT NULL, - PRIMARY KEY (booking_ref), - FOREIGN KEY (passenger_id) REFERENCES passenger(passenger_id), - FOREIGN KEY (flight_number, flight_date, seat_number) REFERENCES flight_seat(flight_number, flight_date, seat_number) - UNIQUE INDEX (flight_number, flight_date, seat_number)); -``` - -**Challenges**: Passengers may have multiple forms of ID, name changes, or international travel requirements. - -## Gym Members - -**Entity Type**: Membership accounts -**Primary Key**: Membership ID -**Identification Process**: -- Photo ID verification at signup -- Photo stored for access control -- Payment method verification -- Regular membership renewal - -**Database Design**: -(DataJoint) -```python -@schema -class Membership(dj.Manual): - definition = """ - member_id : int - --- - member_name : varchar(100) - phone : varchar(20) - email : varchar(100) - start_date : date - end_date : date - photo_path : varchar(255) - """ -``` -(Equivalent SQL) -```sql -CREATE TABLE membership ( - member_id INT PRIMARY KEY, - member_name VARCHAR(100) NOT NULL, - phone VARCHAR(20) NOT NULL, - email VARCHAR(100) NOT NULL, - start_date DATE NOT NULL, - end_date DATE, - photo_path VARCHAR(255), - PRIMARY KEY (member_id) - UNIQUE INDEX (phone), - UNIQUE INDEX (email), -); -``` -**Challenges**: Members may share memberships (family plans), change contact information, or cancel/reactivate memberships. - -## Online Video Game Players - -**Entity Type**: Player accounts -**Primary Key**: Player ID (UUID) -**Identification Process**: -- Email verification for account creation -- Username uniqueness check -- Optional phone number verification -- Anti-cheat system monitoring - -**Database Design**: -(DataJoint) -```python -@schema -class Player(dj.Manual): - definition = """ - player_id : uuid - --- - username : varchar(30) - email = null : varchar(100) - phone = null : varchar(20) - registration_date : date - account_status : enum('active', 'suspended', 'banned') - last_login : timestamp - account_status = "active" : enum('active', 'suspended', 'banned') DEFAULT 'active' - UNIQUE INDEX (username) - UNIQUE INDEX (email) - UNIQUE INDEX (phone) - """ -``` - -(Equivalent SQL) -```sql -CREATE TABLE players ( - player_id BINARY(16) PRIMARY KEY, -- UUID - username VARCHAR(30) NOT NULL, - email VARCHAR(100), - phone VARCHAR(20), - registration_date DATE NOT NULL, - account_status ENUM('active', 'suspended', 'banned') DEFAULT 'active', - UNIQUE INDEX (username), - UNIQUE INDEX (email), - UNIQUE INDEX (phone), - PRIMARY KEY (player_id) -); -``` - -**Challenges**: Players may create multiple accounts, share accounts, or use different devices. - -## Social Media Posts - -**Entity Type**: Individual posts -**Primary Key**: Post ID (UUID) -**Identification Process**: -- Automatic UUID generation -- User authentication verification -- Content moderation checks -- Timestamp recording - -**Database Design**: -(DataJoint) -```python -@schema -class Post(dj.Manual): - definition = """ - post_id : uuid - --- - -> User - content : varchar(1024) - created_at = CURRENT_TIMESTAMP : timestamp - updated_at = CURRENT_TIMESTAMP : timestamp - visibility : enum('public', 'friends', 'private') - """ -``` -(Equivalent SQL) -```sql -CREATE TABLE post ( - post_id BINARY(16) PRIMARY KEY, -- UUID4 - user_id INT NOT NULL, - content varchar(1024) NOT NULL, - created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, - visibility ENUM('public', 'friends', 'private') DEFAULT 'public', - PRIMARY KEY (post_id), - FOREIGN KEY (user_id) REFERENCES user(user_id) -); -``` - -**Challenges**: Posts may be edited, deleted, or shared across platforms. - -## Mortgage Loans - -**Entity Type**: Individual loan accounts -**Primary Key**: Loan number -**Identification Process**: -- Borrower identity verification -- Credit check and income verification -- Property appraisal and title search -- Legal documentation review - -**Database Design**: -DataJoint -```python -@schema -class Loan(dj.Manual): - definition = """ - loan_number : varchar(20) - --- - -> Borrower - property_address : varchar(255) - loan_amount : decimal(15,2) - interest_rate : decimal(5,4) - term_months : int - origination_date : date - maturity_date : date - """ -``` - -Equivalent SQL: -```sql -CREATE TABLE loan ( - loan_number VARCHAR(20) PRIMARY KEY, - borrower_ssn CHAR(11) NOT NULL, - property_address varchar(255) NOT NULL, - loan_amount DECIMAL(15,2) NOT NULL, - interest_rate DECIMAL(5,4) NOT NULL, - term_months INT NOT NULL, - origination_date DATE NOT NULL, - maturity_date DATE NOT NULL, - PRIMARY KEY (loan_number), - FOREIGN KEY (borrower_ssn) REFERENCES borrower(borrower_ssn)); -``` - - -**Challenges**: Borrowers may refinance, modify loans, or transfer property ownership. - -## Key Takeaways - -Each scenario demonstrates different levels of entity integrity requirements: - -1. **Strict Integrity** (Airlines, Banks): Government-level verification required -2. **Moderate Integrity** (Universities, Daycare): Photo ID and documentation verification -3. **Flexible Integrity** (Gyms, Games): Basic verification with some tolerance for sharing -4. **Digital-Only Integrity** (Social Media): Automated systems with minimal real-world verification - -The choice of primary key and identification process depends on: -- **Business requirements** -- **Legal/regulatory constraints** -- **Security needs** -- **User experience considerations** -- **Cost of verification processes** - -# Summary - -The **primary key** is the foundation of relational database design, ensuring **entity integrity**—the guarantee of a one-to-one correspondence between real-world entities and their digital representations in the database. - -## Key Concepts Covered - -1. **Primary Key Definition**: A column or combination of columns that uniquely identifies each row in a table -2. **Entity Integrity**: The principle that each real-world entity must be represented by exactly one unique record -3. **Natural Keys**: Identifiers that exist in the real world (SSN, phone numbers, email addresses) -4. **Surrogate Keys**: Database-generated identifiers with no real-world meaning (auto-increment, UUIDs) -5. **Composite Primary Keys**: Multi-column primary keys for complex identification scenarios -6. **UUIDs**: Universally Unique Identifiers with different types (UUID1, UUID3, UUID4, UUID5) for various use cases -7. **Normalization Principle**: Each table should represent one distinct entity class -8. **Business Context**: Different organizations require different levels of entity integrity based on their needs - -## Design Principles - -- **Choose appropriate primary keys** based on business requirements and identification processes -- **Separate different entity types** into different tables following normalization principles -- **Use secondary unique indexes** to enforce additional uniqueness constraints -- **Consider the trade-offs** between natural keys and surrogate keys -- **Implement appropriate verification processes** to maintain entity integrity in the real world - -By implementing these principles, you can create robust databases that faithfully mirror the real-world entities and relationships they are intended to manage, supporting reliable and accurate data management across various business contexts. diff --git a/book/30-design/030-foreign-keys.ipynb b/book/30-design/030-foreign-keys.ipynb deleted file mode 100644 index a5d58e0..0000000 --- a/book/30-design/030-foreign-keys.ipynb +++ /dev/null @@ -1,511 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Foreign Keys" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "# Modeling Relationships with Foreign Keys\n\nWhile **entity integrity** ensures that each record uniquely represents a real-world entity, **referential integrity** ensures that the *relationships between* these entities are valid and consistent. It's a guarantee that you won't have an employee assigned to a non-existent department or a task associated with a deleted project.\n\nCrucially, **referential integrity is impossible without entity integrity**. You must first have a reliable way to identify unique entities before you can define their relationships.\n\nIn relational databases, these relationships are established and enforced using **foreign keys**. A foreign key creates a link between a **child table** (the one with the reference) and a **parent table** (the one being referenced). Think of `Employee` as the child and `Title` as the parent; an employee must have a valid, existing title.\n\nA foreign key is a column (or set of columns) in the child table that refers to the primary key of the parent table. In DataJoint, a foreign key *always* references a parent's primary key, which is a highly recommended practice for clarity and consistency.\n\n## Referential Integrity + Workflow Dependencies\n\nIn DataJoint, foreign keys serve a **dual role** that extends beyond traditional relational databases:\n\n1. **Referential integrity** (like traditional databases): Ensures that child references must exist in the parent table\n2. **Workflow dependencies** (DataJoint's addition): Prescribes the order of operations—the parent must be created before the child\n\nThis transforms the schema into a **directed acyclic graph (DAG)** representing valid workflow execution sequences. The foreign key `-> Title` in `Employee` not only ensures that each employee has a valid title, but also establishes that titles must be created before employees can be assigned to them.\n\nFor more on how DataJoint extends foreign keys with workflow semantics, see [Relational Workflows](../20-concepts/05-workflows.md).\n\nIn the following example, we define the parent table `Title` and the child table `Employee`, which references `Title`." - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Exception reporting mode: Minimal\n" - ] - } - ], - "source": [ - "%xmode Minimal\n", - "\n", - "import datajoint as dj\n", - "schema = dj.Schema('company')\n", - "\n", - "\n", - "@schema\n", - "class Title(dj.Lookup):\n", - " definition = \"\"\"\n", - " title_code : char(8)\n", - " ---\n", - " full_title : varchar(120)\n", - " \"\"\"\n", - " \n", - " contents = [\n", - " (\"SW-Dev1\", \"Software Developer 1\"),\n", - " (\"SW-Dev2\", \"Software Developer 2\"),\n", - " (\"SW-Dev3\", \"Software Developer 3\"),\n", - " (\"Web-Dev1\", \"Web Developer 1\"),\n", - " (\"Web-Dev2\", \"Web Developer 2\"),\n", - " (\"Web-Dev3\", \"Web Developer 3\"),\n", - " (\"HR-Mgr\", \"Human Resources Manager\")\n", - " ]\n", - "\n", - "@schema\n", - "class Employee(dj.Manual):\n", - " definition = \"\"\"\n", - " person_id : int \n", - " ---\n", - " first_name : varchar(30)\n", - " last_name : varchar(30)\n", - " -> Title\n", - " \"\"\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here the arrow `-> Title` creates a foreign key from `Employee` (child) to `Title` (parent). This foreign key:\n", - "- **Enforces referential integrity**: Ensures each employee has a valid title that exists in the `Title` table\n", - "- **Establishes workflow dependency**: Requires that titles must be created before employees can be assigned to them\n", - "\n", - "We can use the `dj.Diagram` class to visualize the relationships created by the foreign keys." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "Title\n", - "\n", - "\n", - "Title\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Employee\n", - "\n", - "\n", - "Employee\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Title->Employee\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dj.Diagram(schema)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The parent table `Title` is above and the child table `Employee` is below. The arrow direction indicates both the referential relationship (Employee references Title) and the workflow dependency (Title must be created before Employee)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Defining Foreign Keys in SQL\n", - "In standard SQL, the same relationship is defined with more verbose syntax. The child table must explicitly redefine the columns of the parent's primary key and then declare the foreign key constraint.\n", - "\n", - "Notice that the data type (`char(8)`) of the primary key in `Title` must be exactly repeated for the `title_code` column in `Employee`.\n", - "\n", - "```sql\n", - "-- Parent Table\n", - "CREATE TABLE title (\n", - " title_code CHAR(8) NOT NULL,\n", - " full_title VARCHAR(120) NOT NULL,\n", - " PRIMARY KEY (title_code)\n", - ");\n", - "\n", - "-- Child Table\n", - "CREATE TABLE employee (\n", - "person_id INT NOT NULL,\n", - "first_name VARCHAR(30) NOT NULL,\n", - "last_name VARCHAR(30) NOT NULL,\n", - "title_code CHAR(8) NOT NULL,\n", - "PRIMARY KEY (person_id),\n", - "FOREIGN KEY (title_code) REFERENCES title(title_code)\n", - ");\n", - "```\n", - "\n", - "The concise `-> Title` syntax in DataJoint handles this automatically, reducing redundancy and preventing potential errors if the parent's primary key definition changes." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```{admonition} A Logical Constraint, not a Physical Pointer\n", - ":class: tip\n", - "\n", - "A revolutionary concept in the relational model is that a foreign key is **not a physical pointer** to a location on a disk. Instead, it is a **logical constraint** enforced at runtime.\n", - "\n", - "When you try to insert a row into a child table, the database doesn't follow a pre-existing \"link.\" It performs a search on the parent table to see if a record with a matching primary key exists. If a match is found, the insert is allowed; otherwise, it is rejected.\n", - "\n", - "This is fundamentally different from other data models like HDF5, where data is often linked by direct pointers or paths [^1]. The logical nature of foreign keys gives relational databases their flexibility and integrity.\n", - "\n", - "[^1]: The HDF Group. \"HDF5 User's Guide: Groups and Links\". [https://docs.hdfgroup.org/hdf5/develop/H5.intro.html#intro-groups](https://docs.hdfgroup.org/hdf5/develop/H5.intro.html#intro-groups)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## The Five Effects of a Foreign Key\n", - "\n", - "Foreign keys enforce **referential integrity** by regulating the relationships between a **parent table** (referenced entity set) and a **child table** (dependent entity set). In DataJoint, they also establish **workflow dependencies** that prescribe the order of operations. In addition to defining how entities relate, foreign keys also impose important constraints on data operations. \n", - "\n", - "Below are the five key effects of foreign keys:\n", - "\n", - "### Effect 1. The primary key columns from the parent become embedded as foreign key columns in the child \n", - "When a foreign key relationship is established, the **primary key** (or unique key) of the parent table becomes part of the child table’s schema. The child table includes the foreign key attribute(s) with **matching name and datatype** to ensure that each row in the child table refers to a valid parent record.\n", - "\n", - "If you examine the heading of `Employee`, you will find that it now contains a `title_code` field. It will have the same data type as the corresponding field in `Title`. \n" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "
\n", - "

person_id

\n", - " \n", - "
\n", - "

first_name

\n", - " \n", - "
\n", - "

last_name

\n", - " \n", - "
\n", - "

title_code

\n", - " \n", - "
\n", - " \n", - "

Total: 0

\n", - " " - ], - "text/plain": [ - "*person_id first_name last_name title_code \n", - "+-----------+ +------------+ +-----------+ +------------+\n", - "\n", - " (Total: 0)" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "Employee()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Effect 2: Inserts into the Child Table are Restricted\n", - "\n", - "A foreign key ensures that no \"orphaned\" records are created. An insert into the child table is only permitted if the foreign key value corresponds to an existing primary key in the parent table.\n", - "\n", - "The rule is simple: **Inserts are restricted in the child, not the parent.** You can always add new job titles, but you cannot add an employee with a `title_code` that doesn't exist in the `Title` table.\n", - "\n", - "**In DataJoint, this enforces workflow order**: The parent entity must be created before the child entity can reference it. This ensures workflows execute in the correct sequence." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "# This works!\n", - "Employee.insert1((1, 'Mark', 'Sommers', 'Web-Dev1'))" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "ename": "IntegrityError", - "evalue": "Cannot add or update a child row: a foreign key constraint fails (`company`.`employee`, CONSTRAINT `employee_ibfk_1` FOREIGN KEY (`title_code`) REFERENCES `#title` (`title_code`) ON DELETE RESTRICT ON UPDATE CASCADE)", - "output_type": "error", - "traceback": [ - "\u001b[31mIntegrityError\u001b[39m\u001b[31m:\u001b[39m Cannot add or update a child row: a foreign key constraint fails (`company`.`employee`, CONSTRAINT `employee_ibfk_1` FOREIGN KEY (`title_code`) REFERENCES `#title` (`title_code`) ON DELETE RESTRICT ON UPDATE CASCADE)\n" - ] - } - ], - "source": [ - "# This fails!\n", - "Employee.insert1((2, 'Brenda', 'Means', 'BizDev'))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Effect 3: Deletes from the Parent Table are Restricted\n", - "\n", - "To prevent broken relationships, a parent record cannot be deleted if any child records still refer to it.\n", - "\n", - "The rule is the inverse of the insert rule: **Deletes are restricted in the parent, not the child.** You can always delete an employee, but you cannot delete a title if it is still assigned to an employee.\n", - "\n", - "In standard SQL, this operation would fail with a constraint error. DataJoint, however, implements a **cascading delete**. It will warn you that deleting the parent record will also delete all dependent child records, which can cascade through many levels of a deep hierarchy.\n", - "\n", - "**In DataJoint, this maintains workflow consistency**: When you delete a parent entity, all downstream workflow artifacts that depend on it are also deleted. This ensures computational validity—if the inputs are gone, any results based on those inputs must be removed as well. This is essential for maintaining workflow integrity in computational pipelines." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[2025-09-18 14:31:21,418][INFO]: Deleting 1 rows from `company`.`employee`\n", - "[2025-09-18 14:31:21,422][INFO]: Deleting 7 rows from `company`.`#title`\n", - "[2025-09-18 14:31:29,056][WARNING]: Delete cancelled\n" - ] - }, - { - "data": { - "text/plain": [ - "7" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "Title.delete()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Effect 4: Updates to Referenced Keys are Restricted\n", - "\n", - "To maintain referential integrity, updates to a parent's primary key or a child's foreign key are restricted.\n", - "\n", - "In general relational theory, databases can be configured to handle this with **cascading updates**, where changing a parent's primary key automatically propagates that change to all child records.\n", - "\n", - "However, DataJoint does not support updating primary key values, as this can risk breaking referential integrity in complex scientific workflows. The preferred and safer pattern in DataJoint is to **delete the old record and insert a new one** with the updated information.\n", - "\n", - "**In DataJoint, this preserves workflow immutability**: Workflow artifacts are treated as immutable once created. If upstream data changes, the workflow must be re-executed from that point forward. This ensures that all downstream results remain consistent with their inputs, maintaining computational validity throughout the workflow." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Effect 5: Performance Optimization with Secondary Indexes\n", - "\n", - "A secondary index is automatically created on the foreign key in the child table to accelerate common operations and queries associated with foreign keys.\n", - "\n", - "**Why indexes matter:**\n", - "\n", - "1. **Delete operations**: When deleting from the parent table, the database must look up all matching child records to enforce referential integrity. An index on the foreign key makes these lookups fast, even when dealing with large child tables.\n", - "\n", - "2. **Join operations**: When joining a parent table with a child table, the database matches the foreign key in the child to the primary key in the parent. An index on the foreign key allows the database to quickly locate matching rows, dramatically improving join performance.\n", - "\n", - "3. **Subqueries**: When checking whether a foreign key value exists in the parent table, the database uses the index to quickly verify the existence of the referenced record. This is especially important for insert operations that must validate foreign key constraints.\n", - "\n", - "**In DataJoint, this optimization is automatic**: When you define a foreign key with `-> Parent`, DataJoint automatically creates the necessary index. This ensures that workflow operations—from populating tables to cascading deletes—remain efficient even as your data grows.\n", - "\n", - "The index is created automatically by the database system, so you don't need to explicitly define it. However, understanding its existence helps you appreciate why foreign key operations remain performant even with large datasets.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Association Tables: Many-to-Many Relationships\n", - "\n", - "The simple one-to-many relationship we've seen so far (Employee → Title) is just the beginning. In real-world applications, you often need to model **many-to-many relationships** where entities can be connected in complex ways.\n", - "\n", - "Consider a scenario where you want to track which languages each person speaks and their fluency level. A person can speak multiple languages, and each language can be spoken by multiple people. This is a classic many-to-many relationship.\n", - "\n", - "To model this, we create an **association table** (also called a **junction table** or **bridge table**) that serves as an intermediary between the two main entities. The association table contains foreign keys to both parent tables, and often includes additional attributes that describe the relationship itself.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create a new schema for our language example\n", - "import random\n", - "from faker import Faker\n", - "\n", - "schema_lang = dj.Schema('languages')\n", - "fake = Faker()\n", - "\n", - "@schema_lang\n", - "class Language(dj.Lookup):\n", - " definition = \"\"\"\n", - " lang_code : char(4)\n", - " ---\n", - " language_name : varchar(30)\n", - " \"\"\"\n", - " \n", - " contents = [\n", - " (\"ENG\", \"English\"),\n", - " (\"SPA\", \"Spanish\"), \n", - " (\"JPN\", \"Japanese\"),\n", - " (\"TAG\", \"Tagalog\"),\n", - " (\"MAN\", \"Mandarin\"),\n", - " (\"POR\", \"Portuguese\")\n", - " ]\n", - "\n", - "@schema_lang\n", - "class Person(dj.Manual):\n", - " definition = \"\"\"\n", - " person_id : int\n", - " ---\n", - " name : varchar(60)\n", - " date_of_birth : date\n", - " \"\"\"\n", - "\n", - "@schema_lang \n", - "class Fluency(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Person\n", - " -> Language\n", - " ---\n", - " fluency_level : enum('beginner', 'intermediate', 'fluent')\n", - " \"\"\"\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "# Summary\n\nForeign keys ensure referential integrity by linking a child table to a parent table. In DataJoint, they also establish **workflow dependencies** that prescribe the order of operations. This link imposes five key effects:\n\n1. **Schema Embedding**: The parent's primary key is added as columns to the child table.\n2. **Insert Restriction**: A row cannot be added to the **child** if its foreign key doesn't match a primary key in the **parent**. In DataJoint, this enforces workflow order—the parent must be created before the child.\n3. **Delete Restriction**: A row cannot be deleted from the **parent** if it is still referenced by any rows in the **child**. In DataJoint, cascading deletes maintain workflow consistency by removing dependent downstream artifacts.\n4. **Update Restriction**: Updates to the primary and foreign keys are restricted to prevent inconsistencies. In DataJoint, this preserves workflow immutability—workflow artifacts must be re-executed rather than updated.\n5. **Performance Optimization**: An index is automatically created on the foreign key in the child table to speed up searches and joins.\n\n**In DataJoint, foreign keys transform the schema into a directed acyclic graph (DAG)** that represents valid workflow execution sequences. The schema becomes an executable specification of your workflow, where foreign keys not only enforce referential integrity but also prescribe the order of operations and maintain computational validity throughout the workflow.\n\nFor more on how DataJoint extends foreign keys with workflow semantics, see [Relational Workflows](../20-concepts/05-workflows.md)." - } - ], - "metadata": { - "kernelspec": { - "display_name": "base", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.2" - }, - "orig_nbformat": 4 - }, - "nbformat": 4, - "nbformat_minor": 2 -} \ No newline at end of file diff --git a/book/30-design/030-foreign-keys.md b/book/30-design/030-foreign-keys.md new file mode 100644 index 0000000..0647e62 --- /dev/null +++ b/book/30-design/030-foreign-keys.md @@ -0,0 +1,652 @@ +--- +title: Foreign Keys +--- + +# Foreign Keys: Ensuring Referential Integrity + +While **entity integrity** ensures that each record uniquely represents a real-world entity, **referential integrity** ensures that *relationships between* entities are valid and consistent. +A foreign key guarantees that you won't have an employee assigned to a non-existent department or a task associated with a deleted project. + +**Referential integrity is impossible without entity integrity.** You must first have a reliable way to identify unique entities before you can define relationships between them. + +```{admonition} Learning Objectives +:class: note + +By the end of this chapter, you will: +- Understand referential integrity and how foreign keys enforce it +- Learn the five effects of foreign keys on database operations +- Master foreign key modifier syntax: `nullable` and `unique` +- Understand when modifiers can and cannot be applied +- Design tables that properly express relationship constraints +``` + +# What is a Foreign Key? + +A **foreign key** is a column (or set of columns) in a child table that references the primary key of a parent table. +This link establishes a relationship between entities and enforces referential integrity by ensuring that references point to valid records. + +```{card} Foreign Key Characteristics +In DataJoint, foreign keys: +- Always reference the **primary key** of the parent table +- Automatically inherit the parent's primary key attributes (name and datatype) +- Create both a **referential constraint** and a **workflow dependency** +- Can be placed in the primary key (above `---`) or as secondary attributes (below `---`) +``` + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Title(dj.Lookup): + definition = """ + title_code : char(8) # job title code + --- + full_title : varchar(120) # full title description + """ + +@schema +class Employee(dj.Manual): + definition = """ + person_id : int # employee identifier + --- + first_name : varchar(30) + last_name : varchar(30) + -> Title # foreign key to Title + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE title ( + title_code CHAR(8) NOT NULL COMMENT 'job title code', + full_title VARCHAR(120) NOT NULL COMMENT 'full title description', + PRIMARY KEY (title_code) +); + +CREATE TABLE employee ( + person_id INT NOT NULL COMMENT 'employee identifier', + first_name VARCHAR(30) NOT NULL, + last_name VARCHAR(30) NOT NULL, + title_code CHAR(8) NOT NULL, + PRIMARY KEY (person_id), + FOREIGN KEY (title_code) REFERENCES title(title_code) +); +``` +```` +````` + +The arrow `-> Title` in DataJoint creates a foreign key from `Employee` (child) to `Title` (parent). +Notice how the SQL version requires explicit repetition of the column name and datatype—DataJoint handles this automatically. + +# Referential Integrity + Workflow Dependencies + +In DataJoint, foreign keys serve a **dual role** that extends beyond traditional relational databases: + +1. **Referential integrity** (like traditional databases): Ensures that child records reference existing parent records +2. **Workflow dependencies** (DataJoint's addition): Prescribes the order of operations—the parent must exist before the child can reference it + +This transforms the schema into a **directed acyclic graph (DAG)** representing valid workflow execution sequences. +The foreign key `-> Title` in `Employee` not only ensures that each employee has a valid title, but also establishes that titles must be created before employees can be assigned to them. + +```{seealso} +For more on how DataJoint extends foreign keys with workflow semantics, see [Relational Workflows](../20-concepts/05-workflows.md). +``` + +```{admonition} A Logical Constraint, Not a Physical Pointer +:class: tip + +A revolutionary concept in the relational model is that a foreign key is **not a physical pointer** to a location on disk. +Instead, it is a **logical constraint** enforced at runtime. + +When you insert a row into a child table, the database doesn't follow a pre-existing "link." +It performs a search on the parent table to verify that a matching primary key exists. +If found, the insert succeeds; otherwise, it is rejected. + +This differs fundamentally from other data models like HDF5, where data is often linked by direct pointers or paths. +The logical nature of foreign keys gives relational databases their flexibility and integrity. +``` + +# The Five Effects of a Foreign Key + +Foreign keys impose important constraints on data operations. +Understanding these effects is essential for designing schemas that maintain integrity. + +## Effect 1: Schema Embedding + +When a foreign key is declared, the primary key columns from the parent become embedded in the child table with **matching names and datatypes**. + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Employee(dj.Manual): + definition = """ + person_id : int + --- + first_name : varchar(30) + last_name : varchar(30) + -> Title # embeds title_code with type char(8) + """ + +# The Employee table now contains: +# person_id (int) - primary key +# first_name (varchar(30)) +# last_name (varchar(30)) +# title_code (char(8)) - inherited from Title +``` +```` +````{tab-item} SQL +:sync: sql +```sql +-- The foreign key requires explicit column definition +CREATE TABLE employee ( + person_id INT NOT NULL, + first_name VARCHAR(30) NOT NULL, + last_name VARCHAR(30) NOT NULL, + title_code CHAR(8) NOT NULL, -- must match Title's primary key type + PRIMARY KEY (person_id), + FOREIGN KEY (title_code) REFERENCES title(title_code) +); +``` +```` +````` + +## Effect 2: Insert Restriction on Child + +A foreign key ensures that no "orphaned" records are created. +An insert into the child table succeeds only if the foreign key value corresponds to an existing primary key in the parent. + +**The rule**: Inserts are restricted in the **child**, not the parent. +You can always add new job titles, but you cannot add an employee with a `title_code` that doesn't exist in `Title`. + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +# This works - 'Web-Dev1' exists in Title +Employee.insert1((1, 'Mark', 'Sommers', 'Web-Dev1')) + +# This fails - 'BizDev' does not exist in Title +Employee.insert1((2, 'Brenda', 'Means', 'BizDev')) +# IntegrityError: Cannot add or update a child row: +# a foreign key constraint fails +``` +```` +````{tab-item} SQL +:sync: sql +```sql +-- This works - 'Web-Dev1' exists in title +INSERT INTO employee (person_id, first_name, last_name, title_code) +VALUES (1, 'Mark', 'Sommers', 'Web-Dev1'); + +-- This fails - 'BizDev' does not exist in title +INSERT INTO employee (person_id, first_name, last_name, title_code) +VALUES (2, 'Brenda', 'Means', 'BizDev'); +-- ERROR: Cannot add or update a child row: +-- a foreign key constraint fails +``` +```` +````` + +**In DataJoint, this enforces workflow order**: The parent entity must be created before the child entity can reference it. + +## Effect 3: Delete Restriction on Parent + +To prevent broken relationships, a parent record cannot be deleted if any child records still reference it. + +**The rule**: Deletes are restricted in the **parent**, not the child. +You can always delete an employee, but you cannot delete a title if employees still have that title. + +In standard SQL, this would fail with a constraint error. +DataJoint implements **cascading delete**—it warns you that deleting the parent will also delete all dependent child records, which can cascade through many levels of a deep hierarchy. + +**In DataJoint, this maintains workflow consistency**: When you delete a parent entity, all downstream workflow artifacts that depend on it are also deleted. +This ensures computational validity—if the inputs are gone, any results based on those inputs must be removed. + +## Effect 4: Update Restriction on Keys + +Updates to a parent's primary key or a child's foreign key are restricted to maintain referential integrity. + +DataJoint does not support updating primary key values, as this risks breaking referential integrity in complex scientific workflows. +The preferred pattern is to **delete the old record and insert a new one** with the updated information. + +**In DataJoint, this preserves workflow immutability**: Workflow artifacts are treated as immutable once created. +If upstream data changes, the workflow must be re-executed from that point forward. + +## Effect 5: Automatic Index Creation + +A secondary index is automatically created on the foreign key columns in the child table. +This index accelerates: + +1. **Delete operations**: Fast lookup of child records when checking if parent can be deleted +2. **Join operations**: Efficient matching of foreign keys to primary keys +3. **Constraint validation**: Quick verification during inserts + +You don't need to create this index manually—the database system handles it automatically when the foreign key is declared. + +# Foreign Key Modifiers + +DataJoint provides two modifiers that alter foreign key behavior: `nullable` and `unique`. +These modifiers control whether the relationship is optional and whether it enforces uniqueness. + +## The `nullable` Modifier + +By default, foreign key attributes are **required** (NOT NULL)—every child record must reference a valid parent. +The `nullable` modifier makes the relationship **optional**, allowing child records to exist without a parent reference. + +```{card} Nullable Foreign Key Syntax +`-> [nullable] ParentTable` + +This creates foreign key attributes that accept NULL values, indicating "no associated parent." +``` + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Account(dj.Manual): + definition = """ + account_id : int unsigned # account identifier + --- + -> [nullable] Customer # optional owner - can be NULL + open_date : date + balance : decimal(10,2) + """ + +# Accounts can exist without an owner +Account.insert1({ + 'account_id': 1001, + 'customer_id': None, # NULL - no owner assigned + 'open_date': '2024-01-15', + 'balance': 0.00 +}) +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE account ( + account_id INT UNSIGNED NOT NULL COMMENT 'account identifier', + customer_id INT UNSIGNED NULL, -- allows NULL values + open_date DATE NOT NULL, + balance DECIMAL(10,2) NOT NULL, + PRIMARY KEY (account_id), + FOREIGN KEY (customer_id) REFERENCES customer(customer_id) +); + +-- Accounts can exist without an owner +INSERT INTO account (account_id, customer_id, open_date, balance) +VALUES (1001, NULL, '2024-01-15', 0.00); +``` +```` +````` + +**Use cases for nullable foreign keys:** +- Accounts that may not yet have an assigned owner +- Products that may not have a designated supplier +- Tasks that have not yet been assigned to an employee + +````{admonition} Primary Key Foreign Keys Cannot Be Nullable +:class: warning + +Foreign keys that are part of the **primary key** (declared above the `---` line) **cannot be made nullable**. +Primary key attributes must always have values—they identify the entity. + +```python +# INVALID - primary key foreign keys cannot be nullable +@schema +class Session(dj.Manual): + definition = """ + -> [nullable] Subject # ERROR: primary key cannot be NULL + session : int + --- + session_date : date + """ +``` + +Only foreign keys in **secondary attributes** (below the `---` line) can be nullable. +```` + +## The `unique` Modifier + +By default, a secondary foreign key allows **many-to-one** relationships—multiple child records can reference the same parent. +The `unique` modifier restricts this to **one-to-one**—at most one child can reference each parent. + +```{card} Unique Foreign Key Syntax +`-> [unique] ParentTable` + +This adds a unique constraint on the foreign key attributes, ensuring each parent is referenced by at most one child. +``` + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Employee(dj.Manual): + definition = """ + employee_id : int unsigned + --- + full_name : varchar(60) + """ + +@schema +class ParkingSpot(dj.Manual): + definition = """ + spot_number : int unsigned # parking spot identifier + --- + -> [unique] Employee # at most one spot per employee + location : varchar(30) + """ + +# Each employee can have at most one parking spot +ParkingSpot.insert1({ + 'spot_number': 101, + 'employee_id': 1, + 'location': 'Garage A' +}) + +# This would fail - employee 1 already has a spot +ParkingSpot.insert1({ + 'spot_number': 102, + 'employee_id': 1, # ERROR: duplicate entry + 'location': 'Garage B' +}) +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE employee ( + employee_id INT UNSIGNED NOT NULL, + full_name VARCHAR(60) NOT NULL, + PRIMARY KEY (employee_id) +); + +CREATE TABLE parking_spot ( + spot_number INT UNSIGNED NOT NULL COMMENT 'parking spot identifier', + employee_id INT UNSIGNED NOT NULL, + location VARCHAR(30) NOT NULL, + PRIMARY KEY (spot_number), + UNIQUE KEY (employee_id), -- unique constraint on foreign key + FOREIGN KEY (employee_id) REFERENCES employee(employee_id) +); +``` +```` +````` + +**Use cases for unique foreign keys:** +- Parking spots assigned to employees (one spot per employee) +- Primary contact person for a department +- Default billing address for a customer + +## Combining Modifiers + +The `nullable` and `unique` modifiers can be combined to create an **optional one-to-one** relationship: + +```{card} Combined Modifier Syntax +`-> [nullable, unique] ParentTable` or `-> [unique, nullable] ParentTable` + +This creates an optional relationship where each parent can be referenced by at most one child (or none). +``` + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Account(dj.Manual): + definition = """ + account_id : int unsigned + --- + -> [nullable, unique] Customer # optional, one account per customer max + open_date : date + """ + +# Account without owner +Account.insert1({'account_id': 1, 'customer_id': None, 'open_date': '2024-01-01'}) + +# Account with owner - customer 100 +Account.insert1({'account_id': 2, 'customer_id': 100, 'open_date': '2024-01-02'}) + +# This fails - customer 100 already has an account +Account.insert1({'account_id': 3, 'customer_id': 100, 'open_date': '2024-01-03'}) +# IntegrityError: Duplicate entry '100' for key 'customer_id' +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE account ( + account_id INT UNSIGNED NOT NULL, + customer_id INT UNSIGNED NULL, -- nullable + open_date DATE NOT NULL, + PRIMARY KEY (account_id), + UNIQUE KEY (customer_id), -- unique (NULLs don't violate uniqueness) + FOREIGN KEY (customer_id) REFERENCES customer(customer_id) +); +``` +```` +````` + +```{admonition} NULL Values and Unique Constraints +:class: note + +In SQL, NULL values are **not considered equal** for uniqueness purposes. +Multiple rows can have NULL in a column with a unique constraint—only non-NULL values must be unique. + +This is why `[nullable, unique]` works as expected: many accounts can have no owner (NULL), but each customer can own at most one account. +``` + +# Modifier Summary + +| Modifier | Placement | Effect | Use Case | +|----------|-----------|--------|----------| +| (none) | Secondary | Required many-to-one | Default: every child references exactly one parent | +| `nullable` | Secondary only | Optional many-to-one | Child may exist without parent reference | +| `unique` | Secondary | Required one-to-one | Each parent referenced by at most one child | +| `nullable, unique` | Secondary only | Optional one-to-one | Optional relationship, but exclusive if present | + +```{admonition} Modifiers Apply Only to Secondary Foreign Keys +:class: important + +Foreign keys in the **primary key** (above `---`): +- Cannot use `nullable` (primary keys cannot be NULL) +- The `unique` modifier is redundant (primary keys are already unique) + +Modifiers are meaningful only for foreign keys declared as **secondary attributes** (below `---`). +``` + +# Foreign Key Placement + +Where you place a foreign key—above or below the `---` line—fundamentally changes its meaning: + +| Placement | Primary Key? | Relationship | Line Style in Diagram | +|-----------|--------------|--------------|----------------------| +| Above `---` (only FK) | Yes, entire PK | One-to-one (extension) | Thick solid | +| Above `---` (with other attrs) | Yes, part of PK | One-to-many (containment) | Thin solid | +| Below `---` | No | One-to-many (reference) | Dashed | +| Below `---` + `unique` | No | One-to-one (reference) | Dashed | + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +# Foreign key IS the entire primary key (thick solid line) +@schema +class CustomerPreferences(dj.Manual): + definition = """ + -> Customer # customer_id IS the primary key + --- + theme : varchar(20) + """ + +# Foreign key is PART OF primary key (thin solid line) +@schema +class CustomerAccount(dj.Manual): + definition = """ + -> Customer # customer_id is part of primary key + account_num : int # together they form the primary key + --- + balance : decimal(10,2) + """ + +# Foreign key is a secondary attribute (dashed line) +@schema +class Order(dj.Manual): + definition = """ + order_id : int # order_id is the primary key + --- + -> Customer # customer_id is a secondary attribute + order_date : date + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +-- Foreign key IS the entire primary key +CREATE TABLE customer_preferences ( + customer_id INT NOT NULL, + theme VARCHAR(20) NOT NULL, + PRIMARY KEY (customer_id), + FOREIGN KEY (customer_id) REFERENCES customer(customer_id) +); + +-- Foreign key is PART OF primary key +CREATE TABLE customer_account ( + customer_id INT NOT NULL, + account_num INT NOT NULL, + balance DECIMAL(10,2) NOT NULL, + PRIMARY KEY (customer_id, account_num), + FOREIGN KEY (customer_id) REFERENCES customer(customer_id) +); + +-- Foreign key is a secondary attribute +CREATE TABLE order_ ( + order_id INT NOT NULL, + customer_id INT NOT NULL, + order_date DATE NOT NULL, + PRIMARY KEY (order_id), + FOREIGN KEY (customer_id) REFERENCES customer(customer_id) +); +``` +```` +````` + +```{seealso} +To see how these foreign key placements appear in schema diagrams, see [Diagramming](040-diagrams.ipynb). +For detailed coverage of relationship patterns, see [Relationships](050-relationships.ipynb). +``` + +# Association Tables: Many-to-Many Relationships + +A single foreign key creates a one-to-many (or one-to-one) relationship. +To model **many-to-many** relationships, use an **association table** with foreign keys to both entities: + +`````{tab-set} +````{tab-item} DataJoint +:sync: datajoint +```python +@schema +class Person(dj.Manual): + definition = """ + person_id : int + --- + name : varchar(60) + """ + +@schema +class Language(dj.Lookup): + definition = """ + lang_code : char(4) + --- + language_name : varchar(30) + """ + +@schema +class Fluency(dj.Manual): + definition = """ + -> Person # part of primary key + -> Language # part of primary key + --- + fluency_level : enum('beginner', 'intermediate', 'fluent') + """ +``` +```` +````{tab-item} SQL +:sync: sql +```sql +CREATE TABLE person ( + person_id INT NOT NULL, + name VARCHAR(60) NOT NULL, + PRIMARY KEY (person_id) +); + +CREATE TABLE language ( + lang_code CHAR(4) NOT NULL, + language_name VARCHAR(30) NOT NULL, + PRIMARY KEY (lang_code) +); + +CREATE TABLE fluency ( + person_id INT NOT NULL, + lang_code CHAR(4) NOT NULL, + fluency_level ENUM('beginner', 'intermediate', 'fluent') NOT NULL, + PRIMARY KEY (person_id, lang_code), + FOREIGN KEY (person_id) REFERENCES person(person_id), + FOREIGN KEY (lang_code) REFERENCES language(lang_code) +); +``` +```` +````` + +The `Fluency` table has a **composite primary key** combining both foreign keys. +This allows: +- Each person to speak multiple languages +- Each language to be spoken by multiple people +- Each person-language combination to appear at most once + +```{seealso} +For more association table patterns and variations, see [Relationships](050-relationships.ipynb). +``` + +# Summary + +Foreign keys ensure referential integrity by linking child tables to parent tables. +In DataJoint, they also establish **workflow dependencies** that prescribe the order of operations. + +| Effect | Description | +|--------|-------------| +| **Schema Embedding** | Parent's primary key attributes are added to child table | +| **Insert Restriction** | Child inserts require valid parent reference | +| **Delete Restriction** | Parent deletes cascade to remove dependent children | +| **Update Restriction** | Primary/foreign key values cannot be updated in place | +| **Index Creation** | Automatic index on foreign key for performance | + +| Modifier | Syntax | Effect | Restriction | +|----------|--------|--------|-------------| +| `nullable` | `-> [nullable] Parent` | Allows NULL (no parent) | Secondary attributes only | +| `unique` | `-> [unique] Parent` | One-to-one relationship | Secondary attributes only | +| Both | `-> [nullable, unique] Parent` | Optional one-to-one | Secondary attributes only | + +```{admonition} Key Principles +:class: tip + +1. **Primary key foreign keys cannot be nullable** — They define entity identity +2. **Modifiers apply only to secondary foreign keys** — Foreign keys in the primary key have fixed behavior +3. **Diagrams don't show modifiers** — Check table definitions for nullable and unique constraints +4. **Foreign keys transform schemas into DAGs** — They prescribe workflow execution order +``` + +```{admonition} Next Steps +:class: note + +Now that you understand foreign keys and their modifiers: +- **[Diagramming](040-diagrams.ipynb)** — Learn to read and interpret schema diagrams +- **[Relationships](050-relationships.ipynb)** — Explore relationship patterns: one-to-one, one-to-many, many-to-many +``` diff --git a/book/30-design/040-diagrams.ipynb b/book/30-design/040-diagrams.ipynb new file mode 100644 index 0000000..1769637 --- /dev/null +++ b/book/30-design/040-diagrams.ipynb @@ -0,0 +1,1361 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cell-0", + "metadata": {}, + "source": "---\ntitle: Diagramming\n---\n\n# Diagramming: Visualizing Schema Structure\n\nSchema diagrams are essential tools for understanding and designing DataJoint pipelines.\nThey provide a visual representation of tables and their dependencies, making complex workflows comprehensible at a glance.\n\nBuilding on the [Foreign Keys](030-foreign-keys.md) chapter—which covered how foreign key placement affects relationship structure—this chapter focuses on how to **read and interpret** the visual representation of these relationships.\n\nAs introduced in [Relational Workflows](../20-concepts/05-workflows.md), DataJoint schemas form **Directed Acyclic Graphs (DAGs)** where:\n\n- **Nodes** represent tables (workflow steps)\n- **Edges** represent foreign key dependencies\n- **Direction** flows from parent (referenced) to child (referencing) tables\n\nThis DAG structure embodies a core principle of the Relational Workflow Model: **the schema is an executable specification**.\nTables at the top are independent entities; tables below depend on tables above them.\nReading the diagram top-to-bottom reveals the workflow execution order.\n\n```{admonition} Learning Objectives\n:class: note\n\nBy the end of this chapter, you will:\n- Understand the three line styles and their semantic meanings\n- Identify relationship types from diagram structure\n- Recognize what diagrams show and don't show\n- Use diagram operations to explore large schemas\n- Compare DataJoint notation with traditional ER diagrams\n```" + }, + { + "cell_type": "markdown", + "id": "cell-1", + "metadata": {}, + "source": [ + "# Quick Reference\n", + "\n", + "| Line Style | Appearance | Relationship | Child's Primary Key | Cardinality |\n", + "|------------|------------|--------------|---------------------|-------------|\n", + "| **Thick Solid** | ━━━ | Extension | Parent PK only | One-to-one |\n", + "| **Thin Solid** | ─── | Containment | Parent PK + own field(s) | One-to-many |\n", + "| **Dashed** | ┄┄┄ | Reference | Own independent PK | One-to-many |\n", + "\n", + "```{card} Key Principle\n", + "**Solid lines** mean the parent's identity becomes part of the child's identity.\n", + "**Dashed lines** mean the child maintains independent identity.\n", + "```\n", + "\n", + "**Visual Indicators:**\n", + "- **Underlined table name**: Independent entity introducing a new schema dimension\n", + "- **Non-underlined name**: Dependent entity whose identity derives from parent(s)\n", + "- **Orange dots**: Renamed foreign keys (via `.proj()`)\n", + "- **Table colors**: Green (Manual), Blue (Imported), Red (Computed), Gray (Lookup)" + ] + }, + { + "cell_type": "markdown", + "id": "cell-2", + "metadata": {}, + "source": [ + "# Setup\n", + "\n", + "First, we import DataJoint and create a schema for our examples:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "cell-3", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[2025-12-18 16:08:53,551][INFO]: DataJoint 0.14.6 connected to dev@db:3306\n" + ] + } + ], + "source": [ + "# create a new tutorial schema\n", + "import datajoint as dj\n", + "schema = dj.Schema('diagrams_tutorial')" + ] + }, + { + "cell_type": "markdown", + "id": "cell-4", + "metadata": {}, + "source": [ + "# The Three Line Styles\n", + "\n", + "Line styles convey the **semantic relationship** between parent and child tables.\n", + "The choice is determined by where the foreign key appears in the child's definition.\n", + "\n", + "## Thick Solid Line: Extension (One-to-One)\n", + "\n", + "The foreign key **is** the entire primary key of the child table.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Customer(dj.Manual):\n", + " definition = \"\"\"\n", + " customer_id : int unsigned\n", + " ---\n", + " customer_name : varchar(60)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class CustomerPreferences(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Customer # This IS the entire primary key\n", + " ---\n", + " theme : varchar(20)\n", + " notifications : enum('on', 'off')\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE customer (\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " customer_name VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (customer_id)\n", + ");\n", + "\n", + "CREATE TABLE customer_preferences (\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " theme VARCHAR(20) NOT NULL,\n", + " notifications ENUM('on', 'off') NOT NULL,\n", + " PRIMARY KEY (customer_id),\n", + " FOREIGN KEY (customer_id) REFERENCES customer(customer_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "**Semantics**: The child *extends* or *specializes* the parent.\n", + "They share the same identity—at most one child exists for each parent.\n", + "\n", + "**Use cases**: Workflow sequences (Order → Shipment → Delivery), optional extensions (Customer → CustomerPreferences), modular data organization.\n", + "\n", + "**In diagrams**: Notice that `CustomerPreferences` is **not underlined**—it doesn't introduce a new dimension." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "cell-5", + "metadata": {}, + "outputs": [], + "source": [ + "# Define tables for thick solid line example\n", + "@schema\n", + "class Customer(dj.Manual):\n", + " definition = \"\"\"\n", + " customer_id : int unsigned\n", + " ---\n", + " customer_name : varchar(60)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class CustomerPreferences(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Customer # This IS the entire primary key\n", + " ---\n", + " theme : varchar(20)\n", + " notifications : enum('on', 'off')\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "cell-6", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer\n", + "\n", + "\n", + "Customer\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "CustomerPreferences\n", + "\n", + "\n", + "CustomerPreferences\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer->CustomerPreferences\n", + "\n", + "\n", + "\n", + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# View the diagram - notice the thick solid line\n", + "dj.Diagram(Customer) + dj.Diagram(CustomerPreferences)" + ] + }, + { + "cell_type": "markdown", + "id": "cell-7", + "metadata": {}, + "source": [ + "## Thin Solid Line: Containment (One-to-Many)\n", + "\n", + "The foreign key is **part of** (but not all of) the child's primary key.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Customer(dj.Manual):\n", + " definition = \"\"\"\n", + " customer_id : int unsigned\n", + " ---\n", + " customer_name : varchar(60)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Account(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Customer # Part of primary key\n", + " account_num : int unsigned # Additional PK component\n", + " ---\n", + " balance : decimal(10,2)\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE customer (\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " customer_name VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (customer_id)\n", + ");\n", + "\n", + "CREATE TABLE account (\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " account_num INT UNSIGNED NOT NULL,\n", + " balance DECIMAL(10,2) NOT NULL,\n", + " PRIMARY KEY (customer_id, account_num),\n", + " FOREIGN KEY (customer_id) REFERENCES customer(customer_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "**Semantics**: The child *belongs to* or *is contained within* the parent.\n", + "Multiple children can exist for each parent, each identified within the parent's context.\n", + "\n", + "**Use cases**: Hierarchies (Study → Subject → Session), ownership (Customer → Account), containment (Order → OrderItem).\n", + "\n", + "**In diagrams**: `Account` is **underlined** because `account_num` introduces a new dimension." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "cell-8", + "metadata": {}, + "outputs": [], + "source": [ + "# Define tables for thin solid line example\n", + "@schema\n", + "class Account(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Customer # Part of primary key\n", + " account_num : int unsigned # Additional PK component\n", + " ---\n", + " balance : decimal(10,2)\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "cell-9", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "Account\n", + "\n", + "\n", + "Account\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer\n", + "\n", + "\n", + "Customer\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer->Account\n", + "\n", + "\n", + "\n", + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# View the diagram - notice the thin solid line\n", + "dj.Diagram(Customer) + dj.Diagram(Account)" + ] + }, + { + "cell_type": "markdown", + "id": "cell-10", + "metadata": {}, + "source": [ + "## Dashed Line: Reference (One-to-Many)\n", + "\n", + "The foreign key is a **secondary attribute** (below the `---` line).\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Department(dj.Manual):\n", + " definition = \"\"\"\n", + " dept_id : int unsigned\n", + " ---\n", + " dept_name : varchar(60)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Employee(dj.Manual):\n", + " definition = \"\"\"\n", + " employee_id : int unsigned # Own independent PK\n", + " ---\n", + " -> Department # Secondary attribute\n", + " employee_name : varchar(60)\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE department (\n", + " dept_id INT UNSIGNED NOT NULL,\n", + " dept_name VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (dept_id)\n", + ");\n", + "\n", + "CREATE TABLE employee (\n", + " employee_id INT UNSIGNED NOT NULL,\n", + " dept_id INT UNSIGNED NOT NULL,\n", + " employee_name VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (employee_id),\n", + " FOREIGN KEY (dept_id) REFERENCES department(dept_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "**Semantics**: The child *references* or *associates with* the parent but maintains independent identity.\n", + "The parent is just one attribute describing the child.\n", + "\n", + "**Use cases**: Loose associations (Product → Category), references that might change (Employee → Department), when child has independent identity." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "cell-11", + "metadata": {}, + "outputs": [], + "source": [ + "# Define tables for dashed line example\n", + "@schema\n", + "class Department(dj.Manual):\n", + " definition = \"\"\"\n", + " dept_id : int unsigned\n", + " ---\n", + " dept_name : varchar(60)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Employee(dj.Manual):\n", + " definition = \"\"\"\n", + " employee_id : int unsigned # Own independent PK\n", + " ---\n", + " -> Department # Secondary attribute\n", + " employee_name : varchar(60)\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "cell-12", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "Employee\n", + "\n", + "\n", + "Employee\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Department\n", + "\n", + "\n", + "Department\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Department->Employee\n", + "\n", + "\n", + "\n", + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# View the diagram - notice the dashed line\n", + "dj.Diagram(Department) + dj.Diagram(Employee)" + ] + }, + { + "cell_type": "markdown", + "id": "cell-13", + "metadata": {}, + "source": [ + "# What Diagrams Show and Don't Show\n", + "\n", + "Understanding the limitations of diagram notation is crucial for accurate schema interpretation.\n", + "\n", + "```{admonition} Diagrams Do NOT Reflect Foreign Key Modifiers\n", + ":class: warning\n", + "\n", + "The `nullable` and `unique` modifiers on foreign keys are **not visible** in diagrams.\n", + "\n", + "- A **dashed line** could represent:\n", + " - A required one-to-many relationship (default)\n", + " - An optional one-to-many relationship (`nullable`)\n", + " - A required one-to-one relationship (`unique`)\n", + " - An optional one-to-one relationship (`nullable, unique`)\n", + "\n", + "**Always check the table definition** to determine if modifiers are applied.\n", + "```\n", + "\n", + "## Clearly Indicated in Diagrams\n", + "\n", + "| Feature | How It's Shown |\n", + "|---------|----------------|\n", + "| Foreign key in primary key | Solid line (thick or thin) |\n", + "| Foreign key as secondary attribute | Dashed line |\n", + "| One-to-one via shared identity | Thick solid line |\n", + "| One-to-many via containment | Thin solid line |\n", + "| Independent entity (new dimension) | Underlined table name |\n", + "| Dependent entity (shared dimension) | Non-underlined table name |\n", + "| Table tier | Colors (Green/Blue/Red/Gray) |\n", + "| Many-to-many patterns | Converging solid lines into association table |\n", + "| Renamed foreign keys | Orange dots on connection |\n", + "\n", + "## NOT Visible in Diagrams\n", + "\n", + "| Feature | Must Check Table Definition |\n", + "|---------|----------------------------|\n", + "| `nullable` foreign keys | Definition shows `-> [nullable] Parent` |\n", + "| `unique` foreign keys | Definition shows `-> [unique] Parent` |\n", + "| Combined modifiers | Definition shows `-> [nullable, unique] Parent` |\n", + "| Secondary unique indexes | Definition shows `unique index(...)` |\n", + "| CHECK constraints | Definition shows constraint |\n", + "| Attribute names and types | Hover tooltip or inspect definition |\n", + "| Default values | Definition shows `= value` |" + ] + }, + { + "cell_type": "markdown", + "id": "cell-14", + "metadata": {}, + "source": [ + "## Example: Hidden Uniqueness\n", + "\n", + "Consider these two schemas—they produce **identical diagrams**:\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} Many Spots per Employee\n", + "```python\n", + "@schema\n", + "class ParkingSpot(dj.Manual):\n", + " definition = \"\"\"\n", + " spot_id : int unsigned\n", + " ---\n", + " -> Employee # many spots per employee allowed\n", + " location : varchar(30)\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} One Spot per Employee\n", + "```python\n", + "@schema\n", + "class ParkingSpot(dj.Manual):\n", + " definition = \"\"\"\n", + " spot_id : int unsigned\n", + " ---\n", + " -> [unique] Employee # only one spot per employee\n", + " location : varchar(30)\n", + " \"\"\"\n", + "```\n", + "````\n", + "`````\n", + "\n", + "Both show a dashed line from `ParkingSpot` to `Employee`.\n", + "Only by examining the definition can you see the `unique` constraint.\n", + "\n", + "```{admonition} Interactive Tip\n", + ":class: tip\n", + "\n", + "In Jupyter notebooks, **hover over table nodes** in the diagram to see the complete table definition, including any modifiers and constraints.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "cell-15", + "metadata": {}, + "source": [ + "# Association Tables and Many-to-Many\n", + "\n", + "Many-to-many relationships appear as tables with **converging foreign keys**—multiple solid lines pointing into a single table.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Student(dj.Manual):\n", + " definition = \"\"\"\n", + " student_id : int unsigned\n", + " ---\n", + " student_name : varchar(60)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Course(dj.Manual):\n", + " definition = \"\"\"\n", + " course_code : char(8)\n", + " ---\n", + " course_title : varchar(100)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Enrollment(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Student\n", + " -> Course\n", + " ---\n", + " grade : enum('A', 'B', 'C', 'D', 'F')\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE student (\n", + " student_id INT UNSIGNED NOT NULL,\n", + " student_name VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (student_id)\n", + ");\n", + "\n", + "CREATE TABLE course (\n", + " course_code CHAR(8) NOT NULL,\n", + " course_title VARCHAR(100) NOT NULL,\n", + " PRIMARY KEY (course_code)\n", + ");\n", + "\n", + "CREATE TABLE enrollment (\n", + " student_id INT UNSIGNED NOT NULL,\n", + " course_code CHAR(8) NOT NULL,\n", + " grade ENUM('A', 'B', 'C', 'D', 'F') NOT NULL,\n", + " PRIMARY KEY (student_id, course_code),\n", + " FOREIGN KEY (student_id) REFERENCES student(student_id),\n", + " FOREIGN KEY (course_code) REFERENCES course(course_code)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "**Reading the diagram:**\n", + "- `Student` and `Course` are independent entities (underlined, at top)\n", + "- `Enrollment` has two thin solid lines converging into it\n", + "- Its primary key is `(student_id, course_code)`—the combination of both parents\n", + "- This creates a many-to-many: each student can take multiple courses, each course can have multiple students" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "cell-16", + "metadata": {}, + "outputs": [], + "source": [ + "# Define tables for many-to-many example\n", + "@schema\n", + "class Student(dj.Manual):\n", + " definition = \"\"\"\n", + " student_id : int unsigned\n", + " ---\n", + " student_name : varchar(60)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Course(dj.Manual):\n", + " definition = \"\"\"\n", + " course_code : char(8)\n", + " ---\n", + " course_title : varchar(100)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Enrollment(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Student\n", + " -> Course\n", + " ---\n", + " grade : enum('A', 'B', 'C', 'D', 'F')\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "cell-17", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "Student\n", + "\n", + "\n", + "Student\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Enrollment\n", + "\n", + "\n", + "Enrollment\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Student->Enrollment\n", + "\n", + "\n", + "\n", + "\n", + "Course\n", + "\n", + "\n", + "Course\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Course->Enrollment\n", + "\n", + "\n", + "\n", + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# View the many-to-many diagram\n", + "dj.Diagram(Enrollment) - 1" + ] + }, + { + "cell_type": "markdown", + "id": "cell-18", + "metadata": {}, + "source": [ + "# Renamed Foreign Keys and Orange Dots\n", + "\n", + "When you reference the same parent table multiple times, or need semantic clarity, use `.proj()` to rename foreign key attributes.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Neuron(dj.Manual):\n", + " definition = \"\"\"\n", + " neuron_id : int unsigned\n", + " ---\n", + " neuron_type : enum('excitatory', 'inhibitory')\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Synapse(dj.Manual):\n", + " definition = \"\"\"\n", + " synapse_id : int unsigned\n", + " ---\n", + " -> Neuron.proj(presynaptic='neuron_id')\n", + " -> Neuron.proj(postsynaptic='neuron_id')\n", + " strength : float\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE neuron (\n", + " neuron_id INT UNSIGNED NOT NULL,\n", + " neuron_type ENUM('excitatory', 'inhibitory') NOT NULL,\n", + " PRIMARY KEY (neuron_id)\n", + ");\n", + "\n", + "CREATE TABLE synapse (\n", + " synapse_id INT UNSIGNED NOT NULL,\n", + " presynaptic INT UNSIGNED NOT NULL,\n", + " postsynaptic INT UNSIGNED NOT NULL,\n", + " strength FLOAT NOT NULL,\n", + " PRIMARY KEY (synapse_id),\n", + " FOREIGN KEY (presynaptic) REFERENCES neuron(neuron_id),\n", + " FOREIGN KEY (postsynaptic) REFERENCES neuron(neuron_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "**Orange dots** appear between `Neuron` and `Synapse`, indicating:\n", + "- A projection has renamed the foreign key attributes\n", + "- Two distinct foreign keys connect the same pair of tables\n", + "- `presynaptic` and `postsynaptic` both reference `Neuron.neuron_id`\n", + "\n", + "In interactive Jupyter notebooks, hovering over orange dots reveals the projection expression.\n", + "\n", + "**Common patterns using renamed foreign keys:**\n", + "- Neural networks: presynaptic and postsynaptic neurons\n", + "- Organizational hierarchies: employee and manager\n", + "- Transportation: origin and destination airports" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "cell-19", + "metadata": {}, + "outputs": [], + "source": [ + "# Define tables for renamed foreign key example\n", + "@schema\n", + "class Neuron(dj.Manual):\n", + " definition = \"\"\"\n", + " neuron_id : int unsigned\n", + " ---\n", + " neuron_type : enum('excitatory', 'inhibitory')\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Synapse(dj.Manual):\n", + " definition = \"\"\"\n", + " synapse_id : int unsigned\n", + " ---\n", + " -> Neuron.proj(presynaptic='neuron_id')\n", + " -> Neuron.proj(postsynaptic='neuron_id')\n", + " strength : float\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "cell-20", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "0\n", + "\n", + "0\n", + "\n", + "\n", + "\n", + "Synapse\n", + "\n", + "\n", + "Synapse\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "0->Synapse\n", + "\n", + "\n", + "\n", + "\n", + "1\n", + "\n", + "1\n", + "\n", + "\n", + "\n", + "1->Synapse\n", + "\n", + "\n", + "\n", + "\n", + "Neuron\n", + "\n", + "\n", + "Neuron\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Neuron->0\n", + "\n", + "\n", + "\n", + "\n", + "Neuron->1\n", + "\n", + "\n", + "\n", + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# View the diagram - notice the orange dots indicating renamed foreign keys\n", + "dj.Diagram(Neuron) + dj.Diagram(Synapse)" + ] + }, + { + "cell_type": "markdown", + "id": "cell-21", + "metadata": {}, + "source": [ + "# Diagram Operations\n", + "\n", + "DataJoint provides operators to filter and combine diagrams for exploring large schemas:\n", + "\n", + "```python\n", + "# Show entire schema\n", + "dj.Diagram(schema)\n", + "\n", + "# Show specific tables\n", + "dj.Diagram(Table1) + dj.Diagram(Table2)\n", + "\n", + "# Show table and N levels of upstream dependencies\n", + "dj.Diagram(Table) - N\n", + "\n", + "# Show table and N levels of downstream dependents\n", + "dj.Diagram(Table) + N\n", + "\n", + "# Combine operations\n", + "(dj.Diagram(Table1) - 2) + (dj.Diagram(Table2) + 1)\n", + "\n", + "# Intersection: show only common nodes between two diagrams\n", + "dj.Diagram(Table1) * dj.Diagram(Table2)\n", + "```\n", + "\n", + "## Finding Paths Between Tables\n", + "\n", + "The intersection operator `*` is particularly useful for finding connection paths between two tables in a large schema.\n", + "By expanding one table downstream and another upstream, then taking the intersection, you reveal only the tables that form the path(s) between them:\n", + "\n", + "```python\n", + "# Find all paths connecting table1 to table2 (where table2 is downstream from table1)\n", + "(dj.Diagram(table1) + 100) * (dj.Diagram(table2) - 100)\n", + "```\n", + "\n", + "This works because:\n", + "- `dj.Diagram(table1) + 100` includes table1 and up to 100 levels of downstream dependents\n", + "- `dj.Diagram(table2) - 100` includes table2 and up to 100 levels of upstream dependencies\n", + "- The intersection `*` shows only tables that appear in **both** diagrams—the connecting path(s)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "cell-22", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "4\n", + "\n", + "4\n", + "\n", + "\n", + "\n", + "Synapse\n", + "\n", + "\n", + "Synapse\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "4->Synapse\n", + "\n", + "\n", + "\n", + "\n", + "5\n", + "\n", + "5\n", + "\n", + "\n", + "\n", + "5->Synapse\n", + "\n", + "\n", + "\n", + "\n", + "Student\n", + "\n", + "\n", + "Student\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Enrollment\n", + "\n", + "\n", + "Enrollment\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Student->Enrollment\n", + "\n", + "\n", + "\n", + "\n", + "Neuron\n", + "\n", + "\n", + "Neuron\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Neuron->4\n", + "\n", + "\n", + "\n", + "\n", + "Neuron->5\n", + "\n", + "\n", + "\n", + "\n", + "Employee\n", + "\n", + "\n", + "Employee\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Department\n", + "\n", + "\n", + "Department\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Department->Employee\n", + "\n", + "\n", + "\n", + "\n", + "CustomerPreferences\n", + "\n", + "\n", + "CustomerPreferences\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer\n", + "\n", + "\n", + "Customer\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer->CustomerPreferences\n", + "\n", + "\n", + "\n", + "\n", + "Account\n", + "\n", + "\n", + "Account\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer->Account\n", + "\n", + "\n", + "\n", + "\n", + "Course\n", + "\n", + "\n", + "Course\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Course->Enrollment\n", + "\n", + "\n", + "\n", + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# View the entire schema we've built\n", + "dj.Diagram(schema)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "tpx88p488y", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "Enrollment\n", + "\n", + "\n", + "Enrollment\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Student\n", + "\n", + "\n", + "Student\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Student->Enrollment\n", + "\n", + "\n", + "\n", + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Find the path from Student to Enrollment using intersection\n", + "# Expand Student downstream and Enrollment upstream, then intersect\n", + "(dj.Diagram(Student) + 100) * (dj.Diagram(Enrollment) - 100)" + ] + }, + { + "cell_type": "markdown", + "id": "cell-23", + "metadata": {}, + "source": [ + "# Diagrams and Queries\n", + "\n", + "The diagram structure directly informs query patterns.\n", + "\n", + "**Solid line paths enable direct joins:**\n", + "```python\n", + "# If A → B → C are connected by solid lines:\n", + "A * C # Valid—primary keys cascade through solid lines\n", + "```\n", + "\n", + "**Dashed lines require intermediate tables:**\n", + "```python\n", + "# If A ---> B (dashed), B → C (solid):\n", + "A * B * C # Must include B to connect A and C\n", + "```\n", + "\n", + "This is why solid lines are preferred when the relationship supports it—they simplify queries by allowing you to join non-adjacent tables directly." + ] + }, + { + "cell_type": "markdown", + "id": "cell-24", + "metadata": {}, + "source": [ + "# Comparison to Other Notations\n", + "\n", + "DataJoint's notation differs significantly from traditional database diagramming:\n", + "\n", + "| Feature | Chen's ER | Crow's Foot | DataJoint |\n", + "|---------|-----------|-------------|-----------|\n", + "| **Cardinality** | Numbers near entities | Symbols at line ends | Line thickness/style |\n", + "| **Direction** | No inherent direction | No inherent direction | Top-to-bottom (DAG) |\n", + "| **Cycles allowed** | Yes | Yes | No |\n", + "| **Entity vs. relationship** | Distinct symbols | Not distinguished | Not distinguished |\n", + "| **Primary key cascade** | Not shown | Not shown | Solid lines show this |\n", + "| **Identity sharing** | Not indicated | Not indicated | Thick solid line |\n", + "\n", + "**Why DataJoint differs:**\n", + "\n", + "1. **DAG structure**: No cycles means schemas are readable as workflows (top-to-bottom execution order)\n", + "2. **Line style semantics**: Immediately reveals relationship type without reading labels\n", + "3. **Primary key cascade visibility**: Solid lines show which tables can be joined directly\n", + "4. **Unified entity treatment**: No artificial distinction between \"entities\" and \"relationships\"\n", + "\n", + "```{seealso}\n", + "The [Relational Workflows](../20-concepts/05-workflows.md) chapter covers the three database paradigms in depth, including how DataJoint's workflow-centric approach compares to Codd's mathematical model and Chen's Entity-Relationship model.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "cell-25", + "metadata": {}, + "source": [ + "# Best Practices\n", + "\n", + "## Reading Diagrams\n", + "\n", + "1. **Start at the top**: Identify independent entities (underlined)\n", + "2. **Follow solid lines**: Trace primary key cascades downward\n", + "3. **Spot convergence patterns**: Multiple lines into a table indicate associations\n", + "4. **Check line thickness**: Thick = one-to-one, Thin = one-to-many containment\n", + "5. **Note dashed lines**: These don't cascade identity\n", + "6. **Check definitions**: For `nullable`, `unique`, and other constraints not visible in diagrams\n", + "\n", + "## Designing with Diagrams\n", + "\n", + "1. **Choose solid lines when**:\n", + " - Building hierarchies (Study → Subject → Session)\n", + " - Creating workflow sequences (Order → Ship → Deliver)\n", + " - You want direct joins across levels\n", + "\n", + "2. **Choose dashed lines when**:\n", + " - Child has independent identity from parent\n", + " - Reference might change or is optional\n", + " - You don't need primary key cascade\n", + "\n", + "3. **Choose thick lines when**:\n", + " - Extending entities with optional information\n", + " - Modeling workflow steps (one output per input)\n", + " - Creating true one-to-one relationships\n", + "\n", + "## Interactive Tips\n", + "\n", + "- **Hover over tables** to see complete definitions (works in Jupyter and SVG exports)\n", + "- **Hover over orange dots** to see projection expressions\n", + "- **Use `+` and `-` operators** to focus on specific parts of large schemas" + ] + }, + { + "cell_type": "markdown", + "id": "cell-26", + "metadata": {}, + "source": [ + "# Summary\n", + "\n", + "| Concept | Key Points |\n", + "|---------|------------|\n", + "| **Line Styles** | Thick solid (extension), thin solid (containment), dashed (reference) |\n", + "| **Underlined Names** | Indicate tables introducing new schema dimensions |\n", + "| **Orange Dots** | Indicate renamed foreign keys via `.proj()` |\n", + "| **Diagram Shows** | Foreign key placement, relationship structure, table tiers |\n", + "| **Diagram Doesn't Show** | `nullable`, `unique` modifiers, secondary indexes |\n", + "\n", + "```{admonition} Key Principle\n", + ":class: tip\n", + "\n", + "DataJoint diagrams show **structural relationships** based on where foreign keys appear in table definitions.\n", + "They do **not** show constraint modifiers (`nullable`, `unique`) that alter the cardinality of secondary foreign keys.\n", + "\n", + "**Always check the table definition** for the complete picture.\n", + "```\n", + "\n", + "```{admonition} Remember\n", + ":class: note\n", + "\n", + "In DataJoint, diagrams and implementation are unified.\n", + "There's no separate design document that can drift out of sync—the diagram **is** generated from the actual schema.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "azttpu2mrwk", + "source": "```{admonition} Next Steps\n:class: note\n\nNow that you can read and interpret schema diagrams:\n- **[Relationships](050-relationships.ipynb)** — Explore relationship patterns: one-to-one, one-to-many, many-to-many, hierarchies, and more\n- **[Master-Part Tables](053-master-part.ipynb)** — Special pattern for composite entities\n```", + "metadata": {} + }, + { + "cell_type": "markdown", + "id": "cell-27", + "metadata": {}, + "source": [ + "# Cleanup\n", + "\n", + "Optionally drop the tutorial schema when done:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "cell-28", + "metadata": {}, + "outputs": [], + "source": [ + "# drop the schema\n", + "schema.drop(force=True)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.2" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/book/30-design/050-relationships.ipynb b/book/30-design/050-relationships.ipynb index affb482..944d67c 100644 --- a/book/30-design/050-relationships.ipynb +++ b/book/30-design/050-relationships.ipynb @@ -2,2071 +2,2738 @@ "cells": [ { "cell_type": "markdown", + "id": "cell-0", "metadata": {}, - "source": "# Relationships\n\nIn this chapter, we'll explore how to build complex relationships between entities using a combination of uniqueness constraints and referential constraints. Understanding these patterns is essential for designing schemas that accurately represent business rules and data dependencies.\n\n## Uniqueness Constraints\n\nUniqueness constraints are typically set through primary keys, but tables can also have additional unique indexes beyond the primary key. These constraints ensure that specific combinations of attributes remain unique across all rows in a table.\n\n## Referential Constraints\n\nReferential constraints establish relationships between tables and are enforced by [foreign keys](030-foreign-keys.ipynb). They ensure that references between tables remain valid and prevent orphaned records.\n\nIn DataJoint, foreign keys also participate in the **relational workflow model** introduced earlier: each dependency not only enforces referential integrity but also prescribes the order of operations in a workflow. When table `B` references table `A`, `A` must be populated before `B`, and deleting from `A` cascades through all dependent workflow steps. The resulting schema is a directed acyclic graph (DAG) whose arrows describe both data relationships and workflow execution order (see [Relational Workflows](../20-concepts/05-workflows.md)).\n\n## Foreign Keys Establish 1:N or 1:1 Relationships\n\nWhen a child table defines a foreign key constraint to a parent table, it creates a relationship between the entities in the parent and child tables. The cardinality of this relationship is always **1 on the parent side**: each entry in the child table must refer to a single entity in the parent table.\n\nOn the child side, the relationship can have different cardinalities:\n\n- **0–1 (optional one-to-one)**: if the foreign key field in the child table has a unique constraint\n- **1 (mandatory one-to-one)**: if the foreign key is the entire primary key of the child table\n- **N (one-to-many)**: if no uniqueness constraint is applied to the foreign key field\n\n## What We'll Cover\n\nThis chapter explores these key relationship patterns:\n\n* **One-to-Many Relationships**: The most common pattern, using foreign keys in secondary attributes\n* **One-to-One Relationships**: Using primary key foreign keys or unique constraints\n* **Many-to-Many Relationships**: Using association tables with composite primary keys\n* **Sequences**: Cascading one-to-one relationships for workflows\n* **Hierarchies**: Cascading one-to-many relationships for nested data structures\n* **Parameterization**: Association tables where the association itself is the primary entity\n* **Directed Graphs**: Self-referencing relationships with renamed foreign keys\n* **Complex Constraints**: Using nullable enums with unique indexes for special requirements\n\nLet's begin by illustrating these possibilities with examples of bank customers and their accounts." + "source": [ + "---\n", + "title: Relationships\n", + "---\n", + "\n", + "# Relationships: Modeling Entity Connections\n", + "\n", + "Relational databases model the real world by representing entities and the **relationships** between them.\n", + "Building on the [Foreign Keys](030-foreign-keys.md) chapter—which covered referential integrity and foreign key modifiers—and the [Diagramming](040-diagrams.ipynb) chapter—which introduced the visual notation—this chapter focuses on the **patterns** that emerge from different combinations of uniqueness and referential constraints.\n", + "\n", + "Understanding these patterns is essential for designing schemas that accurately represent business rules, data dependencies, and workflow structures.\n", + "Throughout this chapter, we'll use diagrams to visualize each pattern, reinforcing the connection between table definitions and their visual representation.\n", + "\n", + "```{admonition} Learning Objectives\n", + ":class: note\n", + "\n", + "By the end of this chapter, you will:\n", + "- Understand how foreign key placement determines relationship cardinality\n", + "- Design one-to-one, one-to-many, and many-to-many relationships\n", + "- Use association tables for complex relationships\n", + "- Build hierarchies and sequences using cascading foreign keys\n", + "- Apply the parameterization pattern for method/parameter combinations\n", + "- Model directed graphs using renamed foreign keys\n", + "```" + ] }, { - "cell_type": "code", - "execution_count": 1, + "cell_type": "markdown", + "id": "cell-1", "metadata": {}, - "outputs": [], "source": [ - "import datajoint as dj" + "# Relationship Fundamentals\n", + "\n", + "Relationships between tables are established through **referential constraints** (foreign keys) combined with **uniqueness constraints** (primary keys and unique indexes).\n", + "\n", + "```{card} Relationship Building Blocks\n", + "**Foreign keys** establish connections between tables:\n", + "- A foreign key in the child references the primary key of the parent\n", + "- The parent side always has cardinality of exactly **one**\n", + "\n", + "**Uniqueness constraints** determine cardinality on the child side:\n", + "- No constraint on foreign key: **many** children per parent (one-to-many)\n", + "- Unique constraint on foreign key: **one** child per parent (one-to-one)\n", + "- Foreign key is entire primary key: **at most one** child per parent (one-to-one extension)\n", + "```\n", + "\n", + "In DataJoint, foreign keys also participate in the **relational workflow model**: each dependency not only enforces referential integrity but also prescribes the order of operations.\n", + "When table B references table A, A must be populated before B.\n", + "The resulting schema is a directed acyclic graph (DAG) whose arrows describe both data relationships and workflow execution order.\n", + "\n", + "```{seealso}\n", + "See [Relational Workflows](../20-concepts/05-workflows.md) for the theoretical foundation of workflow dependencies.\n", + "```" ] }, { "cell_type": "markdown", + "id": "cell-2", "metadata": {}, "source": [ - "# One-to-Many Relationships\n", + "# Setup\n", "\n", - "In the first example, let the rule be that customers are independent entities and accounts have exactly one owner but customers can have any number of accounts.\n", - "This is an example of an 1:N relationship between customers and their accounts.\n", - "\n", - "Then the foreign key is declared in the `Account` table." + "First, we import DataJoint and create a schema for our examples:" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 1, + "id": "cell-3", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ - "[2025-10-08 14:27:33,185][INFO]: DataJoint 0.14.6 connected to dev@db:3306\n" + "[2025-12-18 17:14:49,928][INFO]: DataJoint 0.14.6 connected to dev@db:3306\n" ] - }, - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "Account1\n", - "\n", - "\n", - "Account1\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer1\n", - "\n", - "\n", - "Customer1\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer1->Account1\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" } ], "source": [ - "schema1 = dj.Schema('bank1')\n", - "@schema1\n", - "class Customer1(dj.Manual):\n", + "import datajoint as dj\n", + "\n", + "schema = dj.Schema('relationships_tutorial')" + ] + }, + { + "cell_type": "markdown", + "id": "cell-4", + "metadata": {}, + "source": [ + "# One-to-Many Relationships\n", + "\n", + "The **one-to-many** relationship is the most common pattern: one parent entity can have multiple associated child entities, but each child belongs to exactly one parent.\n", + "\n", + "## Pattern 1: Foreign Key as Secondary Attribute\n", + "\n", + "When the foreign key is a **secondary attribute** (below the `---` line), the child table has its own independent primary key.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Customer(dj.Manual):\n", " definition = \"\"\"\n", " customer_id : int unsigned\n", " ---\n", - " full_name : varchar(30)\n", - " ssn = null : int unsigned\n", - " unique index(ssn)\n", + " full_name : varchar(60)\n", " \"\"\"\n", "\n", - "@schema1\n", - "class Account1(dj.Manual):\n", + "@schema\n", + "class Account(dj.Manual):\n", " definition = \"\"\"\n", - " account : int unsigned\n", + " account_id : int unsigned # account's own identity\n", " ---\n", - " -> Customer1\n", + " -> Customer # foreign key as secondary attribute\n", " open_date : date\n", + " balance : decimal(10,2)\n", " \"\"\"\n", - "\n", - "dj.Diagram(schema1)" + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE customer (\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " full_name VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (customer_id)\n", + ");\n", + "\n", + "CREATE TABLE account (\n", + " account_id INT UNSIGNED NOT NULL COMMENT 'account own identity',\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " open_date DATE NOT NULL,\n", + " balance DECIMAL(10,2) NOT NULL,\n", + " PRIMARY KEY (account_id),\n", + " FOREIGN KEY (customer_id) REFERENCES customer(customer_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "**Characteristics:**\n", + "- Each account has its own independent identity (`account_id`)\n", + "- Each account belongs to exactly one customer\n", + "- Each customer can have multiple accounts (or none)\n", + "- In diagrams: shown as a **dashed line**" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 2, + "id": "ae45f4df", "metadata": {}, + "outputs": [], "source": [ - "In this design:\n", - "\n", - "* Each account is linked to a single customer through a foreign key referencing the primary key in Customer1, ensuring that each account has one, and only one, owner.\n", - "* This setup allows each customer to own multiple accounts, as there is no unique constraint on the foreign key in Account1.\n", - "* Customers may have zero or more accounts, as there’s no requirement for every customer to have an associated account.\n", - "* Every account must have an owner, since the foreign key reference to Customer1 is mandatory (non-nullable).\n", - "\n", - "This structure establishes a one-to-many relationship between customers and accounts: one customer can own multiple accounts, but each account belongs to only one customer.\n", - "\n", - "To allow some accounts without an assigned owner, we can modify the design to make the foreign key nullable:" + "# create a new tutorial schema\n", + "import datajoint as dj\n", + "schema = dj.Schema('relationships_tutorial')" ] }, { "cell_type": "code", "execution_count": 3, + "id": "cell-5", + "metadata": {}, + "outputs": [], + "source": [ + "# Pattern 1: Foreign key as secondary attribute (dashed line)\n", + "@schema\n", + "class Customer(dj.Manual):\n", + " definition = \"\"\"\n", + " customer_id : int unsigned\n", + " ---\n", + " full_name : varchar(60)\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class AccountIndependent(dj.Manual):\n", + " definition = \"\"\"\n", + " account_id : int unsigned # account's own identity\n", + " ---\n", + " -> Customer # foreign key as secondary attribute\n", + " open_date : date\n", + " balance : decimal(10,2)\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "cell-6", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ - "\n", + "\n", "\n", - "\n", - "\n", + "\n", + "\n", "\n", - "Customer2\n", - "\n", - "\n", - "Customer2\n", + "AccountIndependent\n", + "\n", + "\n", + "AccountIndependent\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Account2\n", - "\n", - "\n", - "Account2\n", + "Customer\n", + "\n", + "\n", + "Customer\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer2->Account2\n", - "\n", + "Customer->AccountIndependent\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 3, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "schema2 = dj.Schema('bank2')\n", - "@schema2\n", - "class Customer2(dj.Manual):\n", + "# View the diagram - notice the dashed line (secondary FK)\n", + "dj.Diagram(schema)" + ] + }, + { + "cell_type": "markdown", + "id": "cell-7", + "metadata": {}, + "source": [ + "## Pattern 2: Foreign Key in Composite Primary Key\n", + "\n", + "When the foreign key is **part of the primary key** (above `---`), the child's identity includes the parent's identity.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Customer(dj.Manual):\n", " definition = \"\"\"\n", " customer_id : int unsigned\n", " ---\n", - " full_name : varchar(30)\n", - " ssn = null : int unsigned\n", - " unique index(ssn)\n", + " full_name : varchar(60)\n", " \"\"\"\n", "\n", - "@schema2\n", - "class Account2(dj.Manual):\n", + "@schema\n", + "class Account(dj.Manual):\n", " definition = \"\"\"\n", - " account : int unsigned\n", + " -> Customer # foreign key in primary key\n", + " account_num : int unsigned # account number within customer\n", " ---\n", - " -> [nullable] Customer2\n", " open_date : date\n", + " balance : decimal(10,2)\n", " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE customer (\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " full_name VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (customer_id)\n", + ");\n", + "\n", + "CREATE TABLE account (\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " account_num INT UNSIGNED NOT NULL COMMENT 'account number within customer',\n", + " open_date DATE NOT NULL,\n", + " balance DECIMAL(10,2) NOT NULL,\n", + " PRIMARY KEY (customer_id, account_num),\n", + " FOREIGN KEY (customer_id) REFERENCES customer(customer_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", "\n", - "dj.Diagram(schema2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this modified design:\n", + "**Characteristics:**\n", + "- Accounts are identified **within the context of their customer**\n", + "- Account #3 for Customer A is different from Account #3 for Customer B\n", + "- The primary key `(customer_id, account_num)` cascades through foreign keys\n", + "- In diagrams: shown as a **thin solid line**\n", "\n", - "* Accounts without owners are allowed by setting the foreign key to Customer2 as nullable.\n", - "* The schema diagram does not visually distinguish between required and optional dependencies, so the nullable nature of the foreign key is not visible in the diagram.\n", - "* This configuration supports cases where accounts may or may not be assigned to a customer, adding flexibility to the data model." + "**Key difference**: With a dashed line, `account_id` must be globally unique.\n", + "With a thin solid line, `account_num` only needs to be unique within each customer." ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 5, + "id": "cell-8", "metadata": {}, + "outputs": [], "source": [ - "Consider a third design where the foreign key is part of a composite primary key:" + "# Pattern 2: Foreign key in composite primary key (thin solid line)\n", + "@schema\n", + "class AccountContained(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Customer # foreign key in primary key\n", + " account_num : int unsigned # account number within customer\n", + " ---\n", + " open_date : date\n", + " balance : decimal(10,2)\n", + " \"\"\"" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 6, + "id": "cell-9", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ - "\n", + "\n", "\n", - "\n", - "\n", + "\n", + "\n", "\n", - "Customer3\n", - "\n", - "\n", - "Customer3\n", + "AccountContained\n", + "\n", + "\n", + "AccountContained\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Account3\n", - "\n", - "\n", - "Account3\n", + "AccountIndependent\n", + "\n", + "\n", + "AccountIndependent\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer\n", + "\n", + "\n", + "Customer\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer3->Account3\n", - "\n", + "Customer->AccountContained\n", + "\n", + "\n", + "\n", + "\n", + "Customer->AccountIndependent\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 4, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "schema3 = dj.Schema('bank3')\n", + "# View the diagram - notice the thin solid line (FK in primary key)\n", + "dj.Diagram(schema)" + ] + }, + { + "cell_type": "markdown", + "id": "cell-10", + "metadata": {}, + "source": [ + "## Optional One-to-Many with Nullable Foreign Key\n", "\n", - "@schema3\n", - "class Customer3(dj.Manual):\n", - " definition = \"\"\"\n", - " customer_id : int unsigned\n", - " ---\n", - " full_name : varchar(30)\n", - " ssn = null : int unsigned # Optional SSN with unique constraint\n", - " unique index(ssn)\n", - " \"\"\"\n", + "To allow children without a parent reference, use the `nullable` modifier:\n", "\n", - "@schema3\n", - "class Account3(dj.Manual):\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Account(dj.Manual):\n", " definition = \"\"\"\n", - " -> Customer3\n", - " account : int unsigned\n", + " account_id : int unsigned\n", " ---\n", + " -> [nullable] Customer # optional owner\n", " open_date : date\n", + " balance : decimal(10,2)\n", " \"\"\"\n", "\n", - "dj.Diagram(schema3)" + "# Accounts can exist without an assigned customer\n", + "Account.insert1({\n", + " 'account_id': 9999,\n", + " 'customer_id': None, # no owner yet\n", + " 'open_date': '2024-01-01',\n", + " 'balance': 0.00\n", + "})\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE account (\n", + " account_id INT UNSIGNED NOT NULL,\n", + " customer_id INT UNSIGNED NULL, -- allows NULL\n", + " open_date DATE NOT NULL,\n", + " balance DECIMAL(10,2) NOT NULL,\n", + " PRIMARY KEY (account_id),\n", + " FOREIGN KEY (customer_id) REFERENCES customer(customer_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "```{admonition} Nullable Foreign Keys in Primary Key\n", + ":class: warning\n", + "\n", + "Foreign keys that are part of the primary key **cannot be made nullable**.\n", + "Primary key attributes must always have values.\n", + "The `nullable` modifier only applies to secondary attributes (below `---`).\n", + "```" ] }, { "cell_type": "markdown", + "id": "cell-11", "metadata": {}, "source": [ - "In this design:\n", + "# One-to-One Relationships\n", + "\n", + "A **one-to-one** relationship ensures that each parent has **at most one** associated child.\n", + "There are several ways to achieve this.\n", "\n", - "* Composite Primary Key: The primary key for `Account3` is a combination of `customer_id` and `account`, meaning each account is uniquely identified by both the customer and account number together and neither of the two fields separately has to be unique across accounts.\n", - "* One-to-Many Relationship: Since `customer_id` is only part of the primary key (not the entire primary key), it doesn’t need to be unique within `Account3`. This allows each customer to have multiple accounts, preserving the one-to-many relationship between Customer3 and Account3.\n", - "* Foreign Key Reference: The foreign key to `Customer3` establishes the relationship, ensuring that each entry in Account3 references a valid customer in Customer3.\n", + "## Pattern 1: Foreign Key as Entire Primary Key (Extension)\n", "\n", - "This setup maintains the one-to-many relationship while allowing for each account to be uniquely identified by a combination of customer and account identifiers.\n", + "The strongest one-to-one relationship occurs when the foreign key **is** the entire primary key of the child table.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Customer(dj.Manual):\n", + " definition = \"\"\"\n", + " customer_id : int unsigned\n", + " ---\n", + " full_name : varchar(60)\n", + " \"\"\"\n", "\n", - "In the diagram, solid lines indicate a dependency where the foreign key is part of the primary key, signifying a stronger relationship than a secondary reference. This stronger relationship ensures that any foreign keys pointing to `Account3` will also transitively reference `Customer3`." + "@schema\n", + "class CustomerPreferences(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Customer # foreign key IS the primary key\n", + " ---\n", + " theme : varchar(20)\n", + " notification_email : varchar(100)\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE customer (\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " full_name VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (customer_id)\n", + ");\n", + "\n", + "CREATE TABLE customer_preferences (\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " theme VARCHAR(20) NOT NULL,\n", + " notification_email VARCHAR(100) NOT NULL,\n", + " PRIMARY KEY (customer_id),\n", + " FOREIGN KEY (customer_id) REFERENCES customer(customer_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "**Characteristics:**\n", + "- `CustomerPreferences` shares the same identity as `Customer`\n", + "- At most one preferences record per customer\n", + "- In diagrams: shown as a **thick solid line**\n", + "- Table name is **not underlined** (no new dimension introduced)\n", + "\n", + "**Why use separate tables for one-to-one?**\n", + "- **Modularity**: Separate optional data from required data\n", + "- **Access control**: Different permissions for different data\n", + "- **Avoiding NULL columns**: Instead of nullable columns, use a separate table\n", + "- **Workflow stages**: Each table represents a processing step" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 7, + "id": "cell-12", "metadata": {}, + "outputs": [], "source": [ - "### Exercise: Analyzing Bank Design\n", - "\n", - "Review the database design below and consider how this structure might reflect the bank’s operations." + "# Pattern 1: Foreign key as entire primary key (thick solid line)\n", + "@schema\n", + "class CustomerPreferences(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Customer # foreign key IS the primary key\n", + " ---\n", + " theme : varchar(20)\n", + " notification_email : varchar(100)\n", + " \"\"\"" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 8, + "id": "cell-13", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ - "\n", + "\n", "\n", - "\n", - "\n", + "\n", + "\n", "\n", - "Account4\n", - "\n", - "\n", - "Account4\n", + "AccountContained\n", + "\n", + "\n", + "AccountContained\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer4\n", - "\n", - "\n", - "Customer4\n", + "AccountIndependent\n", + "\n", + "\n", + "AccountIndependent\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer\n", + "\n", + "\n", + "Customer\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Account4->Customer4\n", - "\n", + "Customer->AccountContained\n", + "\n", + "\n", + "\n", + "\n", + "Customer->AccountIndependent\n", + "\n", + "\n", + "\n", + "\n", + "CustomerPreferences\n", + "\n", + "\n", + "CustomerPreferences\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer->CustomerPreferences\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 5, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "schema4 = dj.Schema('bank4')\n", + "# View the diagram - notice the thick solid line (extension/one-to-one)\n", + "dj.Diagram(schema)" + ] + }, + { + "cell_type": "markdown", + "id": "cell-14", + "metadata": {}, + "source": [ + "## Pattern 2: Unique Foreign Key (Reference)\n", + "\n", + "Adding the `unique` modifier to a secondary foreign key creates a one-to-one relationship while maintaining independent identity:\n", "\n", - "@schema4\n", - "class Account4(dj.Manual):\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Employee(dj.Manual):\n", " definition = \"\"\"\n", - " account : int unsigned\n", + " employee_id : int unsigned\n", " ---\n", - " open_date : date\n", + " full_name : varchar(60)\n", " \"\"\"\n", "\n", - "\n", - "@schema4\n", - "class Customer4(dj.Manual):\n", + "@schema\n", + "class ParkingSpot(dj.Manual):\n", " definition = \"\"\"\n", - " customer_id : int unsigned\n", + " spot_id : int unsigned # spot has its own identity\n", " ---\n", - " full_name : varchar(30)\n", - " ssn = null : int unsigned # Optional SSN with unique constraint\n", - " unique index(ssn)\n", - " -> Account4\n", + " -> [unique] Employee # at most one spot per employee\n", + " location : varchar(30)\n", " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE employee (\n", + " employee_id INT UNSIGNED NOT NULL,\n", + " full_name VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (employee_id)\n", + ");\n", + "\n", + "CREATE TABLE parking_spot (\n", + " spot_id INT UNSIGNED NOT NULL,\n", + " employee_id INT UNSIGNED NOT NULL,\n", + " location VARCHAR(30) NOT NULL,\n", + " PRIMARY KEY (spot_id),\n", + " UNIQUE KEY (employee_id),\n", + " FOREIGN KEY (employee_id) REFERENCES employee(employee_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", "\n", - "\n", - "dj.Diagram(schema4)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Interpretation\n", - "Consider the implications of this setup:\n", - "\n", - "* Each customer entry contains a reference to a single account in `Account4`, suggesting that each customer is linked to one specific account.\n", - "* Since there is no constraint on the number of customers who can point to the same account, this design may allow multiple customers to be associated with a single account, indicating the possibility of shared accounts.\n", - "* However, the structure does not allow an customer to exist without being associated with an account, as each customer record must reference an existing account.\n", - "\n", - "These choices might reflect the bank’s operations and policies, such as whether joint accounts are supported, and how account ownership is managed." + "**Characteristics:**\n", + "- Each parking spot has its own identity (`spot_id`)\n", + "- The `unique` constraint ensures one spot per employee maximum\n", + "- In diagrams: shown as a **dashed line** (uniqueness not visible)" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 9, + "id": "cell-15", "metadata": {}, + "outputs": [], "source": [ - "# One-to-One Relationships\n", - "\n", - "A one-to-one relationship is created when a foreign key in the child table is also designated as either the primary key or a unique index. This ensures that each entry in the child table corresponds to a single, unique entry in the parent table, and no parent entry is linked to more than one child entry.\n", - "\n", - "In a one-to-one relationship, the connection is always optional on the child side: a child entry is not required for every parent entry. Therefore, the cardinality on the child side is 0..1—each parent may have either zero or one associated child entry.\n", + "# Pattern 2: Unique foreign key (dashed line, uniqueness not visible)\n", + "@schema\n", + "class Employee(dj.Manual):\n", + " definition = \"\"\"\n", + " employee_id : int unsigned\n", + " ---\n", + " full_name : varchar(60)\n", + " \"\"\"\n", "\n", - "In the following example, the foreign key in `Account` is also its primary key, resulting in a one-to-one relationship:" + "@schema\n", + "class ParkingSpot(dj.Manual):\n", + " definition = \"\"\"\n", + " spot_id : int unsigned # spot has its own identity\n", + " ---\n", + " -> [unique] Employee # at most one spot per employee\n", + " location : varchar(30)\n", + " \"\"\"" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 10, + "id": "cell-16", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ - "\n", + "\n", "\n", - "\n", - "\n", + "\n", + "\n", "\n", - "Customer5\n", - "\n", - "\n", - "Customer5\n", + "Employee\n", + "\n", + "\n", + "Employee\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Account5\n", - "\n", - "\n", - "Account5\n", + "ParkingSpot\n", + "\n", + "\n", + "ParkingSpot\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer5->Account5\n", - "\n", + "Employee->ParkingSpot\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 6, + "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "schema5 = dj.Schema('bank5')\n", + "# View the diagram - notice the dashed line (unique constraint NOT visible!)\n", + "dj.Diagram(Employee) + 1" + ] + }, + { + "cell_type": "markdown", + "id": "cell-17", + "metadata": {}, + "source": [ + "## Pattern 3: Optional One-to-One\n", "\n", - "@schema5\n", - "class Customer5(dj.Manual):\n", - " definition = \"\"\"\n", - " customer_id : int unsigned\n", - " ---\n", - " full_name : varchar(30)\n", - " ssn = null : int unsigned # Optional SSN with unique constraint\n", - " unique index(ssn)\n", - " \"\"\"\n", + "Combine `nullable` and `unique` for an optional one-to-one relationship:\n", "\n", - "@schema5\n", - "class Account5(dj.Manual):\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class ParkingSpot(dj.Manual):\n", " definition = \"\"\"\n", - " -> Customer5\n", + " spot_id : int unsigned\n", " ---\n", - " open_date : date\n", + " -> [nullable, unique] Employee # optional, but exclusive\n", + " location : varchar(30)\n", " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE parking_spot (\n", + " spot_id INT UNSIGNED NOT NULL,\n", + " employee_id INT UNSIGNED NULL, -- can be unassigned\n", + " location VARCHAR(30) NOT NULL,\n", + " PRIMARY KEY (spot_id),\n", + " UNIQUE KEY (employee_id), -- but exclusive when assigned\n", + " FOREIGN KEY (employee_id) REFERENCES employee(employee_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", "\n", - "dj.Diagram(schema5)" + "Multiple spots can be unassigned (NULL), but each employee can be assigned at most one spot." ] }, { "cell_type": "markdown", + "id": "cell-18", "metadata": {}, "source": [ - "The diagramming notation represents this relationship with a thick solid line, which indicates the strongest type of dependency between two entities.\n", - "In this setup, `Customer5` and `Account5` share the same identity because `Account5` inherits its primary key from `Customer5`.\n", - "This setup creates a strict one-to-one relationship between Customer5 and Account5, where each account is uniquely and exclusively linked to a single customer.\n", - "\n", - "### Characteristics of This Structure\n", - "* **Unified Identity:** Since `Account5` shares the primary key with `Customer5`, each `Account5` record is uniquely identified by the same key as `Customer5`.\n", - "This enforces the rule that each account cannot exist without an associated customer.\n", + "# Many-to-Many Relationships\n", "\n", - "* **Conflated Entities:** In the diagram, the name `Account5` is no longer underscored, indicating it has ceased to function as a separate “dimension” or independent entity. `Account5` is now fully conflated with `Customer5`, meaning it effectively serves as an extension of the `Customer5` entity, rather than an independent table with its own identity.\n", + "A **many-to-many** relationship allows entities from both sides to be connected to multiple entities on the other side.\n", + "This requires an **association table** (also called junction table or bridge table).\n", "\n", - "### Why Keep Separate Tables?\n", - "Although this design could allow us to simply merge all account-related data into the `Customer5` table, there are reasons we may choose to keep `Account5` as a separate table:\n", + "## Basic Association Table\n", "\n", - "1. **Modularity and Clarity**: Separating `Account5` from `Customer5` keeps the structure modular, which can clarify different aspects of customer and account data in queries and during development.\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Student(dj.Manual):\n", + " definition = \"\"\"\n", + " student_id : int unsigned\n", + " ---\n", + " student_name : varchar(60)\n", + " \"\"\"\n", "\n", - "2. **Data Management**: By keeping account information in a separate table, we can manage and update account-related data independently from customer information. This separation can be beneficial for tasks such as auditing, logging, or updating fields associated with only account data.\n", + "@schema\n", + "class Course(dj.Manual):\n", + " definition = \"\"\"\n", + " course_code : char(8)\n", + " ---\n", + " course_title : varchar(100)\n", + " \"\"\"\n", "\n", - "3. **Avoiding Optional Fields**: In cases where certain fields are only relevant to accounts (e.g., open_date, account-specific details), keeping them in a separate table prevents having unused or irrelevant fields in the main `Customer5` table.\n", - "\n", - "4. **Access Control**: When account information is sensitive or needs restricted access, placing it in a separate table can simplify access control, allowing finer-grained security policies around account data.\n", - "\n", - "5. **Scalability and Maintenance**: Over time, this separation can support scalability as customer and account data expand. If we anticipate adding extensive account-specific data or if account records will be managed differently from customer records, the separate tables facilitate maintenance and future-proof the structure.\n", - "\n", - "6. **Schema Evolution**: Separate tables provide flexibility to adapt or expand either the Customer5 or Account5 table independently, without altering the other table. This flexibility is especially useful if the schema is expected to evolve over time." + "@schema\n", + "class Enrollment(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Student # part of composite primary key\n", + " -> Course # part of composite primary key\n", + " ---\n", + " enrollment_date : date\n", + " grade : enum('A', 'B', 'C', 'D', 'F', 'IP')\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE student (\n", + " student_id INT UNSIGNED NOT NULL,\n", + " student_name VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (student_id)\n", + ");\n", + "\n", + "CREATE TABLE course (\n", + " course_code CHAR(8) NOT NULL,\n", + " course_title VARCHAR(100) NOT NULL,\n", + " PRIMARY KEY (course_code)\n", + ");\n", + "\n", + "CREATE TABLE enrollment (\n", + " student_id INT UNSIGNED NOT NULL,\n", + " course_code CHAR(8) NOT NULL,\n", + " enrollment_date DATE NOT NULL,\n", + " grade ENUM('A', 'B', 'C', 'D', 'F', 'IP') NOT NULL,\n", + " PRIMARY KEY (student_id, course_code),\n", + " FOREIGN KEY (student_id) REFERENCES student(student_id),\n", + " FOREIGN KEY (course_code) REFERENCES course(course_code)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "**Characteristics:**\n", + "- `Enrollment` has a composite primary key from both parents\n", + "- Each student can enroll in many courses\n", + "- Each course can have many students\n", + "- Each student-course combination appears at most once\n", + "- Association table can have its own attributes (grade, enrollment_date)" ] }, { - "cell_type": "markdown", - "metadata": {}, - "source": [] - }, - { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 11, + "id": "cell-19", "metadata": {}, + "outputs": [], "source": [ - "Another approach to establishing a one-to-one relationship is to use a secondary foreign key with an additional uniqueness constraint. DataJoint’s foreign key syntax supports both `unique` and `nullable` modifiers on foreign keys, providing flexibility in how relationships are structured.\n", - "\n", - "* **`unique` Modifier**: Adding the unique modifier to a foreign key converts a one-to-many relationship into a one-to-one relationship. This ensures that each entry in the child table corresponds to only one entry in the parent table and vice versa, enforcing a strict one-to-one link.\n", + "# Many-to-many with association table\n", + "@schema\n", + "class Student(dj.Manual):\n", + " definition = \"\"\"\n", + " student_id : int unsigned\n", + " ---\n", + " student_name : varchar(60)\n", + " \"\"\"\n", "\n", - "* **`nullable` Modifier**: The `nullable` modifier allows the relationship to be optional on the child side, meaning that not every child entry must reference a parent entry. (Relationships are already optional on the parent side, as parent entries don’t depend on children.)\n", + "@schema\n", + "class Course(dj.Manual):\n", + " definition = \"\"\"\n", + " course_code : char(8)\n", + " ---\n", + " course_title : varchar(100)\n", + " \"\"\"\n", "\n", - "## One-to-One Relationship with Unique and Nullable Modifiers\n", - "The following example demonstrates how to model a one-to-one relationship using a secondary unique constraint:" + "@schema\n", + "class Enrollment(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Student # part of composite primary key\n", + " -> Course # part of composite primary key\n", + " ---\n", + " enrollment_date : date\n", + " grade : enum('A', 'B', 'C', 'D', 'F', 'IP')\n", + " \"\"\"" ] }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 12, + "id": "cell-20", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ - "\n", + "\n", "\n", - "\n", - "\n", + "\n", + "\n", "\n", - "Account6\n", - "\n", - "\n", - "Account6\n", + "Enrollment\n", + "\n", + "\n", + "Enrollment\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer6\n", - "\n", - "\n", - "Customer6\n", + "Course\n", + "\n", + "\n", + "Course\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer6->Account6\n", - "\n", + "Course->Enrollment\n", + "\n", + "\n", + "\n", + "\n", + "Student\n", + "\n", + "\n", + "Student\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Student->Enrollment\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 7, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ + "# View the many-to-many diagram\n", + "dj.Diagram(Student) + dj.Diagram(Course) + dj.Diagram(Enrollment)" + ] + }, + { + "cell_type": "markdown", + "id": "xa3lj79iple", + "metadata": {}, + "source": [ + "## Enums vs. Lookup Tables for Valid Values\n", "\n", - "schema6 = dj.Schema('bank6')\n", + "In the `Enrollment` example above, we used an **enum** to restrict `grade` to valid values: `enum('A', 'B', 'C', 'D', 'F', 'IP')`.\n", + "An alternative approach uses a **lookup table** to define valid grades.\n", "\n", - "@schema6\n", - "class Customer6(dj.Manual):\n", + "Both approaches enforce data integrity, but they have different trade-offs:\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} Using Enum\n", + ":sync: enum\n", + "```python\n", + "@schema\n", + "class Enrollment(dj.Manual):\n", " definition = \"\"\"\n", - " customer_id : int unsigned\n", + " -> Student\n", + " -> Course\n", + " ---\n", + " enrollment_date : date\n", + " grade : enum('A', 'B', 'C', 'D', 'F', 'IP')\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} Using Lookup Table\n", + ":sync: lookup\n", + "```python\n", + "@schema\n", + "class LetterGrade(dj.Lookup):\n", + " definition = \"\"\"\n", + " grade : char(2)\n", " ---\n", - " full_name : varchar(30)\n", - " ssn = null : int unsigned # Optional SSN with unique constraint\n", - " unique index(ssn)\n", + " grade_point = null : decimal(3,2)\n", + " description : varchar(30)\n", " \"\"\"\n", + " contents = [\n", + " ('A', 4.00, 'Excellent'),\n", + " ('B', 3.00, 'Good'),\n", + " ('C', 2.00, 'Satisfactory'),\n", + " ('D', 1.00, 'Passing'),\n", + " ('F', 0.00, 'Failing'),\n", + " ('IP', None, 'In Progress'),\n", + " ]\n", "\n", - "@schema6\n", - "class Account6(dj.Manual):\n", + "@schema\n", + "class EnrollmentWithLookup(dj.Manual):\n", " definition = \"\"\"\n", - " account : int unsigned\n", + " -> Student\n", + " -> Course\n", " ---\n", - " -> [unique, nullable] Customer6\n", - " open_date : date\n", + " enrollment_date : date\n", + " -> LetterGrade\n", " \"\"\"\n", + "```\n", + "````\n", + "`````\n", + "\n", + "```{list-table} Enum vs. Lookup Table Comparison\n", + ":header-rows: 1\n", + ":widths: 15 42 43\n", + "\n", + "* - Aspect\n", + " - Enum\n", + " - Lookup Table\n", + "* - **Associated data**\n", + " - Cannot store additional attributes (e.g., grade points)\n", + " - Can include related data like grade points, descriptions\n", + "* - **Modifications**\n", + " - Requires `ALTER TABLE` to add/remove values\n", + " - Add or remove rows without schema changes\n", + "* - **Querying**\n", + " - Values are inline; no join needed\n", + " - Requires join to access the value or related data\n", + "* - **Referential integrity**\n", + " - Enforced by column type\n", + " - Enforced by foreign key constraint\n", + "* - **Complexity**\n", + " - Simple, self-contained\n", + " - Additional table to manage\n", + "* - **Reuse**\n", + " - Must repeat enum definition in each table\n", + " - Single source of truth; multiple tables can reference it\n", + "* - **UI integration**\n", + " - Values must be hardcoded in application\n", + " - Query the table to populate dropdown menus dynamically\n", + "```\n", "\n", - "dj.Diagram(schema6)" + "```{admonition} When to Use Each Approach\n", + ":class: tip\n", + "\n", + "**Use enums when:**\n", + "- The set of values is small and unlikely to change\n", + "- No additional attributes are associated with each value\n", + "- You want to minimize schema complexity\n", + "\n", + "**Use lookup tables when:**\n", + "- Values need associated attributes (e.g., grade points, descriptions)\n", + "- The set of values may change without requiring schema migration\n", + "- Multiple tables reference the same set of values\n", + "- You need to query or report on the valid values themselves\n", + "- Graphical interfaces or dashboards need to populate dropdown menus—querying a lookup table provides the options dynamically without hardcoding values in the application\n", + "```\n", + "\n", + "```{seealso}\n", + "See the [Lookup Tables](020-lookup-tables.ipynb) chapter for more details on creating and using lookup tables.\n", + "```" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 13, + "id": "bmty821kwm5", "metadata": {}, + "outputs": [], "source": [ - "In this design:\n", - "\n", - "* The `Account6` table includes a foreign key reference to `Customer6`, modified with both `unique` and `nullable` modifications.\n", - "* The `unique` constraint ensures that each `Account6` entry is exclusively linked to a single `Customer6` entry, establishing a one-to-one relationship.\n", - "* The `nullable` constraint allows accounts to exist without necessarily being tied to a customer, making the relationship optional from the child’s perspective.\n", - "\n", - "\n", - "### When to Use this Approach\n", - "Using a secondary unique constraint on a foreign key is helpful when:\n", - "\n", - "* **Optional Relationships**: You want flexibility to create child entries without always requiring a parent reference.\n", - "* **Separate, Modular Tables**: Keeping entities modular and maintaining a strict one-to-one relationship without merging the tables or merging the entity identities in the child table with those in the parent.\n", + "# Lookup table approach for grades\n", + "@schema\n", + "class LetterGrade(dj.Lookup):\n", + " definition = \"\"\"\n", + " grade : char(2)\n", + " ---\n", + " grade_point = null : decimal(3,2)\n", + " description : varchar(30)\n", + " \"\"\"\n", + " contents = [\n", + " ('A', 4.00, 'Excellent'),\n", + " ('B', 3.00, 'Good'),\n", + " ('C', 2.00, 'Satisfactory'),\n", + " ('D', 1.00, 'Passing'),\n", + " ('F', 0.00, 'Failing'),\n", + " ('IP', None, 'In Progress'),\n", + " ]\n", "\n", - "This method provides flexibility and maintains clear separation between entities while enforcing a one-to-one association, even if the relationship isn’t visually highlighted in the diagram." + "@schema\n", + "class EnrollmentWithLookup(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Student\n", + " -> Course\n", + " ---\n", + " enrollment_date : date\n", + " -> LetterGrade\n", + " \"\"\"" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 14, + "id": "c4ecykge118", "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

grade

\n", + " \n", + "
\n", + "

grade_point

\n", + " \n", + "
\n", + "

description

\n", + " \n", + "
A4.00Excellent
B3.00Good
C2.00Satisfactory
D1.00Passing
F0.00Failing
IPNoneIn Progress
\n", + " \n", + "

Total: 6

\n", + " " + ], + "text/plain": [ + "*grade grade_point description \n", + "+-------+ +------------+ +------------+\n", + "A 4.00 Excellent \n", + "B 3.00 Good \n", + "C 2.00 Satisfactory \n", + "D 1.00 Passing \n", + "F 0.00 Failing \n", + "IP None In Progress \n", + " (Total: 6)" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "## Diagram Representation Limitations\n", - "\n", - "This *dependency-and-constraint* pattern doesn’t visually convey the close, exclusive association created by the unique and nullable modifiers. The diagram will show a basic line for the foreign key, lacking any specific notation to indicate that the relationship is both unique and optional.\n", - "\n", - "The diagram only reflect the relationships formed through the the structure of primary keys and foreign keys, without taking into account the additional constraints imposed by secondary unique indexes. While solid think lines indicate a one-to-one relationship, additional uniqueness constraints may be in force that are not evident from the diagram alone.\n", - "\n", - "Consider all the diagrams side-by-side and recall which ones are one-to-one and which are one-to-many:" + "# View the lookup table contents - notice the associated grade points\n", + "LetterGrade()" ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 15, + "id": "zwvkqwauqip", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ - "\n", + "\n", "\n", - "\n", - "\n", + "\n", + "\n", "\n", - "Customer1\n", - "\n", - "\n", - "Customer1\n", + "EnrollmentWithLookup\n", + "\n", + "\n", + "EnrollmentWithLookup\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Account1\n", - "\n", - "\n", - "Account1\n", + "Course\n", + "\n", + "\n", + "Course\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer1->Account1\n", - "\n", + "Course->EnrollmentWithLookup\n", + "\n", "\n", - "\n", + "\n", "\n", - "Customer2\n", - "\n", - "\n", - "Customer2\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Account2\n", - "\n", - "\n", - "Account2\n", + "LetterGrade\n", + "\n", + "\n", + "LetterGrade\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer2->Account2\n", - "\n", - "\n", - "\n", - "\n", - "Customer3\n", - "\n", - "\n", - "Customer3\n", - "\n", + "LetterGrade->EnrollmentWithLookup\n", + "\n", "\n", - "\n", - "\n", - "\n", - "Account3\n", - "\n", - "\n", - "Account3\n", + "\n", + "\n", + "Student\n", + "\n", + "\n", + "Student\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer3->Account3\n", - "\n", - "\n", - "\n", - "\n", - "Account4\n", - "\n", - "\n", - "Account4\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer4\n", - "\n", - "\n", - "Customer4\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Account4->Customer4\n", - "\n", - "\n", - "\n", - "\n", - "Customer5\n", - "\n", - "\n", - "Customer5\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Account5\n", - "\n", - "\n", - "Account5\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer5->Account5\n", - "\n", - "\n", - "\n", - "\n", - "Customer6\n", - "\n", - "\n", - "Customer6\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Account6\n", - "\n", - "\n", - "Account6\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer6->Account6\n", - "\n", + "Student->EnrollmentWithLookup\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 8, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "dj.Diagram(schema1) + dj.Diagram(schema2) + dj.Diagram(schema3) + dj.Diagram(schema4) + dj.Diagram(schema5) + dj.Diagram(schema6)" + "# Compare the diagrams: EnrollmentWithLookup has an additional dependency on LetterGrade\n", + "dj.Diagram(EnrollmentWithLookup) - 1" ] }, { "cell_type": "markdown", + "id": "cell-21", "metadata": {}, "source": [ - "### How to Read Relationships from Diagrams\n", - "\n", - "By examining the diagrams above, only one (`schema5`) clearly shows a one-to-one relationship through its **thick solid line**. Here's how to interpret the diagramming notation:\n", + "## Constrained Many-to-Many\n", "\n", - "**Line Styles:**\n", - "* **Thick solid line**: One-to-one relationship where the foreign key is the entire primary key of the child table\n", - "* **Thin solid line**: One-to-many relationship where the foreign key is part of (but not all of) the primary key \n", - "* **Dashed line**: One-to-many relationship where the foreign key is a secondary attribute (not part of primary key)\n", + "By moving one foreign key below `---` with a `unique` constraint, you can create constrained relationships:\n", "\n", - "**What the Diagram Cannot Show:**\n", - "* Whether a foreign key is nullable (allows zero or one instead of exactly one)\n", - "* Secondary unique constraints that convert one-to-many into one-to-one\n", - "* The diagram only reflects the structure of primary keys and foreign keys\n", - "\n", - "**Pro Tip**: In Jupyter notebooks, you can hover over a diagram element to view its full table definition, including any secondary uniqueness constraints and nullable modifiers.\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "# Each customer has at most one account, but accounts can be shared\n", + "@schema\n", + "class CustomerAccount(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Customer # primary key\n", + " ---\n", + " -> Account # each customer links to one account\n", + " \"\"\"\n", "\n", - "**Best Practice**: DataJoint users generally avoid secondary unique constraints when the primary key structure can enforce uniqueness. Making the foreign key part of the primary key (creating solid lines in diagrams) provides two benefits:\n", - "1. **Visual clarity**: The relationship type is immediately obvious from the diagram\n", - "2. **Query simplicity**: Primary keys cascade through foreign keys, enabling direct joins between distant tables in the hierarchy" + "# With unique constraint: each account belongs to at most one customer\n", + "@schema\n", + "class AccountOwnership(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Account # primary key\n", + " ---\n", + " -> [unique] Customer # each customer owns at most one account\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "-- Each customer has at most one account\n", + "CREATE TABLE customer_account (\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " account_id INT UNSIGNED NOT NULL,\n", + " PRIMARY KEY (customer_id),\n", + " FOREIGN KEY (customer_id) REFERENCES customer(customer_id),\n", + " FOREIGN KEY (account_id) REFERENCES account(account_id)\n", + ");\n", + "\n", + "-- Each customer owns at most one account (bidirectional constraint)\n", + "CREATE TABLE account_ownership (\n", + " account_id INT UNSIGNED NOT NULL,\n", + " customer_id INT UNSIGNED NOT NULL,\n", + " PRIMARY KEY (account_id),\n", + " UNIQUE KEY (customer_id),\n", + " FOREIGN KEY (account_id) REFERENCES account(account_id),\n", + " FOREIGN KEY (customer_id) REFERENCES customer(customer_id)\n", + ");\n", + "```\n", + "````\n", + "`````" ] }, { "cell_type": "markdown", + "id": "cell-22", "metadata": {}, "source": [ - "# Many-to-Many Relationships\n", + "# Hierarchies\n", + "\n", + "**Hierarchies** are cascading one-to-many relationships that create tree structures.\n", + "Each level adds a new dimension to the composite primary key.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Study(dj.Manual):\n", + " definition = \"\"\"\n", + " study : varchar(8) # study code\n", + " ---\n", + " investigator : varchar(60)\n", + " study_description : varchar(255)\n", + " \"\"\"\n", "\n", - "In relational databases, a single foreign key between two tables can only establish one-to-many or one-to-one relationships.\n", - "To create a many-to-many (M:N) relationship between two entities, a third table is required, with each entry in this table linking one instance from each of the two related tables.\n", - "This third table is commonly referred to as an association table or join table.\n", + "@schema\n", + "class Subject(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Study\n", + " subject_id : varchar(12) # subject within study\n", + " ---\n", + " species : enum('human', 'primate', 'rodent')\n", + " date_of_birth = null : date\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Session(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Subject\n", + " session : smallint unsigned # session within subject\n", + " ---\n", + " session_date : date\n", + " operator : varchar(60)\n", + " \"\"\"\n", "\n", - "## Structure of Many-to-Many Relationships\n", - "An M:N relationship can be visualized as two one-to-many (1:N and 1:M) relationships with the association table.\n", + "@schema\n", + "class Scan(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Session\n", + " scan : smallint unsigned # scan within session\n", + " ---\n", + " scan_time : time\n", + " scan_type : varchar(30)\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE study (\n", + " study VARCHAR(8) NOT NULL,\n", + " investigator VARCHAR(60) NOT NULL,\n", + " study_description VARCHAR(255) NOT NULL,\n", + " PRIMARY KEY (study)\n", + ");\n", + "\n", + "CREATE TABLE subject (\n", + " study VARCHAR(8) NOT NULL,\n", + " subject_id VARCHAR(12) NOT NULL,\n", + " species ENUM('human', 'primate', 'rodent') NOT NULL,\n", + " date_of_birth DATE NULL,\n", + " PRIMARY KEY (study, subject_id),\n", + " FOREIGN KEY (study) REFERENCES study(study)\n", + ");\n", + "\n", + "CREATE TABLE session (\n", + " study VARCHAR(8) NOT NULL,\n", + " subject_id VARCHAR(12) NOT NULL,\n", + " session SMALLINT UNSIGNED NOT NULL,\n", + " session_date DATE NOT NULL,\n", + " operator VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (study, subject_id, session),\n", + " FOREIGN KEY (study, subject_id) REFERENCES subject(study, subject_id)\n", + ");\n", + "\n", + "CREATE TABLE scan (\n", + " study VARCHAR(8) NOT NULL,\n", + " subject_id VARCHAR(12) NOT NULL,\n", + " session SMALLINT UNSIGNED NOT NULL,\n", + " scan SMALLINT UNSIGNED NOT NULL,\n", + " scan_time TIME NOT NULL,\n", + " scan_type VARCHAR(30) NOT NULL,\n", + " PRIMARY KEY (study, subject_id, session, scan),\n", + " FOREIGN KEY (study, subject_id, session)\n", + " REFERENCES session(study, subject_id, session)\n", + ");\n", + "```\n", + "````\n", + "`````\n", "\n", - "The association table contains:\n", - "* **A foreign key** referencing each of the two related entities, establishing connections to instances of both tables.\n", - "* **Composite primary** key or a secondary unique constraint on the two foreign keys to ensure each combination of entities is unique.\n", + "**Key features of hierarchies:**\n", + "- Primary keys **cascade** through the hierarchy\n", + "- `Scan`'s primary key is `(study, subject_id, session, scan)`\n", + "- Direct joins work across any levels: `Study * Scan` is valid\n", + "- In diagrams: chain of **thin solid lines**\n", "\n", - "This structure allows each entity to link to multiple instances of the other entity through the association table." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Example: Bank Customers and Bank Accounts\n", - "Consider a bank system where customers can have multiple accounts, and accounts can be jointly owned by multiple customers. To represent this many-to-many relationship, an association table is used to link `Customer` and `Account`:" + "This pattern is common in scientific data organization (BIDS, NWB) where data is structured as Study → Subject → Session → Data." ] }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 16, + "id": "cell-23", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer7\n", - "\n", - "\n", - "Customer7\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Subject\n", + "\n", + "\n", + "Subject\n", "\n", "\n", "\n", - "\n", - "\n", - "CustomerAccount7\n", - "\n", - "\n", - "CustomerAccount7\n", + "\n", + "\n", + "Session\n", + "\n", + "\n", + "Session\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer7->CustomerAccount7\n", - "\n", + "Subject->Session\n", + "\n", "\n", - "\n", - "\n", - "Account7\n", - "\n", - "\n", - "Account7\n", + "\n", + "\n", + "Scan\n", + "\n", + "\n", + "Scan\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Account7->CustomerAccount7\n", - "\n", + "Session->Scan\n", + "\n", + "\n", + "\n", + "\n", + "Study\n", + "\n", + "\n", + "Study\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Study->Subject\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 9, + "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "schema7 = dj.Schema('bank7')\n", + "# Hierarchy pattern: cascading one-to-many relationships\n", + "@schema\n", + "class Study(dj.Manual):\n", + " definition = \"\"\"\n", + " study : varchar(8) # study code\n", + " ---\n", + " investigator : varchar(60)\n", + " study_description : varchar(255)\n", + " \"\"\"\n", "\n", - "@schema7\n", - "class Customer7(dj.Manual):\n", + "@schema\n", + "class Subject(dj.Manual):\n", " definition = \"\"\"\n", - " customer_id : int unsigned\n", + " -> Study\n", + " subject_id : varchar(12) # subject within study\n", " ---\n", - " full_name : varchar(30)\n", + " species : enum('human', 'primate', 'rodent')\n", + " date_of_birth = null : date\n", " \"\"\"\n", "\n", - "@schema7\n", - "class Account7(dj.Manual):\n", + "@schema\n", + "class Session(dj.Manual):\n", " definition = \"\"\"\n", - " account_id : int unsigned\n", + " -> Subject\n", + " session : smallint unsigned # session within subject\n", " ---\n", - " open_date : date\n", + " session_date : date\n", + " operator : varchar(60)\n", " \"\"\"\n", "\n", - "@schema7\n", - "class CustomerAccount7(dj.Manual):\n", + "@schema\n", + "class Scan(dj.Manual):\n", " definition = \"\"\"\n", - " -> Customer7\n", - " -> Account7\n", + " -> Session\n", + " scan : smallint unsigned # scan within session\n", + " ---\n", + " scan_time : time\n", + " scan_type : varchar(30)\n", " \"\"\"\n", "\n", - "dj.Diagram(schema7)" + "# View the hierarchy diagram - chain of thin solid lines\n", + "dj.Diagram(Study) + 10" ] }, { "cell_type": "markdown", + "id": "cell-25", "metadata": {}, "source": [ - "DataJoint’s diagramming language does not use special notation for association tables; they appear identical to other tables.\n", - "By contrast, other diagramming styles, such as **Chen’s Entity-Relationship (ER) notation**, represent associations—often called \"relationship sets\"—with diamond shapes to distinguish them from entity sets.\n", + "# Sequences\n", + "\n", + "**Sequences** are cascading one-to-one relationships representing workflow steps.\n", + "Each step extends the identity of the previous step.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Order(dj.Manual):\n", + " definition = \"\"\"\n", + " order_id : int unsigned\n", + " ---\n", + " order_date : date\n", + " customer : varchar(60)\n", + " \"\"\"\n", "\n", - "DataJoint purposefully avoids this strict conceptual distinction between entities and relationships, as the boundary between them is often blurred.\n", - "For instance, a synapse between two neurons could be considered an entity, storing specific data about the synapse itself, or it might be viewed as an association linking two neurons.\n", - "Additionally, some relationships can even link other relationships, a complexity not easily captured in Chen’s notation.\n", + "@schema\n", + "class Shipment(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Order # same identity as Order\n", + " ---\n", + " ship_date : date\n", + " carrier : varchar(30)\n", + " \"\"\"\n", "\n", - "In DataJoint, you can often recognize an association table by its converging pattern of foreign keys, which reference multiple tables to form a many-to-many relationship. This flexible approach supports various interpretations of relationships, making DataJoint schemas particularly adaptable for complex scientific data, where associations may themselves hold meaningful attributes." + "@schema\n", + "class Delivery(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Shipment # same identity as Shipment (and Order)\n", + " ---\n", + " delivery_date : date\n", + " signature : varchar(60)\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE order_ (\n", + " order_id INT UNSIGNED NOT NULL,\n", + " order_date DATE NOT NULL,\n", + " customer VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (order_id)\n", + ");\n", + "\n", + "CREATE TABLE shipment (\n", + " order_id INT UNSIGNED NOT NULL,\n", + " ship_date DATE NOT NULL,\n", + " carrier VARCHAR(30) NOT NULL,\n", + " PRIMARY KEY (order_id),\n", + " FOREIGN KEY (order_id) REFERENCES order_(order_id)\n", + ");\n", + "\n", + "CREATE TABLE delivery (\n", + " order_id INT UNSIGNED NOT NULL,\n", + " delivery_date DATE NOT NULL,\n", + " signature VARCHAR(60) NOT NULL,\n", + " PRIMARY KEY (order_id),\n", + " FOREIGN KEY (order_id) REFERENCES shipment(order_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "**Key features of sequences:**\n", + "- All tables share the same primary key (`order_id`)\n", + "- Each step is optional—not every order is shipped, not every shipment is delivered\n", + "- Direct queries across steps: `Order * Delivery` works without including `Shipment`\n", + "- In diagrams: chain of **thick solid lines**" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 17, + "id": "cell-26", "metadata": {}, + "outputs": [], "source": [ - "Association tables are primarily used to establish many-to-many relationships, but they also offer the flexibility to model one-to-many and even one-to-one relationships by applying additional uniqueness constraints. By controlling the uniqueness on the foreign keys within the association table, you can fine-tune the type of relationship between entities.\n", + "# Sequence pattern: cascading one-to-one relationships\n", + "@schema\n", + "class Order(dj.Manual):\n", + " definition = \"\"\"\n", + " order_id : int unsigned\n", + " ---\n", + " order_date : date\n", + " customer : varchar(60)\n", + " \"\"\"\n", "\n", - "### Example, enforcing One-to-Many with Shared Accounts \n", + "@schema\n", + "class Shipment(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Order # same identity as Order\n", + " ---\n", + " ship_date : date\n", + " carrier : varchar(30)\n", + " \"\"\"\n", "\n", - "In the following example, we model a scenario where each customer can have only one account, but each account may be shared among multiple customers.\n", - "This structure enforces a one-to-many relationship between `Customer8` and `Account8` via the `CustomerAccount8` association table." + "@schema\n", + "class Delivery(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Shipment # same identity as Shipment (and Order)\n", + " ---\n", + " delivery_date : date\n", + " signature : varchar(60)\n", + " \"\"\"" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 18, + "id": "cell-27", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", + "\n", + "\n", + "\n", + "\n", "\n", - "Customer8\n", - "\n", - "\n", - "Customer8\n", + "Order\n", + "\n", + "\n", + "Order\n", "\n", "\n", "\n", - "\n", - "\n", - "CustomerAccount8\n", - "\n", - "\n", - "CustomerAccount8\n", + "\n", + "\n", + "Shipment\n", + "\n", + "\n", + "Shipment\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer8->CustomerAccount8\n", - "\n", + "Order->Shipment\n", + "\n", "\n", - "\n", - "\n", - "Account8\n", - "\n", - "\n", - "Account8\n", + "\n", + "\n", + "Delivery\n", + "\n", + "\n", + "Delivery\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Account8->CustomerAccount8\n", - "\n", + "Shipment->Delivery\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 10, + "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "schema8 = dj.Schema('bank8')\n", + "# View the sequence diagram - chain of thick solid lines\n", + "dj.Diagram(Order) + 10" + ] + }, + { + "cell_type": "markdown", + "id": "cell-28", + "metadata": {}, + "source": [ + "# Parameterization\n", "\n", - "@schema8\n", - "class Customer8(dj.Manual):\n", + "The **parameterization pattern** applies different methods, algorithms, or parameters to the same entities.\n", + "The association table itself becomes the entity of interest.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Image(dj.Manual):\n", " definition = \"\"\"\n", - " customer_id : int unsigned\n", + " image_id : int unsigned\n", " ---\n", - " full_name : varchar(30)\n", + " raw_image : longblob\n", " \"\"\"\n", "\n", - "@schema8\n", - "class Account8(dj.Manual):\n", + "@schema\n", + "class EnhanceMethod(dj.Lookup):\n", " definition = \"\"\"\n", - " account_id : int unsigned\n", + " method_id : int unsigned\n", " ---\n", - " open_date : date\n", + " method_name : varchar(30)\n", + " method_description : varchar(255)\n", " \"\"\"\n", + " contents = [\n", + " (1, 'sharpen', 'Sharpen edges using unsharp mask'),\n", + " (2, 'denoise', 'Remove noise using median filter'),\n", + " (3, 'contrast', 'Enhance contrast using histogram equalization'),\n", + " ]\n", "\n", - "@schema8\n", - "class CustomerAccount8(dj.Manual):\n", + "@schema\n", + "class EnhancedImage(dj.Computed):\n", " definition = \"\"\"\n", - " -> Customer8\n", + " -> Image\n", + " -> EnhanceMethod\n", " ---\n", - " -> Account8\n", + " enhanced_image : longblob\n", + " processing_time : float\n", " \"\"\"\n", - "\n", - "dj.Diagram(schema8)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Explanation of the Design\n", - "* **Association Table (`CustomerAccount8`)**: The `CustomerAccount8` association table links `Customer8` and `Account8` through foreign keys. Although it resembles a many-to-many structure, by making the foreign key `-> Customer8` unique (it's the primary key), we ensure that each customer is associated with only one account. However, we leave the `-> Account8` foreign key unconstrained, allowing multiple customers to link to the same account, which enables account sharing.\n", - "\n", - "## Versatility of Association Tables\n", - "While association tables are necessary for modeling many-to-many relationships, they can also model one-to-many and even one-to-one relationships.\n", - "\n", - "This is accomplished by by altering their primary key or adding additional uniqueness constraints. If the association table links tables `A` and `B`, then:\n", - "\n", - "* One-to-Many: \n", - "\n", - "```\n", - "-> B\n", - "--- \n", - "-> A\n", - "```\n", - "\n", - "Any number of `B`s are each matched to at most one `A`.\n", - "\n", - "\n", - "* One-to-One: \n", - "```\n", - "-> A\n", - "---\n", - "-> [unique] B\n", - "```\n", - "\n", - "With uniqueness constraints on both `A` and `B`, each entry in `A` is matched to at most one entry in `B` and vice versa.\n", - "\n", - "* Many-to-Many\n", - "\n", "```\n", - "-> A\n", - "-> B\n", - "---\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE image (\n", + " image_id INT UNSIGNED NOT NULL,\n", + " raw_image LONGBLOB NOT NULL,\n", + " PRIMARY KEY (image_id)\n", + ");\n", + "\n", + "CREATE TABLE enhance_method (\n", + " method_id INT UNSIGNED NOT NULL,\n", + " method_name VARCHAR(30) NOT NULL,\n", + " method_description VARCHAR(255) NOT NULL,\n", + " PRIMARY KEY (method_id)\n", + ");\n", + "\n", + "CREATE TABLE enhanced_image (\n", + " image_id INT UNSIGNED NOT NULL,\n", + " method_id INT UNSIGNED NOT NULL,\n", + " enhanced_image LONGBLOB NOT NULL,\n", + " processing_time FLOAT NOT NULL,\n", + " PRIMARY KEY (image_id, method_id),\n", + " FOREIGN KEY (image_id) REFERENCES image(image_id),\n", + " FOREIGN KEY (method_id) REFERENCES enhance_method(method_id)\n", + ");\n", "```\n", - " Leave both foreign keys in the primary key, allowing each entity to associate freely with multiple instances of the other.\n", - "\n", - "This approach makes association tables a powerful tool for defining relationships of varying cardinality, adding flexibility and adaptability to DataJoint schemas. By managing uniqueness constraints directly in the association table, you can model complex relationships while keeping the primary entities’ structures simple and intuitive." + "````\n", + "`````\n", + "\n", + "**Characteristics:**\n", + "- Same image processed with multiple methods\n", + "- Same method applied to multiple images\n", + "- Results stored with composite key `(image_id, method_id)`\n", + "- Typical in computational workflows with parameter sweeps" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 19, + "id": "cell-29", "metadata": {}, + "outputs": [], "source": [ - "The schema diagram indicates the cardinality of these associations with thick lines corresponding to one-to-one relationship and thin lines indicating one-to-many:" + "# Parameterization pattern\n", + "@schema\n", + "class Image(dj.Manual):\n", + " definition = \"\"\"\n", + " image_id : int unsigned\n", + " ---\n", + " raw_image : longblob\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class EnhanceMethod(dj.Lookup):\n", + " definition = \"\"\"\n", + " method_id : int unsigned\n", + " ---\n", + " method_name : varchar(30)\n", + " method_description : varchar(255)\n", + " \"\"\"\n", + " contents = [\n", + " (1, 'sharpen', 'Sharpen edges using unsharp mask'),\n", + " (2, 'denoise', 'Remove noise using median filter'),\n", + " (3, 'contrast', 'Enhance contrast using histogram equalization'),\n", + " ]\n", + "\n", + "@schema\n", + "class EnhancedImage(dj.Computed):\n", + " definition = \"\"\"\n", + " -> Image\n", + " -> EnhanceMethod\n", + " ---\n", + " enhanced_image : longblob\n", + " processing_time : float\n", + " \"\"\"" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 20, + "id": "cell-30", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", + "\n", + "\n", + "\n", + "\n", "\n", - "Customer8\n", - "\n", - "\n", - "Customer8\n", + "Image\n", + "\n", + "\n", + "Image\n", "\n", "\n", "\n", - "\n", - "\n", - "CustomerAccount8\n", - "\n", - "\n", - "CustomerAccount8\n", + "\n", + "\n", + "EnhancedImage\n", + "\n", + "\n", + "EnhancedImage\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer8->CustomerAccount8\n", - "\n", - "\n", - "\n", - "\n", - "Customer7\n", - "\n", - "\n", - "Customer7\n", - "\n", - "\n", + "Image->EnhancedImage\n", + "\n", "\n", - "\n", - "\n", - "CustomerAccount7\n", - "\n", - "\n", - "CustomerAccount7\n", + "\n", + "\n", + "EnhanceMethod\n", + "\n", + "\n", + "EnhanceMethod\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Customer7->CustomerAccount7\n", - "\n", - "\n", - "\n", - "\n", - "Account8\n", - "\n", - "\n", - "Account8\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Account8->CustomerAccount8\n", - "\n", - "\n", - "\n", - "\n", - "Account7\n", - "\n", - "\n", - "Account7\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Account7->CustomerAccount7\n", - "\n", + "EnhanceMethod->EnhancedImage\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 11, + "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "dj.Diagram(schema7) + dj.Diagram(schema8)" + "# View the parameterization diagram\n", + "dj.Diagram(Image) + dj.Diagram(EnhanceMethod) + dj.Diagram(EnhancedImage)" ] }, { "cell_type": "markdown", + "id": "cell-31", "metadata": {}, "source": [ - "# More Design Patterns\n", + "# Directed Graphs\n", + "\n", + "**Directed graphs** model relationships where entities of the same type connect to each other.\n", + "Use **renamed foreign keys** to reference the same parent table multiple times.\n", + "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", + "@schema\n", + "class Neuron(dj.Manual):\n", + " definition = \"\"\"\n", + " neuron_id : int unsigned\n", + " ---\n", + " neuron_type : enum('excitatory', 'inhibitory')\n", + " layer : tinyint unsigned\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Synapse(dj.Manual):\n", + " definition = \"\"\"\n", + " synapse_id : int unsigned\n", + " ---\n", + " -> Neuron.proj(presynaptic='neuron_id')\n", + " -> Neuron.proj(postsynaptic='neuron_id')\n", + " strength : float\n", + " synapse_type : varchar(30)\n", + " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE neuron (\n", + " neuron_id INT UNSIGNED NOT NULL,\n", + " neuron_type ENUM('excitatory', 'inhibitory') NOT NULL,\n", + " layer TINYINT UNSIGNED NOT NULL,\n", + " PRIMARY KEY (neuron_id)\n", + ");\n", + "\n", + "CREATE TABLE synapse (\n", + " synapse_id INT UNSIGNED NOT NULL,\n", + " presynaptic INT UNSIGNED NOT NULL,\n", + " postsynaptic INT UNSIGNED NOT NULL,\n", + " strength FLOAT NOT NULL,\n", + " synapse_type VARCHAR(30) NOT NULL,\n", + " PRIMARY KEY (synapse_id),\n", + " FOREIGN KEY (presynaptic) REFERENCES neuron(neuron_id),\n", + " FOREIGN KEY (postsynaptic) REFERENCES neuron(neuron_id)\n", + ");\n", + "```\n", + "````\n", + "`````\n", + "\n", + "The `.proj()` operator renames the foreign key attribute:\n", + "- `presynaptic` references `Neuron.neuron_id`\n", + "- `postsynaptic` references `Neuron.neuron_id`\n", "\n", - "Here we will consider several other common patterns that make use of uniqueness constraints (primary keys and unique indexes) and referential constraints (foreign keys) to design more complex relationships." + "In diagrams, **orange dots** indicate renamed foreign keys.\n", + "\n", + "**Other examples:**\n", + "- Employees and managers (both are employees)\n", + "- Cities connected by flights\n", + "- Users following other users" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "## Sequences\n\nAs discussed in the [Relational Workflows](../20-concepts/05-workflows.md) chapter, DataJoint schemas are directional: dependencies form a *directed-acyclic graph* (DAG) representing sequences of steps or operations.\nThe diagrams are plotted with all the dependencies pointing in the same direction (top-to-bottom or left-to-right), so that a schema diagram can be understood as an operational workflow.\n\nLet's model a simple sequence of operations such as placing an order, shipping, and delivery.\nThe three entities: `Order`, `Shipment`, and `Delivery` form a sequence of one-to-one relationships:" - }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 21, + "id": "cell-32", "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "Shipment\n", - "\n", - "\n", - "Shipment\n", - "\n", - "\n", + "outputs": [], + "source": [ + "# Directed graph pattern with renamed foreign keys\n", + "@schema\n", + "class Neuron(dj.Manual):\n", + " definition = \"\"\"\n", + " neuron_id : int unsigned\n", + " ---\n", + " neuron_type : enum('excitatory', 'inhibitory')\n", + " layer : tinyint unsigned\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Synapse(dj.Manual):\n", + " definition = \"\"\"\n", + " synapse_id : int unsigned\n", + " ---\n", + " -> Neuron.proj(presynaptic='neuron_id')\n", + " -> Neuron.proj(postsynaptic='neuron_id')\n", + " strength : float\n", + " synapse_type : varchar(30)\n", + " \"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "cell-33", + "metadata": {}, + "outputs": [ + { + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "0\n", + "\n", + "0\n", "\n", - "\n", - "\n", - "Delivery\n", - "\n", - "\n", - "Delivery\n", + "\n", + "\n", + "Synapse\n", + "\n", + "\n", + "Synapse\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Shipment->Delivery\n", - "\n", - "\n", - "\n", - "\n", - "Confirmation\n", - "\n", - "\n", - "Confirmation\n", - "\n", + "0->Synapse\n", + "\n", "\n", + "\n", + "\n", + "1\n", + "\n", + "1\n", "\n", - "\n", + "\n", "\n", - "Delivery->Confirmation\n", - "\n", + "1->Synapse\n", + "\n", "\n", - "\n", + "\n", "\n", - "Order\n", - "\n", - "\n", - "Order\n", + "Neuron\n", + "\n", + "\n", + "Neuron\n", "\n", "\n", "\n", - "\n", + "\n", "\n", - "Order->Shipment\n", - "\n", + "Neuron->0\n", + "\n", + "\n", + "\n", + "\n", + "Neuron->1\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 13, + "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "schema = dj.Schema('orders')\n", + "# View the diagram - notice the orange dots indicating renamed foreign keys\n", + "dj.Diagram(Neuron) + dj.Diagram(Synapse)" + ] + }, + { + "cell_type": "markdown", + "id": "cell-34", + "metadata": {}, + "source": [ + "# Design Puzzle: Unique Designation\n", + "\n", + "A common challenge: how to designate exactly one special item among many?\n", + "For example: each state has many cities, but exactly one capital.\n", + "\n", + "**Requirements:**\n", + "1. Each city belongs to exactly one state\n", + "2. Each state has exactly one capital\n", + "3. A capital must be a city in that state\n", "\n", + "`````{tab-set}\n", + "````{tab-item} DataJoint\n", + ":sync: datajoint\n", + "```python\n", "@schema\n", - "class Order(dj.Manual):\n", + "class State(dj.Manual):\n", " definition = \"\"\"\n", - " order_number : int\n", + " state : char(2) # two-letter state code\n", " ---\n", - " order_date : date\n", + " state_name : varchar(30)\n", " \"\"\"\n", "\n", "@schema\n", - "class Shipment(dj.Manual):\n", + "class City(dj.Manual):\n", " definition = \"\"\"\n", - " -> Order\n", + " -> State\n", + " city_name : varchar(60)\n", " ---\n", - " ship_date : date\n", + " population : int unsigned\n", + " capital = null : enum('YES') # nullable enum for designation\n", + " unique index(state, capital) # only one 'YES' per state\n", " \"\"\"\n", + "```\n", + "````\n", + "````{tab-item} SQL\n", + ":sync: sql\n", + "```sql\n", + "CREATE TABLE state (\n", + " state CHAR(2) NOT NULL,\n", + " state_name VARCHAR(30) NOT NULL,\n", + " PRIMARY KEY (state)\n", + ");\n", + "\n", + "CREATE TABLE city (\n", + " state CHAR(2) NOT NULL,\n", + " city_name VARCHAR(60) NOT NULL,\n", + " population INT UNSIGNED NOT NULL,\n", + " capital ENUM('YES') NULL,\n", + " PRIMARY KEY (state, city_name),\n", + " UNIQUE KEY (state, capital),\n", + " FOREIGN KEY (state) REFERENCES state(state)\n", + ");\n", + "```\n", + "````\n", + "`````\n", "\n", + "**How it works:**\n", + "- `capital` is NULL for non-capitals (most cities)\n", + "- `capital = 'YES'` for the capital city\n", + "- `unique index(state, capital)` ensures only one 'YES' per state\n", + "- NULL values don't violate uniqueness (multiple NULLs allowed)\n", "\n", + "This pattern works for team captains, default addresses, primary contacts, etc." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "cell-35", + "metadata": {}, + "outputs": [], + "source": [ + "# Unique designation pattern\n", "@schema\n", - "class Delivery(dj.Manual):\n", + "class State(dj.Manual):\n", " definition = \"\"\"\n", - " -> Shipment\n", + " state : char(2) # two-letter state code\n", " ---\n", - " delivery_date : date\n", + " state_name : varchar(30)\n", " \"\"\"\n", "\n", "@schema\n", - "class Confirmation(dj.Manual):\n", + "class City(dj.Manual):\n", " definition = \"\"\"\n", - " -> Delivery\n", + " -> State\n", + " city_name : varchar(60)\n", " ---\n", - " confirmation_date : date\n", - " \"\"\"\n", - "\n", - "\n", - "dj.Diagram(schema)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this design, `Order`, `Shipment`, `Delivery`, and `Confirmation` use the same primary key that cascades through the sequence. This makes it straightforward to perform relationonal operations that skip steps. For example, joining information from `Order` and `Confirmation` does not require the inclusion of `Shipment` and `Deliverty` in the query: `Order * Confirmation` is a well-formed query:" + " population : int unsigned\n", + " capital = null : enum('YES') # nullable enum for designation\n", + " unique index(state, capital) # only one 'YES' per state\n", + " \"\"\"" ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 24, + "id": "cell-36", "metadata": {}, "outputs": [ { "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "
\n", - "

order_number

\n", - " \n", - "
\n", - "

order_date

\n", - " \n", - "
\n", - "

confirmation_date

\n", - " \n", - "
\n", - " \n", - "

Total: 0

\n", - " " + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "City\n", + "\n", + "\n", + "City\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "State\n", + "\n", + "\n", + "State\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "State->City\n", + "\n", + "\n", + "\n", + "" ], "text/plain": [ - "*order_number order_date confirmation_d\n", - "+------------+ +------------+ +------------+\n", - "\n", - " (Total: 0)" + "" ] }, - "execution_count": 14, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "Order * Confirmation" + "# View the diagram\n", + "dj.Diagram(State) + dj.Diagram(City)" ] }, { "cell_type": "markdown", + "id": "cell-37", "metadata": {}, "source": [ - "## Hierarchies\n", - "\n", - "Several 1:N relationships in sequence form a *hierarchy*. Many data standards are defined as hierarchies.\n", - "\n", - "Filesystems of files and folders with standardized naming conventions are examples of hierarchical structures. Many file formats such as HDF5 are also hierarchical.\n", - "\n", - "### Example: Brain Imaging Database \n", - "\n", - "Consider the hierarchical structure of the *[Brain Imaging Data Standard](https://bids.neuroimaging.io/)* — [@10.1038/sdata.2016.44], which is used for brain imaging data.\n", - "\n", - "In BIDS, a neuroimaging study is organized around experiment subjects, imaging sessions for each subject, and then specific types of brain scans within each session: anatomical scans, diffusion-weighted imaging (DWI) scans, and functional imaging.\n", + "# Relationship Summary\n", + "\n", + "| Pattern | Foreign Key Position | Constraint | Cardinality | Diagram Line |\n", + "|---------|---------------------|------------|-------------|--------------|\n", + "| One-to-many (reference) | Secondary | None | 1:N | Dashed |\n", + "| One-to-many (containment) | Part of PK | None | 1:N | Thin solid |\n", + "| One-to-one (extension) | Entire PK | Inherent | 1:1 | Thick solid |\n", + "| One-to-one (reference) | Secondary | `unique` | 1:1 | Dashed |\n", + "| Optional relationship | Secondary | `nullable` | 1:0..N | Dashed |\n", + "| Optional one-to-one | Secondary | `nullable, unique` | 1:0..1 | Dashed |\n", + "| Many-to-many | Both in PK | None | M:N | Two thin solids |\n", + "\n", + "```{admonition} Design Guidelines\n", + ":class: tip\n", + "\n", + "1. **Prefer solid lines** when appropriate—they enable direct joins across levels\n", + "2. **Use composite primary keys** for hierarchies to cascade identity\n", + "3. **Use association tables** for many-to-many relationships\n", + "4. **Use thick solid lines** (extension) when child has no independent meaning\n", + "5. **Check table definitions** for `nullable` and `unique`—diagrams don't show them\n", + "```\n", "\n", - "The hierarchy `Study` → `Subject` → `Session` → `Func` represents this organization, where:\n", - "- Each study has multiple subjects\n", - "- Each subject has multiple imaging sessions \n", - "- Each session contains multiple functional imaging scans\n", + "```{admonition} Next Steps\n", + ":class: note\n", "\n", - "This is a classic hierarchical design where the primary key cascades through the foreign keys. Notice that the primary key of `Func` contains four attributes: `study`, `subject_id`, `session`, and `func`. This allows direct joins between any tables in the hierarchy without requiring intermediate tables.\n", + "Now that you understand relationship patterns:\n", + "- **[Master-Part Tables](053-master-part.ipynb)** — Special pattern for composite entities\n", + "- **[Normalization](055-normalization.ipynb)** — Principles for organizing attributes\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "cell-38", + "metadata": {}, + "source": [ + "# Complete Schema Diagram\n", "\n", - "Let's design a relational schema for the BIDS hierarchical format:" + "View all the tables we've created in this tutorial:" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, + "id": "cell-39", "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ - "\n", + "\n", "\n", - "\n", - "\n", + "\n", + "\n", "\n", - "Func\n", - "\n", - "\n", - "Func\n", + "8\n", + "\n", + "8\n", + "\n", + "\n", + "\n", + "Synapse\n", + "\n", + "\n", + "Synapse\n", "\n", "\n", "\n", - "\n", + "\n", + "\n", + "8->Synapse\n", + "\n", + "\n", + "\n", "\n", + "9\n", + "\n", + "9\n", + "\n", + "\n", + "\n", + "9->Synapse\n", + "\n", + "\n", + "\n", + "\n", "Subject\n", - "\n", - "\n", - "Subject\n", + "\n", + "\n", + "Subject\n", "\n", "\n", "\n", "\n", - "\n", + "\n", "Session\n", - "\n", - "\n", - "Session\n", + "\n", + "\n", + "Session\n", "\n", "\n", "\n", "\n", - "\n", + "\n", "Subject->Session\n", - "\n", + "\n", "\n", "\n", - "\n", + "\n", "Study\n", - "\n", - "\n", - "Study\n", + "\n", + "\n", + "Study\n", "\n", "\n", "\n", "\n", - "\n", + "\n", "Study->Subject\n", - "\n", + "\n", "\n", - "\n", - "\n", - "Session->Func\n", - "\n", + "\n", + "\n", + "Student\n", + "\n", + "\n", + "Student\n", + "\n", "\n", "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "schema = dj.Schema('bids')\n", - "\n", - "@schema\n", - "class Study(dj.Manual):\n", - " definition = \"\"\"\n", - " study : varchar(6) # study unique code\n", - " ---\n", - " investigator : varchar(60) # primary investigator\n", - " study_description : varchar(255)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Subject(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Study\n", - " subject_id : varchar(8)\n", - " --- \n", - " subject_species : enum('human', 'primate', 'rodent')\n", - " date_of_birth = null : date\n", - " subject_notes : varchar(2000)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Session(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Subject\n", - " session : smallint unsigned\n", - " --- \n", - " session_date : date\n", - " operator : varchar(60)\n", - " aim : varchar(255)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Func(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Session\n", - " func : smallint unsigned\n", - " --- \n", - " func_filepath : varchar(255)\n", - " scan_params : varchar(500)\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Key Features of Hierarchical Design\n", - "\n", - "In this hierarchical design:\n", - "\n", - "* **Cascading Primary Keys**: The primary key from `Study` propagates all the way down to `Func`, creating solid lines in the diagram. This is indicated by the solid (not dashed) connections.\n", - "\n", - "* **Direct Joins**: Because the primary keys cascade, you can join any two tables in the hierarchy directly. For example, `Study * Func` is a valid join that doesn't require including `Subject` or `Session` in the query.\n", - "\n", - "* **One-to-Many at Each Level**: Each foreign key creates a one-to-many relationship:\n", - " - One study has many subjects\n", - " - One subject has many sessions\n", - " - One session has many functional scans\n", - "\n", - "* **Composite Primary Keys**: Lower tables have composite primary keys that include all ancestor identifiers. For example, a functional scan is uniquely identified by `(study, subject_id, session, func)`.\n", - "\n", - "This design pattern is extremely common in scientific workflows and hierarchical data organization systems.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Parameterization\n", - "\n", - "The parameterization pattern is used when you want to apply different methods, algorithms, or parameters to the same set of entities. This creates a many-to-many relationship, but the association table itself becomes the entity of interest.\n", - "\n", - "## Example: Image Enhancement\n", - "\n", - "Consider a system where you have images and want to apply various enhancement methods to them:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "EnhancedImage\n", - "\n", - "\n", - "EnhancedImage\n", + "\n", + "\n", + "EnrollmentWithLookup\n", + "\n", + "\n", + "EnrollmentWithLookup\n", "\n", "\n", "\n", - "\n", - "\n", - "Image\n", - "\n", - "\n", - "Image\n", + "\n", + "\n", + "Student->EnrollmentWithLookup\n", + "\n", + "\n", + "\n", + "\n", + "Enrollment\n", + "\n", + "\n", + "Enrollment\n", "\n", "\n", "\n", - "\n", - "\n", - "Image->EnhancedImage\n", - "\n", + "\n", + "\n", + "Student->Enrollment\n", + "\n", "\n", - "\n", - "\n", - "EnhanceMethod\n", - "\n", - "\n", - "EnhanceMethod\n", + "\n", + "\n", + "State\n", + "\n", + "\n", + "State\n", "\n", "\n", "\n", - "\n", - "\n", - "EnhanceMethod->EnhancedImage\n", - "\n", + "\n", + "\n", + "City\n", + "\n", + "\n", + "City\n", + "\n", "\n", "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "schema9 = dj.Schema('params')\n", - "\n", - "@schema9\n", - "class Image(dj.Manual):\n", - " definition = \"\"\"\n", - " image_id : int\n", - " ---\n", - " image : longblob # image data\n", - " \"\"\"\n", - "\n", - "@schema9\n", - "class EnhanceMethod(dj.Lookup):\n", - " definition = \"\"\"\n", - " enhance_method : int\n", - " ---\n", - " method_name : varchar(16)\n", - " method_description : varchar(255)\n", - " \"\"\"\n", - " contents = [\n", - " (1, 'sharpen', 'Sharpen edges in the image'),\n", - " (2, 'contrast', 'Increase contrast'),\n", - " (3, 'denoise', 'Remove noise from image')\n", - " ]\n", - "\n", - "@schema9\n", - "class EnhancedImage(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Image\n", - " -> EnhanceMethod\n", - " ---\n", - " enhanced_image : longblob\n", - " processing_timestamp : timestamp\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema9)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Understanding the Parameterization Pattern\n", - "\n", - "In this design:\n", - "\n", - "* **`Image`**: Stores the original images, each with a unique ID\n", - "* **`EnhanceMethod`**: A lookup table defining available enhancement methods\n", - "* **`EnhancedImage`**: The association table that stores the results of applying each method to each image\n", - "\n", - "The key feature is that `EnhancedImage` has a **composite primary key** consisting of both `image_id` and `enhance_method`. This allows:\n", - "- The same image to be processed with multiple enhancement methods\n", - "- The same enhancement method to be applied to multiple images\n", - "- Each combination is stored as a unique result\n", - "\n", - "This pattern is called \"parameterization\" because you're essentially parameterizing the Image entity by the EnhanceMethod. The `EnhancedImage` table is the entity of primary interest—it contains the actual processed results.\n", - "\n", - "**Design Choice**: Because both foreign keys are part of the primary key, this creates a many-to-many relationship. If you moved `EnhanceMethod` below the line (as a secondary attribute), each image could only be enhanced using one method." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Directed Graphs\n", - "\n", - "Directed graphs represent relationships where there is a directional connection between nodes. In databases, this often appears when entities of the same type can be related to each other in a specific direction.\n", - "\n", - "Common examples include:\n", - "- Employees and their managers (both are employees)\n", - "- Neurons and synapses (connections between neurons)\n", - "- Social media follows (users following other users)\n", - "- File systems (folders containing other folders)\n", - "\n", - "## Example: Neural Connectivity\n", - "\n", - "In neuroscience, we often need to model connections between neurons. Each synapse is a directed connection from a presynaptic neuron to a postsynaptic neuron:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "0\n", - "\n", - "0\n", + "\n", + "\n", + "State->City\n", + "\n", "\n", - "\n", - "\n", - "Synapse\n", - "\n", - "\n", - "Synapse\n", + "\n", + "\n", + "Shipment\n", + "\n", + "\n", + "Shipment\n", "\n", "\n", "\n", - "\n", - "\n", - "0->Synapse\n", - "\n", + "\n", + "\n", + "Delivery\n", + "\n", + "\n", + "Delivery\n", + "\n", "\n", - "\n", - "\n", - "1\n", - "\n", - "1\n", "\n", - "\n", - "\n", - "1->Synapse\n", - "\n", + "\n", + "\n", + "Shipment->Delivery\n", + "\n", + "\n", + "\n", + "\n", + "Scan\n", + "\n", + "\n", + "Scan\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Session->Scan\n", + "\n", + "\n", + "\n", + "\n", + "ParkingSpot\n", + "\n", + "\n", + "ParkingSpot\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Order\n", + "\n", + "\n", + "Order\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Order->Shipment\n", + "\n", "\n", "\n", - "\n", + "\n", "Neuron\n", - "\n", - "\n", - "Neuron\n", + "\n", + "\n", + "Neuron\n", "\n", "\n", "\n", - "\n", - "\n", - "Neuron->0\n", - "\n", + "\n", + "\n", + "Neuron->8\n", + "\n", "\n", - "\n", - "\n", - "Neuron->1\n", - "\n", + "\n", + "\n", + "Neuron->9\n", + "\n", "\n", + "\n", + "\n", + "Image\n", + "\n", + "\n", + "Image\n", + "\n", "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "schema10 = dj.Schema('neuro')\n", - "\n", - "@schema10\n", - "class Neuron(dj.Manual):\n", - " definition = \"\"\"\n", - " neuron_id : int\n", - " ---\n", - " neuron_type : enum('excitatory', 'inhibitory')\n", - " layer : int\n", - " \"\"\"\n", - "\n", - "@schema10\n", - "class Synapse(dj.Manual):\n", - " definition = \"\"\"\n", - " synapse_id : int\n", - " ---\n", - " -> Neuron.proj(presynaptic='neuron_id')\n", - " -> Neuron.proj(postsynaptic='neuron_id')\n", - " strength : float # synaptic weight\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Key Features of Directed Graphs\n", - "\n", - "In this neural connectivity example:\n", - "\n", - "* **Acyclic Dependencies**: The `Synapse` table references the `Neuron` table twice, creating connections between neurons without the need for a cyclic dependency (Neuron -> Synapse -> Neuron).\n", - "* **Renamed foreign keys**: We use `.proj()` to rename the foreign keys to `presynaptic` and `postsynaptic`, making the relationship clear\n", - "* **Multigraph**: Multiple synapses can connect the same pair of neurons (since `synapse_id` is the primary key)\n", - "* **Directionality**: The relationship has a clear direction from presynaptic to postsynaptic neuron\n", - "\n", - "## Example: Employee Management Hierarchy\n", - "\n", - "Another common directed graph is an organizational hierarchy where employees report to managers (who are also employees):" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "7\n", - "\n", - "7\n", "\n", - "\n", - "\n", - "ReportsTo\n", - "\n", - "\n", - "ReportsTo\n", + "\n", + "\n", + "EnhancedImage\n", + "\n", + "\n", + "EnhancedImage\n", "\n", "\n", "\n", - "\n", - "\n", - "7->ReportsTo\n", - "\n", + "\n", + "\n", + "Image->EnhancedImage\n", + "\n", "\n", "\n", - "\n", + "\n", "Employee\n", - "\n", - "\n", - "Employee\n", + "\n", + "\n", + "Employee\n", "\n", "\n", "\n", - "\n", - "\n", - "Employee->7\n", - "\n", + "\n", + "\n", + "Employee->ParkingSpot\n", + "\n", "\n", - "\n", - "\n", - "Employee->ReportsTo\n", - "\n", + "\n", + "\n", + "CustomerPreferences\n", + "\n", + "\n", + "CustomerPreferences\n", + "\n", "\n", "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "schema11 = dj.Schema('org')\n", - "\n", - "@schema11\n", - "class Employee(dj.Manual):\n", - " definition = \"\"\"\n", - " employee_id : int\n", - " ---\n", - " full_name : varchar(60)\n", - " hire_date : date\n", - " \"\"\"\n", - "\n", - "@schema11\n", - "class ReportsTo(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Employee\n", - " ---\n", - " -> Employee.proj(manager_id='employee_id')\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema11)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this organizational hierarchy:\n", - "\n", - "* **`Employee`** is the base entity table containing all employees\n", - "* **`ReportsTo`** creates the management relationships, where:\n", - " - The primary key is `employee_id` (the subordinate)\n", - " - The foreign key references `manager_id` (also an employee)\n", - "* **One manager, many reports**: Each employee reports to exactly one manager (or none if they're the CEO), but one manager can have many direct reports\n", - "\n", - "**Why not a self-referencing foreign key?** In classical SQL design, you might add a `manager_id` column directly to the `Employee` table. In DataJoint, we avoid self-referencing tables to maintain the acyclic property of the schema. Instead, we create a separate association table. This also makes it easier to:\n", - "- Add attributes to the relationship (e.g., start date of reporting relationship)\n", - "- Query the reporting structure\n", - "- Handle employees without managers (no nullable foreign keys in primary keys)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Design Puzzle: States, Cities, and Capitals\n", - "\n", - "Here's an interesting design challenge that combines one-to-many relationships with additional uniqueness constraints:\n", - "\n", - "**Requirements:**\n", - "1. Each city belongs to exactly one state\n", - "2. Each state has exactly one capital city\n", - "3. A capital must be a city\n", - "4. A capital must be in the same state\n", - "\n", - "This is a classic puzzle in database design that's similar to modeling teams and team captains. Let's explore the solution:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "City\n", - "\n", - "\n", - "City\n", + "\n", + "\n", + "Customer\n", + "\n", + "\n", + "Customer\n", "\n", "\n", "\n", - "\n", - "\n", - "State\n", - "\n", - "\n", - "State\n", + "\n", + "\n", + "Customer->CustomerPreferences\n", + "\n", + "\n", + "\n", + "\n", + "AccountIndependent\n", + "\n", + "\n", + "AccountIndependent\n", "\n", "\n", "\n", - "\n", - "\n", - "State->City\n", - "\n", + "\n", + "\n", + "Customer->AccountIndependent\n", + "\n", + "\n", + "\n", + "\n", + "AccountContained\n", + "\n", + "\n", + "AccountContained\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer->AccountContained\n", + "\n", + "\n", + "\n", + "\n", + "Course\n", + "\n", + "\n", + "Course\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Course->EnrollmentWithLookup\n", + "\n", + "\n", + "\n", + "\n", + "Course->Enrollment\n", + "\n", + "\n", + "\n", + "\n", + "LetterGrade\n", + "\n", + "\n", + "LetterGrade\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "LetterGrade->EnrollmentWithLookup\n", + "\n", + "\n", + "\n", + "\n", + "EnhanceMethod\n", + "\n", + "\n", + "EnhanceMethod\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "EnhanceMethod->EnhancedImage\n", + "\n", "\n", "\n", "" ], "text/plain": [ - "" + "" ] }, - "execution_count": 20, + "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "schema12 = dj.Schema('geography')\n", - "\n", - "@schema12\n", - "class State(dj.Manual):\n", - " definition = \"\"\"\n", - " state : char(2) # two-letter state code\n", - " ---\n", - " state_name : varchar(30)\n", - " \"\"\"\n", - "\n", - "@schema12\n", - "class City(dj.Manual):\n", - " definition = \"\"\"\n", - " -> State\n", - " city_name : varchar(30)\n", - " ---\n", - " capital = null : enum('YES')\n", - " unique index(state, capital)\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema12)" + "# View the entire schema\n", + "dj.Diagram(schema)" ] }, { "cell_type": "markdown", + "id": "cell-40", "metadata": {}, - "source": "### Understanding the Solution\n\nThis design elegantly handles all four requirements:\n\n1. **Cities belong to states**: The foreign key `-> State` in the `City` table's primary key ensures each city belongs to exactly one state\n\n2. **Unique capitals**: The `unique index(state, capital)` constraint ensures that for each state, there can be at most one city with `capital='YES'`\n\n3. **Capitals are cities**: Since the capital designation is just an attribute of the `City` table, capitals are by definition cities\n\n4. **Capitals in the same state**: The `state` field is part of both the foreign key and the unique index, ensuring the capital must be in that state\n\n**The Key Insight**: The nullable enum field `capital = null : enum('YES')` combined with `unique index(state, capital)` is the clever trick. Since the unique index only applies to non-null values, you can have multiple cities with `capital=null` in the same state, but only one city can have `capital='YES'` for each state.\n\nThis is the same pattern you would use for modeling teams and team captains, where:\n- `State` → `Team`\n- `City` → `Player`\n- `capital='YES'` → `is_captain='YES'`\n\n## Summary\n\nWe've explored several key relationship patterns:\n\n* **One-to-Many**: Using foreign keys in secondary attributes (dashed lines) or using foreign keys in a composite primary key (solid thin lines). Or use an association table with a unique constraint on one of the foreign keys.\n* **One-to-One**: Using foreign keys as primary keys (thick solid lines) or in secondary attributes with unique constraints (thin dashed lines) \n* **Many-to-Many**: Using association tables with composite primary keys\n* **Sequences**: Cascading one-to-one relationships for workflows\n* **Hierarchies**: Cascading one-to-many relationships with expanding primary keys\n* **Parameterization**: Association tables where the association itself is the entity of interest\n* **Directed Graphs**: Self-referencing relationships with renamed foreign keys\n* **Constrained Relationships**: Using nullable enums with unique indexes for special cases\n\nUnderstanding these patterns allows you to design schemas that accurately represent complex business rules and data relationships.\n\n:::{seealso}\nFor a complete business database demonstrating these relationship patterns in a realistic context, see the [Classic Sales](../80-examples/010-classic-sales.ipynb) example, which models offices, employees, customers, orders, and products as an integrated workflow with hierarchies, sequences, and association patterns.\n:::" + "source": [ + "# Cleanup\n", + "\n", + "Optionally drop the tutorial schema when done:" + ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": 26, + "id": "cell-41", "metadata": {}, - "source": [] + "outputs": [], + "source": [ + "# drop the schema\n", + "schema.drop(force=True)" + ] } ], "metadata": { @@ -2089,5 +2756,5 @@ } }, "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file + "nbformat_minor": 5 +} diff --git a/book/30-design/053-master-part.ipynb b/book/30-design/053-master-part.ipynb index 91283f9..6d936c8 100644 --- a/book/30-design/053-master-part.ipynb +++ b/book/30-design/053-master-part.ipynb @@ -112,7 +112,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "As seen in this example, DataJoint provides special syntax for defining master-part relationships:\n\n1. **Master tables are declared normally** – The master entity is declared as any regular table by subclassing `dj.Manual`/`dj.Lookup`/`dj.Imported`/`dj.Computed`. Thus a table becomes a master table by virtue of having part tables.\n\n2. **Nested class definition** – Parts are declared as a nested class inside its master class, subclassing `dj.Part`. Thus the part tables are referred to by their full class name such as `Polygon.Vertex`. Their classes do not need to be wrapped with the `@schema` decorator: the decorator of the master class is responsible for declaring all of its parts. \n\n3. **Foreign key from part to master** – The part tables declare a foreign key to its master directly or transitively through other parts. Inside the namespace of the master class, a special object named `master` can be used to reference the master table. Thus the definition of the `Vertex` table can declare the foreign key `-> master` as an equivalent alias to `-> Polygon`—either will form a valid foreign key. \n\n4. **Diagram notation** – In schema diagrams, part tables are rendered without colored blocks. They appear as labels attached to the master node, emphasizing that they do not stand on their own. The absence of color also highlights that other tables rarely reference parts directly; the master represents the entity identity.\n\n5. **Workflow semantics** – For computed and imported tables, the master's `make()` method is responsible for inserting both the master row and all its parts within a single ACID transaction. This ensures compositional integrity is maintained automatically." + "source": "As seen in this example, DataJoint provides special syntax for defining master-part relationships:\n\n1. **Master tables are declared normally** – The master entity is declared as any regular table by subclassing `dj.Manual`/`dj.Lookup`/`dj.Imported`/`dj.Computed`. Thus a table becomes a master table by virtue of having part tables.\n\n2. **Nested class definition** – Parts are declared as a nested class inside its master class, subclassing `dj.Part`. Thus the part tables are referred to by their full class name such as `Polygon.Vertex`. Their classes do not need to be wrapped with the `@schema` decorator: the decorator of the master class is responsible for declaring all of its parts. \n\n3. **Foreign key from part to master** – The part tables declare a foreign key to its master directly or transitively through other parts. Inside the namespace of the master class, a special object named `master` can be used to reference the master table. Thus the definition of the `Vertex` table can declare the foreign key `-> master` as an equivalent alias to `-> Polygon`—either will form a valid foreign key. \n\n4. **Part tables can introduce new schema dimensions** – Unlike auto-populated master tables which cannot introduce new dimensions (see [Primary Keys](018-primary-key.md)), part tables *can* define new primary key attributes. In the example above, `Vertex` introduces the `vertex_id` dimension to identify individual vertices within each polygon. This is the mechanism by which computations can produce multiple output entities from a single input.\n\n5. **Diagram notation** – In schema diagrams, part tables are rendered without colored blocks. They appear as labels attached to the master node, emphasizing that they do not stand on their own. Part table names that introduce new dimensions are underlined, following the standard convention for dimension-defining tables.\n\n6. **Workflow semantics** – For computed and imported tables, the master's `make()` method is responsible for inserting both the master row and all its parts within a single ACID transaction. This ensures compositional integrity is maintained automatically." }, { "cell_type": "markdown", @@ -122,7 +122,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "## Master-Part in Computed Tables\n\nMaster-part relationships are most powerful in auto-computed tables (`dj.Computed` or `dj.Imported`). \nThe master is responsible for populating all its parts within a single `make` call.\n\n### ACID Transactions\n\nWhen `populate` is called, DataJoint executes each `make()` method inside an **ACID transaction**:\n\n- **Atomicity** – The entire `make` call is all-or-nothing. Either the master row and all its parts are inserted together, or none of them are. If any error occurs—whether in computing results, inserting the master, or inserting any part—the entire transaction is rolled back. No partial results are ever committed to the database.\n\n- **Consistency** – The transaction moves the database from one valid state to another. The master-part relationship ensures that every master entry has its complete set of parts. Referential integrity constraints are satisfied at commit time.\n\n- **Isolation** – The transaction operates on a consistent snapshot of the database. Other concurrent transactions cannot see the partially inserted data until the transaction commits. This means other processes querying the database will never observe a master without its parts.\n\n- **Durability** – Once the transaction commits successfully, the data is permanently stored. Even if the system crashes immediately after, the master and all its parts will be present when the database restarts.\n\n### The Master's Responsibility\n\nThe master's `make` method is responsible for:\n1. Fetching all necessary input data\n2. Performing all computations\n3. Inserting the master row\n4. Inserting all part rows\n\nThis design ensures that the entire computation for one entity is self-contained within a single transactional boundary.\n\n### Example: Blob Detection\n\nConsider the [Blob Detection](../80-examples/075-blob-detection.ipynb) pipeline where `Detection` (master) and `Detection.Blob` (part) work together:\n\n```python\n@schema\nclass Detection(dj.Computed):\n definition = \"\"\"\n -> Image\n -> BlobParamSet\n ---\n nblobs : int\n \"\"\"\n\n class Blob(dj.Part):\n definition = \"\"\"\n -> master\n blob_id : int\n ---\n x : float\n y : float\n r : float\n \"\"\"\n\n def make(self, key):\n # fetch inputs\n img = (Image & key).fetch1(\"image\")\n params = (BlobParamSet & key).fetch1()\n\n # compute results\n blobs = blob_doh(\n img, \n min_sigma=params['min_sigma'], \n max_sigma=params['max_sigma'], \n threshold=params['threshold'])\n\n # insert master and parts (within one transaction)\n self.insert1(dict(key, nblobs=len(blobs)))\n self.Blob.insert(\n (dict(key, blob_id=i, x=x, y=y, r=r)\n for i, (x, y, r) in enumerate(blobs)))\n```\n\nIn this example:\n- The `make` method is called once per `(image_id, blob_paramset)` combination\n- Each call runs inside its own ACID transaction\n- The master row (`Detection`) stores the aggregate blob count\n- The part rows (`Detection.Blob`) store the coordinates of each detected blob\n- If `blob_doh` raises an exception or any insert fails, nothing is committed\n- An image with 200 detected blobs results in 1 master row + 200 part rows, all inserted atomically\n\nThis transactional guarantee means that any downstream table depending on `Detection` can trust that all `Detection.Blob` rows for that detection are present." + "source": "## Master-Part in Computed Tables\n\nMaster-part relationships are most powerful in auto-computed tables (`dj.Computed` or `dj.Imported`). \nThe master is responsible for populating all its parts within a single `make` call.\n\n### Schema Dimensions in Computed Tables\n\nAuto-populated tables have a fundamental constraint: **they cannot introduce new schema dimensions directly**. Their primary key must be fully determined by foreign keys to their upstream dependencies. This ensures that the key source (the set of entities to be computed) is well-defined.\n\nHowever, computations often produce multiple output entities from a single input—detecting multiple cells in an image, extracting multiple spikes from a recording, or identifying multiple vertices in a polygon. **Part tables solve this by being allowed to introduce new dimensions**.\n\nIn the blob detection example below, `Detection` (the master) inherits its primary key entirely from `Image` and `BlobParamSet`. It cannot add new dimensions. But `Detection.Blob` (the part) introduces the `blob_id` dimension to identify individual blobs within each detection.\n\n### ACID Transactions\n\nWhen `populate` is called, DataJoint executes each `make()` method inside an **ACID transaction**:\n\n- **Atomicity** – The entire `make` call is all-or-nothing. Either the master row and all its parts are inserted together, or none of them are. If any error occurs—whether in computing results, inserting the master, or inserting any part—the entire transaction is rolled back. No partial results are ever committed to the database.\n\n- **Consistency** – The transaction moves the database from one valid state to another. The master-part relationship ensures that every master entry has its complete set of parts. Referential integrity constraints are satisfied at commit time.\n\n- **Isolation** – The transaction operates on a consistent snapshot of the database. Other concurrent transactions cannot see the partially inserted data until the transaction commits. This means other processes querying the database will never observe a master without its parts.\n\n- **Durability** – Once the transaction commits successfully, the data is permanently stored. Even if the system crashes immediately after, the master and all its parts will be present when the database restarts.\n\n### The Master's Responsibility\n\nThe master's `make` method is responsible for:\n1. Fetching all necessary input data\n2. Performing all computations\n3. Inserting the master row\n4. Inserting all part rows\n\nThis design ensures that the entire computation for one entity is self-contained within a single transactional boundary.\n\n### Example: Blob Detection\n\nConsider the [Blob Detection](../80-examples/075-blob-detection.ipynb) pipeline where `Detection` (master) and `Detection.Blob` (part) work together:\n\n```python\n@schema\nclass Detection(dj.Computed):\n definition = \"\"\"\n -> Image\n -> BlobParamSet\n ---\n nblobs : int\n \"\"\"\n\n class Blob(dj.Part):\n definition = \"\"\"\n -> master\n blob_id : int # NEW DIMENSION: identifies blobs within detection\n ---\n x : float\n y : float\n r : float\n \"\"\"\n\n def make(self, key):\n # fetch inputs\n img = (Image & key).fetch1(\"image\")\n params = (BlobParamSet & key).fetch1()\n\n # compute results\n blobs = blob_doh(\n img, \n min_sigma=params['min_sigma'], \n max_sigma=params['max_sigma'], \n threshold=params['threshold'])\n\n # insert master and parts (within one transaction)\n self.insert1(dict(key, nblobs=len(blobs)))\n self.Blob.insert(\n (dict(key, blob_id=i, x=x, y=y, r=r)\n for i, (x, y, r) in enumerate(blobs)))\n```\n\nIn this example:\n- The `make` method is called once per `(image_id, blob_paramset)` combination\n- Each call runs inside its own ACID transaction\n- `Detection` cannot introduce new dimensions—its primary key is fully inherited\n- `Detection.Blob` introduces the `blob_id` dimension to identify each detected blob\n- The master row stores the aggregate blob count; the part rows store individual coordinates\n- If `blob_doh` raises an exception or any insert fails, nothing is committed\n- An image with 200 detected blobs results in 1 master row + 200 part rows, all inserted atomically\n\nThis transactional guarantee means that any downstream table depending on `Detection` can trust that all `Detection.Blob` rows for that detection are present." }, { "cell_type": "markdown", @@ -137,7 +137,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "## Summary\n\nMaster-part relationships provide a structured way to model entities that own subordinate detail rows. Key principles:\n\n1. **Compositional integrity** – A master and its parts form an indivisible unit. They are inserted and deleted together, never partially.\n\n2. **ACID transactions** – Each `make()` call runs inside a transaction guaranteeing atomicity, consistency, isolation, and durability. If any step fails, the entire operation is rolled back.\n\n3. **Master's responsibility** – The master's `make()` method is solely responsible for populating itself and all its parts. This keeps the transactional boundary clear and self-contained.\n\n4. **Implicit part dependency** – A foreign key to the master implies a dependency on all its parts. Downstream tables can safely assume that when the master exists, all its parts are present and complete.\n\n5. **Clean separation** – Masters hold aggregate/summary data while parts hold per-item details. Downstream tables reference the master; queries join with parts when details are needed.\n\nDataJoint's nested class syntax and transactional populate mechanism make this pattern easy to express and safe to use in relational workflows.\n\n:::{seealso}\nFor a complete working example demonstrating these concepts, see the [Blob Detection](../80-examples/075-blob-detection.ipynb) pipeline, where `Detection` (master) and `Detection.Blob` (part) illustrate atomic population and downstream dependency through `SelectDetection`.\n:::" + "source": "## Summary\n\nMaster-part relationships provide a structured way to model entities that own subordinate detail rows. Key principles:\n\n1. **Compositional integrity** – A master and its parts form an indivisible unit. They are inserted and deleted together, never partially.\n\n2. **ACID transactions** – Each `make()` call runs inside a transaction guaranteeing atomicity, consistency, isolation, and durability. If any step fails, the entire operation is rolled back.\n\n3. **Master's responsibility** – The master's `make()` method is solely responsible for populating itself and all its parts. This keeps the transactional boundary clear and self-contained.\n\n4. **Schema dimensions** – Auto-populated master tables cannot introduce new dimensions; their primary key is fully inherited through foreign keys. Part tables *can* introduce new dimensions, enabling computations to produce multiple output entities from a single input.\n\n5. **Implicit part dependency** – A foreign key to the master implies a dependency on all its parts. Downstream tables can safely assume that when the master exists, all its parts are present and complete.\n\n6. **Clean separation** – Masters hold aggregate/summary data while parts hold per-item details. Downstream tables reference the master; queries join with parts when details are needed.\n\nDataJoint's nested class syntax and transactional populate mechanism make this pattern easy to express and safe to use in relational workflows.\n\n:::{seealso}\n- [Primary Keys](018-primary-key.md) — Schema dimensions and their constraints on auto-populated tables\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example where `Detection` (master) and `Detection.Blob` (part) illustrate atomic population and downstream dependency through `SelectDetection`\n:::" } ], "metadata": { diff --git a/book/30-design/060-diagrams.ipynb b/book/30-design/060-diagrams.ipynb deleted file mode 100644 index db7cbfd..0000000 --- a/book/30-design/060-diagrams.ipynb +++ /dev/null @@ -1,610 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Diagramming\n", - "\n", - "Schema diagrams are essential tools for understanding and designing DataJoint pipelines.\n", - "They provide a visual representation of tables and their dependencies, making complex workflows comprehensible at a glance.\n", - "\n", - "As introduced in [Relational Workflows](../20-concepts/05-workflows.md), DataJoint schemas form **Directed Acyclic Graphs (DAGs)** where:\n", - "\n", - "- **Nodes** represent tables (workflow steps)\n", - "- **Edges** represent foreign key dependencies\n", - "- **Direction** flows from parent (referenced) to child (referencing) tables\n", - "\n", - "This DAG structure embodies a core principle of the Relational Workflow Model: **the schema is an executable specification**.\n", - "Tables at the top are independent entities; tables below depend on tables above them.\n", - "Reading the diagram top-to-bottom reveals the workflow execution order.\n", - "\n", - "DataJoint's diagramming notation differs from traditional notations (Chen's ER, Crow's Foot, UML) in one critical way: **line styles encode semantic relationship types**, not just cardinality.\n", - "This makes the diagram immediately informative about how entities relate—whether they share identity, belong to each other, or merely reference each other." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Quick Reference\n", - "\n", - "| Line Style | Appearance | Relationship | Child's Primary Key | Cardinality |\n", - "|------------|------------|--------------|--------------------|--------------|\n", - "| **Thick Solid** | ━━━ | Extension | Parent PK only | One-to-one |\n", - "| **Thin Solid** | ─── | Containment | Parent PK + own field(s) | One-to-many |\n", - "| **Dashed** | ┄┄┄ | Reference | Own independent PK | One-to-many |\n", - "\n", - "**Key Principle**: Solid lines mean the parent's identity becomes part of the child's identity.\n", - "Dashed lines mean the child maintains independent identity.\n", - "\n", - "**Visual Indicators**:\n", - "- **Underlined table name**: Independent entity with its own primary key\n", - "- **Non-underlined name**: Dependent entity whose identity derives from parent\n", - "- **Orange dots**: Renamed foreign keys (see [Renamed Foreign Keys](#renamed-foreign-keys-and-orange-dots))\n", - "- **Table colors**: Green (Manual), Blue (Imported), Red (Computed), Gray (Lookup)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## The Three Line Styles\n", - "\n", - "Line styles convey the **semantic relationship** between parent and child tables.\n", - "The choice of line style is determined by where the foreign key appears in the child's definition.\n", - "\n", - "### Thick Solid Line: Extension (One-to-One)\n", - "\n", - "The foreign key **is** the entire primary key of the child table.\n", - "\n", - "**Semantics**: The child *extends* or *specializes* the parent.\n", - "They share the same identity—at most one child exists for each parent.\n", - "\n", - "```python\n", - "@schema\n", - "class Customer(dj.Manual):\n", - " definition = \"\"\"\n", - " customer_id : int\n", - " ---\n", - " name : varchar(50)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class CustomerPreferences(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Customer # This IS the entire primary key\n", - " ---\n", - " theme : varchar(20)\n", - " \"\"\"\n", - "```\n", - "\n", - "**Use cases**: Workflow sequences (Order → Shipment → Delivery), optional extensions (Customer → CustomerPreferences), modular data splits.\n", - "\n", - "### Thin Solid Line: Containment (One-to-Many)\n", - "\n", - "The foreign key is **part of** (but not all of) the child's primary key.\n", - "\n", - "**Semantics**: The child *belongs to* or *is contained within* the parent.\n", - "Multiple children can exist for each parent, each identified within the parent's context.\n", - "\n", - "```python\n", - "@schema\n", - "class Customer(dj.Manual):\n", - " definition = \"\"\"\n", - " customer_id : int\n", - " ---\n", - " name : varchar(50)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Account(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Customer # Part of primary key\n", - " account_number : int # Additional PK component\n", - " ---\n", - " balance : decimal(10,2)\n", - " \"\"\"\n", - "```\n", - "\n", - "**Use cases**: Hierarchies (Study → Subject → Session), ownership (Customer → Account), containment (Order → OrderItem).\n", - "\n", - "### Dashed Line: Reference (One-to-Many)\n", - "\n", - "The foreign key is a **secondary attribute** (below the `---` line).\n", - "\n", - "**Semantics**: The child *references* or *associates with* the parent but maintains independent identity.\n", - "The parent is just one attribute describing the child.\n", - "\n", - "```python\n", - "@schema\n", - "class Bank(dj.Manual):\n", - " definition = \"\"\"\n", - " bank_id : int\n", - " ---\n", - " bank_name : varchar(100)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Account(dj.Manual):\n", - " definition = \"\"\"\n", - " account_number : int # Own independent PK\n", - " ---\n", - " -> Bank # Secondary attribute\n", - " balance : decimal(10,2)\n", - " \"\"\"\n", - "```\n", - "\n", - "**Use cases**: Loose associations (Product → Manufacturer), references that might change (Employee → Department), when child has independent identity." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Visual Examples\n", - "\n", - "Let's see each line style in action with live diagrams." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import datajoint as dj\n", - "dj.conn()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Dashed Line Example" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "schema_dashed = dj.Schema('diagram_dashed')\n", - "\n", - "@schema_dashed\n", - "class Customer(dj.Manual):\n", - " definition = \"\"\"\n", - " customer_id : int\n", - " ---\n", - " name : varchar(50)\n", - " \"\"\"\n", - "\n", - "@schema_dashed \n", - "class Account(dj.Manual):\n", - " definition = \"\"\"\n", - " account_number : int\n", - " ---\n", - " -> Customer\n", - " balance : decimal(10,2)\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema_dashed)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Dashed line**: `Account` has its own independent identity (`account_number`).\n", - "The `customer_id` foreign key is secondary—it references `Customer` but doesn't define the account's identity." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Thin Solid Line Example" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "schema_thin = dj.Schema('diagram_thin')\n", - "\n", - "@schema_thin\n", - "class Customer(dj.Manual):\n", - " definition = \"\"\"\n", - " customer_id : int\n", - " ---\n", - " name : varchar(50)\n", - " \"\"\"\n", - "\n", - "@schema_thin\n", - "class Account(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Customer\n", - " account_number : int\n", - " ---\n", - " balance : decimal(10,2)\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema_thin)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Thin solid line**: `Account`'s primary key is `(customer_id, account_number)`.\n", - "Accounts *belong to* customers—Account #3 means \"Account #3 of Customer X.\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Thick Solid Line Example" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "schema_thick = dj.Schema('diagram_thick')\n", - "\n", - "@schema_thick\n", - "class Customer(dj.Manual):\n", - " definition = \"\"\"\n", - " customer_id : int\n", - " ---\n", - " name : varchar(50)\n", - " \"\"\"\n", - "\n", - "@schema_thick\n", - "class Account(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Customer\n", - " ---\n", - " balance : decimal(10,2)\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema_thick)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Thick solid line**: `Account`'s primary key *is* `customer_id` (inherited from `Customer`).\n", - "Each customer can have at most one account—they share identity.\n", - "Note that `Account` is no longer underlined, indicating it's not an independent dimension." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Association Tables and Many-to-Many Relationships\n", - "\n", - "Many-to-many relationships appear as tables with **converging foreign keys**—multiple thin solid lines pointing into a single table." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "schema_assoc = dj.Schema(\"projects\")\n", - "\n", - "@schema_assoc\n", - "class Employee(dj.Manual):\n", - " definition = \"\"\"\n", - " employee_id : int\n", - " ---\n", - " employee_name : varchar(60)\n", - " \"\"\"\n", - "\n", - "@schema_assoc\n", - "class Project(dj.Manual):\n", - " definition = \"\"\"\n", - " project_code : varchar(8)\n", - " ---\n", - " project_title : varchar(50)\n", - " start_date : date\n", - " end_date : date\n", - " \"\"\"\n", - " \n", - "@schema_assoc\n", - "class Assignment(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Employee\n", - " -> Project\n", - " ---\n", - " percent_effort : decimal(4,1) unsigned\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema_assoc)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Reading this diagram**:\n", - "- `Employee` and `Project` are independent entities (underlined, at top)\n", - "- `Assignment` has two thin solid lines converging into it\n", - "- Its primary key is `(employee_id, project_code)`—the combination of both parents\n", - "- This creates a many-to-many relationship: each employee can work on multiple projects, and each project can have multiple employees" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Renamed Foreign Keys and Orange Dots\n", - "\n", - "DataJoint foreign keys always reference the parent's **primary key**.\n", - "Usually, the foreign key attribute keeps the same name as in the parent.\n", - "However, sometimes you need different names:\n", - "\n", - "- **Multiple references to the same table** (e.g., presynaptic and postsynaptic neurons)\n", - "- **Semantic clarity** (e.g., `manager_id` instead of `employee_id`)\n", - "- **Avoiding name conflicts**\n", - "\n", - "Use `.proj()` to rename foreign key attributes:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "schema_graph = dj.Schema('directed_graph')\n", - "\n", - "@schema_graph\n", - "class Neuron(dj.Manual):\n", - " definition = \"\"\"\n", - " neuron_id : int\n", - " ---\n", - " neuron_type : enum('excitatory', 'inhibitory')\n", - " layer : int\n", - " \"\"\"\n", - "\n", - "@schema_graph\n", - "class Synapse(dj.Manual):\n", - " definition = \"\"\"\n", - " synapse_id : int\n", - " ---\n", - " -> Neuron.proj(presynaptic='neuron_id')\n", - " -> Neuron.proj(postsynaptic='neuron_id')\n", - " strength : float\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema_graph)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Orange dots** appear between `Neuron` and `Synapse`, indicating:\n", - "- A projection has renamed the foreign key attribute\n", - "- Two distinct foreign keys connect the same pair of tables\n", - "- In the `Synapse` table: `presynaptic` and `postsynaptic` both reference `Neuron.neuron_id`\n", - "\n", - "In interactive Jupyter notebooks, hovering over orange dots reveals the projection expression.\n", - "\n", - "**Common patterns** using renamed foreign keys:\n", - "- **Neural networks**: Presynaptic and postsynaptic neurons\n", - "- **Organizational hierarchies**: Employee and manager (both reference `Employee`)\n", - "- **Transportation**: Origin and destination airports" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "## Real-World Example: Classic Sales Database\n\nLet's examine a real database—the [MySQL tutorial sample database](https://www.mysqltutorial.org/getting-started-with-mysql/mysql-sample-database/).\n\n### Traditional ER Diagram\n\nHere is the classic Entity-Relationship diagram from the MySQL tutorial:\n\n![Classic Sales ER Diagram](../images/mysql-classic-sales-ERD.png)\n\nThis diagram uses Crow's Foot notation, where:\n- Lines with crow's feet indicate \"many\" relationships\n- Single lines indicate \"one\" relationships\n- The diagram shows cardinality but not the semantic nature of relationships\n\n### DataJoint Diagram\n\nNow let's see how the same database appears in DataJoint notation:" - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "schema = dj.Schema(\"classic_sales\")\n", - "schema.spawn_missing_classes()\n", - "\n", - "dj.Diagram(schema)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "### Comparing the Two Diagrams\n\n**Reading the DataJoint diagram**:\n1. **Independent entities at top**: `Productline`, `Office`, `Customer` (underlined)\n2. **Follow solid lines down**: Track how primary keys cascade through the hierarchy\n3. **Identify association tables**: Look for converging lines (e.g., `Orderdetail` links `Order` and `Product`)\n4. **Dashed lines**: Reference relationships that don't cascade identity\n\n**Key differences from the ER diagram**:\n\n| Aspect | Traditional ER (Crow's Foot) | DataJoint |\n|--------|------------------------------|-----------|\n| **Layout** | Arbitrary arrangement | Top-to-bottom workflow order |\n| **Line meaning** | Cardinality only (one vs. many) | Semantic relationship type |\n| **Primary key cascade** | Not visible | Solid lines show direct join paths |\n| **Workflow sequence** | Must read documentation | Clear from vertical structure |\n\nThe vertical layout reveals the workflow: create product lines and offices first, then products and employees, then customers and orders, and finally order details and payments.\n\n:::{seealso}\nFor the complete schema with data and example queries, see the [Classic Sales](../80-examples/010-classic-sales.ipynb) example.\n:::" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## What Diagrams Show and Don't Show\n", - "\n", - "### Clearly Indicated\n", - "\n", - "| Feature | How It's Shown |\n", - "|---------|---------------|\n", - "| Relationship type | Line style (thick/thin/dashed) |\n", - "| Dependency direction | Arrows from parent to child |\n", - "| Independent entities | Underlined table names |\n", - "| Table tiers | Colors (Green/Blue/Red/Gray) |\n", - "| Many-to-many | Converging lines into association table |\n", - "| Renamed foreign keys | Orange dots |\n", - "\n", - "### Not Visible\n", - "\n", - "| Feature | Must Check |\n", - "|---------|------------|\n", - "| Nullable foreign keys | Table definition |\n", - "| Secondary unique constraints | Table definition |\n", - "| Attribute names and types | Hover or inspect definition |\n", - "| CHECK constraints | Table definition |\n", - "\n", - "**Design principle**: DataJoint users generally avoid secondary unique constraints.\n", - "Making foreign keys part of the primary key (creating solid lines) provides visual clarity and enables direct joins across multiple levels." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Diagram Operations\n", - "\n", - "DataJoint provides operators to filter and combine diagrams:\n", - "\n", - "```python\n", - "# Show entire schema\n", - "dj.Diagram(schema)\n", - "\n", - "# Show specific tables\n", - "dj.Diagram(Table1) + dj.Diagram(Table2)\n", - "\n", - "# Show table and N levels of upstream dependencies\n", - "dj.Diagram(Table) - N\n", - "\n", - "# Show table and N levels of downstream dependents\n", - "dj.Diagram(Table) + N\n", - "\n", - "# Combine operations\n", - "(dj.Diagram(Table1) - 2) + (dj.Diagram(Table2) + 1)\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Diagrams and Queries\n", - "\n", - "The diagram structure directly informs query patterns.\n", - "\n", - "**Solid line paths enable direct joins**:\n", - "```python\n", - "# If A → B → C are connected by solid lines:\n", - "A * C # Valid—primary keys cascade through solid lines\n", - "```\n", - "\n", - "**Dashed lines require intermediate tables**:\n", - "```python\n", - "# If A ---> B (dashed), B → C (solid):\n", - "A * B * C # Must include B\n", - "```\n", - "\n", - "This is why solid lines are preferred when appropriate—they simplify queries by allowing you to skip intermediate tables." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Comparison to Other Notations\n", - "\n", - "DataJoint's notation differs significantly from traditional database diagramming:\n", - "\n", - "| Feature | Chen's ER | Crow's Foot | DataJoint |\n", - "|---------|-----------|-------------|----------|\n", - "| **Cardinality** | Numbers near entities | Symbols at line ends | Line thickness/style |\n", - "| **Direction** | No inherent direction | No inherent direction | Always top-to-bottom (DAG) |\n", - "| **Cycles allowed** | Yes | Yes | No |\n", - "| **Entity vs. relationship** | Distinct (rect vs. diamond) | Not distinguished | Not distinguished |\n", - "| **Primary key cascade** | Not shown | Not shown | Solid lines show this |\n", - "| **Identity sharing** | Not indicated | Not indicated | Thick solid line |\n", - "\n", - "**Why DataJoint differs**:\n", - "\n", - "1. **DAG structure**: No cycles means schemas are readable as workflows (top-to-bottom execution order)\n", - "2. **Line style semantics**: Immediately reveals relationship type without reading labels\n", - "3. **Primary key cascade visibility**: Solid lines show which tables can be joined directly\n", - "4. **Unified entity treatment**: No artificial distinction between \"entities\" and \"relationships\"—associations are just tables with converging foreign keys\n", - "\n", - ":::{seealso}\n", - "The [Relational Workflows](../20-concepts/05-workflows.md) chapter covers the three database paradigms in depth, including how DataJoint's workflow-centric approach compares to Codd's mathematical model and Chen's Entity-Relationship model.\n", - ":::" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Best Practices\n", - "\n", - "### Reading Diagrams\n", - "\n", - "1. **Start at the top**: Identify independent entities (underlined)\n", - "2. **Follow solid lines**: Trace primary key cascades downward\n", - "3. **Spot convergence patterns**: Multiple lines into a table indicate associations\n", - "4. **Check line thickness**: Thick = one-to-one, Thin = one-to-many containment\n", - "5. **Note dashed lines**: These don't cascade identity\n", - "\n", - "### Designing with Diagrams\n", - "\n", - "1. **Choose solid lines when**:\n", - " - Building hierarchies (Study → Subject → Session)\n", - " - Creating workflow sequences (Order → Ship → Deliver)\n", - " - You want direct joins across levels\n", - "\n", - "2. **Choose dashed lines when**:\n", - " - Child has independent identity from parent\n", - " - Reference might change or is optional\n", - " - You don't need primary key cascade\n", - "\n", - "3. **Choose thick lines when**:\n", - " - Extending entities with optional information\n", - " - Modeling workflow steps (one output per input)\n", - " - Creating one-to-one relationships\n", - "\n", - "### Interactive Tips\n", - "\n", - "- **Hover over tables** to see complete definitions (works in Jupyter and SVG exports)\n", - "- **Hover over orange dots** to see projection expressions\n", - "- **Use `+` and `-` operators** to focus on specific parts of large schemas" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Summary\n", - "\n", - "DataJoint diagrams are more than documentation—they are **live views** of your schema that:\n", - "\n", - "- Reveal workflow structure through top-to-bottom layout\n", - "- Show relationship semantics through line styles\n", - "- Guide query design through primary key cascade visibility\n", - "- Stay synchronized because they're generated from the actual schema\n", - "\n", - "The key insight: in DataJoint, diagrams and implementation are unified.\n", - "There's no separate design document that can drift out of sync—the diagram **is** the schema." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.11.0" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file diff --git a/book/30-design/082-indexes.ipynb b/book/30-design/082-indexes.ipynb index ab892e7..a0becef 100644 --- a/book/30-design/082-indexes.ipynb +++ b/book/30-design/082-indexes.ipynb @@ -4,44 +4,13 @@ "cell_type": "markdown", "id": "cell-0", "metadata": {}, - "source": [ - "---\n", - "title: Indexes\n", - "authors:\n", - " - name: Dimitri Yatsenko\n", - "date: 2024-10-22\n", - "---\n", - "\n", - "# Indexes: Accelerating Data Lookups\n", - "\n", - "As tables grow to thousands or millions of records, query performance becomes critical. **Indexes** are data structures that enable fast lookups by specific attributes, dramatically reducing query times from scanning every row to near-instantaneous retrieval.\n", - "\n", - "Think of an index like the index at the back of a textbook: instead of reading every page to find a topic, you look it up in the index and jump directly to the relevant pages. Database indexes work the same way—they create organized lookup structures that point directly to matching records.\n", - "\n", - "```{admonition} Learning Objectives\n", - ":class: note\n", - "\n", - "By the end of this chapter, you will:\n", - "- Understand how indexes accelerate database queries\n", - "- Recognize the three mechanisms that create indexes in DataJoint\n", - "- Declare explicit secondary indexes for frequently queried attributes\n", - "- Understand composite index ordering and its impact on queries\n", - "- Know when to use regular vs. unique indexes\n", - "```" - ] + "source": "---\ntitle: Indexes\n---\n\n# Indexes: Accelerating Data Lookups\n\nAs tables grow to thousands or millions of records, query performance becomes critical. **Indexes** are data structures that enable fast lookups by specific attributes, dramatically reducing query times from scanning every row to near-instantaneous retrieval.\n\nThink of an index like the index at the back of a textbook: instead of reading every page to find a topic, you look it up in the index and jump directly to the relevant pages. Database indexes work the same way—they create organized lookup structures that point directly to matching records.\n\n```{admonition} Learning Objectives\n:class: note\n\nBy the end of this chapter, you will:\n- Understand how indexes accelerate database queries\n- Recognize the three mechanisms that create indexes in DataJoint\n- Declare explicit secondary indexes for frequently queried attributes\n- Understand composite index ordering and its impact on queries\n- Know when to use regular vs. unique indexes\n```" }, { "cell_type": "markdown", "id": "cell-1", "metadata": {}, - "source": [ - "## Prerequisites\n", - "\n", - "This chapter assumes familiarity with:\n", - "- [Primary Keys](020-primary-key.md) — Understanding unique entity identification\n", - "- [Foreign Keys](030-foreign-keys.ipynb) — Understanding table relationships\n", - "- [Create Tables](015-table.ipynb) — Basic table declaration syntax" - ] + "source": "## Prerequisites\n\nThis chapter assumes familiarity with:\n- [Primary Keys](018-primary-key.md) — Understanding unique entity identification\n- [Foreign Keys](030-foreign-keys.md) — Understanding table relationships\n- [Tables](015-table.ipynb) — Basic table declaration syntax" }, { "cell_type": "markdown", diff --git a/book/30-design/090-pipeline-project.md b/book/30-design/090-pipeline-project.md index dbbc842..8265d23 100644 --- a/book/30-design/090-pipeline-project.md +++ b/book/30-design/090-pipeline-project.md @@ -1,7 +1,5 @@ --- title: Pipeline Projects -authors: - - name: Dimitri Yatsenko --- # Pipeline Projects diff --git a/book/40-operations/050-populate.ipynb b/book/40-operations/050-populate.ipynb index 2137b47..f6560e0 100644 --- a/book/40-operations/050-populate.ipynb +++ b/book/40-operations/050-populate.ipynb @@ -3,7 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": "# Populate\n\nThe `populate` operation is the engine of workflow automation in DataJoint.\nWhile [insert](010-insert.ipynb), [delete](020-delete.ipynb), and [update](030-updates.ipynb) are operations for Manual tables, `populate` automates data entry for **Imported** and **Computed** tables based on the dependencies defined in the schema.\n\nAs introduced in [Workflow Operations](000-workflow-operations.md), the distinction between external and automatic data entry maps directly to table tiers:\n\n| Table Tier | Data Entry Method |\n|------------|-------------------|\n| Lookup | `contents` property (part of schema) |\n| Manual | `insert` from external sources |\n| **Imported** | **Automatic `populate`** |\n| **Computed** | **Automatic `populate`** |\n\nThis chapter shows how `populate` transforms the schema's dependency graph into executable computations.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Data from external systems or human entry |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## The `make` Method\n\nThe `make()` method defines the computational logic for each entry.\nIt receives a **key** dictionary identifying which entity to compute and must **fetch** inputs, **compute** results, and **insert** them into the table.\n\nSee the dedicated [make Method](055-make.ipynb) chapter for:\n- The three-part anatomy (fetch, compute, insert)\n- Restrictions on auto-populated tables\n- The three-part pattern for long-running computations\n- Transaction handling strategies\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations via `contents`\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. When `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [The `make` Method](055-make.ipynb) — Anatomy, constraints, and patterns\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces both summary and detail rows, structure them as master and parts to keep them in the same transaction" + "source": "# Populate\n\nThe `populate` operation is the engine of workflow automation in DataJoint.\nWhile [insert](010-insert.ipynb), [delete](020-delete.ipynb), and [update](030-updates.ipynb) are operations for Manual tables, `populate` automates data entry for **Imported** and **Computed** tables based on the dependencies defined in the schema.\n\nAs introduced in [Workflow Operations](000-workflow-operations.md), the distinction between external and automatic data entry maps directly to table tiers:\n\n| Table Tier | Data Entry Method |\n|------------|-------------------|\n| Lookup | `contents` property (part of schema) |\n| Manual | `insert` from external sources |\n| **Imported** | **Automatic `populate`** |\n| **Computed** | **Automatic `populate`** |\n\nThis chapter shows how `populate` transforms the schema's dependency graph into executable computations.\n\n## The Relational Workflow Model in Action\n\nRecall that the **Relational Workflow Model** is built on four fundamental concepts:\n\n1. **Workflow Entity** — Each table represents an entity type created at a specific workflow step\n2. **Workflow Dependencies** — Foreign keys prescribe the order of operations\n3. **Workflow Steps** — Distinct phases where entity types are created (manual or automated)\n4. **Directed Acyclic Graph (DAG)** — The schema forms a graph structure ensuring valid execution sequences\n\nThe Relational Workflow Model defines a new class of databases: **Computational Databases**, where computational transformations are first-class citizens of the data model. In a computational database, the schema is not merely a passive data structure—it is an executable specification of the workflow itself.\n\n## From Declarative Schema to Executable Pipeline\n\nA DataJoint schema uses **table tiers** to distinguish different workflow roles:\n\n| Tier | Color | Role in Workflow |\n|------|-------|------------------|\n| **Lookup** | Gray | Static reference data and configuration parameters |\n| **Manual** | Green | Data from external systems or human entry |\n| **Imported** | Blue | Data acquired automatically from instruments or files |\n| **Computed** | Red | Derived data produced by computational transformations |\n\nBecause dependencies are explicit through foreign keys, DataJoint's `populate()` method can explore the DAG top-down: for every upstream key that has not been processed, it executes the table's `make()` method inside an atomic transaction. If anything fails, the transaction is rolled back, preserving **computational validity**—the guarantee that all derived data remains consistent with its upstream dependencies.\n\nThis is the essence of **workflow automation**: each table advertises what it depends on, and `populate()` runs only the computations that are still missing.\n\n## The `populate` Method\n\nThe `populate()` method is the engine of workflow automation. When called on a computed or imported table, it:\n\n1. **Identifies missing work** — Queries the key source (the join of all upstream dependencies) and subtracts keys already present in the table\n2. **Iterates over pending keys** — For each missing key, calls the table's `make()` method\n3. **Wraps each `make()` in a transaction** — Ensures atomicity: either all inserts succeed or none do\n4. **Handles errors gracefully** — Failed jobs are logged but do not stop the remaining work\n\n```python\n# Process all pending work\nDetection.populate(display_progress=True)\n\n# Process a specific subset\nDetection.populate(Image & \"image_id < 10\")\n\n# Distribute across workers\nDetection.populate(reserve_jobs=True)\n```\n\nThe `reserve_jobs=True` option enables parallel execution across multiple processes or machines by using the database itself for job coordination.\n\n## The `make` Method\n\nThe `make()` method defines the computational logic for each entry.\nIt receives a **key** dictionary identifying which entity to compute and must **fetch** inputs, **compute** results, and **insert** them into the table.\n\nSee the dedicated [make Method](055-make.ipynb) chapter for:\n- The three-part anatomy (fetch, compute, insert)\n- Restrictions on auto-populated tables\n- The three-part pattern for long-running computations\n- Transaction handling strategies\n\n## Schema Dimensions and the Key Source\n\nAuto-populated tables have a fundamental constraint: **they cannot introduce new schema dimensions**. A schema dimension is created when a table defines a new primary key attribute directly (see [Primary Keys](../30-design/018-primary-key.md)). For Computed and Imported tables, the primary key must be fully determined by foreign keys to upstream dependencies.\n\nThis constraint is what makes the **key source** well-defined. The key source is computed as:\n\n```\nkey_source = (join of all primary-key dependencies).proj() - Table\n```\n\nIn other words: take the Cartesian product of all upstream tables referenced in the primary key, project to just the primary key attributes, and subtract the entries already present in the table. The result is the set of pending work items.\n\nBecause auto-populated tables cannot add new dimensions, each key in the key source corresponds to exactly one `make()` call. The computation receives a fully-specified key and produces results for that key.\n\n**What if a computation produces multiple outputs?** Use [part tables](../30-design/053-master-part.ipynb). Part tables *can* introduce new dimensions. For example, a blob detection algorithm might find 200 blobs in one image—the `Detection` master cannot introduce a `blob_id` dimension, but `Detection.Blob` (the part table) can.\n\n## Transactional Integrity\n\nEach `make()` call executes inside an **ACID transaction**. This provides critical guarantees for computational workflows:\n\n- **Atomicity** — The entire computation either commits or rolls back as a unit\n- **Isolation** — Partial results are never visible to other processes\n- **Consistency** — The database moves from one valid state to another\n\nWhen a computed table has [part tables](../30-design/053-master-part.ipynb), the transaction boundary encompasses both the master and all its parts. The master's `make()` method is responsible for inserting everything within a single transactional scope. See the [Master-Part](../30-design/053-master-part.ipynb) chapter for detailed coverage of ACID semantics and the master's responsibility pattern.\n\n## Case Study: Blob Detection\n\nThe [Blob Detection](../80-examples/075-blob-detection.ipynb) example demonstrates these concepts in a compact image-analysis workflow:\n\n1. **Source data** — `Image` (manual) stores NumPy arrays as `longblob` fields\n2. **Parameter space** — `BlobParamSet` (lookup) defines detection configurations via `contents`\n3. **Computation** — `Detection` (computed) depends on both upstream tables\n\nThe `Detection` table uses a master-part structure: the master row stores an aggregate (blob count), while `Detection.Blob` parts store per-feature coordinates. Notice how schema dimensions work here:\n\n- `Detection` inherits its primary key entirely from `Image` and `BlobParamSet`—it cannot introduce new dimensions\n- `Detection.Blob` introduces the `blob_id` dimension to identify individual blobs within each detection\n\nWhen `populate()` runs:\n\n- Each `(image_id, blob_paramset)` combination triggers one `make()` call\n- The `make()` method fetches inputs, runs detection, and inserts both master and parts\n- The transaction ensures all blob coordinates appear atomically with their count\n\n```python\nDetection.populate(display_progress=True)\n# Detection: 100%|██████████| 6/6 [00:01<00:00, 4.04it/s]\n```\n\nThis pattern—automation exploring combinatorics, then human curation—is common in scientific workflows. After reviewing results, the `SelectDetection` manual table records the preferred parameter set for each image. Because `SelectDetection` depends on `Detection`, it implicitly has access to all `Detection.Blob` parts for the selected detection.\n\n:::{seealso}\n- [Primary Keys](../30-design/018-primary-key.md) — Schema dimensions and their constraints\n- [The `make` Method](055-make.ipynb) — Anatomy, constraints, and patterns\n- [Blob Detection](../80-examples/075-blob-detection.ipynb) — Complete working example\n- [Master-Part](../30-design/053-master-part.ipynb) — Transaction semantics and dependency implications\n:::\n\n## Why Computational Databases Matter\n\nThe Relational Workflow Model provides several key benefits:\n\n| Benefit | Description |\n|---------|-------------|\n| **Reproducibility** | Rerunning `populate()` regenerates derived tables from raw inputs |\n| **Dependency-aware scheduling** | DataJoint infers job order from foreign keys (the DAG structure) |\n| **Computational validity** | Transactions ensure downstream results stay consistent with upstream inputs |\n| **Provenance tracking** | The schema documents what was computed from what |\n\n## Practical Tips\n\n- **Develop incrementally** — Test `make()` logic with restrictions (e.g., `Table.populate(restriction)`) before processing all data\n- **Monitor progress** — Use `display_progress=True` for visibility during development\n- **Distribute work** — Use `reserve_jobs=True` when running multiple workers\n- **Use master-part for multi-row results** — When a computation produces multiple output entities, use part tables to introduce new dimensions while keeping everything in a single transaction" } ], "metadata": { diff --git a/book/80-examples/000-example-designs.ipynb b/book/80-examples/000-example-designs.ipynb index 0158299..02c7bfa 100644 --- a/book/80-examples/000-example-designs.ipynb +++ b/book/80-examples/000-example-designs.ipynb @@ -3,16 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "---\n", - "title: Schema Examples\n", - "date: 2025-01-11\n", - "authors:\n", - " - name: Dimitri Yatsenko\n", - "---\n", - "\n", - "In this section, we present several well-designed schemas, populated with data that are used in examples throughout the book." - ] + "source": "---\ntitle: Schema Examples\n---\n\nIn this section, we present several well-designed schemas, populated with data that are used in examples throughout the book." }, { "cell_type": "markdown", @@ -125,4 +116,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/book/80-examples/044-generations.ipynb b/book/80-examples/044-generations.ipynb index 5ff6d53..1b26049 100644 --- a/book/80-examples/044-generations.ipynb +++ b/book/80-examples/044-generations.ipynb @@ -61,91 +61,11 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "id": "710623d9", "metadata": {}, "outputs": [], - "source": [ - "@schema\n", - "class Generation(dj.Lookup):\n", - " definition = \"\"\"\n", - " generation: varchar(16)\n", - " ---\n", - " dob_start: date\n", - " dob_end: date\n", - " characteristic: varchar(255)\n", - " archetype: varchar(16)\n", - " life_principle: varchar(255)\n", - " symbol: varchar(64)\n", - " \"\"\"\n", - "\n", - " contents = [\n", - " (\n", - " 'Lost Generation',\n", - " '1883-01-01', '1900-12-31',\n", - " 'Disillusioned by World War I; sought meaning through art, modernism, and expatriate life.',\n", - " 'Nomad',\n", - " 'Search for authenticity and self-expression amid disillusionment.',\n", - " 'The Great Gatsby',\n", - " ),\n", - " (\n", - " 'Greatest',\n", - " '1901-01-01', '1927-12-31',\n", - " 'Grew up through the Great Depression and World War II; defined by duty and sacrifice.',\n", - " 'Hero',\n", - " 'Duty, unity, and collective purpose.',\n", - " 'Rosie the Riveter',\n", - " ),\n", - " (\n", - " 'Silent',\n", - " '1928-01-01', '1945-12-31',\n", - " 'Conformist yet hardworking; valued stability, discipline, and civic responsibility.',\n", - " 'Artist',\n", - " 'Discipline, craftsmanship, and harmony.',\n", - " 'Grey Flannel Suit',\n", - " ),\n", - " (\n", - " 'Baby Boomers',\n", - " '1946-01-01', '1964-12-31',\n", - " 'Prosperous postwar generation; shaped modern culture, civil rights, and consumerism.',\n", - " 'Prophet',\n", - " 'Purpose, moral vision, and self-expression.',\n", - " 'Woodstock Dove',\n", - " ),\n", - " (\n", - " 'Gen X',\n", - " '1965-01-01', '1980-12-31',\n", - " 'Independent and skeptical; adapted to globalization and the digital revolution.',\n", - " 'Nomad',\n", - " 'Self-reliance, adaptability, and realism.',\n", - " 'MTV Logo',\n", - " ),\n", - " (\n", - " 'Gen Y',\n", - " '1981-01-01', '1996-12-31',\n", - " 'Millennials; tech-savvy, idealistic, collaborative, and shaped by the internet age.',\n", - " 'Hero',\n", - " 'Collaboration, inclusion, and empowerment.',\n", - " 'iPhone',\n", - " ),\n", - " (\n", - " 'Gen Z',\n", - " '1997-01-01', '2012-12-31',\n", - " 'Digital natives; diverse, socially conscious, and fluent in online culture.',\n", - " 'Artist',\n", - " 'Authenticity, empathy, and self-identity.',\n", - " 'TikTok Logo',\n", - " ),\n", - " (\n", - " 'Gen Alpha',\n", - " '2013-01-01', '2025-12-31',\n", - " 'Born into AI and automation; hyper-connected and globally aware from birth.',\n", - " 'Prophet',\n", - " 'Innovation, stewardship, and global vision.',\n", - " 'AI Assistant',\n", - " ),\n", - " ]\n" - ] + "source": "@schema\nclass Generation(dj.Lookup):\n definition = \"\"\"\n generation: varchar(16)\n ---\n dob_start: date\n dob_end: date\n characteristic: varchar(255)\n archetype: varchar(16)\n life_principle: varchar(255)\n symbol: varchar(64)\n \"\"\"\n\n contents = [\n (\n 'Lost Generation',\n '1883-01-01', '1900-12-31',\n 'Disillusioned by WWI; sought meaning through art and modernism.',\n 'Nomad',\n 'Search for authenticity amid disillusionment.',\n 'The Great Gatsby',\n ),\n (\n 'Greatest',\n '1901-01-01', '1927-12-31',\n 'Grew up through Depression and WWII; defined by sacrifice.',\n 'Hero',\n 'Duty, unity, and collective purpose.',\n 'Rosie the Riveter',\n ),\n (\n 'Silent',\n '1928-01-01', '1945-12-31',\n 'Conformist yet hardworking; valued stability and discipline.',\n 'Artist',\n 'Discipline, craftsmanship, and harmony.',\n 'Grey Flannel Suit',\n ),\n (\n 'Baby Boomers',\n '1946-01-01', '1964-12-31',\n 'Prosperous postwar generation; shaped culture and consumerism.',\n 'Prophet',\n 'Purpose, moral vision, and self-expression.',\n 'Woodstock Dove',\n ),\n (\n 'Gen X',\n '1965-01-01', '1980-12-31',\n 'Independent and skeptical; adapted to digital revolution.',\n 'Nomad',\n 'Self-reliance, adaptability, and realism.',\n 'MTV Logo',\n ),\n (\n 'Gen Y',\n '1981-01-01', '1996-12-31',\n 'Millennials; tech-savvy, idealistic, shaped by internet age.',\n 'Hero',\n 'Collaboration, inclusion, and empowerment.',\n 'iPhone',\n ),\n (\n 'Gen Z',\n '1997-01-01', '2012-12-31',\n 'Digital natives; diverse, socially conscious, online fluent.',\n 'Artist',\n 'Authenticity, empathy, and self-identity.',\n 'TikTok Logo',\n ),\n (\n 'Gen Alpha',\n '2013-01-01', '2025-12-31',\n 'Born into AI; hyper-connected and globally aware from birth.',\n 'Prophet',\n 'Innovation, stewardship, and global vision.',\n 'AI Assistant',\n ),\n ]" }, { "cell_type": "code", @@ -935,4 +855,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/book/80-examples/050-languages.ipynb b/book/80-examples/050-languages.ipynb index 07a762d..90d6f13 100644 --- a/book/80-examples/050-languages.ipynb +++ b/book/80-examples/050-languages.ipynb @@ -1,2048 +1,1983 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Languages \n", - "\n", - "This example demonstrates a classic many-to-many relationship design using an association table. We'll track people and the languages they speak, along with their proficiency levels using the internationally recognized CEFR standard.\n", - "\n", - "## Database Schema\n", - "\n", - "The database consists of four tables:\n", - "1. **Language** - A lookup table containing ISO 639-1 language codes and names\n", - "2. **CEFRLevel** - A lookup table defining the six CEFR proficiency levels (A1-C2)\n", - "3. **Person** - Individual people with basic information\n", - "4. **Proficiency** - An association table linking people, languages, and proficiency levels\n", - "\n", - "This design demonstrates:\n", - "- **Many-to-many relationship**: Each person can speak multiple languages, and each language is spoken by multiple people\n", - "- **Lookup tables**: Both Language and CEFRLevel are lookup tables with predefined, standardized content\n", - "- **Association table with multiple foreign keys**: Proficiency references Person, Language, and CEFRLevel\n", - "- **Normalization**: CEFR levels are stored in their own table with additional metadata (descriptions, categories)\n", - "- **International standards**: Uses ISO 639-1 codes for languages and CEFR levels for proficiency\n", - "\n", - "## Language Codes: ISO 639-1 Standard\n", - "\n", - "We use **ISO 639-1** language codes, which are the international standard for representing languages. These codes provide:\n", - "\n", - "### Background\n", - "- **ISO 639-1** is part of the ISO 639 series of standards for language codes\n", - "- Established by the International Organization for Standardization (ISO)\n", - "- Provides two-letter codes for major world languages\n", - "- Used globally in software, databases, and international systems\n", - "\n", - "### Benefits of ISO 639-1 Codes\n", - "1. **International Standard**: Recognized worldwide across industries\n", - "2. **Consistent**: Two-letter format ensures uniform representation\n", - "3. **Comprehensive**: Covers major languages with official status\n", - "4. **Future-proof**: Maintained and updated by ISO\n", - "5. **Integration**: Compatible with web standards (HTML lang attributes, etc.)\n", - "\n", - "### Examples of ISO 639-1 Codes\n", - "- `en` - English\n", - "- `es` - Spanish \n", - "- `fr` - French\n", - "- `de` - German\n", - "- `ja` - Japanese\n", - "- `zh` - Chinese\n", - "- `ar` - Arabic\n", - "- `hi` - Hindi\n", - "\n", - "### Database Design Considerations\n", - "Using standardized codes in our database ensures:\n", - "- **Data consistency** across different systems\n", - "- **Easy integration** with external APIs and services\n", - "- **Future compatibility** with international standards\n", - "- **Reduced ambiguity** compared to custom codes" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Table Definition" - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Exception reporting mode: Minimal\n" - ] - } - ], - "source": [ - "%xmode minimal\n", - "import datajoint as dj\n", - "dj.config['display.limit'] = 6 # keep output concise\n", - "\n", - "# Create schema\n", - "schema = dj.Schema('languages_example')\n", - "\n", - "@schema\n", - "class Language(dj.Lookup):\n", - " definition = \"\"\"\n", - " lang_code : char(2) # ISO 639-1 language code (e.g., 'en', 'es', 'ja')\n", - " ---\n", - " language : varchar(30) # Full language name\n", - " native_name : varchar(50) # Language name in its native script\n", - " \"\"\"\n", - " contents = [\n", - " # Format: (code, language, native_name, family)\n", - " ('ar', 'Arabic', 'العربية'), ('da', 'Danish', 'Dansk'),\n", - " ('de', 'German', 'Deutsch'), ('el', 'Greek', 'Ελληνικά'),\n", - " ('en', 'English', 'English'), ('es', 'Spanish', 'Español'),\n", - " ('fi', 'Finnish', 'Suomi'), ('fr', 'French', 'Français'),\n", - " ('he', 'Hebrew', 'עברית'), ('hi', 'Hindi', 'हिन्दी'),\n", - " ('id', 'Indonesian', 'Bahasa Indonesia'),\n", - " ('it', 'Italian', 'Italiano'), ('ja', 'Japanese', '日本語'),\n", - " ('ko', 'Korean', '한국어'), ('ms', 'Malay', 'Bahasa Melayu'),\n", - " ('nl', 'Dutch', 'Nederlands'), ('no', 'Norwegian', 'Norsk'),\n", - " ('ph', 'Filipino', 'Tagalog'), ('pl', 'Polish', 'Polski'),\n", - " ('pt', 'Portuguese', 'Português'), ('ru', 'Russian', 'Русский'),\n", - " ('sa', 'Sanskrit', 'संस्कृतम्'), ('sv', 'Swedish', 'Svenska'),\n", - " ('th', 'Thai', 'ไทย'), ('tr', 'Turkish', 'Türkçe'),\n", - " ('uk', 'Ukrainian', 'Українська'),\n", - " ('vi', 'Vietnamese', 'Tiếng Việt'), ('zh', 'Chinese', '中文')]\n", - "\n", - "@schema\n", - "class CEFRLevel(dj.Lookup):\n", - " definition = \"\"\"\n", - " cefr_level : char(2) # CEFR proficiency level code (A1, A2, B1, B2, C1, C2)\n", - " ---\n", - " level_name : varchar(30) # Full name of the level\n", - " category : enum('Basic', 'Independent', 'Proficient') # User category\n", - " description : varchar(255) # Brief description of abilities at this level\n", - " \"\"\"\n", - " contents = [\n", - " ('A1', 'Beginner', 'Basic', \n", - " 'Can understand and use familiar everyday expressions and very basic phrases'),\n", - " ('A2', 'Elementary', 'Basic',\n", - " 'Can communicate in simple routine tasks requiring direct exchange of information'),\n", - " ('B1', 'Intermediate', 'Independent',\n", - " 'Can deal with most situations while traveling and produce simple connected text'),\n", - " ('B2', 'Upper Intermediate', 'Independent',\n", - " 'Can interact with fluency and spontaneity and produce clear, detailed text'),\n", - " ('C1', 'Advanced', 'Proficient',\n", - " 'Can express ideas fluently and use language flexibly for social and professional purposes'),\n", - " ('C2', 'Mastery', 'Proficient',\n", - " 'Can understand virtually everything and express themselves with precision')\n", - " ]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Benefits of Using ISO 639-1 Codes\n", - "\n", - "### 1. **International Compatibility**\n", - "```python\n", - "# These codes work with web standards and international APIs\n", - "web_lang_attr = f'' # e.g., \n", - "api_request = f'https://api.translate.com?lang={lang_code}' # e.g., lang=en\n", - "```\n", - "\n", - "### 2. **Consistent Data Representation**\n", - "```python\n", - "# All systems recognize these codes\n", - "browser_detection = {'en': 'English', 'es': 'Spanish', 'ja': 'Japanese'}\n", - "database_lookup = Language & {'lang_code': 'en'} # Always works\n", - "```\n", - "\n", - "### 3. **Future-Proof Design**\n", - "```python\n", - "# New languages can be added following the same standard\n", - "# ISO maintains and updates the standard regularly\n", - "new_languages = [\n", - " ('sw', 'Swahili', 'Kiswahili'),\n", - " ('th', 'Thai', 'ไทย'),\n", - " ('vi', 'Vietnamese', 'Tiếng Việt')\n", - "]\n", - "```\n", - "\n", - "### 4. **Integration with External Services**\n", - "```python\n", - "# Compatible with translation services, content management systems\n", - "translation_api = f'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl={lang_code}'\n", - "content_management = f''\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Language Proficiency: CEFR Levels\n", - "\n", - "For measuring language proficiency, we use the **CEFR** (Common European Framework of Reference for Languages), which is the international standard for describing language ability.\n", - "\n", - "### Background\n", - "\n", - "The **Common European Framework of Reference for Languages (CEFR)** was developed by the Council of Europe and published in 2001. It provides:\n", - "\n", - "- **Standardized proficiency descriptors** recognized worldwide\n", - "- **Six proficiency levels** from beginner to mastery\n", - "- **Can-do statements** describing practical abilities at each level\n", - "- **Common reference** for language teaching, testing, and certification\n", - "\n", - "### The Six CEFR Levels\n", - "\n", - "The CEFR defines six levels, grouped into three broad categories:\n", - "\n", - "#### **A - Basic User**\n", - "- **A1 (Beginner)**: Can understand and use familiar everyday expressions and very basic phrases. Can introduce themselves and ask and answer simple personal questions.\n", - "- **A2 (Elementary)**: Can understand sentences and frequently used expressions. Can communicate in simple routine tasks requiring direct exchange of information.\n", - "\n", - "#### **B - Independent User**\n", - "- **B1 (Intermediate)**: Can understand main points on familiar matters. Can deal with most situations while traveling. Can produce simple connected text on familiar topics.\n", - "- **B2 (Upper Intermediate)**: Can understand main ideas of complex text. Can interact with native speakers with fluency and spontaneity. Can produce clear, detailed text on various subjects.\n", - "\n", - "#### **C - Proficient User**\n", - "- **C1 (Advanced)**: Can understand a wide range of demanding texts. Can express ideas fluently and spontaneously. Can use language flexibly for social, academic, and professional purposes.\n", - "- **C2 (Mastery)**: Can understand virtually everything heard or read. Can summarize information from different sources. Can express themselves with precision and subtle distinction of meaning.\n", - "\n", - "### Benefits of Using CEFR in Our Database\n", - "\n", - "1. **International Recognition**: CEFR is used by universities, employers, and governments worldwide\n", - "2. **Precise Measurement**: Six levels provide nuanced proficiency assessment\n", - "3. **Practical Focus**: Levels describe what learners can actually do with the language\n", - "4. **Career Relevance**: Many job postings and educational programs reference CEFR levels\n", - "5. **Standardized Testing**: Major language tests (TOEFL, IELTS, DELE, etc.) map to CEFR levels\n", - "\n", - "### CEFR Level Equivalencies\n", - "\n", - "Common language tests map to CEFR as follows:\n", - "- **TOEFL iBT**: 42-71 (B1), 72-94 (B2), 95-120 (C1-C2)\n", - "- **IELTS**: 4.0-5.0 (B1), 5.5-6.5 (B2), 7.0-9.0 (C1-C2)\n", - "- **Cambridge**: PET (B1), FCE (B2), CAE (C1), CPE (C2)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Using Language and Proficiency Tables \n", - "\n", - "Let's use the Language table to create a set of persons with different languages they speak." - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [], - "source": [ - "@schema\n", - "class Person(dj.Manual):\n", - " definition = \"\"\"\n", - " person_id : int # Unique identifier for each person\n", - " ---\n", - " name : varchar(60) # Person's name\n", - " date_of_birth : date # Date of birth\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Proficiency(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Person\n", - " -> Language\n", - " ---\n", - " -> CEFRLevel\n", - " \"\"\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Complete Schema Diagram" - ] - }, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Languages \n", + "\n", + "This example demonstrates a classic many-to-many relationship design using an association table. We'll track people and the languages they speak, along with their proficiency levels using the internationally recognized CEFR standard.\n", + "\n", + "## Database Schema\n", + "\n", + "The database consists of four tables:\n", + "1. **Language** - A lookup table containing ISO 639-1 language codes and names\n", + "2. **CEFRLevel** - A lookup table defining the six CEFR proficiency levels (A1-C2)\n", + "3. **Person** - Individual people with basic information\n", + "4. **Proficiency** - An association table linking people, languages, and proficiency levels\n", + "\n", + "This design demonstrates:\n", + "- **Many-to-many relationship**: Each person can speak multiple languages, and each language is spoken by multiple people\n", + "- **Lookup tables**: Both Language and CEFRLevel are lookup tables with predefined, standardized content\n", + "- **Association table with multiple foreign keys**: Proficiency references Person, Language, and CEFRLevel\n", + "- **Normalization**: CEFR levels are stored in their own table with additional metadata (descriptions, categories)\n", + "- **International standards**: Uses ISO 639-1 codes for languages and CEFR levels for proficiency\n", + "\n", + "## Language Codes: ISO 639-1 Standard\n", + "\n", + "We use **ISO 639-1** language codes, which are the international standard for representing languages. These codes provide:\n", + "\n", + "### Background\n", + "- **ISO 639-1** is part of the ISO 639 series of standards for language codes\n", + "- Established by the International Organization for Standardization (ISO)\n", + "- Provides two-letter codes for major world languages\n", + "- Used globally in software, databases, and international systems\n", + "\n", + "### Benefits of ISO 639-1 Codes\n", + "1. **International Standard**: Recognized worldwide across industries\n", + "2. **Consistent**: Two-letter format ensures uniform representation\n", + "3. **Comprehensive**: Covers major languages with official status\n", + "4. **Future-proof**: Maintained and updated by ISO\n", + "5. **Integration**: Compatible with web standards (HTML lang attributes, etc.)\n", + "\n", + "### Examples of ISO 639-1 Codes\n", + "- `en` - English\n", + "- `es` - Spanish \n", + "- `fr` - French\n", + "- `de` - German\n", + "- `ja` - Japanese\n", + "- `zh` - Chinese\n", + "- `ar` - Arabic\n", + "- `hi` - Hindi\n", + "\n", + "### Database Design Considerations\n", + "Using standardized codes in our database ensures:\n", + "- **Data consistency** across different systems\n", + "- **Easy integration** with external APIs and services\n", + "- **Future compatibility** with international standards\n", + "- **Reduced ambiguity** compared to custom codes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Table Definition" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "%xmode minimal\nimport datajoint as dj\ndj.config['display.limit'] = 6 # keep output concise\n\n# Create schema\nschema = dj.Schema('languages_example')\n\n@schema\nclass Language(dj.Lookup):\n definition = \"\"\"\n lang_code : char(2) # ISO 639-1 language code (e.g., 'en', 'es', 'ja')\n ---\n language : varchar(30) # Full language name\n native_name : varchar(50) # Language name in its native script\n \"\"\"\n contents = [\n ('ar', 'Arabic', 'العربية'),\n ('da', 'Danish', 'Dansk'),\n ('de', 'German', 'Deutsch'),\n ('el', 'Greek', 'Ελληνικά'),\n ('en', 'English', 'English'),\n ('es', 'Spanish', 'Español'),\n ('fi', 'Finnish', 'Suomi'),\n ('fr', 'French', 'Français'),\n ('he', 'Hebrew', 'עברית'),\n ('hi', 'Hindi', 'हिन्दी'),\n ('id', 'Indonesian', 'Bahasa Indonesia'),\n ('it', 'Italian', 'Italiano'),\n ('ja', 'Japanese', '日本語'),\n ('ko', 'Korean', '한국어'),\n ('ms', 'Malay', 'Bahasa Melayu'),\n ('nl', 'Dutch', 'Nederlands'),\n ('no', 'Norwegian', 'Norsk'),\n ('ph', 'Filipino', 'Tagalog'),\n ('pl', 'Polish', 'Polski'),\n ('pt', 'Portuguese', 'Português'),\n ('ru', 'Russian', 'Русский'),\n ('sa', 'Sanskrit', 'संस्कृतम्'),\n ('sv', 'Swedish', 'Svenska'),\n ('th', 'Thai', 'ไทย'),\n ('tr', 'Turkish', 'Türkçe'),\n ('uk', 'Ukrainian', 'Українська'),\n ('vi', 'Vietnamese', 'Tiếng Việt'),\n ('zh', 'Chinese', '中文'),\n ]\n\n@schema\nclass CEFRLevel(dj.Lookup):\n definition = \"\"\"\n cefr_level : char(2) # CEFR proficiency level (A1, A2, B1, B2, C1, C2)\n ---\n level_name : varchar(30) # Full name of the level\n category : enum('Basic', 'Independent', 'Proficient') # User category\n description : varchar(255) # Brief description of abilities\n \"\"\"\n contents = [\n ('A1', 'Beginner', 'Basic',\n 'Can understand and use familiar everyday expressions'),\n ('A2', 'Elementary', 'Basic',\n 'Can communicate in simple routine tasks'),\n ('B1', 'Intermediate', 'Independent',\n 'Can deal with most situations while traveling'),\n ('B2', 'Upper Intermediate', 'Independent',\n 'Can interact with fluency and spontaneity'),\n ('C1', 'Advanced', 'Proficient',\n 'Can express ideas fluently for professional purposes'),\n ('C2', 'Mastery', 'Proficient',\n 'Can understand virtually everything heard or read'),\n ]" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Benefits of Using ISO 639-1 Codes\n", + "\n", + "### 1. **International Compatibility**\n", + "```python\n", + "# These codes work with web standards and international APIs\n", + "web_lang_attr = f'' # e.g., \n", + "api_request = f'https://api.translate.com?lang={lang_code}' # e.g., lang=en\n", + "```\n", + "\n", + "### 2. **Consistent Data Representation**\n", + "```python\n", + "# All systems recognize these codes\n", + "browser_detection = {'en': 'English', 'es': 'Spanish', 'ja': 'Japanese'}\n", + "database_lookup = Language & {'lang_code': 'en'} # Always works\n", + "```\n", + "\n", + "### 3. **Future-Proof Design**\n", + "```python\n", + "# New languages can be added following the same standard\n", + "# ISO maintains and updates the standard regularly\n", + "new_languages = [\n", + " ('sw', 'Swahili', 'Kiswahili'),\n", + " ('th', 'Thai', 'ไทย'),\n", + " ('vi', 'Vietnamese', 'Tiếng Việt')\n", + "]\n", + "```\n", + "\n", + "### 4. **Integration with External Services**\n", + "```python\n", + "# Compatible with translation services, content management systems\n", + "translation_api = f'https://translate.googleapis.com/translate_a/single?client=gtx&sl=en&tl={lang_code}'\n", + "content_management = f''\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Language Proficiency: CEFR Levels\n", + "\n", + "For measuring language proficiency, we use the **CEFR** (Common European Framework of Reference for Languages), which is the international standard for describing language ability.\n", + "\n", + "### Background\n", + "\n", + "The **Common European Framework of Reference for Languages (CEFR)** was developed by the Council of Europe and published in 2001. It provides:\n", + "\n", + "- **Standardized proficiency descriptors** recognized worldwide\n", + "- **Six proficiency levels** from beginner to mastery\n", + "- **Can-do statements** describing practical abilities at each level\n", + "- **Common reference** for language teaching, testing, and certification\n", + "\n", + "### The Six CEFR Levels\n", + "\n", + "The CEFR defines six levels, grouped into three broad categories:\n", + "\n", + "#### **A - Basic User**\n", + "- **A1 (Beginner)**: Can understand and use familiar everyday expressions and very basic phrases. Can introduce themselves and ask and answer simple personal questions.\n", + "- **A2 (Elementary)**: Can understand sentences and frequently used expressions. Can communicate in simple routine tasks requiring direct exchange of information.\n", + "\n", + "#### **B - Independent User**\n", + "- **B1 (Intermediate)**: Can understand main points on familiar matters. Can deal with most situations while traveling. Can produce simple connected text on familiar topics.\n", + "- **B2 (Upper Intermediate)**: Can understand main ideas of complex text. Can interact with native speakers with fluency and spontaneity. Can produce clear, detailed text on various subjects.\n", + "\n", + "#### **C - Proficient User**\n", + "- **C1 (Advanced)**: Can understand a wide range of demanding texts. Can express ideas fluently and spontaneously. Can use language flexibly for social, academic, and professional purposes.\n", + "- **C2 (Mastery)**: Can understand virtually everything heard or read. Can summarize information from different sources. Can express themselves with precision and subtle distinction of meaning.\n", + "\n", + "### Benefits of Using CEFR in Our Database\n", + "\n", + "1. **International Recognition**: CEFR is used by universities, employers, and governments worldwide\n", + "2. **Precise Measurement**: Six levels provide nuanced proficiency assessment\n", + "3. **Practical Focus**: Levels describe what learners can actually do with the language\n", + "4. **Career Relevance**: Many job postings and educational programs reference CEFR levels\n", + "5. **Standardized Testing**: Major language tests (TOEFL, IELTS, DELE, etc.) map to CEFR levels\n", + "\n", + "### CEFR Level Equivalencies\n", + "\n", + "Common language tests map to CEFR as follows:\n", + "- **TOEFL iBT**: 42-71 (B1), 72-94 (B2), 95-120 (C1-C2)\n", + "- **IELTS**: 4.0-5.0 (B1), 5.5-6.5 (B2), 7.0-9.0 (C1-C2)\n", + "- **Cambridge**: PET (B1), FCE (B2), CAE (C1), CPE (C2)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using Language and Proficiency Tables \n", + "\n", + "Let's use the Language table to create a set of persons with different languages they speak." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [], + "source": [ + "@schema\n", + "class Person(dj.Manual):\n", + " definition = \"\"\"\n", + " person_id : int # Unique identifier for each person\n", + " ---\n", + " name : varchar(60) # Person's name\n", + " date_of_birth : date # Date of birth\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class Proficiency(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Person\n", + " -> Language\n", + " ---\n", + " -> CEFRLevel\n", + " \"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Complete Schema Diagram" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 36, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "CEFRLevel\n", - "\n", - "\n", - "CEFRLevel\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Proficiency\n", - "\n", - "\n", - "Proficiency\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "CEFRLevel->Proficiency\n", - "\n", - "\n", - "\n", - "\n", - "Language\n", - "\n", - "\n", - "Language\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Language->Proficiency\n", - "\n", - "\n", - "\n", - "\n", - "Person\n", - "\n", - "\n", - "Person\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Person->Proficiency\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 36, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "CEFRLevel\n", + "\n", + "\n", + "CEFRLevel\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Proficiency\n", + "\n", + "\n", + "Proficiency\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "CEFRLevel->Proficiency\n", + "\n", + "\n", + "\n", + "\n", + "Language\n", + "\n", + "\n", + "Language\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Language->Proficiency\n", + "\n", + "\n", + "\n", + "\n", + "Person\n", + "\n", + "\n", + "Person\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Person->Proficiency\n", + "\n", + "\n", + "\n", + "" ], - "source": [ - "dj.Diagram(schema)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Populating the Person and Proficiency data\n", - "\n", - "Let's use Faker to generate realistic sample data for our Person table:\n" + "text/plain": [ + "" ] - }, + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dj.Diagram(schema)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Populating the Person and Proficiency data\n", + "\n", + "Let's use Faker to generate realistic sample data for our Person table:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 37, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

name

\n", - " Person's name\n", - "
\n", - "

date_of_birth

\n", - " Date of birth\n", - "
0Allison Hill1958-11-03
1Megan Mcclain1950-04-03
2Allen Robinson1976-08-12
3Cristian Santos1979-02-09
4Kevin Pacheco1945-03-09
5Melissa Peterson1954-07-28
\n", - "

...

\n", - "

Total: 500

\n", - " " - ], - "text/plain": [ - "*person_id name date_of_birth \n", - "+-----------+ +------------+ +------------+\n", - "0 Allison Hill 1958-11-03 \n", - "1 Megan Mcclain 1950-04-03 \n", - "2 Allen Robinson 1976-08-12 \n", - "3 Cristian Santo 1979-02-09 \n", - "4 Kevin Pacheco 1945-03-09 \n", - "5 Melissa Peters 1954-07-28 \n", - " ...\n", - " (Total: 500)" - ] - }, - "execution_count": 37, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

name

\n", + " Person's name\n", + "
\n", + "

date_of_birth

\n", + " Date of birth\n", + "
0Allison Hill1958-11-03
1Megan Mcclain1950-04-03
2Allen Robinson1976-08-12
3Cristian Santos1979-02-09
4Kevin Pacheco1945-03-09
5Melissa Peterson1954-07-28
\n", + "

...

\n", + "

Total: 500

\n", + " " ], - "source": [ - "# Generate sample people data using Faker\n", - "import numpy as np\n", - "from faker import Faker\n", - "\n", - "fake = Faker()\n", - "\n", - "# Set seed for reproducible results\n", - "np.random.seed(42)\n", - "fake.seed_instance(42)\n", - "\n", - "# Generate n people with diverse backgrounds\n", - "n = 500 # number of people to generate\n", - "Person.insert(\n", - " {\n", - " 'person_id': i,\n", - " 'name': fake.name(),\n", - " 'date_of_birth': fake.date_of_birth(minimum_age=18, maximum_age=80)\n", - " } for i in range(n))\n", - "\n", - "Person()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now let's create random language fluency data by assigning people various language skills:" + "text/plain": [ + "*person_id name date_of_birth \n", + "+-----------+ +------------+ +------------+\n", + "0 Allison Hill 1958-11-03 \n", + "1 Megan Mcclain 1950-04-03 \n", + "2 Allen Robinson 1976-08-12 \n", + "3 Cristian Santo 1979-02-09 \n", + "4 Kevin Pacheco 1945-03-09 \n", + "5 Melissa Peters 1954-07-28 \n", + " ...\n", + " (Total: 500)" ] - }, + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Generate sample people data using Faker\n", + "import numpy as np\n", + "from faker import Faker\n", + "\n", + "fake = Faker()\n", + "\n", + "# Set seed for reproducible results\n", + "np.random.seed(42)\n", + "fake.seed_instance(42)\n", + "\n", + "# Generate n people with diverse backgrounds\n", + "n = 500 # number of people to generate\n", + "Person.insert(\n", + " {\n", + " 'person_id': i,\n", + " 'name': fake.name(),\n", + " 'date_of_birth': fake.date_of_birth(minimum_age=18, maximum_age=80)\n", + " } for i in range(n))\n", + "\n", + "Person()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's create random language fluency data by assigning people various language skills:" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 38, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

lang_code

\n", - " ISO 639-1 language code (e.g., 'en', 'es', 'ja')\n", - "
\n", - "

cefr_level

\n", - " CEFR proficiency level code (A1, A2, B1, B2, C1, C2)\n", - "
0koC1
0saC2
0trB2
0viA2
1enC2
1hiC2
\n", - "

...

\n", - "

Total: 1262

\n", - " " - ], - "text/plain": [ - "*person_id *lang_code cefr_level \n", - "+-----------+ +-----------+ +------------+\n", - "0 ko C1 \n", - "0 sa C2 \n", - "0 tr B2 \n", - "0 vi A2 \n", - "1 en C2 \n", - "1 hi C2 \n", - " ...\n", - " (Total: 1262)" - ] - }, - "execution_count": 38, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

lang_code

\n", + " ISO 639-1 language code (e.g., 'en', 'es', 'ja')\n", + "
\n", + "

cefr_level

\n", + " CEFR proficiency level code (A1, A2, B1, B2, C1, C2)\n", + "
0koC1
0saC2
0trB2
0viA2
1enC2
1hiC2
\n", + "

...

\n", + "

Total: 1262

\n", + " " ], - "source": [ - "lang_keys = Language.fetch(\"KEY\")\n", - "cefr_keys = CEFRLevel.fetch(\"KEY\")\n", - "# Weight probabilities: more people at intermediate levels than extremes\n", - "cefr_probabilities = [0.08, 0.12, 0.13, 0.17, 0.20, 0.30]\n", - "average_languages = 2.5\n", - "\n", - "for person_key in Person.fetch(\"KEY\"):\n", - " num_languages = np.random.poisson(average_languages)\n", - " Proficiency.insert(\n", - " {\n", - " **person_key,\n", - " **lang_key,\n", - " **np.random.choice(cefr_keys, p=cefr_probabilities)\n", - " } for lang_key in np.random.choice(lang_keys, num_languages, replace=False)\n", - " )\n", - "\n", - "Proficiency()" + "text/plain": [ + "*person_id *lang_code cefr_level \n", + "+-----------+ +-----------+ +------------+\n", + "0 ko C1 \n", + "0 sa C2 \n", + "0 tr B2 \n", + "0 vi A2 \n", + "1 en C2 \n", + "1 hi C2 \n", + " ...\n", + " (Total: 1262)" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Why CEFR Levels Improve Queries\n", - "\n", - "Using CEFR levels in our database enables more meaningful queries:\n", - "\n", - "**Granular Filtering**: Query for specific proficiency ranges\n", - "```python\n", - "# Find people with working proficiency (B1-B2)\n", - "Proficiency & 'cefr_level in (\"B1\", \"B2\")'\n", - "\n", - "# Find advanced speakers (C1-C2)\n", - "Proficiency & 'cefr_level in (\"C1\", \"C2\")'\n", - "```\n", - "\n", - "**Comparable Across Languages**: CEFR provides consistent measurement\n", - "```python\n", - "# Find people who are intermediate (B1) or better in ANY language\n", - "Person & (Proficiency & 'cefr_level >= \"B1\"')\n", - "```\n", - "\n", - "**Career-Relevant**: Matches real-world job requirements\n", - "```python\n", - "# Many job postings require \"B2 or higher\"\n", - "# Easy to query candidates meeting this criteria\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Sample Queries with Populated Data\n", - "\n", - "Now that we have data in all three tables, let's run some example queries:\n" - ] - }, + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "lang_keys = Language.fetch(\"KEY\")\n", + "cefr_keys = CEFRLevel.fetch(\"KEY\")\n", + "# Weight probabilities: more people at intermediate levels than extremes\n", + "cefr_probabilities = [0.08, 0.12, 0.13, 0.17, 0.20, 0.30]\n", + "average_languages = 2.5\n", + "\n", + "for person_key in Person.fetch(\"KEY\"):\n", + " num_languages = np.random.poisson(average_languages)\n", + " Proficiency.insert(\n", + " {\n", + " **person_key,\n", + " **lang_key,\n", + " **np.random.choice(cefr_keys, p=cefr_probabilities)\n", + " } for lang_key in np.random.choice(lang_keys, num_languages, replace=False)\n", + " )\n", + "\n", + "Proficiency()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Why CEFR Levels Improve Queries\n", + "\n", + "Using CEFR levels in our database enables more meaningful queries:\n", + "\n", + "**Granular Filtering**: Query for specific proficiency ranges\n", + "```python\n", + "# Find people with working proficiency (B1-B2)\n", + "Proficiency & 'cefr_level in (\"B1\", \"B2\")'\n", + "\n", + "# Find advanced speakers (C1-C2)\n", + "Proficiency & 'cefr_level in (\"C1\", \"C2\")'\n", + "```\n", + "\n", + "**Comparable Across Languages**: CEFR provides consistent measurement\n", + "```python\n", + "# Find people who are intermediate (B1) or better in ANY language\n", + "Person & (Proficiency & 'cefr_level >= \"B1\"')\n", + "```\n", + "\n", + "**Career-Relevant**: Matches real-world job requirements\n", + "```python\n", + "# Many job postings require \"B2 or higher\"\n", + "# Easy to query candidates meeting this criteria\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Sample Queries with Populated Data\n", + "\n", + "Now that we have data in all three tables, let's run some example queries:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 39, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

name

\n", - " Person's name\n", - "
1Megan Mcclain
7Lindsey Roman
61James Powers
70Taylor Mathis Jr.
78Brittany Spears
97April Mitchell
\n", - "

...

\n", - "

Total: 24

\n", - " " - ], - "text/plain": [ - "*person_id name \n", - "+-----------+ +------------+\n", - "1 Megan Mcclain \n", - "7 Lindsey Roman \n", - "61 James Powers \n", - "70 Taylor Mathis \n", - "78 Brittany Spear\n", - "97 April Mitchell\n", - " ...\n", - " (Total: 24)" - ] - }, - "execution_count": 39, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

name

\n", + " Person's name\n", + "
1Megan Mcclain
7Lindsey Roman
61James Powers
70Taylor Mathis Jr.
78Brittany Spears
97April Mitchell
\n", + "

...

\n", + "

Total: 24

\n", + " " ], - "source": [ - "# Query 1: Find the names of all proficient English speakers (C1 or C2)\n", - "Person.proj('name') & (Proficiency & {'lang_code': 'en'} & 'cefr_level in (\"C1\", \"C2\")')" + "text/plain": [ + "*person_id name \n", + "+-----------+ +------------+\n", + "1 Megan Mcclain \n", + "7 Lindsey Roman \n", + "61 James Powers \n", + "70 Taylor Mathis \n", + "78 Brittany Spear\n", + "97 April Mitchell\n", + " ...\n", + " (Total: 24)" ] - }, + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 1: Find the names of all proficient English speakers (C1 or C2)\n", + "Person.proj('name') & (Proficiency & {'lang_code': 'en'} & 'cefr_level in (\"C1\", \"C2\")')" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

name

\n", - " Person's name\n", - "
1Megan Mcclain
2Allen Robinson
7Lindsey Roman
10Amber Perez
11David Garcia
14Nicholas Martin
\n", - "

...

\n", - "

Total: 92

\n", - " " - ], - "text/plain": [ - "*person_id name \n", - "+-----------+ +------------+\n", - "1 Megan Mcclain \n", - "2 Allen Robinson\n", - "7 Lindsey Roman \n", - "10 Amber Perez \n", - "11 David Garcia \n", - "14 Nicholas Marti\n", - " ...\n", - " (Total: 92)" - ] - }, - "execution_count": 40, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

name

\n", + " Person's name\n", + "
1Megan Mcclain
2Allen Robinson
7Lindsey Roman
10Amber Perez
11David Garcia
14Nicholas Martin
\n", + "

...

\n", + "

Total: 92

\n", + " " ], - "source": [ - "# Query 2: Names of people who speak English or Spanish at any level \n", - "Person.proj('name') & (Proficiency & 'lang_code in (\"en\", \"es\")')" + "text/plain": [ + "*person_id name \n", + "+-----------+ +------------+\n", + "1 Megan Mcclain \n", + "2 Allen Robinson\n", + "7 Lindsey Roman \n", + "10 Amber Perez \n", + "11 David Garcia \n", + "14 Nicholas Marti\n", + " ...\n", + " (Total: 92)" ] - }, + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 2: Names of people who speak English or Spanish at any level \n", + "Person.proj('name') & (Proficiency & 'lang_code in (\"en\", \"es\")')" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

name

\n", - " Person's name\n", - "
200Jordan Morris
262Bobby Franklin
300Sabrina Briggs
334Joseph Burch
416William Becker
484Christopher Medina
\n", - " \n", - "

Total: 6

\n", - " " - ], - "text/plain": [ - "*person_id name \n", - "+-----------+ +------------+\n", - "200 Jordan Morris \n", - "262 Bobby Franklin\n", - "300 Sabrina Briggs\n", - "334 Joseph Burch \n", - "416 William Becker\n", - "484 Christopher Me\n", - " (Total: 6)" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

name

\n", + " Person's name\n", + "
200Jordan Morris
262Bobby Franklin
300Sabrina Briggs
334Joseph Burch
416William Becker
484Christopher Medina
\n", + " \n", + "

Total: 6

\n", + " " ], - "source": [ - "# Query 3: Names of people who speak English AND Spanish at any level \n", - "Person.proj('name') & (Proficiency & {'lang_code': 'en'}) & (Proficiency & {'lang_code': 'es'}) " + "text/plain": [ + "*person_id name \n", + "+-----------+ +------------+\n", + "200 Jordan Morris \n", + "262 Bobby Franklin\n", + "300 Sabrina Briggs\n", + "334 Joseph Burch \n", + "416 William Becker\n", + "484 Christopher Me\n", + " (Total: 6)" ] - }, + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 3: Names of people who speak English AND Spanish at any level \n", + "Person.proj('name') & (Proficiency & {'lang_code': 'en'}) & (Proficiency & {'lang_code': 'es'}) " + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 42, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

name

\n", - " Person's name\n", - "
\n", - "

nlanguages

\n", - " calculated attribute\n", - "
\n", - "

languages

\n", - " calculated attribute\n", - "
0Allison Hill4ko,sa,tr,vi
1Megan Mcclain4en,hi,nl,no
8Valerie Gray5fi,hi,ja,ko,ru
9Lisa Hensley5he,hi,ph,ru,tr
17Daniel Hahn4el,pt,sv,th
18Matthew Foster4da,el,id,nl
\n", - "

...

\n", - "

Total: 130

\n", - " " - ], - "text/plain": [ - "*person_id name nlanguages languages \n", - "+-----------+ +------------+ +------------+ +------------+\n", - "0 Allison Hill 4 ko,sa,tr,vi \n", - "1 Megan Mcclain 4 en,hi,nl,no \n", - "8 Valerie Gray 5 fi,hi,ja,ko,ru\n", - "9 Lisa Hensley 5 he,hi,ph,ru,tr\n", - "17 Daniel Hahn 4 el,pt,sv,th \n", - "18 Matthew Foster 4 da,el,id,nl \n", - " ...\n", - " (Total: 130)" - ] - }, - "execution_count": 42, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

name

\n", + " Person's name\n", + "
\n", + "

nlanguages

\n", + " calculated attribute\n", + "
\n", + "

languages

\n", + " calculated attribute\n", + "
0Allison Hill4ko,sa,tr,vi
1Megan Mcclain4en,hi,nl,no
8Valerie Gray5fi,hi,ja,ko,ru
9Lisa Hensley5he,hi,ph,ru,tr
17Daniel Hahn4el,pt,sv,th
18Matthew Foster4da,el,id,nl
\n", + "

...

\n", + "

Total: 130

\n", + " " ], - "source": [ - "# Query 4: Show the peole who speak at least four languages\n", - "Person.aggr(Proficiency, 'name',\n", - " nlanguages='count(lang_code)', languages='GROUP_CONCAT(lang_code)'\n", - " ) & 'nlanguages >= 4'" + "text/plain": [ + "*person_id name nlanguages languages \n", + "+-----------+ +------------+ +------------+ +------------+\n", + "0 Allison Hill 4 ko,sa,tr,vi \n", + "1 Megan Mcclain 4 en,hi,nl,no \n", + "8 Valerie Gray 5 fi,hi,ja,ko,ru\n", + "9 Lisa Hensley 5 he,hi,ph,ru,tr\n", + "17 Daniel Hahn 4 el,pt,sv,th \n", + "18 Matthew Foster 4 da,el,id,nl \n", + " ...\n", + " (Total: 130)" ] - }, + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 4: Show the peole who speak at least four languages\n", + "Person.aggr(Proficiency, 'name',\n", + " nlanguages='count(lang_code)', languages='GROUP_CONCAT(lang_code)'\n", + " ) & 'nlanguages >= 4'" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 43, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

name

\n", - " Person's name\n", - "
\n", - "

nlanguages

\n", - " calculated attribute\n", - "
200Jordan Morris7
251Andrea Hubbard7
333Victoria Murray7
\n", - " \n", - "

Total: 3

\n", - " " - ], - "text/plain": [ - "*person_id name nlanguages \n", - "+-----------+ +------------+ +------------+\n", - "200 Jordan Morris 7 \n", - "251 Andrea Hubbard 7 \n", - "333 Victoria Murra 7 \n", - " (Total: 3)" - ] - }, - "execution_count": 43, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

name

\n", + " Person's name\n", + "
\n", + "

nlanguages

\n", + " calculated attribute\n", + "
200Jordan Morris7
251Andrea Hubbard7
333Victoria Murray7
\n", + " \n", + "

Total: 3

\n", + " " ], - "source": [ - "# Query 5: Show the top 3 people by number of languages spoken\n", - "Person.aggr(Proficiency, 'name',\n", - " nlanguages='count(lang_code)'\n", - " ) & dj.Top(3, order_by='nlanguages desc')" + "text/plain": [ + "*person_id name nlanguages \n", + "+-----------+ +------------+ +------------+\n", + "200 Jordan Morris 7 \n", + "251 Andrea Hubbard 7 \n", + "333 Victoria Murra 7 \n", + " (Total: 3)" ] - }, + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 5: Show the top 3 people by number of languages spoken\n", + "Person.aggr(Proficiency, 'name',\n", + " nlanguages='count(lang_code)'\n", + " ) & dj.Top(3, order_by='nlanguages desc')" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 45, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

name

\n", - " Person's name\n", - "
\n", - "

date_of_birth

\n", - " Date of birth\n", - "
0Allison Hill1958-11-03
1Megan Mcclain1950-04-03
2Allen Robinson1976-08-12
7Lindsey Roman1990-10-01
9Lisa Hensley1981-02-23
10Amber Perez1963-01-03
\n", - "

...

\n", - "

Total: 114

\n", - " " - ], - "text/plain": [ - "*person_id name date_of_birth \n", - "+-----------+ +------------+ +------------+\n", - "0 Allison Hill 1958-11-03 \n", - "1 Megan Mcclain 1950-04-03 \n", - "2 Allen Robinson 1976-08-12 \n", - "7 Lindsey Roman 1990-10-01 \n", - "9 Lisa Hensley 1981-02-23 \n", - "10 Amber Perez 1963-01-03 \n", - " ...\n", - " (Total: 114)" - ] - }, - "execution_count": 45, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

name

\n", + " Person's name\n", + "
\n", + "

date_of_birth

\n", + " Date of birth\n", + "
0Allison Hill1958-11-03
1Megan Mcclain1950-04-03
2Allen Robinson1976-08-12
7Lindsey Roman1990-10-01
9Lisa Hensley1981-02-23
10Amber Perez1963-01-03
\n", + "

...

\n", + "

Total: 114

\n", + " " ], - "source": [ - "# Query 6: Show all the people Lindsay Roman (person_id=7) can communicate wtih\n", - "\n", - "Person & (\n", - " Proficiency * Proficiency.proj(other_person='person_id') & {'other_person': 7}\n", - " )\n" + "text/plain": [ + "*person_id name date_of_birth \n", + "+-----------+ +------------+ +------------+\n", + "0 Allison Hill 1958-11-03 \n", + "1 Megan Mcclain 1950-04-03 \n", + "2 Allen Robinson 1976-08-12 \n", + "7 Lindsey Roman 1990-10-01 \n", + "9 Lisa Hensley 1981-02-23 \n", + "10 Amber Perez 1963-01-03 \n", + " ...\n", + " (Total: 114)" ] - }, + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 6: Show all the people Lindsay Roman (person_id=7) can communicate wtih\n", + "\n", + "Person & (\n", + " Proficiency * Proficiency.proj(other_person='person_id') & {'other_person': 7}\n", + " )\n" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 46, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

name

\n", - " Person's name\n", - "
14Nicholas Martin
15Margaret Hawkins DDS
33Angelica Tucker
39Crystal Robinson
41David Caldwell
47Javier Ramirez
\n", - "

...

\n", - "

Total: 44

\n", - " " - ], - "text/plain": [ - "*person_id name \n", - "+-----------+ +------------+\n", - "14 Nicholas Marti\n", - "15 Margaret Hawki\n", - "33 Angelica Tucke\n", - "39 Crystal Robins\n", - "41 David Caldwell\n", - "47 Javier Ramirez\n", - " ...\n", - " (Total: 44)" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

name

\n", + " Person's name\n", + "
14Nicholas Martin
15Margaret Hawkins DDS
33Angelica Tucker
39Crystal Robinson
41David Caldwell
47Javier Ramirez
\n", + "

...

\n", + "

Total: 44

\n", + " " ], - "source": [ - "# Query 7: Find people with at least intermediate proficiency (B1+) in Spanish\n", - "Person.proj('name') & (Proficiency & {'lang_code': 'es'} & 'cefr_level >= \"B1\"')\n" + "text/plain": [ + "*person_id name \n", + "+-----------+ +------------+\n", + "14 Nicholas Marti\n", + "15 Margaret Hawki\n", + "33 Angelica Tucke\n", + "39 Crystal Robins\n", + "41 David Caldwell\n", + "47 Javier Ramirez\n", + " ...\n", + " (Total: 44)" ] - }, + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 7: Find people with at least intermediate proficiency (B1+) in Spanish\n", + "Person.proj('name') & (Proficiency & {'lang_code': 'es'} & 'cefr_level >= \"B1\"')\n" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 47, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

lang_code

\n", - " ISO 639-1 language code (e.g., 'en', 'es', 'ja')\n", - "
\n", - "

language

\n", - " Full language name\n", - "
\n", - "

nspeakers

\n", - " calculated attribute\n", - "
arArabic43
daDanish41
deGerman41
elGreek37
enEnglish48
esSpanish50
\n", - "

...

\n", - "

Total: 28

\n", - " " - ], - "text/plain": [ - "*lang_code language nspeakers \n", - "+-----------+ +----------+ +-----------+\n", - "ar Arabic 43 \n", - "da Danish 41 \n", - "de German 41 \n", - "el Greek 37 \n", - "en English 48 \n", - "es Spanish 50 \n", - " ...\n", - " (Total: 28)" - ] - }, - "execution_count": 47, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

lang_code

\n", + " ISO 639-1 language code (e.g., 'en', 'es', 'ja')\n", + "
\n", + "

language

\n", + " Full language name\n", + "
\n", + "

nspeakers

\n", + " calculated attribute\n", + "
arArabic43
daDanish41
deGerman41
elGreek37
enEnglish48
esSpanish50
\n", + "

...

\n", + "

Total: 28

\n", + " " ], - "source": [ - "# Query 8: Show all languages and the number of people who speak them\n", - "Language.aggr(Proficiency, 'language', nspeakers='count(person_id)')" + "text/plain": [ + "*lang_code language nspeakers \n", + "+-----------+ +----------+ +-----------+\n", + "ar Arabic 43 \n", + "da Danish 41 \n", + "de German 41 \n", + "el Greek 37 \n", + "en English 48 \n", + "es Spanish 50 \n", + " ...\n", + " (Total: 28)" ] - }, + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 8: Show all languages and the number of people who speak them\n", + "Language.aggr(Proficiency, 'language', nspeakers='count(person_id)')" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 48, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

cefr_level

\n", - " CEFR proficiency level code (A1, A2, B1, B2, C1, C2)\n", - "
\n", - "

nspeakers

\n", - " calculated attribute\n", - "
A16
A24
B14
B210
C17
C217
\n", - " \n", - "

Total: 6

\n", - " " - ], - "text/plain": [ - "*cefr_level nspeakers \n", - "+------------+ +-----------+\n", - "A1 6 \n", - "A2 4 \n", - "B1 4 \n", - "B2 10 \n", - "C1 7 \n", - "C2 17 \n", - " (Total: 6)" - ] - }, - "execution_count": 48, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

cefr_level

\n", + " CEFR proficiency level code (A1, A2, B1, B2, C1, C2)\n", + "
\n", + "

nspeakers

\n", + " calculated attribute\n", + "
A16
A24
B14
B210
C17
C217
\n", + " \n", + "

Total: 6

\n", + " " ], - "source": [ - "# Query 9: Count people at each CEFR level for English\n", - "(CEFRLevel).aggr(Proficiency & {'lang_code': 'en'}, nspeakers='count(person_id)')\n" + "text/plain": [ + "*cefr_level nspeakers \n", + "+------------+ +-----------+\n", + "A1 6 \n", + "A2 4 \n", + "B1 4 \n", + "B2 10 \n", + "C1 7 \n", + "C2 17 \n", + " (Total: 6)" ] - }, + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 9: Count people at each CEFR level for English\n", + "(CEFRLevel).aggr(Proficiency & {'lang_code': 'en'}, nspeakers='count(person_id)')\n" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 50, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

lang_code

\n", - " ISO 639-1 language code (e.g., 'en', 'es', 'ja')\n", - "
\n", - "

language

\n", - " Full language name\n", - "
\n", - "

nspeakers

\n", - " calculated attribute\n", - "
frFrench57
noNorwegian55
thThai53
\n", - " \n", - "

Total: 3

\n", - " " - ], - "text/plain": [ - "*lang_code language nspeakers \n", - "+-----------+ +-----------+ +-----------+\n", - "fr French 57 \n", - "no Norwegian 55 \n", - "th Thai 53 \n", - " (Total: 3)" - ] - }, - "execution_count": 50, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

lang_code

\n", + " ISO 639-1 language code (e.g., 'en', 'es', 'ja')\n", + "
\n", + "

language

\n", + " Full language name\n", + "
\n", + "

nspeakers

\n", + " calculated attribute\n", + "
frFrench57
noNorwegian55
thThai53
\n", + " \n", + "

Total: 3

\n", + " " ], - "source": [ - "# Query 10: Show the top 3 languages by number of speakers\n", - "Language.aggr(Proficiency, 'language', nspeakers='count(person_id)') & dj.Top(3, order_by='nspeakers desc')" + "text/plain": [ + "*lang_code language nspeakers \n", + "+-----------+ +-----------+ +-----------+\n", + "fr French 57 \n", + "no Norwegian 55 \n", + "th Thai 53 \n", + " (Total: 3)" ] - }, + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 10: Show the top 3 languages by number of speakers\n", + "Language.aggr(Proficiency, 'language', nspeakers='count(person_id)') & dj.Top(3, order_by='nspeakers desc')" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 54, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

lang_code

\n", - " ISO 639-1 language code (e.g., 'en', 'es', 'ja')\n", - "
\n", - "

cefr_level

\n", - " CEFR proficiency level code (A1, A2, B1, B2, C1, C2)\n", - "
\n", - "

name

\n", - " Person's name\n", - "
\n", - "

language

\n", - " Full language name\n", - "
\n", - "

level_name

\n", - " Full name of the level\n", - "
\n", - "

category

\n", - " User category\n", - "
\n", - "

description

\n", - " Brief description of abilities at this level\n", - "
0koC1Allison HillKoreanAdvancedProficientCan express ideas fluently and use language flexibly for social and professional purposes
0saC2Allison HillSanskritMasteryProficientCan understand virtually everything and express themselves with precision
0trB2Allison HillTurkishUpper IntermediateIndependentCan interact with fluency and spontaneity and produce clear, detailed text
0viA2Allison HillVietnameseElementaryBasicCan communicate in simple routine tasks requiring direct exchange of information
\n", - " \n", - "

Total: 4

\n", - " " - ], - "text/plain": [ - "*person_id *lang_code *cefr_level name language level_name category description \n", - "+-----------+ +-----------+ +------------+ +------------+ +------------+ +------------+ +------------+ +------------+\n", - "0 ko C1 Allison Hill Korean Advanced Proficient Can express id\n", - "0 sa C2 Allison Hill Sanskrit Mastery Proficient Can understand\n", - "0 tr B2 Allison Hill Turkish Upper Intermed Independent Can interact w\n", - "0 vi A2 Allison Hill Vietnamese Elementary Basic Can communicat\n", - " (Total: 4)" - ] - }, - "execution_count": 54, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

lang_code

\n", + " ISO 639-1 language code (e.g., 'en', 'es', 'ja')\n", + "
\n", + "

cefr_level

\n", + " CEFR proficiency level code (A1, A2, B1, B2, C1, C2)\n", + "
\n", + "

name

\n", + " Person's name\n", + "
\n", + "

language

\n", + " Full language name\n", + "
\n", + "

level_name

\n", + " Full name of the level\n", + "
\n", + "

category

\n", + " User category\n", + "
\n", + "

description

\n", + " Brief description of abilities at this level\n", + "
0koC1Allison HillKoreanAdvancedProficientCan express ideas fluently and use language flexibly for social and professional purposes
0saC2Allison HillSanskritMasteryProficientCan understand virtually everything and express themselves with precision
0trB2Allison HillTurkishUpper IntermediateIndependentCan interact with fluency and spontaneity and produce clear, detailed text
0viA2Allison HillVietnameseElementaryBasicCan communicate in simple routine tasks requiring direct exchange of information
\n", + " \n", + "

Total: 4

\n", + " " ], - "source": [ - "# Query 12: Show language skills with full CEFR level descriptions for person 0\n", - "# This demonstrates the benefit of having CEFRLevel as a lookup table\n", - "(Person * Proficiency * Language * CEFRLevel).proj(\n", - " 'name', 'language', 'level_name', 'category', 'description'\n", - ") & {'person_id': 0}\n" + "text/plain": [ + "*person_id *lang_code *cefr_level name language level_name category description \n", + "+-----------+ +-----------+ +------------+ +------------+ +------------+ +------------+ +------------+ +------------+\n", + "0 ko C1 Allison Hill Korean Advanced Proficient Can express id\n", + "0 sa C2 Allison Hill Sanskrit Mastery Proficient Can understand\n", + "0 tr B2 Allison Hill Turkish Upper Intermed Independent Can interact w\n", + "0 vi A2 Allison Hill Vietnamese Elementary Basic Can communicat\n", + " (Total: 4)" ] - }, + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 12: Show language skills with full CEFR level descriptions for person 0\n", + "# This demonstrates the benefit of having CEFRLevel as a lookup table\n", + "(Person * Proficiency * Language * CEFRLevel).proj(\n", + " 'name', 'language', 'level_name', 'category', 'description'\n", + ") & {'person_id': 0}\n" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 55, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

person_id

\n", - " Unique identifier for each person\n", - "
\n", - "

name

\n", - " Person's name\n", - "
\n", - "

num_mastered

\n", - " calculated attribute\n", - "
0Allison Hill2
1Megan Mcclain4
7Lindsey Roman2
8Valerie Gray2
9Lisa Hensley3
14Nicholas Martin2
\n", - "

...

\n", - "

Total: 184

\n", - " " - ], - "text/plain": [ - "*person_id name num_mastered \n", - "+-----------+ +------------+ +------------+\n", - "0 Allison Hill 2 \n", - "1 Megan Mcclain 4 \n", - "7 Lindsey Roman 2 \n", - "8 Valerie Gray 2 \n", - "9 Lisa Hensley 3 \n", - "14 Nicholas Marti 2 \n", - " ...\n", - " (Total: 184)" - ] - }, - "execution_count": 55, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

person_id

\n", + " Unique identifier for each person\n", + "
\n", + "

name

\n", + " Person's name\n", + "
\n", + "

num_mastered

\n", + " calculated attribute\n", + "
0Allison Hill2
1Megan Mcclain4
7Lindsey Roman2
8Valerie Gray2
9Lisa Hensley3
14Nicholas Martin2
\n", + "

...

\n", + "

Total: 184

\n", + " " ], - "source": [ - "# Query 11: Find polyglots with C1 and C2 (mastery) level in multiple languages\n", - "Person.aggr(\n", - " Proficiency & 'cefr_level>=\"C1\"',\n", - " 'name',\n", - " num_mastered='count(*)'\n", - ") & 'num_mastered >= 2'\n" - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": {}, - "outputs": [], - "source": [ - "# If you need to re-run this example, you can drop the schema by uncommenting the following line:\n", - "\n", - "# schema.drop() " + "text/plain": [ + "*person_id name num_mastered \n", + "+-----------+ +------------+ +------------+\n", + "0 Allison Hill 2 \n", + "1 Megan Mcclain 4 \n", + "7 Lindsey Roman 2 \n", + "8 Valerie Gray 2 \n", + "9 Lisa Hensley 3 \n", + "14 Nicholas Marti 2 \n", + " ...\n", + " (Total: 184)" ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" } - ], - "metadata": { - "kernelspec": { - "display_name": "base", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.2" - } + ], + "source": [ + "# Query 11: Find polyglots with C1 and C2 (mastery) level in multiple languages\n", + "Person.aggr(\n", + " Proficiency & 'cefr_level>=\"C1\"',\n", + " 'name',\n", + " num_mastered='count(*)'\n", + ") & 'num_mastered >= 2'\n" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [], + "source": [ + "# If you need to re-run this example, you can drop the schema by uncommenting the following line:\n", + "\n", + "# schema.drop() " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" }, - "nbformat": 4, - "nbformat_minor": 2 -} + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/book/80-examples/060-management.ipynb b/book/80-examples/060-management.ipynb index 49b9549..4a35e4e 100644 --- a/book/80-examples/060-management.ipynb +++ b/book/80-examples/060-management.ipynb @@ -1,1290 +1,1224 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Management Hierarchy\n", - "\n", - "This example demonstrates self-referencing tables and hierarchical data structures, showing how to model organizational relationships where employees manage other employees. It also showcases the benefits of using lookup tables for normalized data design.\n", - "\n", - "## Database Schema\n", - "\n", - "The database consists of four tables:\n", - "1. **Department** - A lookup table containing department information (codes, names, budgets, locations)\n", - "2. **Employee** - Individual employees with basic information and department assignment\n", - "3. **ReportsTo** - An association table linking employees to their managers\n", - "4. **DepartmentChair** - A table linking departments to their chairs\n", - "\n", - "This design allows:\n", - "- Each employee to have at most one manager (many-to-one relationship)\n", - "- Each manager to have multiple direct reports (one-to-many relationship)\n", - "- Each employee to belong to exactly one department (many-to-one relationship)\n", - "- Each department to have exactly one chair (one-to-one relationship)\n", - "- Each employee to be chair of at most one department (one-to-zero-or-one relationship)\n", - "- Modeling of organizational hierarchies with normalized department data\n", - "- Extended department metadata (budget, location) without data duplication\n" - ] - }, - { - "cell_type": "code", - "execution_count": 52, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Exception reporting mode: Minimal\n" - ] - } - ], - "source": [ - "%xmode minimal\n", - "import datajoint as dj\n", - "\n", - "# Create schema\n", - "schema = dj.Schema('management_example')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Table Definitions\n" - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": {}, - "outputs": [], - "source": [ - "@schema\n", - "class Department(dj.Lookup):\n", - " definition = \"\"\"\n", - " dept_code : char(3) # Department code (e.g., 'ENG', 'MKT', 'SAL')\n", - " ---\n", - " department_name : varchar(50) # Full department name\n", - " location : varchar(30) # Office location\n", - " \"\"\"\n", - " contents = [\n", - " ('ENG', 'Engineering','Building A'),\n", - " ('MKT', 'Marketing', 'Building B'),\n", - " ('SAL', 'Sales', 'Building C'),\n", - " ('HR', 'Human Resources', 'Building A'),\n", - " ('FIN', 'Finance', 'Building A'),\n", - " ]\n", - "\n", - "@schema\n", - "class Employee(dj.Manual):\n", - " definition = \"\"\"\n", - " employee_id : int # Unique identifier for each employee\n", - " ---\n", - " name : varchar(60) # Employee's name\n", - " -> Department \n", - " hire_date : date # When they were hired\n", - " \"\"\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "metadata": {}, - "outputs": [], - "source": [ - "@schema\n", - "class ReportsTo(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Employee\n", - " ---\n", - " -> Employee.proj(manager_id='employee_id')\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class DepartmentChair(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Department\n", - " ---\n", - " -> [unique]Employee\n", - " appointed_date : date # When they became department char\n", - " \"\"\"\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Populating the Database\n", - "\n", - "Let's create a realistic organizational hierarchy with sample data.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Inserted 10 employees\n", - "\n", - "Department lookup table:\n", - "*dept_code department_nam location \n", - "+-----------+ +------------+ +------------+\n", - "ENG Engineering Building A \n", - "FIN Finance Building A \n", - "HR Human Resource Building A \n", - "MKT Marketing Building B \n", - "SAL Sales Building C \n", - " (Total: 5)\n", - "\n" - ] - } - ], - "source": [ - "# Insert employees with realistic hierarchy using Department lookup table\n", - "employee_data = [\n", - " # Top-level executives\n", - " {'employee_id': 1, 'name': 'Alice Johnson', 'dept_code': 'ENG', 'hire_date': '2015-01-15'},\n", - " {'employee_id': 2, 'name': 'Bob Smith', 'dept_code': 'MKT', 'hire_date': '2016-03-20'},\n", - " \n", - " # Mid-level managers\n", - " {'employee_id': 3, 'name': 'Carol Davis', 'dept_code': 'ENG', 'hire_date': '2017-06-10'},\n", - " {'employee_id': 4, 'name': 'David Wilson', 'dept_code': 'ENG', 'hire_date': '2018-02-14'},\n", - " {'employee_id': 5, 'name': 'Eva Brown', 'dept_code': 'MKT', 'hire_date': '2018-09-05'},\n", - " \n", - " # Individual contributors\n", - " {'employee_id': 6, 'name': 'Frank Miller', 'dept_code': 'ENG', 'hire_date': '2019-04-12'},\n", - " {'employee_id': 7, 'name': 'Grace Lee', 'dept_code': 'ENG', 'hire_date': '2019-07-18'},\n", - " {'employee_id': 8, 'name': 'Henry Taylor', 'dept_code': 'ENG', 'hire_date': '2020-01-08'},\n", - " {'employee_id': 9, 'name': 'Ivy Chen', 'dept_code': 'MKT', 'hire_date': '2020-03-25'},\n", - " {'employee_id': 10, 'name': 'Jack Anderson', 'dept_code': 'SAL', 'hire_date': '2020-06-30'},\n", - "]\n", - "\n", - "Employee.insert(employee_data)\n", - "print(f\"Inserted {len(employee_data)} employees\")\n", - "\n", - "# Display Department lookup table\n", - "print(\"\\nDepartment lookup table:\")\n", - "print(Department())\n" - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": {}, - "outputs": [], - "source": [ - "# Create reporting relationships\n", - "reports_data = [\n", - " # Alice Johnson (1) manages Carol Davis (3) and David Wilson (4)\n", - " {'employee_id': 3, 'manager_id': 1}, # Carol reports to Alice\n", - " {'employee_id': 4, 'manager_id': 1}, # David reports to Alice\n", - " \n", - " # Bob Smith (2) manages Eva Brown (5)\n", - " {'employee_id': 5, 'manager_id': 2}, # Eva reports to Bob\n", - " \n", - " # Carol Davis (3) manages Frank Miller (6) and Grace Lee (7)\n", - " {'employee_id': 6, 'manager_id': 3}, # Frank reports to Carol\n", - " {'employee_id': 7, 'manager_id': 3}, # Grace reports to Carol\n", - " \n", - " # David Wilson (4) manages Henry Taylor (8)\n", - " {'employee_id': 8, 'manager_id': 4}, # Henry reports to David\n", - " \n", - " # Eva Brown (5) manages Ivy Chen (9)\n", - " {'employee_id': 9, 'manager_id': 5}, # Ivy reports to Eva\n", - " \n", - " # Jack Anderson (10) has no manager (top-level in Sales)\n", - "]\n" - ] - }, - { - "cell_type": "code", - "execution_count": 57, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "ReportsTo.insert(reports_data)\n", - "\n", - "# Create department chair s\n", - "department_chairs_data = [\n", - " {'dept_code': 'ENG', 'employee_id': 1, 'appointed_date': '2015-01-15'}, # Alice Johnson heads Engineering\n", - " {'dept_code': 'MKT', 'employee_id': 2, 'appointed_date': '2016-03-20'}, # Bob Smith heads Marketing\n", - " {'dept_code': 'SAL', 'employee_id': 10, 'appointed_date': '2020-06-30'}, # Jack Anderson heads Sales\n", - " {'dept_code': 'HR', 'employee_id': 11, 'appointed_date': '2021-01-15'}, # We'll add HR head\n", - " {'dept_code': 'FIN', 'employee_id': 12, 'appointed_date': '2021-02-01'}, # We'll add Finance head\n", - "]\n", - "\n", - "# Add the missing employees for HR and Finance chairs\n", - "additional_employees = [\n", - " {'employee_id': 11, 'name': 'Sarah Wilson', 'dept_code': 'HR', 'hire_date': '2020-08-15'},\n", - " {'employee_id': 12, 'name': 'Michael Brown', 'dept_code': 'FIN', 'hire_date': '2020-09-01'},\n", - "]\n", - "\n", - "Employee.insert(additional_employees)\n", - "DepartmentChair.insert(department_chairs_data)\n" - ] - }, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Management Hierarchy\n", + "\n", + "This example demonstrates self-referencing tables and hierarchical data structures, showing how to model organizational relationships where employees manage other employees. It also showcases the benefits of using lookup tables for normalized data design.\n", + "\n", + "## Database Schema\n", + "\n", + "The database consists of four tables:\n", + "1. **Department** - A lookup table containing department information (codes, names, budgets, locations)\n", + "2. **Employee** - Individual employees with basic information and department assignment\n", + "3. **ReportsTo** - An association table linking employees to their managers\n", + "4. **DepartmentChair** - A table linking departments to their chairs\n", + "\n", + "This design allows:\n", + "- Each employee to have at most one manager (many-to-one relationship)\n", + "- Each manager to have multiple direct reports (one-to-many relationship)\n", + "- Each employee to belong to exactly one department (many-to-one relationship)\n", + "- Each department to have exactly one chair (one-to-one relationship)\n", + "- Each employee to be chair of at most one department (one-to-zero-or-one relationship)\n", + "- Modeling of organizational hierarchies with normalized department data\n", + "- Extended department metadata (budget, location) without data duplication\n" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Database Diagram" - ] - }, + "name": "stdout", + "output_type": "stream", + "text": [ + "Exception reporting mode: Minimal\n" + ] + } + ], + "source": [ + "%xmode minimal\n", + "import datajoint as dj\n", + "\n", + "# Create schema\n", + "schema = dj.Schema('management_example')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Table Definitions\n" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [], + "source": [ + "@schema\n", + "class Department(dj.Lookup):\n", + " definition = \"\"\"\n", + " dept_code : char(3) # Department code (e.g., 'ENG', 'MKT', 'SAL')\n", + " ---\n", + " department_name : varchar(50) # Full department name\n", + " location : varchar(30) # Office location\n", + " \"\"\"\n", + " contents = [\n", + " ('ENG', 'Engineering','Building A'),\n", + " ('MKT', 'Marketing', 'Building B'),\n", + " ('SAL', 'Sales', 'Building C'),\n", + " ('HR', 'Human Resources', 'Building A'),\n", + " ('FIN', 'Finance', 'Building A'),\n", + " ]\n", + "\n", + "@schema\n", + "class Employee(dj.Manual):\n", + " definition = \"\"\"\n", + " employee_id : int # Unique identifier for each employee\n", + " ---\n", + " name : varchar(60) # Employee's name\n", + " -> Department \n", + " hire_date : date # When they were hired\n", + " \"\"\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [], + "source": [ + "@schema\n", + "class ReportsTo(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Employee\n", + " ---\n", + " -> Employee.proj(manager_id='employee_id')\n", + " \"\"\"\n", + "\n", + "@schema\n", + "class DepartmentChair(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Department\n", + " ---\n", + " -> [unique]Employee\n", + " appointed_date : date # When they became department char\n", + " \"\"\"\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Populating the Database\n", + "\n", + "Let's create a realistic organizational hierarchy with sample data.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "# Insert employees with realistic hierarchy using Department lookup table\nemployee_data = [\n # Top-level executives\n {'employee_id': 1, 'name': 'Alice Johnson',\n 'dept_code': 'ENG', 'hire_date': '2015-01-15'},\n {'employee_id': 2, 'name': 'Bob Smith',\n 'dept_code': 'MKT', 'hire_date': '2016-03-20'},\n \n # Mid-level managers\n {'employee_id': 3, 'name': 'Carol Davis',\n 'dept_code': 'ENG', 'hire_date': '2017-06-10'},\n {'employee_id': 4, 'name': 'David Wilson',\n 'dept_code': 'ENG', 'hire_date': '2018-02-14'},\n {'employee_id': 5, 'name': 'Eva Brown',\n 'dept_code': 'MKT', 'hire_date': '2018-09-05'},\n \n # Individual contributors\n {'employee_id': 6, 'name': 'Frank Miller',\n 'dept_code': 'ENG', 'hire_date': '2019-04-12'},\n {'employee_id': 7, 'name': 'Grace Lee',\n 'dept_code': 'ENG', 'hire_date': '2019-07-18'},\n {'employee_id': 8, 'name': 'Henry Taylor',\n 'dept_code': 'ENG', 'hire_date': '2020-01-08'},\n {'employee_id': 9, 'name': 'Ivy Chen',\n 'dept_code': 'MKT', 'hire_date': '2020-03-25'},\n {'employee_id': 10, 'name': 'Jack Anderson',\n 'dept_code': 'SAL', 'hire_date': '2020-06-30'},\n]\n\nEmployee.insert(employee_data)\nprint(f\"Inserted {len(employee_data)} employees\")\n\n# Display Department lookup table\nprint(\"\\nDepartment lookup table:\")\nprint(Department())" + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [], + "source": [ + "# Create reporting relationships\n", + "reports_data = [\n", + " # Alice Johnson (1) manages Carol Davis (3) and David Wilson (4)\n", + " {'employee_id': 3, 'manager_id': 1}, # Carol reports to Alice\n", + " {'employee_id': 4, 'manager_id': 1}, # David reports to Alice\n", + " \n", + " # Bob Smith (2) manages Eva Brown (5)\n", + " {'employee_id': 5, 'manager_id': 2}, # Eva reports to Bob\n", + " \n", + " # Carol Davis (3) manages Frank Miller (6) and Grace Lee (7)\n", + " {'employee_id': 6, 'manager_id': 3}, # Frank reports to Carol\n", + " {'employee_id': 7, 'manager_id': 3}, # Grace reports to Carol\n", + " \n", + " # David Wilson (4) manages Henry Taylor (8)\n", + " {'employee_id': 8, 'manager_id': 4}, # Henry reports to David\n", + " \n", + " # Eva Brown (5) manages Ivy Chen (9)\n", + " {'employee_id': 9, 'manager_id': 5}, # Ivy reports to Eva\n", + " \n", + " # Jack Anderson (10) has no manager (top-level in Sales)\n", + "]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "ReportsTo.insert(reports_data)\n\n# Create department chairs\ndepartment_chairs_data = [\n # Alice Johnson heads Engineering\n {'dept_code': 'ENG', 'employee_id': 1, 'appointed_date': '2015-01-15'},\n # Bob Smith heads Marketing\n {'dept_code': 'MKT', 'employee_id': 2, 'appointed_date': '2016-03-20'},\n # Jack Anderson heads Sales\n {'dept_code': 'SAL', 'employee_id': 10, 'appointed_date': '2020-06-30'},\n # We'll add HR head\n {'dept_code': 'HR', 'employee_id': 11, 'appointed_date': '2021-01-15'},\n # We'll add Finance head\n {'dept_code': 'FIN', 'employee_id': 12, 'appointed_date': '2021-02-01'},\n]\n\n# Add the missing employees for HR and Finance chairs\nadditional_employees = [\n {'employee_id': 11, 'name': 'Sarah Wilson',\n 'dept_code': 'HR', 'hire_date': '2020-08-15'},\n {'employee_id': 12, 'name': 'Michael Brown',\n 'dept_code': 'FIN', 'hire_date': '2020-09-01'},\n]\n\nEmployee.insert(additional_employees)\nDepartmentChair.insert(department_chairs_data)" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Database Diagram" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 58, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "2\n", - "\n", - "2\n", - "\n", - "\n", - "\n", - "ReportsTo\n", - "\n", - "\n", - "ReportsTo\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "2->ReportsTo\n", - "\n", - "\n", - "\n", - "\n", - "Employee\n", - "\n", - "\n", - "Employee\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Employee->2\n", - "\n", - "\n", - "\n", - "\n", - "Employee->ReportsTo\n", - "\n", - "\n", - "\n", - "\n", - "DepartmentChair\n", - "\n", - "\n", - "DepartmentChair\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Employee->DepartmentChair\n", - "\n", - "\n", - "\n", - "\n", - "Department\n", - "\n", - "\n", - "Department\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Department->Employee\n", - "\n", - "\n", - "\n", - "\n", - "Department->DepartmentChair\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 58, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "image/svg+xml": [ + "\n", + "\n", + "\n", + "\n", + "\n", + "2\n", + "\n", + "2\n", + "\n", + "\n", + "\n", + "ReportsTo\n", + "\n", + "\n", + "ReportsTo\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "2->ReportsTo\n", + "\n", + "\n", + "\n", + "\n", + "Employee\n", + "\n", + "\n", + "Employee\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Employee->2\n", + "\n", + "\n", + "\n", + "\n", + "Employee->ReportsTo\n", + "\n", + "\n", + "\n", + "\n", + "DepartmentChair\n", + "\n", + "\n", + "DepartmentChair\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Employee->DepartmentChair\n", + "\n", + "\n", + "\n", + "\n", + "Department\n", + "\n", + "\n", + "Department\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Department->Employee\n", + "\n", + "\n", + "\n", + "\n", + "Department->DepartmentChair\n", + "\n", + "\n", + "\n", + "" ], - "source": [ - "dj.Diagram(schema)\n" + "text/plain": [ + "" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Hierarchical Queries\n", - "\n", - "Now let's explore various queries that demonstrate hierarchical data analysis.\n" - ] - }, + }, + "execution_count": 58, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dj.Diagram(schema)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Hierarchical Queries\n", + "\n", + "Now let's explore various queries that demonstrate hierarchical data analysis.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 59, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

employee_id

\n", - " Unique identifier for each employee\n", - "
\n", - "

name

\n", - " Employee's name\n", - "
\n", - "

dept_code

\n", - " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", - "
\n", - "

hire_date

\n", - " When they were hired\n", - "
1Alice JohnsonENG2015-01-15
2Bob SmithMKT2016-03-20
10Jack AndersonSAL2020-06-30
11Sarah WilsonHR2020-08-15
12Michael BrownFIN2020-09-01
\n", - " \n", - "

Total: 5

\n", - " " - ], - "text/plain": [ - "*employee_id name dept_code hire_date \n", - "+------------+ +------------+ +-----------+ +------------+\n", - "1 Alice Johnson ENG 2015-01-15 \n", - "2 Bob Smith MKT 2016-03-20 \n", - "10 Jack Anderson SAL 2020-06-30 \n", - "11 Sarah Wilson HR 2020-08-15 \n", - "12 Michael Brown FIN 2020-09-01 \n", - " (Total: 5)" - ] - }, - "execution_count": 59, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

employee_id

\n", + " Unique identifier for each employee\n", + "
\n", + "

name

\n", + " Employee's name\n", + "
\n", + "

dept_code

\n", + " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", + "
\n", + "

hire_date

\n", + " When they were hired\n", + "
1Alice JohnsonENG2015-01-15
2Bob SmithMKT2016-03-20
10Jack AndersonSAL2020-06-30
11Sarah WilsonHR2020-08-15
12Michael BrownFIN2020-09-01
\n", + " \n", + "

Total: 5

\n", + " " ], - "source": [ - "# Query 1: Show information about department chairs\n", - "\n", - "Employee & DepartmentChair " + "text/plain": [ + "*employee_id name dept_code hire_date \n", + "+------------+ +------------+ +-----------+ +------------+\n", + "1 Alice Johnson ENG 2015-01-15 \n", + "2 Bob Smith MKT 2016-03-20 \n", + "10 Jack Anderson SAL 2020-06-30 \n", + "11 Sarah Wilson HR 2020-08-15 \n", + "12 Michael Brown FIN 2020-09-01 \n", + " (Total: 5)" ] - }, + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 1: Show information about department chairs\n", + "\n", + "Employee & DepartmentChair " + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 60, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

dept_code

\n", - " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", - "
\n", - "

employee_id

\n", - " Unique identifier for each employee\n", - "
\n", - "

department_name

\n", - " Full department name\n", - "
\n", - "

location

\n", - " Office location\n", - "
\n", - "

chair

\n", - " Employee's name\n", - "
ENG1EngineeringBuilding AAlice Johnson
ENG2EngineeringBuilding ABob Smith
ENG3EngineeringBuilding ACarol Davis
ENG4EngineeringBuilding ADavid Wilson
ENG5EngineeringBuilding AEva Brown
ENG6EngineeringBuilding AFrank Miller
ENG7EngineeringBuilding AGrace Lee
ENG8EngineeringBuilding AHenry Taylor
ENG9EngineeringBuilding AIvy Chen
ENG10EngineeringBuilding AJack Anderson
ENG11EngineeringBuilding ASarah Wilson
ENG12EngineeringBuilding AMichael Brown
\n", - "

...

\n", - "

Total: 60

\n", - " " - ], - "text/plain": [ - "*dept_code *employee_id department_nam location chair \n", - "+-----------+ +------------+ +------------+ +------------+ +------------+\n", - "ENG 1 Engineering Building A Alice Johnson \n", - "ENG 2 Engineering Building A Bob Smith \n", - "ENG 3 Engineering Building A Carol Davis \n", - "ENG 4 Engineering Building A David Wilson \n", - "ENG 5 Engineering Building A Eva Brown \n", - "ENG 6 Engineering Building A Frank Miller \n", - "ENG 7 Engineering Building A Grace Lee \n", - "ENG 8 Engineering Building A Henry Taylor \n", - "ENG 9 Engineering Building A Ivy Chen \n", - "ENG 10 Engineering Building A Jack Anderson \n", - "ENG 11 Engineering Building A Sarah Wilson \n", - "ENG 12 Engineering Building A Michael Brown \n", - " ...\n", - " (Total: 60)" - ] - }, - "execution_count": 60, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

dept_code

\n", + " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", + "
\n", + "

employee_id

\n", + " Unique identifier for each employee\n", + "
\n", + "

department_name

\n", + " Full department name\n", + "
\n", + "

location

\n", + " Office location\n", + "
\n", + "

chair

\n", + " Employee's name\n", + "
ENG1EngineeringBuilding AAlice Johnson
ENG2EngineeringBuilding ABob Smith
ENG3EngineeringBuilding ACarol Davis
ENG4EngineeringBuilding ADavid Wilson
ENG5EngineeringBuilding AEva Brown
ENG6EngineeringBuilding AFrank Miller
ENG7EngineeringBuilding AGrace Lee
ENG8EngineeringBuilding AHenry Taylor
ENG9EngineeringBuilding AIvy Chen
ENG10EngineeringBuilding AJack Anderson
ENG11EngineeringBuilding ASarah Wilson
ENG12EngineeringBuilding AMichael Brown
\n", + "

...

\n", + "

Total: 60

\n", + " " ], - "source": [ - "# Query 2: Show department info including the names of their chairs\n", - "\n", - "(Department * DepartmentChair.proj() * Employee.proj(chair='name'))" + "text/plain": [ + "*dept_code *employee_id department_nam location chair \n", + "+-----------+ +------------+ +------------+ +------------+ +------------+\n", + "ENG 1 Engineering Building A Alice Johnson \n", + "ENG 2 Engineering Building A Bob Smith \n", + "ENG 3 Engineering Building A Carol Davis \n", + "ENG 4 Engineering Building A David Wilson \n", + "ENG 5 Engineering Building A Eva Brown \n", + "ENG 6 Engineering Building A Frank Miller \n", + "ENG 7 Engineering Building A Grace Lee \n", + "ENG 8 Engineering Building A Henry Taylor \n", + "ENG 9 Engineering Building A Ivy Chen \n", + "ENG 10 Engineering Building A Jack Anderson \n", + "ENG 11 Engineering Building A Sarah Wilson \n", + "ENG 12 Engineering Building A Michael Brown \n", + " ...\n", + " (Total: 60)" ] - }, + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 2: Show department info including the names of their chairs\n", + "\n", + "(Department * DepartmentChair.proj() * Employee.proj(chair='name'))" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 61, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

manager_id

\n", - " Unique identifier for each employee\n", - "
\n", - "

name

\n", - " Employee's name\n", - "
\n", - "

dept_code

\n", - " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", - "
\n", - "

hire_date

\n", - " When they were hired\n", - "
1Alice JohnsonENG2015-01-15
2Bob SmithMKT2016-03-20
3Carol DavisENG2017-06-10
4David WilsonENG2018-02-14
5Eva BrownMKT2018-09-05
\n", - " \n", - "

Total: 5

\n", - " " - ], - "text/plain": [ - "*manager_id name dept_code hire_date \n", - "+------------+ +------------+ +-----------+ +------------+\n", - "1 Alice Johnson ENG 2015-01-15 \n", - "2 Bob Smith MKT 2016-03-20 \n", - "3 Carol Davis ENG 2017-06-10 \n", - "4 David Wilson ENG 2018-02-14 \n", - "5 Eva Brown MKT 2018-09-05 \n", - " (Total: 5)" - ] - }, - "execution_count": 61, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

manager_id

\n", + " Unique identifier for each employee\n", + "
\n", + "

name

\n", + " Employee's name\n", + "
\n", + "

dept_code

\n", + " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", + "
\n", + "

hire_date

\n", + " When they were hired\n", + "
1Alice JohnsonENG2015-01-15
2Bob SmithMKT2016-03-20
3Carol DavisENG2017-06-10
4David WilsonENG2018-02-14
5Eva BrownMKT2018-09-05
\n", + " \n", + "

Total: 5

\n", + " " ], - "source": [ - "# Query 2: Show information for managers (employees who have others reporting to them)\n", - "Employee.proj(..., manager_id='employee_id') & ReportsTo\n" + "text/plain": [ + "*manager_id name dept_code hire_date \n", + "+------------+ +------------+ +-----------+ +------------+\n", + "1 Alice Johnson ENG 2015-01-15 \n", + "2 Bob Smith MKT 2016-03-20 \n", + "3 Carol Davis ENG 2017-06-10 \n", + "4 David Wilson ENG 2018-02-14 \n", + "5 Eva Brown MKT 2018-09-05 \n", + " (Total: 5)" ] - }, + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 2: Show information for managers (employees who have others reporting to them)\n", + "Employee.proj(..., manager_id='employee_id') & ReportsTo\n" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 62, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

employee_id

\n", - " Unique identifier for each employee\n", - "
\n", - "

name

\n", - " Employee's name\n", - "
\n", - "

dept_code

\n", - " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", - "
\n", - "

hire_date

\n", - " When they were hired\n", - "
1Alice JohnsonENG2015-01-15
2Bob SmithMKT2016-03-20
10Jack AndersonSAL2020-06-30
11Sarah WilsonHR2020-08-15
12Michael BrownFIN2020-09-01
\n", - " \n", - "

Total: 5

\n", - " " - ], - "text/plain": [ - "*employee_id name dept_code hire_date \n", - "+------------+ +------------+ +-----------+ +------------+\n", - "1 Alice Johnson ENG 2015-01-15 \n", - "2 Bob Smith MKT 2016-03-20 \n", - "10 Jack Anderson SAL 2020-06-30 \n", - "11 Sarah Wilson HR 2020-08-15 \n", - "12 Michael Brown FIN 2020-09-01 \n", - " (Total: 5)" - ] - }, - "execution_count": 62, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

employee_id

\n", + " Unique identifier for each employee\n", + "
\n", + "

name

\n", + " Employee's name\n", + "
\n", + "

dept_code

\n", + " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", + "
\n", + "

hire_date

\n", + " When they were hired\n", + "
1Alice JohnsonENG2015-01-15
2Bob SmithMKT2016-03-20
10Jack AndersonSAL2020-06-30
11Sarah WilsonHR2020-08-15
12Michael BrownFIN2020-09-01
\n", + " \n", + "

Total: 5

\n", + " " ], - "source": [ - "# Query 3: Show top managers, i.e. employees who do not have managers\n", - "Employee - ReportsTo" + "text/plain": [ + "*employee_id name dept_code hire_date \n", + "+------------+ +------------+ +-----------+ +------------+\n", + "1 Alice Johnson ENG 2015-01-15 \n", + "2 Bob Smith MKT 2016-03-20 \n", + "10 Jack Anderson SAL 2020-06-30 \n", + "11 Sarah Wilson HR 2020-08-15 \n", + "12 Michael Brown FIN 2020-09-01 \n", + " (Total: 5)" ] - }, + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 3: Show top managers, i.e. employees who do not have managers\n", + "Employee - ReportsTo" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 63, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

employee_id

\n", - " Unique identifier for each employee\n", - "
\n", - "

name

\n", - " Employee's name\n", - "
\n", - "

dept_code

\n", - " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", - "
\n", - "

hire_date

\n", - " When they were hired\n", - "
6Frank MillerENG2019-04-12
7Grace LeeENG2019-07-18
8Henry TaylorENG2020-01-08
9Ivy ChenMKT2020-03-25
10Jack AndersonSAL2020-06-30
11Sarah WilsonHR2020-08-15
12Michael BrownFIN2020-09-01
\n", - " \n", - "

Total: 7

\n", - " " - ], - "text/plain": [ - "*employee_id name dept_code hire_date \n", - "+------------+ +------------+ +-----------+ +------------+\n", - "6 Frank Miller ENG 2019-04-12 \n", - "7 Grace Lee ENG 2019-07-18 \n", - "8 Henry Taylor ENG 2020-01-08 \n", - "9 Ivy Chen MKT 2020-03-25 \n", - "10 Jack Anderson SAL 2020-06-30 \n", - "11 Sarah Wilson HR 2020-08-15 \n", - "12 Michael Brown FIN 2020-09-01 \n", - " (Total: 7)" - ] - }, - "execution_count": 63, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

employee_id

\n", + " Unique identifier for each employee\n", + "
\n", + "

name

\n", + " Employee's name\n", + "
\n", + "

dept_code

\n", + " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", + "
\n", + "

hire_date

\n", + " When they were hired\n", + "
6Frank MillerENG2019-04-12
7Grace LeeENG2019-07-18
8Henry TaylorENG2020-01-08
9Ivy ChenMKT2020-03-25
10Jack AndersonSAL2020-06-30
11Sarah WilsonHR2020-08-15
12Michael BrownFIN2020-09-01
\n", + " \n", + "

Total: 7

\n", + " " ], - "source": [ - "# Query 4: Show individual contributors, i.e. employees who do not have any direct reports \n", - "\n", - "# Re-interpretation: show all employees whose employee_id does not appear \n", - "# in the manager_id column of the ReportsTo table\n", - "Employee - ReportsTo.proj(employee_id='manager_id', x='employee_id')" + "text/plain": [ + "*employee_id name dept_code hire_date \n", + "+------------+ +------------+ +-----------+ +------------+\n", + "6 Frank Miller ENG 2019-04-12 \n", + "7 Grace Lee ENG 2019-07-18 \n", + "8 Henry Taylor ENG 2020-01-08 \n", + "9 Ivy Chen MKT 2020-03-25 \n", + "10 Jack Anderson SAL 2020-06-30 \n", + "11 Sarah Wilson HR 2020-08-15 \n", + "12 Michael Brown FIN 2020-09-01 \n", + " (Total: 7)" ] - }, + }, + "execution_count": 63, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 4: Show individual contributors, i.e. employees who do not have any direct reports \n", + "\n", + "# Re-interpretation: show all employees whose employee_id does not appear \n", + "# in the manager_id column of the ReportsTo table\n", + "Employee - ReportsTo.proj(employee_id='manager_id', x='employee_id')" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 64, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

employee_id

\n", - " Unique identifier for each employee\n", - "
\n", - "

name

\n", - " Employee's name\n", - "
\n", - "

dept_code

\n", - " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", - "
\n", - "

hire_date

\n", - " When they were hired\n", - "
3Carol DavisENG2017-06-10
4David WilsonENG2018-02-14
5Eva BrownMKT2018-09-05
\n", - " \n", - "

Total: 3

\n", - " " - ], - "text/plain": [ - "*employee_id name dept_code hire_date \n", - "+------------+ +------------+ +-----------+ +------------+\n", - "3 Carol Davis ENG 2017-06-10 \n", - "4 David Wilson ENG 2018-02-14 \n", - "5 Eva Brown MKT 2018-09-05 \n", - " (Total: 3)" - ] - }, - "execution_count": 64, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

employee_id

\n", + " Unique identifier for each employee\n", + "
\n", + "

name

\n", + " Employee's name\n", + "
\n", + "

dept_code

\n", + " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", + "
\n", + "

hire_date

\n", + " When they were hired\n", + "
3Carol DavisENG2017-06-10
4David WilsonENG2018-02-14
5Eva BrownMKT2018-09-05
\n", + " \n", + "

Total: 3

\n", + " " ], - "source": [ - "# Query 5: Show middle managers\n", - "\n", - "Employee & ReportsTo & ReportsTo.proj(employee_id='manager_id', x='employee_id')" + "text/plain": [ + "*employee_id name dept_code hire_date \n", + "+------------+ +------------+ +-----------+ +------------+\n", + "3 Carol Davis ENG 2017-06-10 \n", + "4 David Wilson ENG 2018-02-14 \n", + "5 Eva Brown MKT 2018-09-05 \n", + " (Total: 3)" ] - }, + }, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Query 5: Show middle managers\n", + "\n", + "Employee & ReportsTo & ReportsTo.proj(employee_id='manager_id', x='employee_id')" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 65, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " \n", - " \n", - " \n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "
\n", - "

manager_id

\n", - " Unique identifier for each employee\n", - "
\n", - "

name

\n", - " Employee's name\n", - "
\n", - "

dept_code

\n", - " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", - "
\n", - "

hire_date

\n", - " When they were hired\n", - "
\n", - "

direct_reports

\n", - " calculated attribute\n", - "
1Alice JohnsonENG2015-01-152
2Bob SmithMKT2016-03-201
3Carol DavisENG2017-06-102
4David WilsonENG2018-02-141
5Eva BrownMKT2018-09-051
6Frank MillerENG2019-04-120
7Grace LeeENG2019-07-180
8Henry TaylorENG2020-01-080
9Ivy ChenMKT2020-03-250
10Jack AndersonSAL2020-06-300
11Sarah WilsonHR2020-08-150
12Michael BrownFIN2020-09-010
\n", - " \n", - "

Total: 12

\n", - " " - ], - "text/plain": [ - "*manager_id name dept_code hire_date direct_reports\n", - "+------------+ +------------+ +-----------+ +------------+ +------------+\n", - "1 Alice Johnson ENG 2015-01-15 2 \n", - "2 Bob Smith MKT 2016-03-20 1 \n", - "3 Carol Davis ENG 2017-06-10 2 \n", - "4 David Wilson ENG 2018-02-14 1 \n", - "5 Eva Brown MKT 2018-09-05 1 \n", - "6 Frank Miller ENG 2019-04-12 0 \n", - "7 Grace Lee ENG 2019-07-18 0 \n", - "8 Henry Taylor ENG 2020-01-08 0 \n", - "9 Ivy Chen MKT 2020-03-25 0 \n", - "10 Jack Anderson SAL 2020-06-30 0 \n", - "11 Sarah Wilson HR 2020-08-15 0 \n", - "12 Michael Brown FIN 2020-09-01 0 \n", - " (Total: 12)" - ] - }, - "execution_count": 65, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
\n", + "

manager_id

\n", + " Unique identifier for each employee\n", + "
\n", + "

name

\n", + " Employee's name\n", + "
\n", + "

dept_code

\n", + " Department code (e.g., 'ENG', 'MKT', 'SAL')\n", + "
\n", + "

hire_date

\n", + " When they were hired\n", + "
\n", + "

direct_reports

\n", + " calculated attribute\n", + "
1Alice JohnsonENG2015-01-152
2Bob SmithMKT2016-03-201
3Carol DavisENG2017-06-102
4David WilsonENG2018-02-141
5Eva BrownMKT2018-09-051
6Frank MillerENG2019-04-120
7Grace LeeENG2019-07-180
8Henry TaylorENG2020-01-080
9Ivy ChenMKT2020-03-250
10Jack AndersonSAL2020-06-300
11Sarah WilsonHR2020-08-150
12Michael BrownFIN2020-09-010
\n", + " \n", + "

Total: 12

\n", + " " ], - "source": [ - "# Show all employees and the number of direct reports they have\n", - "Employee.proj(..., manager_id='employee_id').aggr(\n", - " ReportsTo, ..., direct_reports='count(employee_id)', \n", - " keep_all_rows=True)\n" + "text/plain": [ + "*manager_id name dept_code hire_date direct_reports\n", + "+------------+ +------------+ +-----------+ +------------+ +------------+\n", + "1 Alice Johnson ENG 2015-01-15 2 \n", + "2 Bob Smith MKT 2016-03-20 1 \n", + "3 Carol Davis ENG 2017-06-10 2 \n", + "4 David Wilson ENG 2018-02-14 1 \n", + "5 Eva Brown MKT 2018-09-05 1 \n", + "6 Frank Miller ENG 2019-04-12 0 \n", + "7 Grace Lee ENG 2019-07-18 0 \n", + "8 Henry Taylor ENG 2020-01-08 0 \n", + "9 Ivy Chen MKT 2020-03-25 0 \n", + "10 Jack Anderson SAL 2020-06-30 0 \n", + "11 Sarah Wilson HR 2020-08-15 0 \n", + "12 Michael Brown FIN 2020-09-01 0 \n", + " (Total: 12)" ] + }, + "execution_count": 65, + "metadata": {}, + "output_type": "execute_result" } - ], - "metadata": { - "kernelspec": { - "display_name": "base", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.2" - } + ], + "source": [ + "# Show all employees and the number of direct reports they have\n", + "Employee.proj(..., manager_id='employee_id').aggr(\n", + " ReportsTo, ..., direct_reports='count(employee_id)', \n", + " keep_all_rows=True)\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" }, - "nbformat": 4, - "nbformat_minor": 2 -} + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} \ No newline at end of file diff --git a/book/85-special-topics/025-uuid.ipynb b/book/85-special-topics/025-uuid.ipynb index 0cbced9..7a05fb3 100644 --- a/book/85-special-topics/025-uuid.ipynb +++ b/book/85-special-topics/025-uuid.ipynb @@ -3,98 +3,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "# UUIDs\n", - "\n", - "This chapter demonstrates how to use **Universally Unique Identifiers (UUIDs)** in DataJoitn tables. \n", - "\n", - "For the conceptual foundation on primary keys and when to use surrogate keys like UUIDs, see [Primary Keys](020-primary-key.md). That chapter covers:\n", - "- What primary keys are and why they matter\n", - "- Natural keys vs. surrogate keys\n", - "- When UUIDs are appropriate\n", - "- UUID types (UUID1, UUID3, UUID4, UUID5) and their characteristics\n", - "- Choosing the right UUID type for your use case\n", - "\n", - "Here, we focus on **practical implementation** of UUIDs in DataJoint schemas.\n", - "\n", - "## Quick Reference: UUID Types\n", - "\n", - "- **UUID1**: Time-based, sortable (good for primary keys where ordering matters)\n", - "- **UUID4**: Random, anonymous (good for security tokens, session IDs)\n", - "- **UUID3/5**: Deterministic, hierarchical (good for content-addressable systems)\n", - "\n", - "See [Primary Keys](020-primary-key.md) for detailed characteristics and use cases.\n", - "\n", - "## Alternative Unique Identifier Systems\n", - "\n", - "While UUIDs are the most common standardized unique identifier system, there are other alternatives that may be better suited for specific use cases:\n", - "\n", - "### ULID (Universally Unique Lexicographically Sortable Identifier)\n", - "\n", - "**ULID** provides a lexicographically sortable unique identifier that combines timestamp and randomness. Unlike UUIDs, ULIDs are guaranteed to be sortable by creation time.\n", - "\n", - "**Key characteristics:**\n", - "- **48-bit timestamp** (millisecond precision) + **80-bit randomness**\n", - "- **Lexicographically sortable**: Can be sorted as strings to maintain chronological order\n", - "- **URL-safe**: Uses Crockford's Base32 encoding (no special characters)\n", - "- **Case-insensitive**: Designed for human readability\n", - "- **26 characters**: More compact than UUIDs (36 characters)\n", - "\n", - "**Use cases:**\n", - "- Database primary keys where chronological sorting is important\n", - "- Log entries that need to maintain temporal order\n", - "- Distributed systems requiring sortable IDs without coordination\n", - "- When you need to sort by creation time without additional timestamp columns\n", - "\n", - "**Example**: `01ARZ3NDEKTSV4RRFFQ69G5FAV`\n", - "\n", - "**Resources:**\n", - "- [ULID Specification](https://github.com/ulid/spec)\n", - "- [ULID Generator/Calculator](https://zelark.github.io/nano-id-cc/)\n", - "\n", - "### NANOID\n", - "\n", - "**NANOID** is a tiny, URL-safe, unique string ID generator that uses a cryptographically strong random generator.\n", - "\n", - "**Key characteristics:**\n", - "- **Configurable length**: Default is 21 characters (adjustable)\n", - "- **URL-safe**: Uses URL-safe characters (A-Za-z0-9_-)\n", - "- **Fast**: 4x faster than UUID\n", - "- **Smaller size**: 21 characters vs UUID's 36 characters\n", - "- **Collision-resistant**: Uses cryptographically strong random generator\n", - "\n", - "**Use cases:**\n", - "- Short URLs or slugs\n", - "- Compact identifiers where size matters\n", - "- High-performance applications requiring fast ID generation\n", - "- User-facing identifiers where brevity improves UX\n", - "\n", - "**Example**: `V1StGXR8_Z5jdHi6B-myT` (21 characters)\n", - "\n", - "**Resources:**\n", - "- [NANOID GitHub](https://github.com/ai/nanoid)\n", - "- [NANOID Collision Calculator](https://zelark.github.io/nano-id-cc/)\n", - "\n", - "### Choosing Between UUIDs, ULIDs, and NANOIDs\n", - "\n", - "| Feature | UUID | ULID | NANOID |\n", - "|---------|------|------|--------|\n", - "| **Standard** | RFC 9562 | Community spec | Community spec |\n", - "| **Length** | 36 chars | 26 chars | 21 chars (default) |\n", - "| **Sortable** | UUID1/UUID6 only | ✅ Always | ❌ No |\n", - "| **Time-ordered** | UUID1/UUID6 only | ✅ Always | ❌ No |\n", - "| **URL-safe** | ✅ | ✅ | ✅ |\n", - "| **Database support** | ✅ Native in many | Limited | Limited |\n", - "| **Best for** | Standard compliance, interoperability | Sortable IDs, logs | Compact IDs, URLs |\n", - "\n", - "**Note**: DataJoint natively supports UUIDs. For ULID or NANOID, you would store them as `varchar` or `char` or `binary` types.\n", - "\n", - "## Python's UUID Module\n", - "\n", - "Python provides the [UUID module](https://docs.python.org/3/library/uuid.html) as part of its standard library for generating UUIDs.\n", - "\n", - "**Note**: UUIDs are standardized by [RFC 9562](https://www.rfc-editor.org/rfc/rfc9562.html) (which obsoletes RFC 4122). The Python `uuid` module implements the standard UUID formats defined in the specification." - ] + "source": "---\ntitle: UUIDs\n---\n\n# UUIDs: Universally Unique Identifiers\n\nThis chapter demonstrates how to use **Universally Unique Identifiers (UUIDs)** in DataJoint tables as surrogate keys.\n\nFor the conceptual foundation on primary keys and when to use surrogate keys like UUIDs, see [Primary Keys](../30-design/018-primary-key.md). That chapter covers:\n- What primary keys are and why they matter\n- Natural keys vs. surrogate keys\n- Why DataJoint requires explicit key values (no auto-increment)\n- When surrogate keys like UUIDs are appropriate\n\nThis chapter focuses on **practical implementation** of UUIDs and related unique identifier systems.\n\n## When to Use UUIDs\n\nUUIDs are appropriate when you need surrogate keys that are:\n- **Globally unique** without coordination between systems\n- **Generated client-side** before database insertion (required by DataJoint)\n- **Collision-resistant** even across distributed systems\n- **Not exposed to users** (internal identifiers only)\n\n## UUID Types and Characteristics\n\nUUIDs are standardized by [RFC 9562](https://www.rfc-editor.org/rfc/rfc9562.html). The most commonly used types are:\n\n### UUID1 (Time-based)\n\n**Generated from**: Timestamp + MAC address (or random node ID)\n\n**Characteristics**:\n- **Sortable by creation time** — UUIDs generated later are lexicographically greater\n- **Contains temporal information** — Timestamp can be extracted\n- **Contains hardware identifier** — May expose MAC address (privacy concern)\n- **Sequential when generated rapidly** — Reduces index fragmentation\n\n**Best for**: Primary keys where chronological ordering matters, audit logs, distributed systems needing time-ordered IDs.\n\n### UUID4 (Random)\n\n**Generated from**: Cryptographically secure random numbers\n\n**Characteristics**:\n- **No temporal information** — Cannot determine when generated\n- **No hardware identifier** — Privacy-preserving\n- **Uniformly distributed** — May cause index fragmentation in large tables\n- **Highest entropy** — Most unpredictable\n\n**Best for**: Security tokens, session IDs, cases where predictability is a concern, privacy-sensitive contexts.\n\n### UUID3/UUID5 (Deterministic)\n\n**Generated from**: Namespace UUID + name string (MD5 for UUID3, SHA-1 for UUID5)\n\n**Characteristics**:\n- **Deterministic** — Same inputs always produce the same UUID\n- **Hierarchical** — Can create nested namespaces\n- **Content-addressable** — UUID identifies the content\n- **Reproducible** — No need to store generated UUIDs\n\n**Best for**: Content-addressable storage, deduplication, creating stable IDs from existing identifiers, hierarchical categorization systems.\n\n```{admonition} Choosing the Right UUID Type\n:class: tip\n\n| Requirement | Recommended Type |\n|-------------|------------------|\n| Chronological ordering needed | UUID1 |\n| Privacy is important | UUID4 |\n| Predictability is a concern | UUID4 |\n| Need deterministic IDs from names | UUID5 (preferred) or UUID3 |\n| Distributed system, no coordination | UUID1 or UUID4 |\n| Content-addressable storage | UUID5 |\n```\n\n## Alternative Unique Identifier Systems\n\nWhile UUIDs are the most common standardized unique identifier system, alternatives may be better suited for specific use cases:\n\n### ULID (Universally Unique Lexicographically Sortable Identifier)\n\n**ULID** provides a lexicographically sortable unique identifier that combines timestamp and randomness. Unlike UUIDs, ULIDs are guaranteed to be sortable by creation time.\n\n**Characteristics:**\n- **48-bit timestamp** (millisecond precision) + **80-bit randomness**\n- **Lexicographically sortable**: Can be sorted as strings to maintain chronological order\n- **URL-safe**: Uses Crockford's Base32 encoding (no special characters)\n- **Case-insensitive**: Designed for human readability\n- **26 characters**: More compact than UUIDs (36 characters)\n\n**Best for**: Database primary keys where chronological sorting is important, log entries, distributed systems requiring sortable IDs without coordination.\n\n**Example**: `01ARZ3NDEKTSV4RRFFQ69G5FAV`\n\n**Resources:**\n- [ULID Specification](https://github.com/ulid/spec)\n\n### NANOID\n\n**NANOID** is a tiny, URL-safe, unique string ID generator using a cryptographically strong random generator.\n\n**Characteristics:**\n- **Configurable length**: Default is 21 characters (adjustable)\n- **URL-safe**: Uses URL-safe characters (A-Za-z0-9_-)\n- **Fast**: 4x faster than UUID\n- **Smaller size**: 21 characters vs UUID's 36 characters\n- **Collision-resistant**: Uses cryptographically strong random generator\n\n**Best for**: Short URLs or slugs, compact identifiers, high-performance applications, user-facing identifiers where brevity improves UX.\n\n**Example**: `V1StGXR8_Z5jdHi6B-myT` (21 characters)\n\n**Resources:**\n- [NANOID GitHub](https://github.com/ai/nanoid)\n- [NANOID Collision Calculator](https://zelark.github.io/nano-id-cc/)\n\n### Comparison Table\n\n| Feature | UUID | ULID | NANOID |\n|---------|------|------|--------|\n| **Standard** | RFC 9562 | Community spec | Community spec |\n| **Length** | 36 chars | 26 chars | 21 chars (default) |\n| **Sortable** | UUID1 only | Always | No |\n| **Time-ordered** | UUID1 only | Always | No |\n| **URL-safe** | Yes | Yes | Yes |\n| **Database support** | Native in many | Limited | Limited |\n| **Best for** | Standard compliance | Sortable IDs | Compact IDs |\n\n```{admonition} DataJoint Support\n:class: note\n\nDataJoint natively supports UUIDs with the `uuid` data type. For ULID or NANOID, store them as `char(26)` or `char(21)` respectively.\n```\n\n## Python's UUID Module\n\nPython provides the [uuid module](https://docs.python.org/3/library/uuid.html) as part of its standard library:" }, { "cell_type": "code", @@ -225,13 +134,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "## Using UUIDs in DataJoint Tables\n", - "\n", - "DataJoint supports UUIDs as primary key types. When you declare a column as `uuid`, DataJoint automatically stores it as `BINARY(16)` in MySQL for efficient storage.\n", - "\n", - "Let's create a simple example to demonstrate UUID usage in DataJoint:" - ] + "source": "## Using UUIDs in DataJoint Tables\n\nDataJoint natively supports UUIDs as a data type. When you declare an attribute as `uuid`, DataJoint automatically stores it as `BINARY(16)` in MySQL for efficient storage and indexing.\n\n```{admonition} Why BINARY(16)?\n:class: note\n\nUUIDs are 128-bit values. Storing them as `BINARY(16)` (16 bytes) is more efficient than storing the 36-character string representation, and allows for faster comparisons and indexing.\n```\n\nLet's create examples demonstrating UUID usage in DataJoint:" }, { "cell_type": "code", @@ -1513,11 +1416,11 @@ ] }, { - "cell_type": "code", + "cell_type": "markdown", "execution_count": null, "metadata": {}, "outputs": [], - "source": [] + "source": "## Summary\n\nUUIDs provide a robust solution for generating surrogate keys in DataJoint:\n\n| Aspect | Key Points |\n|--------|------------|\n| **UUID1** | Time-based, sortable, good for primary keys needing order |\n| **UUID4** | Random, privacy-preserving, good for security tokens |\n| **UUID5** | Deterministic, good for content-addressable systems |\n| **Storage** | DataJoint stores UUIDs as `BINARY(16)` for efficiency |\n| **Alternatives** | ULID (sortable, compact) and NANOID (very compact) available |\n\n```{admonition} Next Steps\n:class: tip\n\n- Review [Primary Keys](../30-design/018-primary-key.md) for the conceptual foundation on when to use surrogate keys\n- See [Foreign Keys](../30-design/030-foreign-keys.md) for how UUIDs work in table relationships\n```" } ], "metadata": { @@ -1541,4 +1444,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/book/85-special-topics/083-attach.ipynb b/book/85-special-topics/083-attach.ipynb index 295a592..d85a736 100644 --- a/book/85-special-topics/083-attach.ipynb +++ b/book/85-special-topics/083-attach.ipynb @@ -1293,41 +1293,10 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "@schema\n", - "class WebImage(dj.Lookup):\n", - " definition = \"\"\"\n", - " # A reference to a web image\n", - " image_number : int\n", - " ---\n", - " image_name : varchar(30)\n", - " image_description : varchar(1024)\n", - " image_url : varchar(1024)\n", - " \n", - " unique index(image_name)\n", - " \"\"\"\n", - " contents = [\n", - " (0, \"pyramidal\", \n", - " \n", - " 'Coronal section containing the chronically imaged pyramidal neuron \"dow\" '\\\n", - " '(visualized by green GFP) does not stain for GABA (visualized by antibody staining in red). '\\\n", - " 'Confocal image stack, overlay of GFP and GABA channels. Scale bar: 100 um',\n", - " \n", - " \"https://upload.wikimedia.org/wikipedia/commons/d/dc/PLoSBiol4.e126.Fig6fNeuron.jpg\"\n", - " ),\n", - " (1, \"striatal\", \n", - " \n", - " \"Mouse spiny striatal projection neuron expressing a transgenic fluorescent protein \"\\\n", - " \"(colored yellow) delivered by a recombinant virus (AAV). \"\\\n", - " \"The striatal interneuron are stainerd in green for the neurokinin-1 receptor.\",\n", - " \n", - " \"https://upload.wikimedia.org/wikipedia/commons/e/e8/Striatal_neuron_in_an_interneuron_cage.jpg\"\n", - " )\n", - " ]" - ] + "source": "@schema\nclass WebImage(dj.Lookup):\n definition = \"\"\"\n # A reference to a web image\n image_number : int\n ---\n image_name : varchar(30)\n image_description : varchar(1024)\n image_url : varchar(1024)\n \n unique index(image_name)\n \"\"\"\n contents = [\n (0, \"pyramidal\", \n 'Coronal section containing the chronically imaged pyramidal '\n 'neuron \"dow\" (visualized by green GFP) does not stain for GABA '\n '(visualized by antibody staining in red). '\n 'Confocal image stack, overlay of GFP and GABA channels. '\n 'Scale bar: 100 um',\n \"https://upload.wikimedia.org/wikipedia/commons/d/dc/\"\n \"PLoSBiol4.e126.Fig6fNeuron.jpg\"\n ),\n (1, \"striatal\", \n \"Mouse spiny striatal projection neuron expressing a transgenic \"\n \"fluorescent protein (colored yellow) delivered by a recombinant \"\n \"virus (AAV). The striatal interneuron are stained in green for \"\n \"the neurokinin-1 receptor.\",\n \"https://upload.wikimedia.org/wikipedia/commons/e/e8/\"\n \"Striatal_neuron_in_an_interneuron_cage.jpg\"\n )\n ]" }, { "cell_type": "markdown", @@ -2755,4 +2724,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/book/85-special-topics/084-filepath.ipynb b/book/85-special-topics/084-filepath.ipynb index 25a8500..e0ac1b0 100644 --- a/book/85-special-topics/084-filepath.ipynb +++ b/book/85-special-topics/084-filepath.ipynb @@ -93,20 +93,10 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# Step 1: Find a bunch of images on the web\n", - "logos = dict(\n", - " ucsd='https://upload.wikimedia.org/wikipedia/commons/f/f6/UCSD_logo.png',\n", - " datajoint='https://datajoint.io/static/images/DJiotitle.png',\n", - " utah='https://umc.utah.edu/wp-content/uploads/sites/15/2015/01/Ulogo_400p.png',\n", - " bcm='https://upload.wikimedia.org/wikipedia/commons/5/5d/Baylor_College_of_Medicine_Logo.png',\n", - " pydata='https://pydata.org/wp-content/uploads/2018/10/pydata-logo.png',\n", - " python='https://www.python.org/static/community_logos/python-logo-master-v3-TM.png',\n", - " pni='https://vathes.com/2018/05/24/Princeton-Neuroscience-Institute-Partners-with-Vathes-to-Support-the-Adoption-of-DataJoint/PNI%20logo.png')" - ] + "source": "# Step 1: Find a bunch of images on the web\nlogos = dict(\n ucsd='https://upload.wikimedia.org/wikipedia/commons/f/f6/UCSD_logo.png',\n datajoint='https://datajoint.io/static/images/DJiotitle.png',\n utah='https://umc.utah.edu/wp-content/uploads/sites/15/2015/01/'\n 'Ulogo_400p.png',\n bcm='https://upload.wikimedia.org/wikipedia/commons/5/5d/'\n 'Baylor_College_of_Medicine_Logo.png',\n pydata='https://pydata.org/wp-content/uploads/2018/10/pydata-logo.png',\n python='https://www.python.org/static/community_logos/'\n 'python-logo-master-v3-TM.png',\n pni='https://vathes.com/2018/05/24/'\n 'Princeton-Neuroscience-Institute-Partners-with-Vathes-to-Support-'\n 'the-Adoption-of-DataJoint/PNI%20logo.png'\n)" }, { "cell_type": "code", @@ -254,33 +244,10 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "import requests \n", - "\n", - "@schema\n", - "class Logo(dj.Imported):\n", - " definition = \"\"\"\n", - " -> Organization\n", - " ---\n", - " logo_image : filepath@remote\n", - " \"\"\"\n", - " \n", - " path = os.path.join(dj.config['stores']['remote']['stage'], 'organizations', 'logos')\n", - " \n", - " def make(self, key):\n", - " # create the subfolder and download the logo into local_file \n", - " os.makedirs(self.path, exist_ok=True)\n", - " url = (Organization & key).fetch1('logo_url')\n", - " local_file = os.path.join(self.path, key['organization'] + os.path.splitext(url)[1])\n", - " print(local_file)\n", - " with open(local_file, 'wb') as f:\n", - " f.write(requests.get(url).content)\n", - " # sync up\n", - " self.insert1(dict(key, logo_image=local_file)) " - ] + "source": "import requests \n\n@schema\nclass Logo(dj.Imported):\n definition = \"\"\"\n -> Organization\n ---\n logo_image : filepath@remote\n \"\"\"\n \n path = os.path.join(\n dj.config['stores']['remote']['stage'], 'organizations', 'logos')\n \n def make(self, key):\n # create the subfolder and download the logo into local_file \n os.makedirs(self.path, exist_ok=True)\n url = (Organization & key).fetch1('logo_url')\n ext = os.path.splitext(url)[1]\n local_file = os.path.join(self.path, key['organization'] + ext)\n print(local_file)\n with open(local_file, 'wb') as f:\n f.write(requests.get(url).content)\n # sync up\n self.insert1(dict(key, logo_image=local_file))" }, { "cell_type": "code", @@ -1099,30 +1066,10 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "@schema\n", - "class StateBird(dj.Imported):\n", - " definition = \"\"\"\n", - " -> State\n", - " ---\n", - " bird_image : filepath@remote \n", - " \"\"\"\n", - " path = os.path.join(dj.config['stores']['remote']['stage'], 'states', 'birds')\n", - " \n", - " \n", - " def make(self, key):\n", - " os.makedirs(self.path, exist_ok=True)\n", - " state = (State & key).fetch1('state')\n", - " url = \"http://www.theus50.com/images/state-birds/{state}-bird.jpg\".format(state=state.lower())\n", - " local_file = os.path.join(self.path, state.lower() + os.path.splitext(url)[1])\n", - " print(local_file)\n", - " with open(local_file, 'wb') as f:\n", - " f.write(requests.get(url).content)\n", - " self.insert1(dict(key, bird_image=local_file)) \n" - ] + "source": "@schema\nclass StateBird(dj.Imported):\n definition = \"\"\"\n -> State\n ---\n bird_image : filepath@remote \n \"\"\"\n path = os.path.join(\n dj.config['stores']['remote']['stage'], 'states', 'birds')\n \n def make(self, key):\n os.makedirs(self.path, exist_ok=True)\n state = (State & key).fetch1('state')\n url = (\"http://www.theus50.com/images/state-birds/\"\n \"{state}-bird.jpg\".format(state=state.lower()))\n ext = os.path.splitext(url)[1]\n local_file = os.path.join(self.path, state.lower() + ext)\n print(local_file)\n with open(local_file, 'wb') as f:\n f.write(requests.get(url).content)\n self.insert1(dict(key, bird_image=local_file))" }, { "cell_type": "code", @@ -1192,30 +1139,10 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "@schema\n", - "class StateFlower(dj.Imported):\n", - " definition = \"\"\"\n", - " -> State\n", - " ---\n", - " flower_image : filepath@remote \n", - " \"\"\"\n", - " path = os.path.join(dj.config['stores']['remote']['stage'],'states', 'flowers')\n", - " \n", - " \n", - " def make(self, key):\n", - " os.makedirs(self.path, exist_ok=True)\n", - " state = (State & key).fetch1('state')\n", - " url = \"http://www.theus50.com/images/state-birds/{state}-flower.jpg\".format(state=state.lower())\n", - " local_file = os.path.join(self.path, state.lower() + os.path.splitext(url)[1])\n", - " print(local_file)\n", - " with open(local_file, 'wb') as f:\n", - " f.write(requests.get(url).content)\n", - " self.insert1(dict(key, flower_image=local_file)) " - ] + "source": "@schema\nclass StateFlower(dj.Imported):\n definition = \"\"\"\n -> State\n ---\n flower_image : filepath@remote \n \"\"\"\n path = os.path.join(\n dj.config['stores']['remote']['stage'], 'states', 'flowers')\n \n def make(self, key):\n os.makedirs(self.path, exist_ok=True)\n state = (State & key).fetch1('state')\n url = (\"http://www.theus50.com/images/state-birds/\"\n \"{state}-flower.jpg\".format(state=state.lower()))\n ext = os.path.splitext(url)[1]\n local_file = os.path.join(self.path, state.lower() + ext)\n print(local_file)\n with open(local_file, 'wb') as f:\n f.write(requests.get(url).content)\n self.insert1(dict(key, flower_image=local_file))" }, { "cell_type": "code", @@ -1930,4 +1857,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file diff --git a/book/95-reference/SPECS_2_0.md b/book/95-reference/SPECS_2_0.md index b9fe5b3..f81ec50 100644 --- a/book/95-reference/SPECS_2_0.md +++ b/book/95-reference/SPECS_2_0.md @@ -804,13 +804,15 @@ Example: ```python Student.insert([ {'student_id': 1000, 'first_name': 'Rebecca', 'last_name': 'Sanchez', - 'sex': 'F', 'date_of_birth': '1997-09-13', 'home_address': '6604 Gentry Turnpike Suite 513', - 'home_city': 'Andreaport', 'home_state': 'MN', 'home_zipcode': '29376', - 'home_phone': '(250)428-1836'}, + 'sex': 'F', 'date_of_birth': '1997-09-13', + 'home_address': '6604 Gentry Turnpike Suite 513', + 'home_city': 'Andreaport', 'home_state': 'MN', + 'home_zipcode': '29376', 'home_phone': '(250)428-1836'}, {'student_id': 1001, 'first_name': 'Matthew', 'last_name': 'Gonzales', - 'sex': 'M', 'date_of_birth': '1997-05-17', 'home_address': '1432 Jessica Freeway Apt. 545', - 'home_city': 'Frazierberg', 'home_state': 'NE', 'home_zipcode': '60485', - 'home_phone': '(699)755-6306x996'} + 'sex': 'M', 'date_of_birth': '1997-05-17', + 'home_address': '1432 Jessica Freeway Apt. 545', + 'home_city': 'Frazierberg', 'home_state': 'NE', + 'home_zipcode': '60485', 'home_phone': '(699)755-6306x996'} ]) ``` @@ -997,10 +999,14 @@ The restriction operator filters the rows of a table based on specified conditio To combine conditions using logical AND (conjunction), conditions MAY be applied sequentially or by using a `dj.AndList` object. ```python # Select young students from outside California (sequential application). - young_non_ca_students = Student & "home_state <> 'CA'" & "date_of_birth >= '2010-01-01'" + young_non_ca_students = (Student + & "home_state <> 'CA'" + & "date_of_birth >= '2010-01-01'") # Equivalent using dj.AndList. - young_non_ca_students_alt = Student & dj.AndList(["home_state <> 'CA'", "date_of_birth >= '2010-01-01'"]) + young_non_ca_students_alt = Student & dj.AndList([ + "home_state <> 'CA'", "date_of_birth >= '2010-01-01'" + ]) ``` 5. **Restriction by a Subquery:** @@ -1051,7 +1057,8 @@ The projection operator selects a subset of attributes from a table. It can also # Compute 'full_name' and 'age'. student_derived_info = Student.proj( full_name='CONCAT(first_name, " ", last_name)', - age='TIMESTAMPDIFF(CURDATE(), date_of_birth) / 365.25' # Example, exact function varies by SQL dialect + # Example age calculation (exact function varies by SQL dialect) + age='TIMESTAMPDIFF(CURDATE(), date_of_birth) / 365.25' ) # Result includes primary key attributes, full_name, and age. ``` @@ -1181,10 +1188,12 @@ Universal sets, denoted by `dj.U(...)`, are symbolic constructs representing the 3. **Aggregation by arbitary groupings:** `dj.U()` creates a new grouping entity with an arbitrary primary key for use in aggregations for which no existing entity type fits that purpose. ```python - # count how many students were born in each year and month + # count how many students were born in each year and month student_counts = dj.U('year_of_birth', 'month_of_birth').aggr( - Student.proj(year_of_birth='YEAR(date_of_birth)', month_of_birth='MONTH(date_of_birth)'), - n_students='COUNT(*)' + Student.proj( + year_of_birth='YEAR(date_of_birth)', + month_of_birth='MONTH(date_of_birth)'), + n_students='COUNT(*)' ) ``` In this case, the rules os semantic matching are lifted. diff --git a/book/README.md b/book/README.md index 6693441..8bda812 100644 --- a/book/README.md +++ b/book/README.md @@ -1,7 +1,5 @@ --- title: The DataJoint Book -authors: - - name: Dimitri Yatsenko --- © DataJoint Inc., 2024-2025. All rights reserved.