diff --git a/book/30-design/060-diagrams.ipynb b/book/30-design/060-diagrams.ipynb index e95321a..db7cbfd 100644 --- a/book/30-design/060-diagrams.ipynb +++ b/book/30-design/060-diagrams.ipynb @@ -6,1607 +6,169 @@ "source": [ "# Diagramming\n", "\n", - "Schema diagrams are essential tools for understanding and designing relational databases. They provide a visual representation of tables and their relationships, making complex schemas easier to comprehend at a glance.\n", + "Schema diagrams are essential tools for understanding and designing DataJoint pipelines.\n", + "They provide a visual representation of tables and their dependencies, making complex workflows comprehensible at a glance.\n", "\n", - "Several diagramming notations have been used for designing relational schemas, as already introduced in the section on [Relational Theory](../20-concepts/01-relational.md). Common notations include:\n", + "As introduced in [Relational Workflows](../20-concepts/05-workflows.md), DataJoint schemas form **Directed Acyclic Graphs (DAGs)** where:\n", "\n", - "* **Chen's Entity-Relationship (ER) Notation**: Uses rectangles for entities and diamonds for relationships. [@10.1201/9781003314455], https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model, [@doi.org/10.1145/320434.320440]\n", - "* **Crow's Foot Notation**, **IE (Information Engineering) Notation**, [**Barker's Notation**](http://www.entitymodelling.org/): Uses symbols at line endpoints to indicate cardinality (one, many, optional) and does not distinguish between entity sets and relationship sets.\n", - "* **UML Class Diagrams**: Adapted from object-oriented modeling.\n", + "- **Nodes** represent tables (workflow steps)\n", + "- **Edges** represent foreign key dependencies\n", + "- **Direction** flows from parent (referenced) to child (referencing) tables\n", "\n", - "DataJoint uses its own diagramming notation that is specifically designed for working with directed acyclic graphs (DAGs) and makes relationship types immediately visible through line styles.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Comparison to Other Diagramming Notations\n", - "\n", - "DataJoint's diagramming approach differs significantly from traditional database diagramming notations. Understanding these differences helps you appreciate the design philosophy behind DataJoint and how to read diagrams effectively.\n", - "\n", - "### Chen's Entity-Relationship (ER) Notation\n", - "\n", - "**Chen's notation**, developed by Peter Chen in 1976, makes a strict distinction between:\n", - "- **Entities** (rectangles): Things that exist independently (e.g., Customer, Product)\n", - "- **Relationships** (diamonds): Associations between entities (e.g., \"purchases\", \"owns\")\n", - "- **Attributes** (ovals): Properties of entities\n", - "\n", - "**Example**: A many-to-many relationship between `Customer` and `Product` would be drawn as:\n", - "```\n", - "[Customer] ──── ──── [Product]\n", - " 1 M:N 1\n", - "```\n", - "\n", - "**Key Differences from DataJoint**:\n", - "- DataJoint **does not distinguish** between entity tables and relationship tables—both are just tables\n", - "- In DataJoint, association tables (like `CustomerProduct`) appear as regular tables with converging foreign keys\n", - "- **Why DataJoint differs**: The boundary between entities and relationships is often blurred. For example, a synapse between neurons could be viewed as either an entity (with properties like strength, neurotransmitter type) or as a relationship (connecting two neurons). DataJoint treats everything uniformly as tables.\n", - "\n", - "### Crow's Foot Notation\n", - "\n", - "**Crow's Foot notation** (also called Information Engineering notation) uses symbols at line endpoints to indicate cardinality:\n", - "\n", - "```\n", - "Customer ||────o{ Order\n", - " ||────o{ indicates \"one to zero-or-many\"\n", - " \n", - "Symbols:\n", - "|| = exactly one\n", - "|o = zero or one \n", - "}o = zero or many\n", - "}| = one or many\n", - "```\n", - "\n", - "**Key Differences from DataJoint**:\n", - "- **Endpoint symbols vs. line styles**: Crow's Foot puts cardinality information at line endpoints; DataJoint uses line thickness and style (thick/thin/dashed)\n", - "- **No directionality**: Crow's Foot diagrams don't inherently have a top-to-bottom flow; DataJoint diagrams always show dependencies flowing downward\n", - "- **Nullable indication**: Crow's Foot explicitly shows optional relationships with the 'o' symbol; DataJoint doesn't show nullable foreign keys in the diagram\n", - "- **Primary key cascade**: Crow's Foot doesn't distinguish between foreign keys in primary keys vs. secondary attributes; this is DataJoint's most important visual feature\n", - "\n", - "### UML Class Diagrams\n", - "\n", - "**UML (Unified Modeling Language)** was designed for object-oriented programming but is sometimes adapted for database design:\n", - "\n", - "```\n", - "┌─────────────┐ ┌─────────────┐\n", - "│ Customer │1 *│ Order │\n", - "│─────────────│◆────────│─────────────│\n", - "│ customerId │ │ orderId │\n", - "│ name │ │ orderDate │\n", - "└─────────────┘ └─────────────┘\n", - "```\n", - "\n", - "**Key Differences from DataJoint**:\n", - "- **Composition vs. aggregation**: UML uses filled vs. hollow diamonds to show ownership strength; DataJoint uses solid vs. dashed lines\n", - "- **Object-oriented focus**: UML emphasizes inheritance and methods; DataJoint focuses purely on data relationships\n", - "- **Multiplicity notation**: UML uses `1`, `*`, `0..1` at line ends; DataJoint uses line styles\n", - "\n", - "### DataJoint's Unique Features\n", - "\n", - "What makes DataJoint diagrams distinctive:\n", - "\n", - "1. **Directed Acyclic Graph (DAG) Requirement**\n", - " - **No cycles allowed**: Foreign keys cannot form loops\n", - " - **Automatic layout**: All dependencies point in one direction (top-to-bottom)\n", - " - **Workflow interpretation**: Diagrams naturally represent operational sequences\n", - " - Other notations allow cycles (e.g., Employee → Department → Manager → Employee)\n", - "\n", - "2. **Line Style Encodes Semantic Information**\n", - " - **Thick solid**: Extension/specialization (one-to-one, shared identity)\n", - " - **Thin solid**: Containment/belonging (one-to-many, cascading identity)\n", - " - **Dashed**: Reference/association (one-to-many, independent identity)\n", - " - Other notations use line endpoints or labels, not line style itself\n", - "\n", - "3. **Primary Key Cascade Visibility**\n", - " - **Solid lines** (thick or thin) mean parent's primary key becomes part of child's identity\n", - " - This is critical for understanding which tables can be joined directly\n", - " - No other notation makes this distinction visible\n", - "\n", - "4. **No Distinction Between Entities and Relationships**\n", - " - Association tables look like any other table\n", - " - Reflects reality: relationships often have attributes and can participate in other relationships\n", - " - Example: A `Synapse` table connecting neurons could be an entity (with properties) or a relationship\n", - "\n", - "5. **Underlined Names for Independent Entities**\n", - " - **Underlined**: Table has its own independent primary key (a \"dimension\")\n", - " - **Not underlined**: Table's primary key is inherited from parent (an \"extension\")\n", - " - Helps identify which tables are starting points vs. elaborations\n", - "\n", - "### Comparison Table\n", - "\n", - "| Feature | Chen's ER | Crow's Foot | UML | DataJoint |\n", - "|---------|-----------|-------------|-----|-----------|\n", - "| **Entities vs. Relationships** | Distinct (rect vs. diamond) | No distinction | Classes vs. associations | No distinction |\n", - "| **Cardinality Display** | Numbers near entities | Symbols at line ends | Multiplicity at ends | Line thickness/style |\n", - "| **Direction** | No inherent direction | No inherent direction | Optional arrows | Always directed (DAG) |\n", - "| **Cycles Allowed** | Yes | Yes | Yes | **No** (acyclic only) |\n", - "| **Primary Key Cascade** | Not shown | Not shown | Not emphasized | **Solid lines show this** |\n", - "| **Optional Relationships** | Shown in labels | 'o' symbol | `0..1` notation | Not visible in diagram |\n", - "| **Identity Sharing** | Not indicated | Not indicated | Composition diamond | **Thick solid line** |\n", - "| **Containment** | Not distinguished | Not distinguished | Aggregation diamond | **Thin solid line** |\n", - "\n", - "### Why DataJoint's Approach Matters\n", - "\n", - "The DAG constraint and line style notation are not just aesthetic choices—they reflect a **design philosophy**:\n", - "\n", - "1. **Schemas as Workflows**: Because there are no cycles, you can read any DataJoint schema as a sequence of operations or transformations. Start at the top (independent entities) and work your way down (derived results).\n", - "\n", - "2. **Query Simplicity**: Solid lines tell you which tables can be joined directly. If you can trace a path of solid lines between two tables, you can join them without including intermediate tables.\n", - "\n", - "3. **Identity Semantics**: The line style immediately tells you whether entities share identity (thick), have contextualized identity (thin), or maintain independent identity (dashed).\n", - "\n", - "4. **Avoiding Self-References**: The no-cycles rule means you never have a table with a foreign key to itself. Instead, you create association tables (like `ReportsTo` connecting employees to managers), which makes relationships explicit and queryable.\n", - "\n", - "### Working with Non-DataJoint Schemas\n", - "\n", - "When you encounter schemas designed in other notations:\n", - "\n", - "- **Look for cycles**: If present, you'll need to break them into association tables for DataJoint\n", - "- **Identify many-to-many relationships**: These always need association tables in DataJoint (and in actual SQL)\n", - "- **Check foreign key placement**: Determine whether they should be in primary keys (solid lines) or secondary attributes (dashed lines)\n", - "- **Use `spawn_missing_classes()`**: DataJoint can automatically generate Python classes from existing MySQL/MariaDB databases" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "Consider the following diagram of the classic sales database, which is a popular example of a relational database schema from the [MySQL tutorial](https://www.mysqltutorial.org/getting-started-with-mysql/mysql-sample-database/), reproduced here for convenience:\n\n```{image} ../images/mysql-classic-sales-ERD.png\n---\nalt: Sales Database ERD\nwidth: 600px\n---\n```" - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "2\n", - "\n", - "2\n", - "\n", - "\n", - "\n", - "EmployeePosition\n", - "\n", - "\n", - "EmployeePosition\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "2->EmployeePosition\n", - "\n", - "\n", - "\n", - "\n", - "3\n", - "\n", - "3\n", - "\n", - "\n", - "\n", - "Customer\n", - "\n", - "\n", - "Customer\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "3->Customer\n", - "\n", - "\n", - "\n", - "\n", - "Order.Details\n", - "\n", - "\n", - "Order.Details\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Order\n", - "\n", - "\n", - "Order\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Order->Order.Details\n", - "\n", - "\n", - "\n", - "\n", - "Payment\n", - "\n", - "\n", - "Payment\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer->Order\n", - "\n", - "\n", - "\n", - "\n", - "Customer->Payment\n", - "\n", - "\n", - "\n", - "\n", - "Employee\n", - "\n", - "\n", - "Employee\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Employee->2\n", - "\n", - "\n", - "\n", - "\n", - "Employee->3\n", - "\n", - "\n", - "\n", - "\n", - "Employee->EmployeePosition\n", - "\n", - "\n", - "\n", - "\n", - "Office\n", - "\n", - "\n", - "Office\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Office->EmployeePosition\n", - "\n", - "\n", - "\n", - "\n", - "Product\n", - "\n", - "\n", - "Product\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Product->Order.Details\n", - "\n", - "\n", - "\n", - "\n", - "ProductLine\n", - "\n", - "\n", - "ProductLine\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "ProductLine->Product\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import datajoint as dj\n", - "\n", - "schema = dj.Schema(\"classic_sales\")\n", - "schema.spawn_missing_classes()\n", - "\n", - "dj.Diagram(schema)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Comparing the Two Representations\n", - "\n", - "The MySQL tutorial ERD (above) and the DataJoint diagram (above) represent the same database, but there are significant visual and conceptual differences:\n", - "\n", - "#### 1. Direction and Layout\n", - "\n", - "**Crow's Foot (MySQL ERD)**:\n", - "- No inherent direction—tables are arranged for visual balance\n", - "- Relationships flow in all directions (left, right, up, down)\n", - "- Must read cardinality symbols at both ends of each line to understand relationships\n", - "\n", - "**DataJoint Diagram**:\n", - "- Strict top-to-bottom flow (directed acyclic graph)\n", - "- Dependencies always point downward (parent above, child below)\n", - "- Can read the schema as a workflow from top to bottom\n", - "- **Insight**: The vertical position tells you dependency order—tables at the top can be populated first\n", - "\n", - "#### 2. Cardinality Indication\n", - "\n", - "**Crow's Foot Notation**:\n", - "- Uses symbols at line endpoints: `||` (one), `o` (zero/optional), `{` (many)\n", - "- Example: `||────o{` means \"one to zero-or-many\"\n", - "- Shows both sides of the relationship explicitly\n", - "- **Advantage**: Makes optional relationships (nullable foreign keys) visible\n", - "\n", - "**DataJoint Notation**:\n", - "- Uses line thickness and style: thick solid, thin solid, dashed\n", - "- Line style indicates whether foreign key is in primary key or not\n", - "- Does **not** show nullable foreign keys (must check table definition)\n", - "- **Advantage**: Line style conveys semantic meaning (extension vs. containment vs. reference)\n", - "\n", - "#### 3. Entity vs. Relationship Distinction\n", - "\n", - "**Crow's Foot / Chen's ER**:\n", - "- In Chen's notation, relationships are drawn as diamonds\n", - "- Crow's Foot treats all tables uniformly but conceptually distinguishes entities from relationships\n", - "- Many-to-many relationships are often shown with special notation\n", - "\n", - "**DataJoint**:\n", - "- **No distinction** between entity tables and association tables—all are tables\n", - "- Association tables (like `CustomerAccount`) look identical to entity tables\n", - "- Recognize associations by their converging foreign key pattern\n", - "- **Philosophy**: \"A synapse between neurons is both an entity (with properties) and a relationship (connecting neurons)—why force a distinction?\"\n", - "\n", - "#### 4. Primary Key Propagation\n", - "\n", - "**Crow's Foot**:\n", - "- Does not visually distinguish between foreign keys in primary keys vs. secondary attributes\n", - "- Cannot tell from the diagram whether you can join tables directly or need intermediate tables\n", - "\n", - "**DataJoint**:\n", - "- **Solid lines** (thick or thin) mean foreign key is in primary key → primary key cascades\n", - "- **Dashed lines** mean foreign key is secondary → no cascade\n", - "- **Critical advantage**: You can see at a glance which tables can be joined directly\n", - "- Example: If `A → B → C` are connected by solid lines, then `A * C` is a valid join\n", - "\n", - "#### 5. Handling Cycles\n", - "\n", - "**Traditional Notations**:\n", - "- Allow cyclic relationships (e.g., Employee → Department → Manager → Employee)\n", - "- Self-referencing tables are common (Employee table with manager_id field)\n", - "\n", - "**DataJoint**:\n", - "- **No cycles allowed**—all schemas must be directed acyclic graphs (DAGs)\n", - "- Self-referencing tables are avoided\n", - "- Instead, use association tables: `Employee` → `ReportsTo` → `Employee` (renamed)\n", - "- **Benefit**: Schemas can always be laid out top-to-bottom as workflows\n", - "\n", - "#### 6. Table Independence (Dimensions)\n", - "\n", - "**Traditional Notations**:\n", - "- Don't visually distinguish between independent entities and dependent entities\n", - "\n", - "**DataJoint**:\n", - "- **Underlined table names**: Independent entities (dimensions) with their own primary key\n", - "- **Non-underlined names**: Dependent entities whose primary key is inherited from parent\n", - "- Helps identify which tables are \"starting points\" vs. \"elaborations\"\n", - "\n", - "### Visual Comparison Example\n", - "\n", - "Consider a simple scenario: Customers placing Orders containing Items.\n", - "\n", - "**Crow's Foot would show**:\n", - "```\n", - "Customer ||────o{ Order o{────|| OrderItem }o────|| Product\n", - " 1 M M M 1\n", - "```\n", - "\n", - "**DataJoint shows**:\n", - "\n", - "```\n", - "Customer Product\n", - " │ │\n", - " │ (dashed) │ (dashed)\n", - " ↓ ↓\n", - " Order ───────┘\n", - " │ (thin solid)\n", - " ↓\n", - "OrderItem\n", - "```\n", - "\n", - "**Reading the DataJoint diagram**:\n", - "- Dashed lines to `Order`: Customer and Product references are secondary (Order has its own ID)\n", - "- Thin solid line to `OrderItem`: Items belong to orders, identified as (order_id, item_number)\n", - "- All solid lines point down: Top-to-bottom workflow (create customers/products, then orders, then items)\n", - "\n", - "### Advantages of Each Notation\n", - "\n", - "**Chen's ER Notation**:\n", - "- ✅ Clear conceptual modeling (entities vs. relationships explicit)\n", - "- ✅ Good for initial design discussions with non-technical stakeholders\n", - "- ❌ Awkward for implementation (relationship diamonds don't map directly to tables)\n", - "\n", - "**Crow's Foot Notation**:\n", - "- ✅ Shows nullable relationships clearly\n", - "- ✅ Explicit cardinality on both sides\n", - "- ✅ Widely recognized in industry\n", - "- ❌ No directionality makes complex schemas harder to read\n", - "- ❌ Doesn't show primary key cascade\n", - "\n", - "**DataJoint Notation**:\n", - "- ✅ Line style conveys semantic meaning (extension, containment, reference)\n", - "- ✅ Solid lines reveal which tables can be joined directly\n", - "- ✅ DAG structure makes schemas readable as workflows\n", - "- ✅ Underlined names identify independent entities\n", - "- ❌ Nullable foreign keys not visible\n", - "- ❌ Secondary unique constraints not visible\n", - "- ❌ Less familiar to those trained in traditional notations\n", - "\n", - "### Best Practices: When to Use Each\n", - "\n", - "- **Use Chen's ER** for: Initial conceptual modeling with stakeholders\n", - "- **Use Crow's Foot** for: Documentation of existing databases, focusing on cardinality\n", - "- **Use DataJoint** for: Scientific workflows, hierarchical data, when query patterns matter\n", - "- **Use UML** for: Object-relational mapping in application development\n", - "\n", - "### Converting Between Notations\n", - "\n", - "When converting from Crow's Foot or ER diagrams to DataJoint:\n", - "\n", - "1. **Identify independent entities** → These become top-level tables (underlined)\n", - "2. **Find many-to-many relationships** → Create association tables with both foreign keys in primary key\n", - "3. **Determine foreign key placement**:\n", - " - Does child belong to parent? → Foreign key in primary key (solid line)\n", - " - Does child reference parent? → Foreign key in secondary attributes (dashed line)\n", - "4. **Break any cycles** → Create association tables to maintain acyclic property\n", - "5. **Check for self-references** → Replace with association tables using renamed foreign keys" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Side-by-Side Notation Comparison\n", - "\n", - "Let's see how the same simple schema (Authors writing Books) would be represented in each notation:\n", - "\n", - "#### The Schema Requirements:\n", - "- Authors are identified by author_id\n", - "- Books are identified by ISBN\n", - "- One book can have multiple authors (many-to-many relationship)\n", - "- Authors can write multiple books\n", - "\n", - "#### Chen's Entity-Relationship Notation\n", - "\n", - "```\n", - "┌─────────┐ ┌──────────┐ ┌─────────┐\n", - "│ AUTHOR │ │ writes │ │ BOOK │\n", - "├─────────┤ │ │ ├─────────┤\n", - "│author_id│──────M───────│ M:N │──────N───────│ ISBN │\n", - "│ name │ │ │ │ title │\n", - "└─────────┘ └──────────┘ └─────────┘\n", - " (entity) (relationship) (entity)\n", - "```\n", - "\n", - "**Features**:\n", - "- Rectangles for entities, diamond for relationship\n", - "- M:N explicitly labeled in the relationship diamond\n", - "- Relationship \"writes\" is a named concept\n", - "- Three visual elements for what SQL implements as three tables\n", - "\n", - "#### Crow's Foot Notation\n", - "\n", - "```\n", - "┌─────────────┐ ┌─────────────┐\n", - "│ Author │ │ Book │\n", - "├─────────────┤ ├─────────────┤\n", - "│ author_id PK│o{────────writes────────}o│ ISBN PK │\n", - "│ name │ │ title │\n", - "└─────────────┘ └─────────────┘\n", - " │ │\n", - " └───────o{ AuthorBook {o────────┘\n", - " ├─────────────────┤\n", - " │ author_id PK,FK │\n", - " │ ISBN PK,FK │\n", - " │ order │\n", - " └─────────────────┘\n", - "```\n", - "\n", - "**Features**:\n", - "- `o{` symbols show \"zero or many\" at line endpoints\n", - "- PK and FK labels on attributes\n", - "- Association table explicitly shown\n", - "- No inherent top-to-bottom flow\n", - "\n", - "#### DataJoint Notation (What You Saw Above)\n", - "\n", - "```\n", - " Author Book\n", - " │ │\n", - " │ (thin solid) │ (thin solid)\n", - " ↓ ↓\n", - " AuthorBook\n", - "```\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conceptual Design vs. Implementation: A Key Philosophical Difference\n", - "\n", - "Database design is traditionally taught as a **two-phase process**:\n", - "\n", - "1. **Conceptual Design Phase**: Create ER diagrams to model entities and relationships\n", - "2. **Implementation Phase**: Translate the conceptual model into SQL CREATE TABLE statements\n", - "\n", - "This separation reflects a workflow where design and implementation are distinct activities, often performed by different people or at different times.\n", - "\n", - "### Traditional Two-Phase Approach\n", - "\n", - "In most database textbooks and courses, the process looks like this:\n", - "\n", - "```\n", - "Step 1: Conceptual Design\n", - "├─ Use Chen's ER diagrams or Crow's Foot notation\n", - "├─ Focus on entities, relationships, cardinalities\n", - "├─ Design without worrying about implementation details\n", - "└─ Create diagrams for discussion and approval\n", - "\n", - " ↓ (Manual Translation)\n", - "\n", - "Step 2: Implementation\n", - "├─ Write SQL CREATE TABLE statements\n", - "├─ Define primary keys and foreign keys\n", - "├─ Implement constraints and indexes\n", - "└─ Hope the implementation matches the design!\n", - "\n", - " ↓ (Potential Divergence)\n", - "\n", - "Problem: Diagrams and Implementation Can Drift Apart\n", - "├─ Diagrams updated → SQL not updated (documentation out of sync)\n", - "├─ SQL updated → Diagrams not updated (design drift)\n", - "└─ Requires discipline to keep them synchronized\n", - "```\n", - "\n", - "**Characteristics**:\n", - "- **Two separate artifacts**: Diagram (conceptual) and SQL code (implementation)\n", - "- **Manual synchronization**: Changes must be made in both places\n", - "- **Documentation debt**: Over time, diagrams often become outdated\n", - "- **Waterfall-oriented**: Design must be \"complete\" before implementation\n", - "- **Communication gap**: Designers and implementers may be different people\n", - "\n", - "### DataJoint's Unified Approach\n", - "\n", - "DataJoint fundamentally changes this by **merging conceptual design and implementation**:\n", - "\n", - "```\n", - "Single Step: Unified Design-Implementation\n", - "├─ Write Python class definitions (or SQL if preferred)\n", - "├─ DataJoint automatically creates tables in database\n", - "├─ DataJoint automatically generates diagrams from live schema\n", - "└─ Diagram and implementation are ALWAYS in sync\n", - "\n", - " ↓ (No Translation Needed)\n", - "\n", - "Result: Diagrams ARE the Implementation\n", - "├─ Change the code → Diagram updates automatically\n", - "├─ Diagram always reflects actual database structure\n", - "└─ Zero documentation debt\n", - "```\n", - "\n", - "**Characteristics**:\n", - "- **Single source of truth**: The code IS the design\n", - "- **Automatic synchronization**: Diagrams generated from actual database schema\n", - "- **Always current**: Diagrams cannot become outdated\n", - "- **Agile-friendly**: Can iterate on design rapidly\n", - "- **Executable documentation**: Diagrams are generated from running code\n", - "\n", - "### Practical Implications\n", - "\n", - "#### Traditional Approach Example:\n", - "\n", - "**Phase 1 - Conceptual Design** (ER Diagram):\n", - "```\n", - "[Student] ──enrolls in─── [Course]\n", - " 1 M:N 1\n", - "```\n", - "\n", - "**Phase 2 - Implementation** (Manual SQL):\n", - "```sql\n", - "CREATE TABLE student (\n", - " student_id INT PRIMARY KEY,\n", - " name VARCHAR(100)\n", - ");\n", - "\n", - "CREATE TABLE course (\n", - " course_id INT PRIMARY KEY,\n", - " title VARCHAR(100)\n", - ");\n", - "\n", - "CREATE TABLE enrollment (\n", - " student_id INT,\n", - " course_id INT,\n", - " PRIMARY KEY (student_id, course_id),\n", - " FOREIGN KEY (student_id) REFERENCES student(student_id),\n", - " FOREIGN KEY (course_id) REFERENCES course(course_id)\n", - ");\n", - "```\n", - "\n", - "**Problem**: If you later add a `grade` field to enrollment, you must:\n", - "1. Update the SQL code\n", - "2. Update the ER diagram manually\n", - "3. Update all documentation\n", - "4. Risk: Steps 2-3 often get skipped\n", - "\n", - "#### DataJoint Unified Approach:\n", - "\n", - "**Single Definition** (Code + Diagram in one):\n", - "```python\n", - "@schema\n", - "class Student(dj.Manual):\n", - " definition = \"\"\"\n", - " student_id : int\n", - " ---\n", - " name : varchar(100)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Course(dj.Manual):\n", - " definition = \"\"\"\n", - " course_id : int\n", - " ---\n", - " title : varchar(100)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Enrollment(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Student\n", - " -> Course\n", - " ---\n", - " grade : char(1) # Added later\n", - " \"\"\"\n", - "\n", - "# Diagram is automatically generated\n", - "dj.Diagram(schema)\n", - "```\n", - "\n", - "**Advantage**: \n", - "- Add `grade` field → Save file → Diagram updates automatically\n", - "- **Impossible** for diagram to be out of sync with implementation\n", - "- Code review catches design changes (they're in the same artifact)\n", - "\n", - "### Enabling Agile Database Design\n", - "\n", - "This unified approach enables an **agile, iterative workflow**:\n", - "\n", - "**Traditional Approach** (Waterfall):\n", - "```\n", - "Design → Review → Approve → Implement → Test → Deploy\n", - " ↑ |\n", - " └────────────── Difficult to go back ───────────┘\n", - "```\n", - "\n", - "**DataJoint Approach** (Agile):\n", - "```\n", - "Design+Implement → Test → Iterate → Deploy\n", - " ↓ ↑\n", - " └──── Easy iteration ──┘\n", - "```\n", - "\n", - "Benefits:\n", - "1. **Rapid prototyping**: Define a table, see the diagram immediately\n", - "2. **Safe experimentation**: Change foreign keys, instantly see impact on diagram\n", - "3. **Continuous refinement**: Iterate on design as you learn more about your domain\n", - "4. **Team collaboration**: Everyone works with the same code that generates diagrams\n", - "5. **Version control**: Git tracks both design and implementation (they're the same file)\n", - "\n", - "### The Bi-Directional Property\n", - "\n", - "DataJoint's approach is **bi-directional**:\n", - "\n", - "**Code → Diagram** (Normal workflow):\n", - "```python\n", - "# Write Python class definition\n", - "@schema\n", - "class MyTable(dj.Manual):\n", - " definition = \"...\"\n", - "\n", - "# Generate diagram\n", - "dj.Diagram(schema) # Automatically reflects code\n", - "```\n", - "\n", - "**Database → Code → Diagram** (Reverse engineering):\n", - "```python\n", - "# Connect to existing database\n", - "schema = dj.Schema('existing_db')\n", - "\n", - "# Spawn Python classes from tables\n", - "schema.spawn_missing_classes()\n", - "\n", - "# Generate diagram\n", - "dj.Diagram(schema) # Reflects actual database structure\n", - "```\n", - "\n", - "This means you can:\n", - "- Import existing databases and immediately visualize them\n", - "- Start from either code or database and get the diagram\n", - "- Ensure documentation always matches reality\n", - "\n", - "### Comparison Summary\n", - "\n", - "| Aspect | Traditional Two-Phase | DataJoint Unified |\n", - "|--------|----------------------|-------------------|\n", - "| **Design artifact** | ER/Crow's Foot diagram | Python/SQL code |\n", - "| **Implementation artifact** | SQL statements | Same as design |\n", - "| **Diagram generation** | Manual (tools like Visio) | Automatic from code |\n", - "| **Synchronization** | Manual discipline | Automatic |\n", - "| **Change process** | Update both separately | Update code once |\n", - "| **Version control** | Separate files | Single source |\n", - "| **Agility** | Waterfall-oriented | Iteration-friendly |\n", - "| **Documentation debt** | Accumulates over time | Impossible to accrue |\n", - "| **Learning curve** | Learn notation, then SQL | Learn one syntax |\n", - "\n", - "### Implications for This Chapter\n", - "\n", - "Because DataJoint diagrams are automatically generated from implementation, this chapter teaches you:\n", - "\n", - "1. **How to read** what the diagram tells you about the actual database\n", - "2. **How to design** by choosing appropriate line styles (which determines implementation)\n", - "3. **How to think** about the semantic meaning of relationships (not just cardinality)\n", - "\n", - "When you learn to read DataJoint diagrams, you're simultaneously learning:\n", - "- How the database is structured (implementation)\n", - "- How entities relate to each other (conceptual model)\n", - "- How to query the data (query patterns follow diagram structure)\n", - "\n", - "**The bottom line**: In DataJoint, the diagram is not a separate design document—it's a **live view** of your implemented schema. This makes diagrams more trustworthy, more useful, and more integral to the development process." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## DataJoint Diagram Structure\n", - "\n", - "A DataJoint schema is depicted as a **Directed Acyclic Graph (DAG)**, where:\n", - "\n", - "* **Nodes** represent tables in the database\n", - "* **Edges** represent foreign key constraints between tables\n", - "* **Direction** always flows from parent (referenced) to child (referencing) tables\n", - "\n", - "The key constraint in DataJoint is that **foreign keys cannot form cycles** - you cannot have a chain of foreign keys that loops back to a starting table. This constraint ensures that:\n", - "\n", - "1. Schemas can always be visualized as top-to-bottom workflows\n", - "2. Data dependencies are clear and unambiguous\n", - "3. Queries can follow predictable patterns\n", - "\n", - "Tables at the top of the diagram are independent entities (no foreign keys to other tables), while tables lower in the diagram depend on tables above them." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Foreign Keys Always Reference Primary Keys\n", - "\n", - "DataJoint enforces a simplifying convention that distinguishes it from raw SQL and other database frameworks: **foreign keys always reference the primary key of the parent table**.\n", - "\n", - "### SQL's Flexibility vs. DataJoint's Constraint\n", - "\n", - "**In SQL**, foreign keys can reference any unique key (not just the primary key):\n", - "\n", - "```sql\n", - "-- SQL allows this:\n", - "CREATE TABLE employee (\n", - " employee_id INT PRIMARY KEY,\n", - " ssn VARCHAR(11) UNIQUE,\n", - " name VARCHAR(100)\n", - ");\n", - "\n", - "CREATE TABLE access_log (\n", - " log_id INT PRIMARY KEY,\n", - " employee_ssn VARCHAR(11), -- References SSN, not employee_id!\n", - " timestamp DATETIME,\n", - " FOREIGN KEY (employee_ssn) REFERENCES employee(ssn)\n", - ");\n", - "```\n", - "\n", - "**In DataJoint**, this is not allowed:\n", - "\n", - "```python\n", - "# DataJoint only allows references to primary keys\n", - "@schema\n", - "class Employee(dj.Manual):\n", - " definition = \"\"\"\n", - " employee_id : int\n", - " ---\n", - " ssn : varchar(11)\n", - " unique index(ssn)\n", - " name : varchar(100)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class AccessLog(dj.Manual):\n", - " definition = \"\"\"\n", - " log_id : int\n", - " ---\n", - " -> Employee # MUST reference employee_id (the primary key), not ssn\n", - " timestamp : datetime\n", - " \"\"\"\n", - "```\n", - "\n", - "### Why This Constraint?\n", - "\n", - "This restriction provides several benefits:\n", - "\n", - "1. **Simplicity**: No ambiguity about what foreign keys reference\n", - "2. **Consistency**: All relationships work the same way\n", - "3. **Diagrammatic clarity**: Every edge in the diagram represents a primary key reference\n", - "4. **Identity-based relationships**: Relationships are between entities (identified by primary keys), not between arbitrary attributes\n", - "5. **Query optimization**: Primary keys always have indexes; other unique keys may not\n", - "\n", - "**Design principle**: If you need to reference a secondary unique key, that often indicates the entity should be redesigned with that key as the primary key, or the relationship should go through a different table.\n", - "\n", - "### Renamed Foreign Keys: When Names Don't Match\n", - "\n", - "Most of the time, foreign key attributes have the **same names** as the primary key attributes they reference:\n", - "\n", - "```python\n", - "@schema\n", - "class Customer(dj.Manual):\n", - " definition = \"\"\"\n", - " customer_id : int # Primary key\n", - " ---\n", - " name : varchar(100)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Order(dj.Manual):\n", - " definition = \"\"\"\n", - " order_id : int\n", - " ---\n", - " -> Customer # Creates foreign key: customer_id → customer_id\n", - " order_date : date\n", - " \"\"\"\n", - "```\n", - "\n", - "In `Order`, the foreign key attribute is automatically named `customer_id` (same as in `Customer`).\n", - "\n", - "However, sometimes you need **different names**. Common scenarios:\n", - "\n", - "1. **Multiple references to the same table** (e.g., presynaptic and postsynaptic neurons)\n", - "2. **Semantic clarity** (e.g., `manager_id` instead of `employee_id` for a manager reference)\n", - "3. **Avoiding name conflicts** (when two parents have identically named primary keys)\n", - "\n", - "### Renamed Foreign Keys Using `.proj()`\n", - "\n", - "DataJoint uses **projection** to rename foreign key attributes:\n", - "\n", - "```python\n", - "@schema\n", - "class Neuron(dj.Manual):\n", - " definition = \"\"\"\n", - " neuron_id : int\n", - " ---\n", - " neuron_type : varchar(20)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Synapse(dj.Manual):\n", - " definition = \"\"\"\n", - " synapse_id : int\n", - " ---\n", - " -> Neuron.proj(presynaptic='neuron_id') # Renames to 'presynaptic'\n", - " -> Neuron.proj(postsynaptic='neuron_id') # Renames to 'postsynaptic'\n", - " strength : float\n", - " \"\"\"\n", - "```\n", - "\n", - "**What happens**:\n", - "- Instead of creating `neuron_id`, creates `presynaptic` and `postsynaptic`\n", - "- Both still reference the `neuron_id` primary key in the `Neuron` table\n", - "- Allows two foreign keys from the same child to the same parent\n", - "\n", - "### Orange Dots in Diagrams: Visualizing Renamed References\n", - "\n", - "When foreign keys are renamed, DataJoint diagrams show them as **orange dots** (intermediate nodes):\n", - "\n", - "```\n", - " Neuron\n", - " / / (●) (●) ← Orange dots represent renamed references\n", - " | |\n", - " ↓ ↓\n", - " Synapse\n", - "```\n", - "\n", - "The orange dots indicate:\n", - "- A **projection** has been applied to rename the foreign key\n", - "- The actual foreign key attribute name differs from the parent's primary key name\n", - "- Multiple edges can connect the same pair of tables\n", - "\n", - "**In the actual diagram**, hovering over the orange dot reveals:\n", - "- The parent table being referenced (`Neuron`)\n", - "- The projection expression (e.g., `presynaptic='neuron_id'`)\n", - "- The renamed attribute name\n", - "\n", - "### Multiple Edges Between Tables\n", - "\n", - "This naming mechanism enables **multigraphs**—where two tables can be connected by multiple distinct foreign keys:\n", - "\n", - "#### Example 1: Neural Connectivity\n", - "\n", - "```python\n", - "@schema\n", - "class Neuron(dj.Manual):\n", - " definition = \"\"\"\n", - " neuron_id : int\n", - " ---\n", - " layer : int\n", - " neuron_type : enum('excitatory', 'inhibitory')\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Synapse(dj.Manual):\n", - " definition = \"\"\"\n", - " synapse_id : int\n", - " ---\n", - " -> Neuron.proj(presynaptic='neuron_id')\n", - " -> Neuron.proj(postsynaptic='neuron_id')\n", - " strength : float\n", - " \"\"\"\n", - "```\n", - "\n", - "Result: Two foreign keys from `Synapse` to `Neuron`, creating a directed graph of neural connections.\n", - "\n", - "#### Example 2: Employee Reporting Structure\n", - "\n", - "```python\n", - "@schema\n", - "class Employee(dj.Manual):\n", - " definition = \"\"\"\n", - " employee_id : int\n", - " ---\n", - " name : varchar(100)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class ReportsTo(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Employee # subordinate\n", - " ---\n", - " -> Employee.proj(manager_id='employee_id') # manager\n", - " \"\"\"\n", - "```\n", - "\n", - "Result: Two foreign keys from `ReportsTo` to `Employee`, representing the management hierarchy.\n", - "\n", - "#### Example 3: Flight Routes\n", - "\n", - "```python\n", - "@schema\n", - "class Airport(dj.Manual):\n", - " definition = \"\"\"\n", - " airport_code : char(3)\n", - " ---\n", - " airport_name : varchar(100)\n", - " city : varchar(50)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Flight(dj.Manual):\n", - " definition = \"\"\"\n", - " flight_number : varchar(10)\n", - " ---\n", - " -> Airport.proj(origin='airport_code')\n", - " -> Airport.proj(destination='airport_code')\n", - " departure_time : time\n", - " \"\"\"\n", - "```\n", - "\n", - "Result: Two foreign keys from `Flight` to `Airport`, representing origin and destination.\n", - "\n", - "### When to Use Renamed Foreign Keys\n", - "\n", - "**Use renamed foreign keys when**:\n", - "- ✅ You need multiple references to the same parent table\n", - "- ✅ The semantic role of the reference is important (presynaptic vs. postsynaptic)\n", - "- ✅ You want to avoid name conflicts\n", - "- ✅ The relationship is self-referential (table references itself indirectly)\n", - "\n", - "**Avoid renaming when**:\n", - "- ❌ There's only one reference to the parent\n", - "- ❌ The default name is clear and unambiguous\n", - "- ❌ Simpler is better—don't rename unnecessarily\n", - "\n", - "### Understanding Orange Dots in Diagrams\n", - "\n", - "When you see orange dots in a DataJoint diagram:\n", - "\n", - "1. **Count the dots**: Each dot represents a renamed foreign key reference\n", - "2. **Follow the path**: Dot connects the child table to the parent through a renamed projection\n", - "3. **Check for multiple edges**: If you see multiple orange dots connecting the same tables, those are multiple distinct foreign keys\n", - "4. **Hover for details**: In interactive diagrams, hovering reveals the projection expression\n", - "\n", - "**Visual interpretation**:\n", - "```\n", - "Table A\n", - " |\n", - " ├─── (●) ──→ Table B (first foreign key, renamed)\n", - " |\n", - " └─── (●) ──→ Table B (second foreign key, renamed)\n", - "```\n", - "\n", - "This means Table A has two different foreign keys both referencing Table B's primary key, but with different attribute names.\n", - "\n", - "### Comparison to Other Notations\n", - "\n", - "| Feature | SQL (Standard) | Traditional ERD | DataJoint |\n", - "|---------|---------------|-----------------|-----------|\n", - "| **FK can reference** | Any unique key | Any unique key | Primary key only |\n", - "| **Renamed FK** | Column name differs | Not standardized | `.proj()` syntax |\n", - "| **Multiple FK to same table** | Yes | Yes, but notation varies | Yes, via orange dots |\n", - "| **Self-referencing** | Common (FK to own table) | Shown as loop | Avoided; use association table |\n", - "| **Visual indication of rename** | N/A | Not standard | Orange dot nodes |\n", - "\n", - "### Design Patterns with Renamed Foreign Keys\n", - "\n", - "Renamed foreign keys are essential for several common patterns:\n", - "\n", - "**Pattern 1: Directed Graphs**\n", - "- Nodes: Single table (Neuron, Person, Airport)\n", - "- Edges: Association table with two renamed FKs to the same parent\n", - "- Use: Neural networks, social networks, transportation networks\n", - "\n", - "**Pattern 2: Hierarchical Relationships**\n", - "- Parent/child in same table type (Employee → Manager, Category → ParentCategory)\n", - "- Association table connects with renamed references\n", - "\n", - "**Pattern 3: Comparison Tables**\n", - "- Comparing entities of the same type\n", - "- Example: ProductComparison with product_a and product_b both referencing Product\n", - "\n", - "### The Bottom Line\n", - "\n", - "DataJoint's convention of **foreign keys always referencing primary keys** simplifies the conceptual model. When you see an edge in a DataJoint diagram, you know it represents a primary key reference. \n", - "\n", - "When you see **orange dots**, you know:\n", - "- Foreign key attributes are renamed for clarity or necessity\n", - "- Multiple relationships may exist between the same tables\n", - "- The diagram explicitly shows these special cases\n", - "\n", - "This makes DataJoint diagrams both simpler (one type of reference) and more expressive (orange dots highlight special relationships) than traditional notations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "schema_graph = dj.Schema('directed_graph')\n", - "\n", - "@schema_graph\n", - "class Neuron(dj.Manual):\n", - " definition = \"\"\"\n", - " neuron_id : int\n", - " ---\n", - " neuron_type : enum('excitatory', 'inhibitory')\n", - " layer : int\n", - " \"\"\"\n", - "\n", - "@schema_graph\n", - "class Synapse(dj.Manual):\n", - " definition = \"\"\"\n", - " synapse_id : int\n", - " ---\n", - " -> Neuron.proj(presynaptic='neuron_id')\n", - " -> Neuron.proj(postsynaptic='neuron_id')\n", - " strength : float\n", - " \"\"\"\n", - "\n", - "dj.Diagram(schema_graph)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Observing Orange Dots in Action\n", - "\n", - "In the diagram above, notice:\n", - "\n", - "* **Two orange dots** appear between `Neuron` and `Synapse`\n", - "* Each orange dot represents a renamed foreign key reference\n", - "* One dot represents the `presynaptic` reference\n", - "* The other represents the `postsynaptic` reference\n", - "* Both ultimately reference `Neuron.neuron_id` (the primary key)\n", - "\n", - "**This is how DataJoint visualizes multigraphs**: When two tables are connected by multiple foreign keys, each foreign key appears as a separate edge with its own orange dot (if renamed) or direct line (if not renamed).\n", - "\n", - "**Interactive tip**: In Jupyter notebooks, hover over the orange dots to see:\n", - "- Which table is being referenced\n", - "- The projection expression showing the rename\n", - "- The attribute name in the child table" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "## Database Schemas as Workflows: A Different Philosophy\n\n```{seealso}\nThe workflow approach to database design is covered in depth in [Normalization](055-normalization.ipynb), which presents workflow normalization as one of three normalization approaches. This section focuses on how to **read diagrams** as workflow representations.\n```\n\nDataJoint espouses a fundamentally different view of database schemas compared to traditional Entity-Relationship modeling: **database schemas represent workflows**, not just static collections of entities and their relationships.\n\n### Traditional ERD View: Static Entity-Relationship Model\n\nTraditional ER diagrams focus on:\n- **Entities**: Things that exist (Customer, Product, Order)\n- **Relationships**: How entities relate (customer \"places\" order, order \"contains\" product)\n- **Cardinality**: How many of each entity participate in relationships\n\n**Conceptual Model**: The database is a collection of related entities\n\n**No workflow concept**: ERDs don't inherently suggest an order of operations. You can't look at an ERD and know:\n- Which tables to populate first\n- What sequence of operations the business follows\n- How data flows through the system\n\n```\nTraditional ERD (no inherent direction):\n\n Customer ←──places──→ Order ←──contains──→ Product\n ↕ ↕ ↕\n Department Shipment Inventory\n```\n\n### DataJoint View: Schemas as Operational Workflows\n\nDataJoint schemas represent **sequences of steps, operations, or transformations**:\n\n**Conceptual Model**: The database is a workflow—a directed sequence of data transformations and dependencies\n\n**Built-in workflow concept**: Every DataJoint diagram shows:\n- Which entities can be created first (top of diagram)\n- What depends on what (arrows show dependencies)\n- The operational sequence (read top-to-bottom)\n\n```\nDataJoint Schema (directional workflow):\n\n Customer Product ← Independent entities (populate first)\n ↓ ↓\n Order ← Depends on customers and products\n ↓\n OrderItem ← Depends on orders\n ↓\n Shipment ← Depends on items being ready\n ↓\n Delivery ← Final step in workflow\n```\n\n**Reading the workflow**: \n1. Start by creating customers and products\n2. Customers place orders referencing products\n3. Orders are broken into items\n4. Items are collected and shipped\n5. Shipments are delivered to customers\n\n### The DAG Structure Enables Workflow Interpretation\n\nThe **Directed Acyclic Graph (DAG)** structure is not just a technical constraint—it's a fundamental design choice that enables workflow thinking:\n\n**Direction**:\n- All foreign keys point \"upstream\" in the workflow\n- Dependencies flow from top to bottom\n- Schemas naturally represent operational sequences\n\n**Acyclic (No Loops)**:\n- Prevents circular dependencies\n- Ensures there's always a valid starting point\n- Makes the workflow execution order unambiguous\n\n**Implications**:\n- You can read any DataJoint schema as an operational manual\n- The vertical position tells you when things happen in the workflow\n- Understanding the schema = understanding the business process\n\n### Relationships Through Converging Edges\n\nIn DataJoint, **relationships are established by converging edges**—when a table has foreign keys to multiple upstream tables:\n\n```\n TableA TableB\n ↓ ↓\n TableC ← Converging edges create relationship\n```\n\n**What this means**:\n- `TableC` requires matching entities from both `TableA` and `TableB`\n- To create an entry in `TableC`, you must find compatible entities upstream\n- The relationship is defined by the **matching** of upstream entities\n\n### Design for Efficient Queries\n\nDataJoint schemas are designed with **query efficiency** in mind:\n\n#### Solid Lines = Direct Join Paths\n\n```\n Customer\n ↓ (solid)\n Order\n ↓ (solid)\n OrderItem\n```\n\n**Query benefit**: Can join `Customer * OrderItem` directly because the primary key cascades through solid lines.\n\n#### Dashed Lines = Must Include Intermediate Tables\n\n```\n Product\n ↓ (dashed)\n Order\n ↓ (solid)\n OrderItem\n```\n\n**Query requirement**: To join `Product` and `OrderItem`, must include `Order` because dashed lines don't cascade primary keys.\n\n**The diagram guides query design**: Follow the solid line paths for efficient joins.\n\n### Summary: Workflow Thinking\n\nThe workflow paradigm means:\n- **Design phase**: Think about your workflow first—what are the steps, what depends on what?\n- **Schema structure**: Map workflow to tables—independent entities at top, dependent steps cascade down\n- **Execution**: Follow the workflow—populate independent entities first, computed tables execute automatically\n- **Understanding**: Read the schema as a process map—vertical position = when in the workflow" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Quick Reference: Line Styles\n", - "\n", - "Before diving into details, here's a quick reference for interpreting DataJoint diagrams:\n", - "\n", - "| Line Style | Symbol | Meaning | Semantic Relationship | Example |\n", - "|------------|--------|---------|----------------------|---------|\n", - "| **Thick Solid** | ━━━ | One-to-one | **Extension**: Child extends parent | Customer ━━━ CustomerNotes |\n", - "| **Thin Solid** | ─── | One-to-many | **Containment**: Child belongs to parent | Customer ─── Account |\n", - "| **Dashed** | ┄┄┄ | One-to-many | **Reference**: Child references parent | Account ┄┄┄ Bank |\n", - "\n", - "**Key Principle**: Solid lines mean the parent's identity becomes part of the child's identity. Dashed lines mean the child maintains independent identity.\n", - "\n", - "**Critical Distinction**: \n", - "- **Thin solid** → Many children can belong to one parent (containment)\n", - "- **Thick solid** → Only one child can extend each parent (extension)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Example: Project Assignment Schema\n", - "\n", - "Let's start with a simple schema called \"projects\" depicting employees, projects, and assignments of projects to employees. This is a classic many-to-many relationship:\n", - "\n", - "* **Employee** and **Project** are independent entities (no foreign keys)\n", - "* **Assignment** is an association table that links employees to projects\n", - "* One employee can be assigned to multiple projects\n", - "* One project can have multiple employees assigned to it\n" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[2025-10-08 19:08:07,939][INFO]: DataJoint 0.14.6 connected to dev@db:3306\n" - ] - } - ], - "source": [ - "import datajoint as dj\n", - "\n", - "dj.conn()\n", - "\n", - "\n", - "schema = dj.Schema(\"projects\")\n", - "\n", - "@schema\n", - "class Employee(dj.Manual):\n", - " definition = \"\"\"\n", - " employee_id : int\n", - " ---\n", - " employee_name : varchar(60)\n", - " \"\"\"\n", - "\n", - "@schema\n", - "class Project(dj.Manual):\n", - " definition = \"\"\"\n", - " project_code : varchar(8)\n", - " ---\n", - " project_title : varchar(50)\n", - " start_date : date\n", - " end_date : date\n", - " \"\"\"\n", - " \n", - "@schema\n", - "class Assignment(dj.Manual):\n", - " definition = \"\"\"\n", - " -> Employee\n", - " -> Project\n", - " ---\n", - " percent_effort : decimal(4,1) unsigned\n", - " \"\"\"" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "Assignment\n", - "\n", - "\n", - "Assignment\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Employee\n", - "\n", - "\n", - "Employee\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Employee->Assignment\n", - "\n", - "\n", - "\n", - "\n", - "Project\n", - "\n", - "\n", - "Project\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Project->Assignment\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "dj.Diagram(schema)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Interpreting the Diagram\n", - "\n", - "In this diagram, observe the following features:\n", + "This DAG structure embodies a core principle of the Relational Workflow Model: **the schema is an executable specification**.\n", + "Tables at the top are independent entities; tables below depend on tables above them.\n", + "Reading the diagram top-to-bottom reveals the workflow execution order.\n", "\n", - "* **Green boxes with underlined names**: These are the tables. The underline indicates these are independent entities (dimensions) with their own primary keys.\n", - "* **Thin solid lines**: Both edges use thin solid lines, indicating that the foreign keys from `Assignment` are part of its primary key.\n", - "* **Converging pattern**: The `Assignment` table has two foreign keys converging into it, which is the visual signature of an association table creating a many-to-many relationship.\n", - "* **Top-to-bottom layout**: Independent entities (`Employee` and `Project`) are at the top, and the dependent table (`Assignment`) is at the bottom.\n", - "\n", - "**Interactive Features**: In Jupyter notebooks, you can hover over any table in the diagram to see its complete definition, including all attributes and constraints.\n" + "DataJoint's diagramming notation differs from traditional notations (Chen's ER, Crow's Foot, UML) in one critical way: **line styles encode semantic relationship types**, not just cardinality.\n", + "This makes the diagram immediately informative about how entities relate—whether they share identity, belong to each other, or merely reference each other." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Understanding Line Styles\n", - "\n", - "The most important feature of DataJoint diagrams is how **line styles indicate relationship types**. There are three types of lines, each conveying specific information about how tables are related.\n", - "\n", - "## The Three Line Styles\n", - "\n", - "### 1. Thick Solid Line (━━━)\n", - "\n", - "**Meaning**: One-to-one relationship where the foreign key constitutes the **entire primary key** of the child table.\n", - "\n", - "**Conceptual Model**: The child entity is a **specialized extension or elaboration** of the parent entity. The child doesn't have its own independent identity—it IS the parent, just with additional information attached.\n", - "\n", - "**Characteristics**:\n", - "- Child table shares the exact same primary key as the parent\n", - "- Creates the strongest form of dependency and unified identity\n", - "- Child inherits the parent's complete identity\n", - "- Perfect for adding optional or modular information to an entity\n", - "- Enables direct joins across multiple levels of hierarchy\n", - "\n", - "**Example Use Cases**:\n", - "- Workflow sequences: Order → Shipment → Delivery (each step extends the order)\n", - "- Optional entity extensions: Customer → CustomerPreferences (preferences extend customer)\n", - "- Modular data splits: Experiment → ExperimentNotes (notes extend experiment)\n", - "- One-to-one specializations where child adds detail without changing identity\n", - "\n", - "**Think of it as**: \"Child extends parent\" or \"Child specializes parent\"\n", - "\n", - "### 2. Thin Solid Line (───)\n", - "\n", - "**Meaning**: One-to-many relationship where the foreign key is **part of** (but not all of) the child's primary key.\n", - "\n", - "**Conceptual Model**: The child entity **belongs to or is contained within** the parent entity. The child has its own identity, but only within the context of its parent. Multiple children can exist for each parent, each identified by the parent's key plus additional distinguishing attributes.\n", - "\n", - "**Characteristics**:\n", - "- Child table has a composite primary key: parent's PK + additional field(s)\n", - "- Creates hierarchical ownership or containment structures\n", - "- Child's identity is contextualized by parent (e.g., \"Account #3 of Customer #42\")\n", - "- Parent's primary key \"cascades\" down, becoming part of child's identity\n", - "- Enables direct joins to ancestors without intermediate tables\n", - "\n", - "**Example Use Cases**:\n", - "- Hierarchies: Study → Subject → Session → Scan (sessions belong to subjects)\n", - "- Ownership: Customer → Account (accounts belong to customers)\n", - "- Containment: Folder → File (files are contained in folders)\n", - "- Parts-of: Order → OrderItem (items are parts of orders)\n", - "\n", - "**Think of it as**: \"Child belongs to parent\" or \"Child is contained in parent\"\n", - "\n", - "### 3. Dashed Line (- - - -)\n", - "\n", - "**Meaning**: One-to-many relationship where the foreign key is a **secondary attribute** (not part of the primary key).\n", - "\n", - "**Conceptual Model**: The child entity **references or associates with** the parent but maintains complete independence. The child has its own identity that is unrelated to the parent, and the parent is just one of many attributes describing the child.\n", - "\n", - "**Characteristics**:\n", - "- Child table has its own independent primary key\n", - "- Foreign key appears below the line (in secondary attributes)\n", - "- Relationship is \"looser\" - no identity cascade\n", - "- Cannot skip intermediate tables in joins\n", - "- Relationship can be more easily changed or made optional\n", - "\n", - "**Example Use Cases**:\n", - "- Optional associations: Product → Manufacturer (product exists independently)\n", - "- References that might change: Employee → Department (employee might transfer)\n", - "- Loose couplings: Document → Author (document has independent identity)\n", - "- When child entity has independent identity from parent\n", - "\n", - "**Think of it as**: \"Child references parent\" or \"Child is loosely associated with parent\"\n", - "\n", - "## Summary: The Conceptual Framework\n", - "\n", - "Understanding the semantic difference between line types:\n", - "\n", - "| Line Type | Semantic Relationship | Identity | Cardinality |\n", - "|-----------|----------------------|----------|-------------|\n", - "| **Thick Solid** | Extension/Specialization | Shared | One-to-one |\n", - "| **Thin Solid** | Containment/Belonging | Contextualized | One-to-many |\n", - "| **Dashed** | Reference/Association | Independent | One-to-many |\n", - "\n", - "**Key Insight**: Solid lines (thick or thin) indicate that the parent's identity becomes part of the child's identity, either completely (thick) or partially (thin). Dashed lines indicate the child maintains its own independent identity." + "## Quick Reference\n", + "\n", + "| Line Style | Appearance | Relationship | Child's Primary Key | Cardinality |\n", + "|------------|------------|--------------|--------------------|--------------|\n", + "| **Thick Solid** | ━━━ | Extension | Parent PK only | One-to-one |\n", + "| **Thin Solid** | ─── | Containment | Parent PK + own field(s) | One-to-many |\n", + "| **Dashed** | ┄┄┄ | Reference | Own independent PK | One-to-many |\n", + "\n", + "**Key Principle**: Solid lines mean the parent's identity becomes part of the child's identity.\n", + "Dashed lines mean the child maintains independent identity.\n", + "\n", + "**Visual Indicators**:\n", + "- **Underlined table name**: Independent entity with its own primary key\n", + "- **Non-underlined name**: Dependent entity whose identity derives from parent\n", + "- **Orange dots**: Renamed foreign keys (see [Renamed Foreign Keys](#renamed-foreign-keys-and-orange-dots))\n", + "- **Table colors**: Green (Manual), Blue (Imported), Red (Computed), Gray (Lookup)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Key Conceptual Distinction: Containment vs. Extension\n", - "\n", - "The choice between thin and thick solid lines reflects a fundamental design decision about the **semantic relationship** between entities:\n", - "\n", - "### Thin Lines → Containment/Belonging (One-to-Many)\n", - "\n", - "When you use a thin solid line, you're saying:\n", - "- *\"The child belongs to the parent\"*\n", - "- *\"Multiple children can exist within each parent\"*\n", - "- *\"The child's identity includes the parent's identity\"*\n", - "\n", - "**Example**: A `Session` belongs to a `Subject`. You can have Session 1, Session 2, Session 3 all belonging to the same subject. The sessions are identified as \"Subject 42, Session 1\" and \"Subject 42, Session 2\".\n", - "\n", - "```\n", - "-> Subject\n", - "session : int\n", - "```\n", - "\n", - "**Result**: Composite primary key `(subject_id, session)` → Thin solid line\n", - "\n", - "### Thick Lines → Extension/Specialization (One-to-One)\n", + "## The Three Line Styles\n", "\n", - "When you use a thick solid line, you're saying:\n", - "- *\"The child extends the parent with additional information\"*\n", - "- *\"There can be at most one child for each parent\"*\n", - "- *\"The child and parent share the same identity\"*\n", + "Line styles convey the **semantic relationship** between parent and child tables.\n", + "The choice of line style is determined by where the foreign key appears in the child's definition.\n", "\n", - "**Example**: A `Shipment` extends an `Order`. Once an order is placed, it may or may not be shipped. If it's shipped, there's exactly one shipment record that shares the order's identity.\n", + "### Thick Solid Line: Extension (One-to-One)\n", "\n", - "```\n", - "-> Order\n", - "```\n", + "The foreign key **is** the entire primary key of the child table.\n", "\n", - "**Result**: Primary key is just `order_id` (inherited) → Thick solid line\n", + "**Semantics**: The child *extends* or *specializes* the parent.\n", + "They share the same identity—at most one child exists for each parent.\n", "\n", - "### Visual Summary\n", + "```python\n", + "@schema\n", + "class Customer(dj.Manual):\n", + " definition = \"\"\"\n", + " customer_id : int\n", + " ---\n", + " name : varchar(50)\n", + " \"\"\"\n", "\n", + "@schema\n", + "class CustomerPreferences(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Customer # This IS the entire primary key\n", + " ---\n", + " theme : varchar(20)\n", + " \"\"\"\n", "```\n", - "Parent Table\n", - " │\n", - " ├─── (thin solid) ──→ Child has: Parent PK + own fields → \"belongs to\"\n", - " │\n", - " └═══ (thick solid) ══→ Child has: Parent PK only → \"extends\"\n", - "```\n", - "\n", - "### When to Choose Which\n", "\n", - "Ask yourself: \"Can there be multiple children for each parent?\"\n", - "\n", - "- **Yes** → Use thin solid line (child belongs to parent)\n", - "- **No** → Use thick solid line (child extends parent)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Concrete Examples: Containment vs. Extension\n", + "**Use cases**: Workflow sequences (Order → Shipment → Delivery), optional extensions (Customer → CustomerPreferences), modular data splits.\n", "\n", - "Let's illustrate the distinction with two concrete scenarios:\n", + "### Thin Solid Line: Containment (One-to-Many)\n", "\n", - "### Scenario 1: Experiment Sessions (Containment - Thin Line)\n", + "The foreign key is **part of** (but not all of) the child's primary key.\n", "\n", - "In a lab, an experiment has multiple sessions. Each session belongs to its experiment:\n", + "**Semantics**: The child *belongs to* or *is contained within* the parent.\n", + "Multiple children can exist for each parent, each identified within the parent's context.\n", "\n", "```python\n", - "class Experiment(dj.Manual):\n", + "@schema\n", + "class Customer(dj.Manual):\n", " definition = \"\"\"\n", - " experiment_id : int\n", + " customer_id : int\n", " ---\n", - " experiment_name : varchar(100)\n", + " name : varchar(50)\n", " \"\"\"\n", "\n", - "class Session(dj.Manual):\n", + "@schema\n", + "class Account(dj.Manual):\n", " definition = \"\"\"\n", - " -> Experiment # Part of primary key\n", - " session : int # Additional PK component\n", + " -> Customer # Part of primary key\n", + " account_number : int # Additional PK component\n", " ---\n", - " session_date : date\n", + " balance : decimal(10,2)\n", " \"\"\"\n", "```\n", "\n", - "**Relationship**: Session **belongs to** Experiment\n", - "- Can have Session 1, Session 2, Session 3... all for the same experiment\n", - "- Session identity: \"Experiment 5, Session 2\"\n", - "- **Thin solid line** in diagram\n", - "- Think: \"sessions are contained within experiments\"\n", + "**Use cases**: Hierarchies (Study → Subject → Session), ownership (Customer → Account), containment (Order → OrderItem).\n", "\n", - "### Scenario 2: Experiment Notes (Extension - Thick Line)\n", + "### Dashed Line: Reference (One-to-Many)\n", "\n", - "Sometimes you want to add optional notes to an experiment without cluttering the main table:\n", + "The foreign key is a **secondary attribute** (below the `---` line).\n", + "\n", + "**Semantics**: The child *references* or *associates with* the parent but maintains independent identity.\n", + "The parent is just one attribute describing the child.\n", "\n", "```python\n", - "class Experiment(dj.Manual):\n", + "@schema\n", + "class Bank(dj.Manual):\n", " definition = \"\"\"\n", - " experiment_id : int\n", + " bank_id : int\n", " ---\n", - " experiment_name : varchar(100)\n", + " bank_name : varchar(100)\n", " \"\"\"\n", "\n", - "class ExperimentNotes(dj.Manual):\n", + "@schema\n", + "class Account(dj.Manual):\n", " definition = \"\"\"\n", - " -> Experiment # This IS the primary key\n", + " account_number : int # Own independent PK\n", " ---\n", - " notes : varchar(4000)\n", - " added_date : timestamp\n", + " -> Bank # Secondary attribute\n", + " balance : decimal(10,2)\n", " \"\"\"\n", "```\n", "\n", - "**Relationship**: ExperimentNotes **extends** Experiment\n", - "- Can have at most one notes entry per experiment\n", - "- Notes identity: same as experiment (shares `experiment_id`)\n", - "- **Thick solid line** in diagram\n", - "- Think: \"notes are an optional extension of experiment\"\n", - "\n", - "### The Decision Tree\n", - "\n", - "```\n", - "Do multiple children exist for each parent?\n", - "│\n", - "├─ YES ──→ Thin Solid Line (Containment/Belonging)\n", - "│ Child PK: parent_pk + own_field(s)\n", - "│ Example: One experiment has many sessions\n", - "│\n", - "└─ NO ───→ Thick Solid Line (Extension/Specialization)\n", - " Child PK: parent_pk only\n", - " Example: One experiment has zero or one notes entry\n", - "```" + "**Use cases**: Loose associations (Product → Manufacturer), references that might change (Employee → Department), when child has independent identity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Visual Examples of Each Line Style\n", + "## Visual Examples\n", "\n", - "Let's create schemas that demonstrate each type of line. We'll use variations of a simple customer-account relationship.\n" + "Let's see each line style in action with live diagrams." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import datajoint as dj\n", + "dj.conn()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Example 1: Dashed Line (Secondary Foreign Key)\n" + "### Dashed Line Example" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "Account\n", - "\n", - "\n", - "Account\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer\n", - "\n", - "\n", - "Customer\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer->Account\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "schema_dashed = dj.Schema('diagram_dashed')\n", "\n", @@ -1627,71 +189,29 @@ " balance : decimal(10,2)\n", " \"\"\"\n", "\n", - "dj.Diagram(schema_dashed)\n" + "dj.Diagram(schema_dashed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Dashed line** from `Customer` to `Account`: The foreign key `customer_id` is in the secondary attributes of `Account` (below the `---` line). Each account has its own independent ID (`account_number`), and the relationship to customer is secondary. \n", - "\n", - "This represents a **reference or loose association** - the account maintains its own identity independent of the customer. This is a one-to-many relationship (one customer can have many accounts), and accounts could theoretically be reassigned to different customers by updating the foreign key (though in practice, we prefer delete-and-insert over updates)." + "**Dashed line**: `Account` has its own independent identity (`account_number`).\n", + "The `customer_id` foreign key is secondary—it references `Customer` but doesn't define the account's identity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Example 2: Thin Solid Line (Composite Primary Key)\n" + "### Thin Solid Line Example" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "Account\n", - "\n", - "\n", - "Account\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer\n", - "\n", - "\n", - "Customer\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer->Account\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "schema_thin = dj.Schema('diagram_thin')\n", "\n", @@ -1712,71 +232,29 @@ " balance : decimal(10,2)\n", " \"\"\"\n", "\n", - "dj.Diagram(schema_thin)\n" + "dj.Diagram(schema_thin)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Thin solid line** from `Customer` to `Account`: The foreign key `customer_id` is part of `Account`'s primary key (above the `---` line). The primary key of `Account` is the composite `(customer_id, account_number)`. \n", - "\n", - "This represents **containment or belonging** - each account belongs to and is identified within the context of its customer. Account #3 means nothing on its own; it's \"Account #3 of Customer #42\". This is a one-to-many relationship with identity cascade, creating a hierarchical structure where accounts are owned by customers." + "**Thin solid line**: `Account`'s primary key is `(customer_id, account_number)`.\n", + "Accounts *belong to* customers—Account #3 means \"Account #3 of Customer X.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Example 3: Thick Solid Line (Primary Key is Foreign Key)\n" + "### Thick Solid Line Example" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer\n", - "\n", - "\n", - "Customer\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Account\n", - "\n", - "\n", - "Account\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "Customer->Account\n", - "\n", - "\n", - "\n", - "" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "schema_thick = dj.Schema('diagram_thick')\n", "\n", @@ -1796,216 +274,337 @@ " balance : decimal(10,2)\n", " \"\"\"\n", "\n", - "dj.Diagram(schema_thick)\n" + "dj.Diagram(schema_thick)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "**Thick solid line** from `Customer` to `Account`: The foreign key IS the entire primary key of `Account`. The primary key of `Account` is just `customer_id` (inherited from `Customer`). \n", - "\n", - "This represents an **extension or specialization** - the account extends the customer entity with additional financial information. Customer and account share the same identity; they are one-to-one. This is useful when you want to modularize data (separate customer info from account info) while maintaining that each customer can have at most one account. Note that `Account` is no longer underlined in the diagram, indicating it's not an independent dimension but rather an extension of `Customer`." + "**Thick solid line**: `Account`'s primary key *is* `customer_id` (inherited from `Customer`).\n", + "Each customer can have at most one account—they share identity.\n", + "Note that `Account` is no longer underlined, indicating it's not an independent dimension." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# What Diagrams Show and Don't Show\n", - "\n", - "Understanding the limitations of diagram notation is just as important as understanding what they do show.\n", - "\n", - "## What Diagrams Clearly Indicate\n", - "\n", - "✅ **Line Style (Thick/Thin/Dashed)**:\n", - "- Whether foreign keys are in primary key or secondary attributes\n", - "- Whether relationships are one-to-one (thick) or one-to-many (thin/dashed)\n", - "- Whether primary keys cascade through relationships (solid lines)\n", + "## Association Tables and Many-to-Many Relationships\n", "\n", - "✅ **Direction of Dependencies**:\n", - "- Which table depends on which (arrows point from parent to child)\n", - "- Workflow order (top to bottom)\n", - "- Which tables are independent vs. dependent\n", - "\n", - "✅ **Table Types**:\n", - "- Underlined names = independent entities (dimensions)\n", - "- Non-underlined names = dependent entities (no independent identity)\n", - "- Color coding for table tiers (Manual, Lookup, Imported, Computed, etc.)\n", - "\n", - "✅ **Association Patterns**:\n", - "- Converging lines indicate association tables\n", - "- Many-to-many relationships (two foreign keys in primary key)\n", + "Many-to-many relationships appear as tables with **converging foreign keys**—multiple thin solid lines pointing into a single table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "schema_assoc = dj.Schema(\"projects\")\n", "\n", - "## What Diagrams Don't Show\n", + "@schema_assoc\n", + "class Employee(dj.Manual):\n", + " definition = \"\"\"\n", + " employee_id : int\n", + " ---\n", + " employee_name : varchar(60)\n", + " \"\"\"\n", "\n", - "❌ **Nullable Foreign Keys**:\n", - "- Whether a foreign key is nullable (allows NULL values)\n", - "- Cannot distinguish between mandatory and optional relationships\n", - "- Must examine table definition to see nullable modifiers\n", + "@schema_assoc\n", + "class Project(dj.Manual):\n", + " definition = \"\"\"\n", + " project_code : varchar(8)\n", + " ---\n", + " project_title : varchar(50)\n", + " start_date : date\n", + " end_date : date\n", + " \"\"\"\n", + " \n", + "@schema_assoc\n", + "class Assignment(dj.Manual):\n", + " definition = \"\"\"\n", + " -> Employee\n", + " -> Project\n", + " ---\n", + " percent_effort : decimal(4,1) unsigned\n", + " \"\"\"\n", "\n", - "❌ **Secondary Unique Constraints**:\n", - "- Unique indexes on secondary attributes\n", - "- Could convert a one-to-many into a one-to-one relationship\n", - "- Not visible in the diagram at all\n", + "dj.Diagram(schema_assoc)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Reading this diagram**:\n", + "- `Employee` and `Project` are independent entities (underlined, at top)\n", + "- `Assignment` has two thin solid lines converging into it\n", + "- Its primary key is `(employee_id, project_code)`—the combination of both parents\n", + "- This creates a many-to-many relationship: each employee can work on multiple projects, and each project can have multiple employees" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Renamed Foreign Keys and Orange Dots\n", "\n", - "❌ **Actual Attribute Names**:\n", - "- Only table names are shown (hover to see attributes)\n", - "- Foreign key field names might be renamed via projection\n", - "- Must inspect definition to see exact field names\n", + "DataJoint foreign keys always reference the parent's **primary key**.\n", + "Usually, the foreign key attribute keeps the same name as in the parent.\n", + "However, sometimes you need different names:\n", "\n", - "❌ **Data Types and Constraints**:\n", - "- Cannot see CHECK constraints, default values, etc.\n", - "- Must examine table definition for these details\n", + "- **Multiple references to the same table** (e.g., presynaptic and postsynaptic neurons)\n", + "- **Semantic clarity** (e.g., `manager_id` instead of `employee_id`)\n", + "- **Avoiding name conflicts**\n", "\n", - "❌ **Composite Unique Constraints**:\n", - "- Complex uniqueness rules beyond the primary key\n", - "- May fundamentally change relationship semantics\n", + "Use `.proj()` to rename foreign key attributes:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "schema_graph = dj.Schema('directed_graph')\n", "\n", - "## Best Practice\n", + "@schema_graph\n", + "class Neuron(dj.Manual):\n", + " definition = \"\"\"\n", + " neuron_id : int\n", + " ---\n", + " neuron_type : enum('excitatory', 'inhibitory')\n", + " layer : int\n", + " \"\"\"\n", "\n", - "**Design Principle**: DataJoint users generally avoid secondary unique constraints when the primary key structure can enforce uniqueness. Making foreign keys part of the primary key (creating solid lines) provides two major benefits:\n", + "@schema_graph\n", + "class Synapse(dj.Manual):\n", + " definition = \"\"\"\n", + " synapse_id : int\n", + " ---\n", + " -> Neuron.proj(presynaptic='neuron_id')\n", + " -> Neuron.proj(postsynaptic='neuron_id')\n", + " strength : float\n", + " \"\"\"\n", "\n", - "1. **Visual Clarity**: The relationship type is immediately obvious from the diagram\n", - "2. **Query Simplicity**: Primary keys cascade through foreign keys, enabling direct joins between distant tables\n" + "dj.Diagram(schema_graph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Practical Tips for Working with Diagrams\n", + "**Orange dots** appear between `Neuron` and `Synapse`, indicating:\n", + "- A projection has renamed the foreign key attribute\n", + "- Two distinct foreign keys connect the same pair of tables\n", + "- In the `Synapse` table: `presynaptic` and `postsynaptic` both reference `Neuron.neuron_id`\n", "\n", - "## How to Read a Schema Quickly\n", + "In interactive Jupyter notebooks, hovering over orange dots reveals the projection expression.\n", "\n", - "When encountering a new schema diagram, follow this systematic approach:\n", + "**Common patterns** using renamed foreign keys:\n", + "- **Neural networks**: Presynaptic and postsynaptic neurons\n", + "- **Organizational hierarchies**: Employee and manager (both reference `Employee`)\n", + "- **Transportation**: Origin and destination airports" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "## Real-World Example: Classic Sales Database\n\nLet's examine a real database—the [MySQL tutorial sample database](https://www.mysqltutorial.org/getting-started-with-mysql/mysql-sample-database/).\n\n### Traditional ER Diagram\n\nHere is the classic Entity-Relationship diagram from the MySQL tutorial:\n\n![Classic Sales ER Diagram](../images/mysql-classic-sales-ERD.png)\n\nThis diagram uses Crow's Foot notation, where:\n- Lines with crow's feet indicate \"many\" relationships\n- Single lines indicate \"one\" relationships\n- The diagram shows cardinality but not the semantic nature of relationships\n\n### DataJoint Diagram\n\nNow let's see how the same database appears in DataJoint notation:" + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "schema = dj.Schema(\"classic_sales\")\n", + "schema.spawn_missing_classes()\n", "\n", - "1. **Identify Independent Entities** (top of diagram, underlined)\n", - " - These are your starting points for data entry\n", - " - No dependencies on other tables\n", - " \n", - "2. **Trace the Solid Lines** (thick or thin)\n", - " - Follow the cascading primary keys\n", - " - Understand which tables can be joined directly\n", - " \n", - "3. **Spot Association Tables** (converging patterns)\n", - " - Look for tables with multiple foreign keys in their primary key\n", - " - These represent many-to-many relationships\n", - " \n", - "4. **Check Line Thickness**\n", - " - Thick lines = one-to-one relationships (workflow steps, optional extensions)\n", - " - Thin lines = one-to-many hierarchies (ownership, containment)\n", - " - Dashed lines = one-to-many loose associations\n", + "dj.Diagram(schema)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": "### Comparing the Two Diagrams\n\n**Reading the DataJoint diagram**:\n1. **Independent entities at top**: `Productline`, `Office`, `Customer` (underlined)\n2. **Follow solid lines down**: Track how primary keys cascade through the hierarchy\n3. **Identify association tables**: Look for converging lines (e.g., `Orderdetail` links `Order` and `Product`)\n4. **Dashed lines**: Reference relationships that don't cascade identity\n\n**Key differences from the ER diagram**:\n\n| Aspect | Traditional ER (Crow's Foot) | DataJoint |\n|--------|------------------------------|-----------|\n| **Layout** | Arbitrary arrangement | Top-to-bottom workflow order |\n| **Line meaning** | Cardinality only (one vs. many) | Semantic relationship type |\n| **Primary key cascade** | Not visible | Solid lines show direct join paths |\n| **Workflow sequence** | Must read documentation | Clear from vertical structure |\n\nThe vertical layout reveals the workflow: create product lines and offices first, then products and employees, then customers and orders, and finally order details and payments.\n\n:::{seealso}\nFor the complete schema with data and example queries, see the [Classic Sales](../80-examples/010-classic-sales.ipynb) example.\n:::" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What Diagrams Show and Don't Show\n", + "\n", + "### Clearly Indicated\n", + "\n", + "| Feature | How It's Shown |\n", + "|---------|---------------|\n", + "| Relationship type | Line style (thick/thin/dashed) |\n", + "| Dependency direction | Arrows from parent to child |\n", + "| Independent entities | Underlined table names |\n", + "| Table tiers | Colors (Green/Blue/Red/Gray) |\n", + "| Many-to-many | Converging lines into association table |\n", + "| Renamed foreign keys | Orange dots |\n", + "\n", + "### Not Visible\n", + "\n", + "| Feature | Must Check |\n", + "|---------|------------|\n", + "| Nullable foreign keys | Table definition |\n", + "| Secondary unique constraints | Table definition |\n", + "| Attribute names and types | Hover or inspect definition |\n", + "| CHECK constraints | Table definition |\n", + "\n", + "**Design principle**: DataJoint users generally avoid secondary unique constraints.\n", + "Making foreign keys part of the primary key (creating solid lines) provides visual clarity and enables direct joins across multiple levels." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Diagram Operations\n", "\n", - "## Designing with Diagrams in Mind\n", + "DataJoint provides operators to filter and combine diagrams:\n", "\n", - "When designing a new schema, consider:\n", + "```python\n", + "# Show entire schema\n", + "dj.Diagram(schema)\n", "\n", - "**Use Solid Lines When**:\n", - "- Building hierarchical structures (Study → Subject → Session)\n", - "- Creating workflow sequences (Order → Ship → Deliver)\n", - "- You want to enable direct joins across levels\n", + "# Show specific tables\n", + "dj.Diagram(Table1) + dj.Diagram(Table2)\n", "\n", - "**Use Dashed Lines When**:\n", - "- Child has independent identity from parent\n", - "- Relationship might change frequently\n", - "- You don't want primary key cascade\n", + "# Show table and N levels of upstream dependencies\n", + "dj.Diagram(Table) - N\n", "\n", - "**Use Thick Lines When**:\n", - "- Creating one-to-one relationships\n", - "- Extending entities with optional information\n", - "- Modeling sequential workflows\n", + "# Show table and N levels of downstream dependents\n", + "dj.Diagram(Table) + N\n", "\n", - "## Diagram-Driven Queries\n", + "# Combine operations\n", + "(dj.Diagram(Table1) - 2) + (dj.Diagram(Table2) + 1)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Diagrams and Queries\n", "\n", - "The diagram structure directly informs query patterns:\n", + "The diagram structure directly informs query patterns.\n", "\n", - "**Solid Line Paths** (direct joins possible):\n", + "**Solid line paths enable direct joins**:\n", "```python\n", - "# If Study → Subject → Session are connected by solid lines:\n", - "Study * Session # Valid join, no need to include Subject\n", + "# If A → B → C are connected by solid lines:\n", + "A * C # Valid—primary keys cascade through solid lines\n", "```\n", "\n", - "**Dashed Line Paths** (must include intermediate tables):\n", + "**Dashed lines require intermediate tables**:\n", "```python\n", - "# If Study ---> Subject (dashed), Subject → Session (solid):\n", - "Study * Subject * Session # Must include Subject\n", + "# If A ---> B (dashed), B → C (solid):\n", + "A * B * C # Must include B\n", "```\n", "\n", - "## Common Patterns at a Glance\n", + "This is why solid lines are preferred when appropriate—they simplify queries by allowing you to skip intermediate tables." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Comparison to Other Notations\n", "\n", - "| Pattern | Line Type | Primary Key Structure | Use Case |\n", - "|---------|-----------|----------------------|----------|\n", - "| Independent entities | None (top level) | Own PK | Starting points |\n", - "| Hierarchy | Thin solid | Parent PK + own field(s) | Containment, ownership |\n", - "| Sequence | Thick solid | Parent PK only | Workflows, one-to-one |\n", - "| Secondary reference | Dashed | Own PK | Association |\n", - "| Association table | Multiple thin solid | Multiple parent PKs | Many-to-many |\n", + "DataJoint's notation differs significantly from traditional database diagramming:\n", "\n", - "## Interactive Exploration\n", + "| Feature | Chen's ER | Crow's Foot | DataJoint |\n", + "|---------|-----------|-------------|----------|\n", + "| **Cardinality** | Numbers near entities | Symbols at line ends | Line thickness/style |\n", + "| **Direction** | No inherent direction | No inherent direction | Always top-to-bottom (DAG) |\n", + "| **Cycles allowed** | Yes | Yes | No |\n", + "| **Entity vs. relationship** | Distinct (rect vs. diamond) | Not distinguished | Not distinguished |\n", + "| **Primary key cascade** | Not shown | Not shown | Solid lines show this |\n", + "| **Identity sharing** | Not indicated | Not indicated | Thick solid line |\n", "\n", - "**In Jupyter Notebooks**:\n", - "- **Hover** over tables to see complete definitions (also works if you save the diagram as a SVG file)\n", + "**Why DataJoint differs**:\n", "\n", - "**Filtering Diagrams**:\n", - "```python\n", - "# Show only specific tables\n", - "dj.Diagram(Table1) + dj.Diagram(Table2)\n", + "1. **DAG structure**: No cycles means schemas are readable as workflows (top-to-bottom execution order)\n", + "2. **Line style semantics**: Immediately reveals relationship type without reading labels\n", + "3. **Primary key cascade visibility**: Solid lines show which tables can be joined directly\n", + "4. **Unified entity treatment**: No artificial distinction between \"entities\" and \"relationships\"—associations are just tables with converging foreign keys\n", "\n", - "# Show table and its immediate dependencies\n", - "dj.Diagram(Table1) - 1\n", + ":::{seealso}\n", + "The [Relational Workflows](../20-concepts/05-workflows.md) chapter covers the three database paradigms in depth, including how DataJoint's workflow-centric approach compares to Codd's mathematical model and Chen's Entity-Relationship model.\n", + ":::" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Best Practices\n", "\n", - "# Show table and its immediate up to seven layears up in the graph\n", - "dj.Diagram(Table1) - 7\n", + "### Reading Diagrams\n", "\n", - "# Show table and what depends on it \n", - "dj.Diagram(Table1) + 1\n", + "1. **Start at the top**: Identify independent entities (underlined)\n", + "2. **Follow solid lines**: Trace primary key cascades downward\n", + "3. **Spot convergence patterns**: Multiple lines into a table indicate associations\n", + "4. **Check line thickness**: Thick = one-to-one, Thin = one-to-many containment\n", + "5. **Note dashed lines**: These don't cascade identity\n", "\n", - "# Show table and what depends on it up to seven layers down in the graph\n", - "dj.Diagram(Table1) + 7\n", + "### Designing with Diagrams\n", "\n", - "# Show table and what depends on it up to seven layers down in the graph\n", - "dj.Diagram(Table1) + 7\n", + "1. **Choose solid lines when**:\n", + " - Building hierarchies (Study → Subject → Session)\n", + " - Creating workflow sequences (Order → Ship → Deliver)\n", + " - You want direct joins across levels\n", "\n", - "```\n", + "2. **Choose dashed lines when**:\n", + " - Child has independent identity from parent\n", + " - Reference might change or is optional\n", + " - You don't need primary key cascade\n", "\n", - "## Summary\n", + "3. **Choose thick lines when**:\n", + " - Extending entities with optional information\n", + " - Modeling workflow steps (one output per input)\n", + " - Creating one-to-one relationships\n", "\n", - "DataJoint diagrams are powerful tools for:\n", - "- **Understanding** existing schemas quickly\n", - "- **Communicating** design decisions visually\n", - "- **Planning** query strategies\n", - "- **Validating** relationship structures\n", + "### Interactive Tips\n", "\n", - "The key is understanding that **line style encodes relationship semantics**: thick solid lines for one-to-one, thin solid lines for cascading one-to-many, and dashed lines for non-cascading one-to-many relationships. This visual language makes complex schemas comprehensible at a glance.\n" + "- **Hover over tables** to see complete definitions (works in Jupyter and SVG exports)\n", + "- **Hover over orange dots** to see projection expressions\n", + "- **Use `+` and `-` operators** to focus on specific parts of large schemas" ] }, { "cell_type": "markdown", "metadata": {}, - "source": [] + "source": [ + "## Summary\n", + "\n", + "DataJoint diagrams are more than documentation—they are **live views** of your schema that:\n", + "\n", + "- Reveal workflow structure through top-to-bottom layout\n", + "- Show relationship semantics through line styles\n", + "- Guide query design through primary key cascade visibility\n", + "- Stay synchronized because they're generated from the actual schema\n", + "\n", + "The key insight: in DataJoint, diagrams and implementation are unified.\n", + "There's no separate design document that can drift out of sync—the diagram **is** the schema." + ] } ], "metadata": { "kernelspec": { - "display_name": "base", + "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.13.2" - }, - "orig_nbformat": 4 + "version": "3.11.0" + } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } \ No newline at end of file