Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(guides): update json database guide #1086

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,105 @@
title: Why JSON is unsuitable as a database
description: The many reasons why you shouldn't use JSON as a database, and instead opt for SQL.
relevant_links:
Tips on Storing Data: https://tutorial.vco.sh/tips/storage/
Tips on Storing Data: https://tutorial.vco.sh/tips/storage/
---

JSON, quite simply, is not a database. It's not designed to be a data storage format,
rather a wayof transmitting data over a network. It's also often used as a way of doing configuration files for programs.

There is no redundancy built in to JSON. JSON is just a format, and Python has libraries for it
like json and ujson that let you load and dump it, sometimes to files, but that's all it does, write data to a file.
There is no sort of DBMS (Database Management System), which means no sort of sophistication in how the data is stored,
or built in ways to keep it safe and backed up, there's no built in encryption either - bear in mind
in larger applications encryption may be necessary for GDPR/relevant data protection regulations compliance.

JSON, unlike relational databases, has no way to store relational data,
which is a very commonly needed way of storing data.
Relational data, as the name may suggest, is data that relates to other data.
For example if you have a table of users and a table of servers, the server table will probably have an owner field,
where you'd reference a user from the users table. (**This is only relevant for relational data**).

JSON is primarily a KV (key-value) format, for example `{"a": "b"}` where `a` is the key and `b` is the value,
but what if you want to search not by that key but by a sub-key? Well, instead of being able to quickly use `var[key]`,
which in a Python dictionary has a constant return time (for more info look up hash tables),
you now have to iterate through every object in the dictionary and compare to find what you're looking for.
Most relational database systems, like MySQL, MariaDB, and PostgreSQL have ways of indexing secondary fields
apart from the primary key so that you can easily search by multiple attributes.
[JSON (JavaScript Object Notation)](https://wikipedia.org/wiki/JSON) is commonly used for data interchange, but it's not a database solution. [SQL (Structured Query Language)](https://wikipedia.org/wiki/SQL) offers better alternatives due to the following reasons:

## Data Integrity and Validation:

JSON lacks predefined schemas and robust validation checks, which can result in inconsistent and potentially invalid data.

In contrast, SQL databases uphold data integrity through meticulously structured schemas and stringent data type constraints.

Let's illustrate how SQL effectively enforces data integrity:

**Data Type Constraints:** In SQL, you can specify the data type for each column. For example, you can define the "age" column as a small integer. This means it can only store whole numbers within a specific range, ensuring that you can't input "581 years" as someone's age.

**Check Constraints:** SQL allows you to set check constraints on columns to further refine the data that can be inserted. For the "age" column, you can add a check constraint to ensure that the age falls within a logical range, such as between 0 and 120. If someone tries to insert an age outside this range, the database will reject the entry.

Example of a check constraint in SQL:

```sql
CREATE TABLE Users (
UserID INT PRIMARY KEY,
UserName VARCHAR(50),
Age SMALLINT CHECK (Age >= 0 AND Age <= 120)
);
```

We can emphasize the significance of these SQL features by comparing them to JSON. In JSON, unless you write your own validation routine (which can be time-consuming and error-prone), you may inadvertently omit a key or set someone's age to "581 years" without any immediate feedback. Issues like these may only become apparent when your application crashes. However, in an SQL database, you define the data type for age and add a check constraint, which serves as a safeguard against such bad data, enhancing data quality and application reliability.

## Querying and Indexing:

JSON databases lack efficient querying and indexing mechanisms, making data retrieval slow.

In contrast, SQL databases excel at quick data retrieval with optimized indexing. Let's delve deeper into why SQL databases are well-suited for efficient data retrieval and how they outperform JSON databases in complex queries:

**Optimized Indexing:** SQL databases offer sophisticated indexing mechanisms that enable rapid data retrieval. By creating indexes on specific columns, SQL databases can significantly reduce the time it takes to search and retrieve data. For example, if you have a table of customer information and you frequently search for customers by their last name, you can create an index on the "last_name" column. This allows the database to quickly locate the relevant data without scanning the entire table.

**Query Complexity:** JSON, or Python's representation of it (a dictionary), indeed allows efficient access to single keys but falls short when it comes to more complex queries. For instance, consider the scenario where you need to find any user whose account is more than 1 year old. In a JSON database, you would need to write a custom function that iterates through all the users, checks their account creation date, and filters out the relevant users. This process can be time-consuming and resource-intensive, especially with a large dataset. In SQL, on the other hand, this query can be expressed as a simple one-liner:

```sql
SELECT * FROM Users WHERE DATE_DIFF(NOW(), account_creation_date, YEAR) > 1;
```

SQL databases can further optimize this query using indexing, resulting in lightning-fast results.

In summary, while JSON databases are efficient for key-based access, they struggle with complex queries and lack the indexing capabilities that SQL databases offer. SQL databases shine in scenarios where advanced querying and efficient data retrieval are essential, providing a significant performance advantage over JSON databases.

## Complex Queries:

JSON databases struggle with complex queries, lacking features like JOINs and aggregations.

In contrast, SQL databases support advanced querying, enabling complex data operations. Let's explore why SQL databases are better equipped for handling intricate queries and dive into the significance of secondary indexes:

**Complex Queries:** SQL databases offer a wide array of powerful features for complex queries. Features like JOINs, which allow you to combine data from multiple tables based on related columns, and aggregations, which enable you to summarize and analyze data, are fundamental to many data-intensive applications. For instance, you can use SQL to effortlessly retrieve data from multiple related tables, calculate sums, averages, or other statistics, and generate complex reports. This level of query complexity is challenging to achieve with JSON databases.

**Secondary Indexes:** One of the key advantages of SQL databases is their support for secondary indexes. These indexes are separate data structures that enhance query performance by allowing you to efficiently access data based on columns other than the primary key. For instance, consider a scenario where you have a large dataset of products, and you frequently need to search for products by their category. In an SQL database, you can create a secondary index on the "category" column. This index significantly accelerates queries that involve filtering or sorting products by category.

**Query Optimization:** SQL databases leverage secondary indexes to optimize query execution. When you perform a complex query involving filtering or sorting by columns with secondary indexes, the database can use these indexes to quickly locate the relevant data, resulting in faster query performance. This optimization is a game-changer when dealing with large datasets and intricate queries.

In contrast, JSON databases lack the native support for secondary indexes, which means that queries involving non-key fields can be considerably slower and less efficient. To achieve similar query performance in a JSON database, you would often need to implement custom indexing mechanisms, which can be complex and resource-intensive.

In summary, SQL databases excel in handling complex queries due to their support for features like JOINs and aggregations, and the presence of secondary indexes greatly enhances their efficiency. JSON databases, lacking these features and indexing capabilities, struggle to match the query performance and advanced querying capabilities of SQL databases.

## ACID Transactions:

JSON databases often lack proper transaction support, compromising data consistency.

In contrast, SQL databases follow ACID principles, ensuring reliable transactions even during failures. Let's explore why transaction support is crucial and what benefits it provides, particularly in scenarios where data integrity is paramount:

**Data Consistency and Reliability:** Transactions in databases ensure data consistency and reliability. A transaction represents a series of database operations (such as inserts, updates, and deletes) that are treated as a single, atomic unit. Either all the operations within a transaction are completed successfully, or none of them are. This guarantees that the database remains in a consistent state, even in the face of errors or failures.

**Example Scenario:** To illustrate the importance of transactions, consider an application that stores data in a large JSON file. Imagine this application is in the process of writing out a new version of the file when it crashes due to a power outage or other fatal issue. Without proper transaction support, you might end up with various undesirable outcomes: the old data, a half-overwritten file, a half-written half-empty file, or even an empty file. These inconsistencies can lead to data loss, corruption, and operational headaches.

**Well-Defined Behavior:** In contrast, SQL databases with transaction support offer well-defined behavior in such scenarios. When a database with ACID compliance encounters a failure during a transaction, it can automatically roll back the transaction, ensuring that no changes are permanently applied to the database. This means that in the event of a crash or error during a write operation, the database remains in a consistent state, and the previously committed data is not lost.

**Isolation and Durability:** Additionally, transactions provide isolation between concurrent operations, preventing data conflicts and ensuring that multiple users or processes can work with the database simultaneously without interfering with each other. Transactions also guarantee durability, meaning that once a transaction is committed, its changes are permanent and will survive database restarts or system crashes.

In summary, transaction support in SQL databases plays a crucial role in maintaining data consistency, reliability, and integrity, especially in scenarios where failures or errors can occur. It ensures that even in adverse conditions, such as application crashes or power losses, the database remains in a well-defined, consistent state, preventing data loss or corruption. In contrast, JSON databases often lack these fundamental transactional guarantees, making them less suitable for applications where data integrity is paramount.

## Scalability:

JSON databases face scalability challenges due to limited indexing and querying capabilities.

In contrast, SQL databases offer better scalability options, including horizontal scaling. Let's explore how horizontal scaling plays a critical role in the scalability of SQL databases and how it outperforms JSON databases:

**Horizontal Scaling in SQL Databases:** Horizontal scaling involves distributing the workload of a database across multiple servers or nodes to handle increasing data volumes and query loads. SQL databases have an advantage in this context due to their efficient memory management and indexing capabilities.

**Efficient Memory Management:** SQL databases can efficiently manage memory resources. They have the ability to keep frequently accessed indices and hot data in main memory. This feature allows database administrators to scale up the main memory size as needed, leading to substantial performance gains. As data volumes grow, being able to hold crucial data structures in memory significantly reduces the need for disk I/O, resulting in faster query execution.

**Built-In Indexing:** SQL databases excel at indexing, which is a crucial component of efficient querying and scaling. When a database scales horizontally, it can distribute data across multiple servers or nodes while maintaining indexing structures. This means that even as the database grows, the distributed indexing ensures that queries can be processed efficiently across the cluster.

**JSON Databases and Memory Management:** On the other hand, JSON databases, which are often implemented as flat files or document stores, lack built-in indexing mechanisms and efficient memory management. When dealing with large JSON documents, the primary option is to load the entire document into memory for processing. This approach becomes increasingly inefficient as data volumes expand because it consumes substantial memory resources and can lead to performance bottlenecks.

In summary, horizontal scaling in SQL databases offers superior scalability compared to JSON databases due to their efficient memory management and indexing capabilities. SQL databases can distribute data and indices across multiple servers while maintaining performance by keeping essential data structures in memory. In contrast, JSON databases lack the inherent ability to efficiently manage memory and index data, making them less suitable for handling large-scale data and complex query loads.

### Conclusion:

JSON's flexibility suits data exchange, but its shortcomings in data integrity, querying efficiency, transactions, and scalability make it unsuitable for robust databases. SQL databases, with structured schemas, powerful queries, ACID transactions, and scalability, provide better solutions for data-intensive applications. When choosing a database solution, consider your project's needs and the limitations of JSON, favoring SQL where appropriate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One point we could add: with the default naive way of writing JSON (and I'm not aware of any trivial, more efficient alternatives), you need to read in and write out the entire JSON file every time you make a change. Grossly inefficient and prone to fatal problems when your app crashes at the wrong time.


**Inefficient JSON Updates:** Another notable limitation of JSON as a data storage format is its inefficiency in updating data. With the default, naive way of writing JSON (and the lack of widely accepted, more efficient alternatives), you often need to read in and write out the entire JSON file every time you make a change. This practice can be grossly inefficient, especially for large datasets, and it becomes prone to fatal problems when your application crashes or experiences issues at the wrong time. In contrast, SQL databases allow for precise, atomic updates to specific records without the need to rewrite the entire dataset, enhancing both efficiency and data safety.

When evaluating your database needs, it's essential to take into account these limitations of JSON and recognize the benefits of SQL databases in terms of data integrity, efficiency, and reliability, particularly for data-intensive applications.