Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Json decoder error when reading DB from multiple processes #537

Open
Shubhranshu153 opened this issue Sep 15, 2023 · 3 comments
Open

Json decoder error when reading DB from multiple processes #537

Shubhranshu153 opened this issue Sep 15, 2023 · 3 comments

Comments

@Shubhranshu153
Copy link

I was doing some testing with tinydb. Its an awesome software. I found that its not recommended for multiprocess reads like usecase in flask etc. But as i had only parallel read operation, i thought it would work. But when multiple process try to read the db i get a json error

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "ptyhon_db_test_2.py", line 15, in search_data
    result = db.search(Query()['workflowId'] == 'm9imoy9b')
  File "/usr/local/lib/python3.8/dist-packages/tinydb/table.py", line 254, in search
    for doc_id, doc in self._read_table().items()
  File "/usr/local/lib/python3.8/dist-packages/tinydb/table.py", line 704, in _read_table
    tables = self._storage.read()
  File "/usr/local/lib/python3.8/dist-packages/tinydb/storages.py", line 136, in read
    return json.load(self._handle)
  File "/usr/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

To reproduce it:
(need to change the db json and search)

import multiprocessing
import time
from tinydb import TinyDB, Query

# Create a TinyDB database
db = TinyDB('my_database.json')

# Define the ID you want to search for
search_id = 'm9imoy9b'  # Replace with your desired ID

# Define a function to perform the search operation
def search_data():
    while True:
        # Perform a search for the specific ID
        result = db.search(Query()['workflowId'] == search_id)
        print(f"Search result for ID {search_id}")

# Define the number of processes you want for simultaneous searches
num_processes = 100

# Create and start processes for searching
processes = []
for _ in range(num_processes):
    process = multiprocessing.Process(target=search_data)
    process.start()
    processes.append(process)

try:
    # Keep the processes running in the background
    for process in processes:
        process.join()

except KeyboardInterrupt:
    # Terminate the processes gracefully on Ctrl+C
    for process in processes:
        process.terminate()

# Close the database
db.close()
@SpiralAPI
Copy link

I have been experiencing the same thing. For me, it seems to clone the json files contents at the end of the file (basically if you copied everything, and then pasted it again at the end of the json file)

@MrPigss
Copy link
Contributor

MrPigss commented Oct 3, 2023

The problem is that you don't use Locks while reading and writing. Because of this it's possible that one thread or process does a 'seek' operation just before another thread or process wants to read.

For example, below is the code for a read using the JSONStorage.

def read(self) -> Optional[Dict[str, Dict[str, Any]]]:
        # Get the file size by moving the cursor to the file end and reading
        # its location
        self._handle.seek(0, os.SEEK_END)
        size = self._handle.tell()

        if not size:
            # File is empty, so we return ``None`` so TinyDB can properly
            # initialize the database
            return None
        else:
            # Return the cursor to the beginning of the file
            self._handle.seek(0)

            # Load the JSON contents of the file
            return json.load(self._handle)
  1. proces one does a 'read'.
    1.1 proces one does self._handle.seek(0, os.SEEK_END), the cursor is now at the end of the file.
    1.2 proces one does self._handle.seek(0), the cursor is now at the beginning of the file.
  2. proces two start a read.
    2.1 proces two does self._handle.seek(0, os.SEEK_END), the cursor is now at the end of the file.
  3. proces one does a json.load(self._handle) -> you read from the end of a file -> file seems empty

process one set the cursor to the beginning, but proces 2 changed it to the end just before process one wants to read, resulting in an empty str. This is one way things can go wrong but you can imagine that there are a lot of ways that this can fail. Like @SpiralAPI saw there might be a proces that does a self._handle.seek(0, os.SEEK_END) just before a write resulting in appending all the data instead of overwriting.

It's not very usefull to do a search using multiple processes or thread over a single file. Since you would need to use locks every time you read or write, you basically turned it into a synchronous operation.

If you would need to do a CPU-intensive task, it's beter to read everything ahead of time (or at least in chunks) and then pass the data to different processes or threads.

@msiemens
Copy link
Owner

msiemens commented Oct 5, 2024

But as i had only parallel read operation, i thought it would work. But when multiple process try to read the db i get a json error

I tried to execute your code (and had to make some small changes), but when I got it working, I didn't get any of the errors you had. As @MrPigss mentioned, your error seems to indicate that you had some process doing both reads and writes from multiple threads within the same program. With multiprocess, reading should work as long as each instance has its own file handle to the database file.


For reference, here's the code I used:

import multiprocessing

from tinydb import TinyDB, Query

# Define the ID you want to search for
search_id = [{"name": "bar1"}]  # Replace with your desired ID


# Define a function to perform the search operation
def search_data():
    # Create a TinyDB database
    db = TinyDB('test.json')

    while True:
        # Perform a search for the specific ID
        result = db.search(Query()['foo'] == search_id)
        print(f"Search result for ID {search_id}")

    # Close the database
    db.close()

if __name__ == '__main__':
    # Define the number of processes you want for simultaneous searches
    num_processes = 100

    # Create and start processes for searching
    processes = []
    for _ in range(num_processes):
        process = multiprocessing.Process(target=search_data)
        process.start()
        processes.append(process)

    try:
        # Keep the processes running in the background
        for process in processes:
            process.join()

    except KeyboardInterrupt:
        # Terminate the processes gracefully on Ctrl+C
        for process in processes:
            process.terminate()

And this is the test.json content:

{"_default": {"1": {"foo": [{"name": "bar1"}]}, "2": {"foo": [{"name": "bar2"}]}, "3": {"foo": [{"name": "bar3"}]}, "4": {"foo": [{"name": "bar4"}]}, "5": {"foo": [{"name": "bar5"}]}, "6": {"foo": [{"name": "bar6"}]}}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants