Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue and suggested fix #92

Closed
HaukurPall opened this issue Jun 27, 2024 · 2 comments · Fixed by #93
Closed

Performance issue and suggested fix #92

HaukurPall opened this issue Jun 27, 2024 · 2 comments · Fixed by #93

Comments

@HaukurPall
Copy link
Contributor

HaukurPall commented Jun 27, 2024

Long story short

I was using this library and it was very slow, like, insanely. I was reading a ~100MB jsonl file and it took about an hour in my environment. I want to read the file line by line and yield a single line so I can decode the JSON and do my stuff, without loading everything into memory all at once.

Expected behavior

I expected this reading to take less than a few seconds.

Actual behavior

It took around an hour.

Steps to reproduce

Just run this either of these functions on any JSONL file of similar size.

async def read_jsonl_file(file_like, chunk_size: int = 4192) -> AsyncIterable[dict[str, str]]:
    """Read an uploaded JSONL file and yield each line as a dictionary."""
    async with aiofile.AIOFile(file_like) as aio_file:
        async for line in aiofile.LineReader(aio_file=aio_file, chunk_size=chunk_size):
            if line.strip():  # Skip empty lines
                yield json.loads(line)


async def read_jsonl_file_direct_interface(file_like) -> AsyncIterable[dict[str, str]]:
    """Read an uploaded JSONL file and yield each line as a dictionary."""
    async with aiofile.async_open(file_like) as aio_file:
        async for line in aio_file:
            if line.strip():  # Skip empty lines
                yield json.loads(line)

Suggested fix

I looked through the code and found a bad pattern repeated a lot of times in the code. The bad pattern is roughly:

  1. Do a system call to read
  2. Check if that chunk contains a separator, if no new line add it to a buffer, continue
  3. Otherwise, read the line and keep the remainder.

This pattern overlooks the fact that there might be more than a single separator in a chunk. For testing I changed the LineReader implementation:

async def fixed_readline(self) -> str | bytes:
    self._buffer = cast(StringIO | BytesIO, self._buffer)
    while True:
        self._buffer.seek(0)
        line = self._buffer.readline()
        if line and line.endswith(self.linesep):
            tail = self._buffer.read()
            self._buffer.seek(0)
            self._buffer.truncate(0)
            self._buffer.write(tail)
            return line
        # No line in buffer, read more data
        chunk = await self._LineReader__reader.read_chunk()
        if not chunk:
            # No more data to read, return any remaining content in the buffer
            self._buffer.seek(0)
            remaining_content = self._buffer.read()
            # Clear the buffer so we don't return the same content again or leak memory
            self._buffer.truncate(0)
            return remaining_content
        # We have more data to read, write it to the buffer and handle it in the next iteration
        self._buffer.seek(0, 2)  # Seek to the end of the buffer
        self._buffer.write(chunk)

I did a speed test on my file and this fix really improves the performance:

Reading the file without async:
no_async 0.39684295654296875 seconds

Using the aiofiles library:
aiofiles 2.119969367980957 seconds

Either of the functions above:
manual_interface 0.5424532890319824 seconds
direct_interface 0.5431270599365234 seconds

The crux is that this code does not always read more data into memory and does a lot fewer system calls. Always reading more data into memory becomes even worse when dealing with an ever growing tail.

I hope that this new pattern gets adopted as it really makes the library usable in modern async Python environments.

@gnzsnz
Copy link

gnzsnz commented Oct 7, 2024

i couldn't find any reason why this PR has not been merged or maybe discussed?

I'm facing a similar issue. when i use aiofile my code is 5 to 6 times slower than the synchronous version.

is there anything I can help to make this progress?

@HaukurPall
Copy link
Contributor Author

There is nothing pending on my side as far as I know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants