You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was using this library and it was very slow, like, insanely. I was reading a ~100MB jsonl file and it took about an hour in my environment. I want to read the file line by line and yield a single line so I can decode the JSON and do my stuff, without loading everything into memory all at once.
Expected behavior
I expected this reading to take less than a few seconds.
Actual behavior
It took around an hour.
Steps to reproduce
Just run this either of these functions on any JSONL file of similar size.
async def read_jsonl_file(file_like, chunk_size: int = 4192) -> AsyncIterable[dict[str, str]]:
"""Read an uploaded JSONL file and yield each line as a dictionary."""
async with aiofile.AIOFile(file_like) as aio_file:
async for line in aiofile.LineReader(aio_file=aio_file, chunk_size=chunk_size):
if line.strip(): # Skip empty lines
yield json.loads(line)
async def read_jsonl_file_direct_interface(file_like) -> AsyncIterable[dict[str, str]]:
"""Read an uploaded JSONL file and yield each line as a dictionary."""
async with aiofile.async_open(file_like) as aio_file:
async for line in aio_file:
if line.strip(): # Skip empty lines
yield json.loads(line)
Suggested fix
I looked through the code and found a bad pattern repeated a lot of times in the code. The bad pattern is roughly:
Do a system call to read
Check if that chunk contains a separator, if no new line add it to a buffer, continue
Otherwise, read the line and keep the remainder.
This pattern overlooks the fact that there might be more than a single separator in a chunk. For testing I changed the LineReader implementation:
async def fixed_readline(self) -> str | bytes:
self._buffer = cast(StringIO | BytesIO, self._buffer)
while True:
self._buffer.seek(0)
line = self._buffer.readline()
if line and line.endswith(self.linesep):
tail = self._buffer.read()
self._buffer.seek(0)
self._buffer.truncate(0)
self._buffer.write(tail)
return line
# No line in buffer, read more data
chunk = await self._LineReader__reader.read_chunk()
if not chunk:
# No more data to read, return any remaining content in the buffer
self._buffer.seek(0)
remaining_content = self._buffer.read()
# Clear the buffer so we don't return the same content again or leak memory
self._buffer.truncate(0)
return remaining_content
# We have more data to read, write it to the buffer and handle it in the next iteration
self._buffer.seek(0, 2) # Seek to the end of the buffer
self._buffer.write(chunk)
I did a speed test on my file and this fix really improves the performance:
Reading the file without async:
no_async 0.39684295654296875 seconds
Using the aiofiles library:
aiofiles 2.119969367980957 seconds
Either of the functions above:
manual_interface 0.5424532890319824 seconds
direct_interface 0.5431270599365234 seconds
The crux is that this code does not always read more data into memory and does a lot fewer system calls. Always reading more data into memory becomes even worse when dealing with an ever growing tail.
I hope that this new pattern gets adopted as it really makes the library usable in modern async Python environments.
The text was updated successfully, but these errors were encountered:
Long story short
I was using this library and it was very slow, like, insanely. I was reading a ~100MB jsonl file and it took about an hour in my environment. I want to read the file line by line and yield a single line so I can decode the JSON and do my stuff, without loading everything into memory all at once.
Expected behavior
I expected this reading to take less than a few seconds.
Actual behavior
It took around an hour.
Steps to reproduce
Just run this either of these functions on any JSONL file of similar size.
Suggested fix
I looked through the code and found a bad pattern repeated a lot of times in the code. The bad pattern is roughly:
This pattern overlooks the fact that there might be more than a single separator in a chunk. For testing I changed the LineReader implementation:
I did a speed test on my file and this fix really improves the performance:
Reading the file without async:
no_async 0.39684295654296875 seconds
Using the aiofiles library:
aiofiles 2.119969367980957 seconds
Either of the functions above:
manual_interface 0.5424532890319824 seconds
direct_interface 0.5431270599365234 seconds
The crux is that this code does not always read more data into memory and does a lot fewer system calls. Always reading more data into memory becomes even worse when dealing with an ever growing tail.
I hope that this new pattern gets adopted as it really makes the library usable in modern async Python environments.
The text was updated successfully, but these errors were encountered: