Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing this CSV from a URL only gets 13 rows, not 596 #23

Open
simonw opened this issue Nov 11, 2021 · 11 comments
Open

Importing this CSV from a URL only gets 13 rows, not 596 #23

simonw opened this issue Nov 11, 2021 · 11 comments
Labels
bug Something isn't working

Comments

@simonw
Copy link
Owner

simonw commented Nov 11, 2021

https://raw.githubusercontent.com/okfn/dataportals.org/master/data/portals.csv

The "Open CSV from URL..." menu option only produced 13 rows - but using sqlite-utils insert portals.db portals portals.csv --csv on the command-line got all 596.

@simonw simonw added the bug Something isn't working label Nov 11, 2021
@simonw
Copy link
Owner Author

simonw commented Jul 13, 2022

Still a bug against latest Datasette Desktop release.

@simonw
Copy link
Owner Author

simonw commented Jul 13, 2022

Here's how that CSV file starts:

image

And in Datasette the data cuts off here:

image

Which is right where the first double-newline paragraph break in that CS file occurs.

@simonw
Copy link
Owner Author

simonw commented Jul 13, 2022

This is a datasette-app-support problem, moving the issue there.

@simonw simonw transferred this issue from simonw/datasette-app Jul 13, 2022
@simonw
Copy link
Owner Author

simonw commented Jul 13, 2022

I'm suspicious of this code:

async with httpx.AsyncClient() as client:
async with client.stream("GET", url, follow_redirects=True) as response:
reader = AsyncDictReader(response.aiter_lines())

Maybe that AsyncDictReader(response.aiter_lines()) pattern can't cope with CSV files that include their own double-quoted newlines?

@simonw
Copy link
Owner Author

simonw commented Jul 13, 2022

This code is also relevant:

class AsyncDictReader:
def __init__(self, async_line_iterator):
self.async_line_iterator = async_line_iterator
self.buffer = io.StringIO()
self.reader = DictReader(self.buffer)
self.line_num = 0
def __aiter__(self):
return self
async def __anext__(self):
if self.line_num == 0:
header = await self.async_line_iterator.__anext__()
self.buffer.write(header)
line = await self.async_line_iterator.__anext__()
if not line:
raise StopAsyncIteration
self.buffer.write(line)
self.buffer.seek(0)
try:
result = next(self.reader)
except StopIteration as e:
raise StopAsyncIteration from e
self.buffer.seek(0)
self.buffer.truncate(0)
self.line_num = self.reader.line_num
return result

@simonw
Copy link
Owner Author

simonw commented Jul 13, 2022

Here are my notes from when I wrote that AsyncDictReader class: #14 (comment)

@simonw
Copy link
Owner Author

simonw commented Jul 13, 2022

Maybe AsyncDictReader.__anext__() needs to be smart enough to watch out for unbalanced double quotes and consume another line if it spots one?

@simonw
Copy link
Owner Author

simonw commented Jul 13, 2022

https://github.com/MKuranowski/aiocsv may be able to handle this for me.

@simonw
Copy link
Owner Author

simonw commented Jul 13, 2022

aiocsv is designed to work with a aiofiles object with a .read() coroutine - I'm not sure how best to map that to an httpx streaming response.

@simonw
Copy link
Owner Author

simonw commented Jul 14, 2022

I'm beginning to think it would be better for the app to either suck the entire CSV file into memory OR to save it to a temporary file on disk, then read it into a table. Much simpler that way - this problem with newlines has made me very suspicious of importers that don't directly use csv as it was intended to be used.

@simonw
Copy link
Owner Author

simonw commented Jul 14, 2022

I'm going to go with the memory option. Datasette Desktop runs on Macs with a decent amount of RAM, and with swap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant