-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import CSV files using existing APIs #260
base: main
Are you sure you want to change the base?
Conversation
@@ -43,6 +43,7 @@ | |||
"lodash": "^4.17.21", | |||
"maplibre-gl": "^4.4.1", | |||
"next": "^14.2.7", | |||
"papaparse": "^5.5.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H: The build is failing because the types for papaparse come from a separate package -- running npm i -d @types/papaparse
or adding @types/papaparse
to dev dependencies should resolve this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking really and working nicely! One moderate refactor and a couple non-blocking comments, but then I think this is good to merge.
setProgress(rowCursor + assignments.length); | ||
rowCursor += ROWS_PER_BATCH; | ||
if (rowCursor > results.data.length) { | ||
setMapLinks([...mapLinks, {document_id, name: file.name}]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PP: On complete, could be nice to add the upload to user's recent maps, something like:
const upsertUserMap = useMapStore(state => state.upsertUserMap)
...
upsertUserMap({
mapDocument: response,
})
}; | ||
|
||
return ( | ||
<div className="h-screen w-screen flex items-center justify-center bg-gray-100"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Medium (?): Use Radix UI Themes component (<Flex>
, <Heading>
, <Box>
etc) when possible to ensure styling consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
name: string; | ||
}; | ||
|
||
export default function Uploader() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Should this function be directly available in the dropdown menu on the map page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. And the uploader is in a modal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this.
The approach of uploading blocks as assignments and letting the FE un-break the entire document in serial poses a number of challenges:
1. Very high DB load

This db load is from a single block map being loaded for Kansas (an export from Districtr itself).
2. Slow user experience

I think that writing an import endpoint would be much more performant and support many more current loads.
@raphaellaude Agreed on all points for blocks level imports, which are the core use case. After a little more thought, we may want to wait until we have more full functionality to merge -- supporting only VTD imports may be confusing. |
Agreed! Maybe nice to have it in a modal on the map page too, and then directly load the assignments you uploaded |
It looks like it's possible to use psycopg to COPY from a CSV string ( |
Yeah that sounds good for loading the data. Once it's in postgres, I think something like the following pseudo-query should then allow us to unbreak the the blocks: with edge_assignments as (
select
doc.document_id,
edges.parent_path,
edges.child_path,
doc.zone
from parentchildedges edges
left join document doc
on doc.geoid = edges.child_path ),
unbroken_parents as (
select
e.parent_path as path,
e.zone
from edge_assignments e
left join (
select edges.parent_path, count(*) parent_count
from edge_assignments ) pc
using(parent_path)
group by e.parent_path, e.zone
having e.count(*) = parent_count )
select
'{document id}' as document_id,
unbroken_parents.path,
unbroken_parents.zone
from
unbroken_parents
union all
select
doc.document_id,
edges.parent_path as path,
edges.zone
from
edge_assignments edges
where edges.parent_path is not in (
select parent_path
from unbroken_parents ) |
backend/app/main.py
Outdated
geo_id TEXT, | ||
zone INT | ||
)""")) | ||
session.connection().connection.cursor().copy_expert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H: This method does not appear to exist on cursor
2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] File "/app/app/main.py", line 285, in upload_assignments
2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] session.connection().connection.cursor().copy_expert(
2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] AttributeError: 'Cursor' object has no attribute 'copy_expert'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe a psycopg 2 / 3 issue, as I did end up with a new table? I will try to get a blocks import -> shattered map example working, use a TEMP TABLE instead of this current system, and then figure out what function belongs here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes I think you're right. My local build (via docker-compose) and the dev preview should be on psycopg3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this latest commit work? I think it's the right way for psycopg3, it looks like it isn't bulk upload, but it's the same concept of a single COPY command
References: https://www.psycopg.org/psycopg3/docs/basic/copy.html
https://www.psycopg.org/articles/2020/11/15/psycopg3-copy/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does work to populate the table with block assignments -- both locally and on the preview deploy!
backend/app/main.py
Outdated
|
||
# find items to unshatter | ||
results = session.execute(text(""" | ||
SELECT DISTINCT(SUBSTR(geo_id, 1, 11)) AS vtd, zone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 that makes sense. I also realized that I was doing this in reverse (un-shattering blocks, when on a fresh map I need to be shattering the VTDs which need block-level resolution)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I think we need a few extra pieces before this is ready for prod:
- Block geoids are not validated. All the residual blocks in temp table–after parented blocks are removed/replaced by VTDs–are assumed to be valid. We should not assume that is the case.
- Input assumes the CSV has at least two columns which represent the geoids and zone. Invalid inputs are not handled by the endpoint since the data object is loosely typed as
list[list[str]]
. - Backend tests are needed for the new endpoint, including different failure states for bad inputs. The backend should be a better friend to the client, providing informative error messages rather than 500s.
- Return value is misleading.
Otherwise, I still have a preference for the single query insert but this approach will work.
@@ -268,6 +269,91 @@ async def reset_map(document_id: str, session: Session = Depends(get_session)): | |||
return {"message": "Assignments partition reset", "document_id": document_id} | |||
|
|||
|
|||
@app.patch("/api/upload_assignments") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PP: Slight preference for the more RESTful @app.patch("/api/update_assignments/{document_id}/upload")
which matches our style elsewhere.
name: string; | ||
}; | ||
|
||
export default function Uploader() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. And the uploader is in a modal?
setProgress(0); | ||
setTotalRows(results.data.length); | ||
|
||
createMapDocument({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I: if we're tying uploads directly to document upload, then the endpoint should probably create the document and populate it in one step rather. The current API implies you can upload assignments at any time, which is really not the case w/o potentially erroring on conflicting geoids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
}; | ||
|
||
return ( | ||
<div className="h-screen w-screen flex items-center justify-center bg-gray-100"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
||
session.commit() | ||
|
||
return {"assignments_upserted": 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H: I'd think we want a more informative return value. It's a bit misleading to say only a single assignment was inserted (also we're not upserting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
vtds = [] | ||
with cursor.copy("COPY temploader (geo_id, zone) FROM STDIN") as copy: | ||
for row in results: | ||
vtd, zone = row | ||
vtds.append(vtd) | ||
copy.write_row([vtd, zone]) | ||
logger.info("uniform parents") | ||
logger.info(vtds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I: Once we've loaded the assignments in via the temp table, I don't love performing another copy. Did you try adapting the query I previously suggested which performs all steps in a single query?
@@ -268,6 +269,91 @@ async def reset_map(document_id: str, session: Session = Depends(get_session)): | |||
return {"message": "Assignments partition reset", "document_id": document_id} | |||
|
|||
|
|||
@app.patch("/api/upload_assignments") | |||
async def upload_assignments( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H: Please add tests
with cursor.copy("COPY temploader (geo_id, zone) FROM STDIN") as copy: | ||
for record in csv_rows: | ||
if record[1] == "": | ||
copy.write_row([record[0], None]) | ||
else: | ||
copy.write_row([record[0], int(record[1])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H: add handling for bad inputs / types.
|
||
# insert into actual assignments table | ||
session.execute(text(""" | ||
INSERT INTO document.assignments (geo_id, zone, document_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H: Loading the rest of the assignments this way assumes that all the geo_ids are valid. I think at some stage we need to join the blocks against the parent_child_edges in order to determine that they are valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PP: When bulk loading, specifying the partition in the table name INSERT INTO document.assignments_{document_id} ...
will have a slight performance boost over inserting w/o the partition specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H: About geoid validation, should we drop invalid/old geoids? What if someone uploads 2010 geoids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all your work! This is working pretty nicely now! +1 to @raphaellaude suggestions and I added a few more, but this is moving in a great direction!
name: string; | ||
}; | ||
|
||
export default function Uploader() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
const processFile = (file: File) => { | ||
if (!file) { | ||
throw new Error('No file selected'); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H: At least some file type validation would be good -- I accidentally dropped in a parquet and only got an internal server error response.
Additionally, some simple field / geoid validation would be good -- eg. if more than two columns should we ask users? Checking if GEOIDs all start with the expected state FIPS code?
setProgress(0); | ||
setTotalRows(results.data.length); | ||
|
||
createMapDocument({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
||
# insert into actual assignments table | ||
session.execute(text(""" | ||
INSERT INTO document.assignments (geo_id, zone, document_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H: About geoid validation, should we drop invalid/old geoids? What if someone uploads 2010 geoids?
|
||
session.commit() | ||
|
||
return {"assignments_upserted": 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
}; | ||
|
||
export default function Uploader() { | ||
const [progress, setProgress] = useState<number>(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: The progress bar doesn't seem to work in the current implementation, although the response time to process the query isn't very long. Do we need a progress bar?
P.s. For code style I think this also needs the pre-commit hook installed/run! |
As discussed in #254 , this PR adds a page on
/uploader
where a user can select a CSV file, or drop a CSV file into the browser window, and create a map based on an existing gerrydb table. Since it is currently only frontend changes, it works as a separate PRBlank or non-numeric zones are ignored. Big CSVs are uploaded at 2,000 rows/batch
Potential issues