Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import CSV files using existing APIs #260

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Import CSV files using existing APIs #260

wants to merge 13 commits into from

Conversation

mapmeld
Copy link
Member

@mapmeld mapmeld commented Jan 23, 2025

As discussed in #254 , this PR adds a page on /uploader where a user can select a CSV file, or drop a CSV file into the browser window, and create a map based on an existing gerrydb table. Since it is currently only frontend changes, it works as a separate PR

Blank or non-numeric zones are ignored. Big CSVs are uploaded at 2,000 rows/batch

Potential issues

  • supporting and validating more CSV types (TSV, extra headers)
  • possible to check on server or client that the first geoids have matches in the selected gerrydb table?
  • moving finished map links to a sidebar, allowing multiple CSV uploads
  • future auth integration

@@ -43,6 +43,7 @@
"lodash": "^4.17.21",
"maplibre-gl": "^4.4.1",
"next": "^14.2.7",
"papaparse": "^5.5.1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H: The build is failing because the types for papaparse come from a separate package -- running npm i -d @types/papaparse or adding @types/papaparse to dev dependencies should resolve this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mapmeld
Copy link
Member Author

mapmeld commented Feb 3, 2025

OK, I did some testing and think this is ready for review.

  • Allows user to upload multiple files (one at a time)
  • Links appear on the right column, with the original CSV's filename
  • Papaparse handles TSV properly
  • Unusual rows (where right column is blank or not a number value) are skipped
Screenshot 2025-02-02 at 9 40 59 PM

Copy link
Collaborator

@nofurtherinformation nofurtherinformation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking really and working nicely! One moderate refactor and a couple non-blocking comments, but then I think this is good to merge.

setProgress(rowCursor + assignments.length);
rowCursor += ROWS_PER_BATCH;
if (rowCursor > results.data.length) {
setMapLinks([...mapLinks, {document_id, name: file.name}]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PP: On complete, could be nice to add the upload to user's recent maps, something like:

const upsertUserMap = useMapStore(state => state.upsertUserMap)
...
 upsertUserMap({
   mapDocument: response,
 })

};

return (
<div className="h-screen w-screen flex items-center justify-center bg-gray-100">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium (?): Use Radix UI Themes component (<Flex>, <Heading>, <Box> etc) when possible to ensure styling consistency

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

name: string;
};

export default function Uploader() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Should this function be directly available in the dropdown menu on the map page?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. And the uploader is in a modal?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

Copy link
Collaborator

@raphaellaude raphaellaude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this.

The approach of uploading blocks as assignments and letting the FE un-break the entire document in serial poses a number of challenges:

1. Very high DB load

image

This db load is from a single block map being loaded for Kansas (an export from Districtr itself).

2. Slow user experience

image

I think that writing an import endpoint would be much more performant and support many more current loads.

@nofurtherinformation
Copy link
Collaborator

@raphaellaude Agreed on all points for blocks level imports, which are the core use case.

After a little more thought, we may want to wait until we have more full functionality to merge -- supporting only VTD imports may be confusing.

@raphaellaude
Copy link
Collaborator

Oh and we should definitely link /uploader to the map page, right? Maybe under a new menu group in select map

image

@nofurtherinformation
Copy link
Collaborator

Oh and we should definitely link /uploader to the map page, right? Maybe under a new menu group in select map

Agreed! Maybe nice to have it in a modal on the map page too, and then directly load the assignments you uploaded

@mapmeld
Copy link
Member Author

mapmeld commented Feb 4, 2025

It looks like it's possible to use psycopg to COPY from a CSV string ( with cur.copy("COPY tmp_data(id, bla) FROM stdin (format csv, delimiter '|', quote '\"')") in psycopg/psycopg#139 ).
The frontend here could be validating the file and then passing it to the server in the standardized format.

@raphaellaude
Copy link
Collaborator

It looks like it's possible to use psycopg to COPY from a CSV string ( with cur.copy("COPY tmp_data(id, bla) FROM stdin (format csv, delimiter '|', quote '"')") in psycopg/psycopg#139 ).
The frontend here could be validating the file and then passing it to the server in the standardized format.

Yeah that sounds good for loading the data. Once it's in postgres, I think something like the following pseudo-query should then allow us to unbreak the the blocks:

with edge_assignments as (
  select
    doc.document_id,
    edges.parent_path,
    edges.child_path,
    doc.zone
  from parentchildedges edges
  left join document doc
  on doc.geoid = edges.child_path ),
unbroken_parents as (
  select
    e.parent_path as path,
    e.zone
  from edge_assignments e
  left join (
    select edges.parent_path, count(*) parent_count
    from edge_assignments ) pc
  using(parent_path)
  group by e.parent_path, e.zone
  having e.count(*) = parent_count )
select
  '{document id}' as document_id,
  unbroken_parents.path,
  unbroken_parents.zone
from
  unbroken_parents
union all
select
  doc.document_id,
  edges.parent_path as path,
  edges.zone
from
  edge_assignments edges
where edges.parent_path is not in (
  select parent_path
  from unbroken_parents )

geo_id TEXT,
zone INT
)"""))
session.connection().connection.cursor().copy_expert(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H: This method does not appear to exist on cursor

2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] File "/app/app/main.py", line 285, in upload_assignments

2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] session.connection().connection.cursor().copy_expert(

2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] AttributeError: 'Cursor' object has no attribute 'copy_expert'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe a psycopg 2 / 3 issue, as I did end up with a new table? I will try to get a blocks import -> shattered map example working, use a TEMP TABLE instead of this current system, and then figure out what function belongs here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes I think you're right. My local build (via docker-compose) and the dev preview should be on psycopg3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this latest commit work? I think it's the right way for psycopg3, it looks like it isn't bulk upload, but it's the same concept of a single COPY command
References: https://www.psycopg.org/psycopg3/docs/basic/copy.html
https://www.psycopg.org/articles/2020/11/15/psycopg3-copy/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does work to populate the table with block assignments -- both locally and on the preview deploy!


# find items to unshatter
results = session.execute(text("""
SELECT DISTINCT(SUBSTR(geo_id, 1, 11)) AS vtd, zone
Copy link
Collaborator

@nofurtherinformation nofurtherinformation Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H: Unfortunately, VTD GEOIDs are not perfectly hierarchical with blocks. They diverge at the county level, so we can't rely on slicing the IDs to identify VTDs to heal/unshatter.

Diagram from Peter:

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that makes sense. I also realized that I was doing this in reverse (un-shattering blocks, when on a fresh map I need to be shattering the VTDs which need block-level resolution)

@mapmeld
Copy link
Member Author

mapmeld commented Feb 12, 2025

Alright, latest change does a few extra queries, maybe something could be consolidated here, but it does successfully import a mix of VTDs and blocks from a block-level CSV
Screenshot 2025-02-11 at 7 57 15 PM

Copy link
Collaborator

@raphaellaude raphaellaude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I think we need a few extra pieces before this is ready for prod:

  • Block geoids are not validated. All the residual blocks in temp table–after parented blocks are removed/replaced by VTDs–are assumed to be valid. We should not assume that is the case.
  • Input assumes the CSV has at least two columns which represent the geoids and zone. Invalid inputs are not handled by the endpoint since the data object is loosely typed as list[list[str]].
  • Backend tests are needed for the new endpoint, including different failure states for bad inputs. The backend should be a better friend to the client, providing informative error messages rather than 500s.
  • Return value is misleading.

Otherwise, I still have a preference for the single query insert but this approach will work.

@@ -268,6 +269,91 @@ async def reset_map(document_id: str, session: Session = Depends(get_session)):
return {"message": "Assignments partition reset", "document_id": document_id}


@app.patch("/api/upload_assignments")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PP: Slight preference for the more RESTful @app.patch("/api/update_assignments/{document_id}/upload") which matches our style elsewhere.

name: string;
};

export default function Uploader() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. And the uploader is in a modal?

setProgress(0);
setTotalRows(results.data.length);

createMapDocument({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I: if we're tying uploads directly to document upload, then the endpoint should probably create the document and populate it in one step rather. The current API implies you can upload assignments at any time, which is really not the case w/o potentially erroring on conflicting geoids.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

};

return (
<div className="h-screen w-screen flex items-center justify-center bg-gray-100">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


session.commit()

return {"assignments_upserted": 1}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H: I'd think we want a more informative return value. It's a bit misleading to say only a single assignment was inserted (also we're not upserting).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +314 to +321
vtds = []
with cursor.copy("COPY temploader (geo_id, zone) FROM STDIN") as copy:
for row in results:
vtd, zone = row
vtds.append(vtd)
copy.write_row([vtd, zone])
logger.info("uniform parents")
logger.info(vtds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I: Once we've loaded the assignments in via the temp table, I don't love performing another copy. Did you try adapting the query I previously suggested which performs all steps in a single query?

@@ -268,6 +269,91 @@ async def reset_map(document_id: str, session: Session = Depends(get_session)):
return {"message": "Assignments partition reset", "document_id": document_id}


@app.patch("/api/upload_assignments")
async def upload_assignments(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H: Please add tests

Comment on lines +287 to +292
with cursor.copy("COPY temploader (geo_id, zone) FROM STDIN") as copy:
for record in csv_rows:
if record[1] == "":
copy.write_row([record[0], None])
else:
copy.write_row([record[0], int(record[1])])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H: add handling for bad inputs / types.


# insert into actual assignments table
session.execute(text("""
INSERT INTO document.assignments (geo_id, zone, document_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H: Loading the rest of the assignments this way assumes that all the geo_ids are valid. I think at some stage we need to join the blocks against the parent_child_edges in order to determine that they are valid.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PP: When bulk loading, specifying the partition in the table name INSERT INTO document.assignments_{document_id} ... will have a slight performance boost over inserting w/o the partition specified.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H: About geoid validation, should we drop invalid/old geoids? What if someone uploads 2010 geoids?

Copy link
Collaborator

@nofurtherinformation nofurtherinformation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all your work! This is working pretty nicely now! +1 to @raphaellaude suggestions and I added a few more, but this is moving in a great direction!

name: string;
};

export default function Uploader() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

const processFile = (file: File) => {
if (!file) {
throw new Error('No file selected');
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H: At least some file type validation would be good -- I accidentally dropped in a parquet and only got an internal server error response.

Additionally, some simple field / geoid validation would be good -- eg. if more than two columns should we ask users? Checking if GEOIDs all start with the expected state FIPS code?

setProgress(0);
setTotalRows(results.data.length);

createMapDocument({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


# insert into actual assignments table
session.execute(text("""
INSERT INTO document.assignments (geo_id, zone, document_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

H: About geoid validation, should we drop invalid/old geoids? What if someone uploads 2010 geoids?


session.commit()

return {"assignments_upserted": 1}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

};

export default function Uploader() {
const [progress, setProgress] = useState<number>(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: The progress bar doesn't seem to work in the current implementation, although the response time to process the query isn't very long. Do we need a progress bar?

@nofurtherinformation
Copy link
Collaborator

P.s. For code style I think this also needs the pre-commit hook installed/run!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants