Import CSV files using existing APIs #260

mapmeld · 2025-01-23T18:17:56Z

As discussed in #254 , this PR adds a page on /uploader where a user can select a CSV file, or drop a CSV file into the browser window, and create a map based on an existing gerrydb table. Since it is currently only frontend changes, it works as a separate PR

Blank or non-numeric zones are ignored. Big CSVs are uploaded at 2,000 rows/batch

Potential issues

supporting and validating more CSV types (TSV, extra headers)
possible to check on server or client that the first geoids have matches in the selected gerrydb table?
moving finished map links to a sidebar, allowing multiple CSV uploads
future auth integration

nofurtherinformation · 2025-01-24T17:08:59Z

app/package.json

@@ -43,6 +43,7 @@
    "lodash": "^4.17.21",
    "maplibre-gl": "^4.4.1",
    "next": "^14.2.7",
+    "papaparse": "^5.5.1",


H: The build is failing because the types for papaparse come from a separate package -- running npm i -d @types/papaparse or adding @types/papaparse to dev dependencies should resolve this

mapmeld · 2025-02-03T03:43:56Z

OK, I did some testing and think this is ready for review.

Allows user to upload multiple files (one at a time)
Links appear on the right column, with the original CSV's filename
Papaparse handles TSV properly
Unusual rows (where right column is blank or not a number value) are skipped

nofurtherinformation

Overall looking really and working nicely! One moderate refactor and a couple non-blocking comments, but then I think this is good to merge.

nofurtherinformation · 2025-02-03T21:24:47Z

app/src/app/uploader/page.tsx

+              setProgress(rowCursor + assignments.length);
+              rowCursor += ROWS_PER_BATCH;
+              if (rowCursor > results.data.length) {
+                setMapLinks([...mapLinks, {document_id, name: file.name}]);


PP: On complete, could be nice to add the upload to user's recent maps, something like:

const upsertUserMap = useMapStore(state => state.upsertUserMap) ... upsertUserMap({ mapDocument: response, })

nofurtherinformation · 2025-02-03T21:26:24Z

app/src/app/uploader/page.tsx

+  };
+
+  return (
+    <div className="h-screen w-screen flex items-center justify-center bg-gray-100">


Medium (?): Use Radix UI Themes component (<Flex>, <Heading>, <Box> etc) when possible to ensure styling consistency

nofurtherinformation · 2025-02-03T21:36:29Z

app/src/app/uploader/page.tsx

+  name: string;
+};
+
+export default function Uploader() {


Q: Should this function be directly available in the dropdown menu on the map page?

I think so. And the uploader is in a modal?

raphaellaude

Let's discuss this.

The approach of uploading blocks as assignments and letting the FE un-break the entire document in serial poses a number of challenges:

1. Very high DB load

This db load is from a single block map being loaded for Kansas (an export from Districtr itself).

2. Slow user experience

I think that writing an import endpoint would be much more performant and support many more current loads.

nofurtherinformation · 2025-02-03T21:48:44Z

@raphaellaude Agreed on all points for blocks level imports, which are the core use case.

After a little more thought, we may want to wait until we have more full functionality to merge -- supporting only VTD imports may be confusing.

raphaellaude · 2025-02-03T21:51:44Z

Oh and we should definitely link /uploader to the map page, right? Maybe under a new menu group in select map

nofurtherinformation · 2025-02-03T21:53:22Z

Oh and we should definitely link /uploader to the map page, right? Maybe under a new menu group in select map

Agreed! Maybe nice to have it in a modal on the map page too, and then directly load the assignments you uploaded

mapmeld · 2025-02-04T00:33:50Z

It looks like it's possible to use psycopg to COPY from a CSV string ( with cur.copy("COPY tmp_data(id, bla) FROM stdin (format csv, delimiter '|', quote '\"')") in psycopg/psycopg#139 ).
The frontend here could be validating the file and then passing it to the server in the standardized format.

raphaellaude · 2025-02-04T02:19:40Z

It looks like it's possible to use psycopg to COPY from a CSV string ( with cur.copy("COPY tmp_data(id, bla) FROM stdin (format csv, delimiter '|', quote '"')") in psycopg/psycopg#139 ).
The frontend here could be validating the file and then passing it to the server in the standardized format.

Yeah that sounds good for loading the data. Once it's in postgres, I think something like the following pseudo-query should then allow us to unbreak the the blocks:

with edge_assignments as (
  select
    doc.document_id,
    edges.parent_path,
    edges.child_path,
    doc.zone
  from parentchildedges edges
  left join document doc
  on doc.geoid = edges.child_path ),
unbroken_parents as (
  select
    e.parent_path as path,
    e.zone
  from edge_assignments e
  left join (
    select edges.parent_path, count(*) parent_count
    from edge_assignments ) pc
  using(parent_path)
  group by e.parent_path, e.zone
  having e.count(*) = parent_count )
select
  '{document id}' as document_id,
  unbroken_parents.path,
  unbroken_parents.zone
from
  unbroken_parents
union all
select
  doc.document_id,
  edges.parent_path as path,
  edges.zone
from
  edge_assignments edges
where edges.parent_path is not in (
  select parent_path
  from unbroken_parents )

nofurtherinformation · 2025-02-10T16:42:44Z

backend/app/main.py

+        geo_id TEXT,
+        zone INT
+    )"""))
+    session.connection().connection.cursor().copy_expert(


H: This method does not appear to exist on cursor

2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] File "/app/app/main.py", line 285, in upload_assignments 2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] session.connection().connection.cursor().copy_expert( 2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-02-10T16:42:02.894 app[48e236dc3e9e58] ewr [info] AttributeError: 'Cursor' object has no attribute 'copy_expert'

This is maybe a psycopg 2 / 3 issue, as I did end up with a new table? I will try to get a blocks import -> shattered map example working, use a TEMP TABLE instead of this current system, and then figure out what function belongs here

Ah, yes I think you're right. My local build (via docker-compose) and the dev preview should be on psycopg3

Does this latest commit work? I think it's the right way for psycopg3, it looks like it isn't bulk upload, but it's the same concept of a single COPY command
References: https://www.psycopg.org/psycopg3/docs/basic/copy.html
https://www.psycopg.org/articles/2020/11/15/psycopg3-copy/

This does work to populate the table with block assignments -- both locally and on the preview deploy!

nofurtherinformation · 2025-02-11T16:11:39Z

backend/app/main.py

+
+    # find items to unshatter
+    results = session.execute(text("""
+        SELECT DISTINCT(SUBSTR(geo_id, 1, 11)) AS vtd, zone


H: Unfortunately, VTD GEOIDs are not perfectly hierarchical with blocks. They diverge at the county level, so we can't rely on slicing the IDs to identify VTDs to heal/unshatter.

Diagram from Peter:

👍 that makes sense. I also realized that I was doing this in reverse (un-shattering blocks, when on a fresh map I need to be shattering the VTDs which need block-level resolution)

mapmeld · 2025-02-12T01:58:29Z

Alright, latest change does a few extra queries, maybe something could be consolidated here, but it does successfully import a mix of VTDs and blocks from a block-level CSV

raphaellaude

Looking good! I think we need a few extra pieces before this is ready for prod:

Block geoids are not validated. All the residual blocks in temp table–after parented blocks are removed/replaced by VTDs–are assumed to be valid. We should not assume that is the case.
Input assumes the CSV has at least two columns which represent the geoids and zone. Invalid inputs are not handled by the endpoint since the data object is loosely typed as list[list[str]].
Backend tests are needed for the new endpoint, including different failure states for bad inputs. The backend should be a better friend to the client, providing informative error messages rather than 500s.
Return value is misleading.

Otherwise, I still have a preference for the single query insert but this approach will work.

raphaellaude · 2025-02-12T13:41:48Z

backend/app/main.py

@@ -268,6 +269,91 @@ async def reset_map(document_id: str, session: Session = Depends(get_session)):
    return {"message": "Assignments partition reset", "document_id": document_id}


+@app.patch("/api/upload_assignments")


PP: Slight preference for the more RESTful @app.patch("/api/update_assignments/{document_id}/upload") which matches our style elsewhere.

raphaellaude · 2025-02-12T13:53:04Z

app/src/app/uploader/page.tsx

+  name: string;
+};
+
+export default function Uploader() {


I think so. And the uploader is in a modal?

raphaellaude · 2025-02-12T13:57:03Z

app/src/app/uploader/page.tsx

+        setProgress(0);
+        setTotalRows(results.data.length);
+
+        createMapDocument({


I: if we're tying uploads directly to document upload, then the endpoint should probably create the document and populate it in one step rather. The current API implies you can upload assignments at any time, which is really not the case w/o potentially erroring on conflicting geoids.

raphaellaude · 2025-02-12T13:58:12Z

app/src/app/uploader/page.tsx

+  };
+
+  return (
+    <div className="h-screen w-screen flex items-center justify-center bg-gray-100">


raphaellaude · 2025-02-12T14:14:02Z

backend/app/main.py

+
+    session.commit()
+
+    return {"assignments_upserted": 1}


H: I'd think we want a more informative return value. It's a bit misleading to say only a single assignment was inserted (also we're not upserting).

raphaellaude · 2025-02-12T14:18:41Z

backend/app/main.py

+    vtds = []
+    with cursor.copy("COPY temploader (geo_id, zone) FROM STDIN") as copy:
+        for row in results:
+            vtd, zone = row
+            vtds.append(vtd)
+            copy.write_row([vtd, zone])
+    logger.info("uniform parents")
+    logger.info(vtds)


I: Once we've loaded the assignments in via the temp table, I don't love performing another copy. Did you try adapting the query I previously suggested which performs all steps in a single query?

raphaellaude · 2025-02-12T14:18:56Z

backend/app/main.py

@@ -268,6 +269,91 @@ async def reset_map(document_id: str, session: Session = Depends(get_session)):
    return {"message": "Assignments partition reset", "document_id": document_id}


+@app.patch("/api/upload_assignments")
+async def upload_assignments(


H: Please add tests

raphaellaude · 2025-02-12T14:23:36Z

backend/app/main.py

+    with cursor.copy("COPY temploader (geo_id, zone) FROM STDIN") as copy:
+        for record in csv_rows:
+            if record[1] == "":
+                copy.write_row([record[0], None])
+            else:
+                copy.write_row([record[0], int(record[1])])


H: add handling for bad inputs / types.

raphaellaude · 2025-02-12T14:25:19Z

backend/app/main.py

+
+    # insert into actual assignments table
+    session.execute(text("""
+        INSERT INTO document.assignments (geo_id, zone, document_id)


H: Loading the rest of the assignments this way assumes that all the geo_ids are valid. I think at some stage we need to join the blocks against the parent_child_edges in order to determine that they are valid.

PP: When bulk loading, specifying the partition in the table name INSERT INTO document.assignments_{document_id} ... will have a slight performance boost over inserting w/o the partition specified.

H: About geoid validation, should we drop invalid/old geoids? What if someone uploads 2010 geoids?

nofurtherinformation

Thank you for all your work! This is working pretty nicely now! +1 to @raphaellaude suggestions and I added a few more, but this is moving in a great direction!

nofurtherinformation · 2025-02-12T14:46:56Z

app/src/app/uploader/page.tsx

+  name: string;
+};
+
+export default function Uploader() {


nofurtherinformation · 2025-02-12T14:49:26Z

app/src/app/uploader/page.tsx

+  const processFile = (file: File) => {
+    if (!file) {
+      throw new Error('No file selected');
+      return;


H: At least some file type validation would be good -- I accidentally dropped in a parquet and only got an internal server error response.

Additionally, some simple field / geoid validation would be good -- eg. if more than two columns should we ask users? Checking if GEOIDs all start with the expected state FIPS code?

nofurtherinformation · 2025-02-12T14:49:32Z

app/src/app/uploader/page.tsx

+        setProgress(0);
+        setTotalRows(results.data.length);
+
+        createMapDocument({


nofurtherinformation · 2025-02-12T14:50:31Z

backend/app/main.py

+
+    # insert into actual assignments table
+    session.execute(text("""
+        INSERT INTO document.assignments (geo_id, zone, document_id)


H: About geoid validation, should we drop invalid/old geoids? What if someone uploads 2010 geoids?

nofurtherinformation · 2025-02-12T14:50:41Z

backend/app/main.py

+
+    session.commit()
+
+    return {"assignments_upserted": 1}


nofurtherinformation · 2025-02-12T14:51:26Z

app/src/app/uploader/page.tsx

+};
+
+export default function Uploader() {
+  const [progress, setProgress] = useState<number>(0);


Q: The progress bar doesn't seem to work in the current implementation, although the response time to process the query isn't very long. Do we need a progress bar?

nofurtherinformation · 2025-02-12T15:22:29Z

P.s. For code style I think this also needs the pre-commit hook installed/run!

mapmeld requested a review from raphaellaude January 23, 2025 18:17

mapmeld had a problem deploying to pr-260 January 23, 2025 18:18 — with GitHub Actions Failure

mapmeld had a problem deploying to pr-260 January 23, 2025 18:34 — with GitHub Actions Failure

nofurtherinformation reviewed Jan 24, 2025

View reviewed changes

mapmeld had a problem deploying to pr-260 January 24, 2025 17:11 — with GitHub Actions Failure

mapmeld temporarily deployed to pr-260 January 24, 2025 17:33 — with GitHub Actions Inactive

mapmeld force-pushed the map-imports branch from 39f3fa5 to b592452 Compare February 3, 2025 03:41

mapmeld temporarily deployed to pr-260 February 3, 2025 03:41 — with GitHub Actions Inactive

mapmeld requested a review from nofurtherinformation February 3, 2025 03:44

nofurtherinformation requested changes Feb 3, 2025

View reviewed changes

raphaellaude requested changes Feb 3, 2025

View reviewed changes

mapmeld added 7 commits February 8, 2025 17:20

install papaparse

161788c

add uploader widget

7503e1e

more rows in a batch

c8f2c90

add types

64be783

typing and protections

68a89a4

allow multiple uploads, show links in sidebar

a5b39b3

use CSV upload to a temp table

58a515f

mapmeld force-pushed the map-imports branch from b592452 to 58a515f Compare February 9, 2025 00:17

mapmeld temporarily deployed to pr-260 February 9, 2025 00:17 — with GitHub Actions Inactive

nofurtherinformation reviewed Feb 10, 2025

View reviewed changes

mapmeld had a problem deploying to pr-260 February 10, 2025 19:55 — with GitHub Actions Failure

temp table, csv copy

63f4753

mapmeld force-pushed the map-imports branch from 3d8590c to 63f4753 Compare February 10, 2025 19:57

mapmeld had a problem deploying to pr-260 February 10, 2025 19:57 — with GitHub Actions Failure

fixing issues

f254ecf

mapmeld temporarily deployed to pr-260 February 10, 2025 20:01 — with GitHub Actions Inactive

mapmeld added 2 commits February 10, 2025 16:17

add to recent maps?

b944406

attempt to use unshatter function

32e1144

mapmeld had a problem deploying to pr-260 February 10, 2025 23:13 — with GitHub Actions Failure

mapmeld temporarily deployed to pr-260 February 11, 2025 13:13 — with GitHub Actions Inactive

nofurtherinformation reviewed Feb 11, 2025

View reviewed changes

load uniform vtds and shattered vtds

71abcbc

mapmeld deployed to pr-260 February 12, 2025 01:57 — with GitHub Actions Active

raphaellaude requested changes Feb 12, 2025

View reviewed changes

nofurtherinformation requested changes Feb 12, 2025

View reviewed changes

backend fmt

307f9c6

		@@ -268,6 +269,91 @@ async def reset_map(document_id: str, session: Session = Depends(get_session)):
		return {"message": "Assignments partition reset", "document_id": document_id}


		@app.patch("/api/upload_assignments")

Import CSV files using existing APIs #260

Are you sure you want to change the base?

Import CSV files using existing APIs #260

Conversation

mapmeld commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapmeld commented Feb 3, 2025

nofurtherinformation left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raphaellaude left a comment

Choose a reason for hiding this comment

1. Very high DB load

2. Slow user experience

nofurtherinformation commented Feb 3, 2025

raphaellaude commented Feb 3, 2025

nofurtherinformation commented Feb 3, 2025

mapmeld commented Feb 4, 2025

raphaellaude commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nofurtherinformation Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapmeld commented Feb 12, 2025

raphaellaude left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nofurtherinformation left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nofurtherinformation commented Feb 12, 2025

nofurtherinformation Feb 11, 2025 •

edited

Loading