Data Week: Internal Batch Scoring MBD API Endpoint (Part 2) #2680

Jkd-eth · 2024-07-09T15:51:33Z

User Story:
As a data engineer, I want to set up an internal batch scoring MBD API endpoint, so that I can process large datasets efficiently for the data team and provide results in a downloadable CSV file.

Acceptance Criteria:
GIVEN the internal API endpoint,
WHEN the data team submits a list of addresses with their API key,
THEN the API should provide an estimated processing time and a job ID, allow status checks via a separate endpoint, and return an S3 bucket link to download the CSV file with the results when the job is completed.

Tech Details:

Dependency on Part 1 being completed
Operate the compute in a dedicated EC2 instance, automatically triggered by input bucket uploads.
Create an endpoint to submit a list of addresses, which replies with an estimated time and a job ID.
Implement a separate endpoint to check the status of the job using the job ID.
Implement a database to store the data from the batch job
Once the job is completed, provide an S3 link to download the CSV file (or Parquet file) containing the addresses and their scores.
Hard-code estimates based on historical data for conservative estimates.
Test the endpoint for performance and accuracy.
Should include the MBD result(s) the transactions will be stored in the Dynamo DB
The rest of the story is covered in part 2

Open Questions:

Are there any specific performance metrics to consider?
What specific format should the CSV file follow?

Notes/Assumptions:

Ensure the endpoint can handle large datasets efficiently.

Restrict access to the endpoint to the data team only.
Authenticate with API keys from the scorer app.
Operate the compute in a dedicated EC2 instance, automatically triggered by input bucket uploads.
Create an endpoint to submit a list of addresses, which replies with an estimated time and a job ID.
Implement a separate endpoint to check the status of the job using the job ID.
Implement a database to store the data from the batch job
Once the job is completed, provide an S3 link to download the CSV file (or Parquet file) containing the addresses and their scores.
Hard-code estimates based on historical data for conservative estimates.
Test the endpoint for performance and accuracy.
Should include the MBD result(s) the transactions will be stored in the Dynamo DB

Open Questions:

Are there any specific performance metrics to consider?
What specific format should the CSV file follow?

Notes/Assumptions:

Assume the API infrastructure is already in place.

lucianHymer · 2024-08-05T18:59:31Z

The info for the task can be found in the new BatchModelScoringRequest object.

You can query the db for objects withe a status PENDING. I didn't add an IN-PROGRESS status, but maybe we should.

The record includes a filename field. This is the name of the address list file. It can be found in S3 at bulk-score-requests<-abc123>/address-lists/<filename>

(the -<abc123> bit changes for each environment. Review doesn't have this, the values for staging and prod are set in 1P to be loaded into the environment)

The record also includes a model_list field that can be forwarded to the analysis endpoint.

Once complete, the results should be saved to bulk-score-requests<-abc123>/model-score-results/<filename>, using the same filename as the address list. This makes it so the django admin can easily render a link. The link is a signed link that is valid for 1 hour (generated whenever the record is loaded in the django admin), that allows a user to download the file even though it's in a private bucket. The BatchModelScoringRequest status should be updated to DONE.

The bucket and folder names are saved as settings in the Django app.

tim-schultz · 2024-08-15T01:57:19Z

@erichfi @stefi-says @NadjibBenlaldj Here is the documentation on how to upload, process, and access bulk processing of addresses https://github.com/passportxyz/passport-scorer/blob/main/api/registry/bulk-mbd-analysis.md. Just lmk if you have any question

Jkd-eth added this to Passport Jul 9, 2024

Jkd-eth moved this to Prioritized in Passport Jul 9, 2024

Jkd-eth mentioned this issue Jul 9, 2024

Data Week: Internal Batch Scoring MBD API Endpoint (Part 1) #2631

Closed

2 tasks

Jkd-eth changed the title ~~MBD: Internal Batch Scoring MBD API Endpoint (Part 2)~~ Data Week: Internal Batch Scoring MBD API Endpoint (Part 2) Jul 9, 2024

erichfi mentioned this issue Jul 15, 2024

Data Infra #2397

Open

tim-schultz added this to Passport New Jul 30, 2024

erichfi moved this to Prioritized in Passport New Aug 2, 2024

tim-schultz self-assigned this Aug 5, 2024

tim-schultz moved this from Prioritized to In Progress (WIP) in Passport Aug 5, 2024

tim-schultz moved this from Prioritized to In Progress (WIP) in Passport New Aug 6, 2024

tim-schultz mentioned this issue Aug 8, 2024

2680 internal batch endpoint passportxyz/passport-scorer#657

Merged

tim-schultz closed this as completed in passportxyz/passport-scorer#657 Aug 14, 2024

tim-schultz moved this from In Progress (WIP) to Product/UX Review in Passport New Aug 14, 2024

tim-schultz mentioned this issue Aug 14, 2024

feat: batch size as env variable and return metadata passportxyz/passport-scorer#659

Merged

tim-schultz moved this from Code Complete to Product/UX Review in Passport New Aug 14, 2024

This was referenced Aug 15, 2024

fix: float first_funder passportxyz/passport-scorer#661

Merged

2680 increase instance passportxyz/passport-scorer#662

Merged

Jkd-eth mentioned this issue Aug 19, 2024

MBD Batch Improvements: Continued stories #2791

Open

7 tasks

Jkd-eth moved this from Ready to Deploy to Done in Passport New Aug 19, 2024

This was referenced Aug 20, 2024

MBD Batch Improvements: Increase number of retries #2794

Closed

MBD Batch Improvements: Targeted runs #2795

Closed

MBD Batch Improvements: Setup data science endpoints #2796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Week: Internal Batch Scoring MBD API Endpoint (Part 2) #2680

Data Week: Internal Batch Scoring MBD API Endpoint (Part 2) #2680

Jkd-eth commented Jul 9, 2024 •

edited by tim-schultz

Loading

lucianHymer commented Aug 5, 2024 •

edited

Loading

tim-schultz commented Aug 15, 2024

Data Week: Internal Batch Scoring MBD API Endpoint (Part 2) #2680

Data Week: Internal Batch Scoring MBD API Endpoint (Part 2) #2680

Comments

Jkd-eth commented Jul 9, 2024 • edited by tim-schultz Loading

lucianHymer commented Aug 5, 2024 • edited Loading

tim-schultz commented Aug 15, 2024

Jkd-eth commented Jul 9, 2024 •

edited by tim-schultz

Loading

lucianHymer commented Aug 5, 2024 •

edited

Loading