Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Week: Internal Batch Scoring MBD API Endpoint (Part 2) #2680

Assignees

Comments

@Jkd-eth
Copy link
Contributor

Jkd-eth commented Jul 9, 2024

User Story:
As a data engineer, I want to set up an internal batch scoring MBD API endpoint, so that I can process large datasets efficiently for the data team and provide results in a downloadable CSV file.

Acceptance Criteria:
GIVEN the internal API endpoint,
WHEN the data team submits a list of addresses with their API key,
THEN the API should provide an estimated processing time and a job ID, allow status checks via a separate endpoint, and return an S3 bucket link to download the CSV file with the results when the job is completed.

Tech Details:

  • Dependency on Part 1 being completed
  • Operate the compute in a dedicated EC2 instance, automatically triggered by input bucket uploads.
  • Create an endpoint to submit a list of addresses, which replies with an estimated time and a job ID.
  • Implement a separate endpoint to check the status of the job using the job ID.
  • Implement a database to store the data from the batch job
  • Once the job is completed, provide an S3 link to download the CSV file (or Parquet file) containing the addresses and their scores.
  • Hard-code estimates based on historical data for conservative estimates.
  • Test the endpoint for performance and accuracy.
  • Should include the MBD result(s) the transactions will be stored in the Dynamo DB
    The rest of the story is covered in part 2

Open Questions:

  • Are there any specific performance metrics to consider?
  • What specific format should the CSV file follow?

Notes/Assumptions:

Ensure the endpoint can handle large datasets efficiently.

  • Restrict access to the endpoint to the data team only.
  • Authenticate with API keys from the scorer app.
  • Operate the compute in a dedicated EC2 instance, automatically triggered by input bucket uploads.
  • Create an endpoint to submit a list of addresses, which replies with an estimated time and a job ID.
  • Implement a separate endpoint to check the status of the job using the job ID.
  • Implement a database to store the data from the batch job
  • Once the job is completed, provide an S3 link to download the CSV file (or Parquet file) containing the addresses and their scores.
  • Hard-code estimates based on historical data for conservative estimates.
  • Test the endpoint for performance and accuracy.
  • Should include the MBD result(s) the transactions will be stored in the Dynamo DB

Open Questions:

  • Are there any specific performance metrics to consider?
  • What specific format should the CSV file follow?

Notes/Assumptions:

  • Assume the API infrastructure is already in place.
@Jkd-eth Jkd-eth added this to Passport Jul 9, 2024
@Jkd-eth Jkd-eth moved this to Prioritized in Passport Jul 9, 2024
@Jkd-eth Jkd-eth changed the title MBD: Internal Batch Scoring MBD API Endpoint (Part 2) Data Week: Internal Batch Scoring MBD API Endpoint (Part 2) Jul 9, 2024
@erichfi erichfi mentioned this issue Jul 15, 2024
@erichfi erichfi moved this to Prioritized in Passport New Aug 2, 2024
@lucianHymer
Copy link
Collaborator

lucianHymer commented Aug 5, 2024

The info for the task can be found in the new BatchModelScoringRequest object.

You can query the db for objects withe a status PENDING. I didn't add an IN-PROGRESS status, but maybe we should.

The record includes a filename field. This is the name of the address list file. It can be found in S3 at bulk-score-requests<-abc123>/address-lists/<filename>

(the -<abc123> bit changes for each environment. Review doesn't have this, the values for staging and prod are set in 1P to be loaded into the environment)

The record also includes a model_list field that can be forwarded to the analysis endpoint.

Once complete, the results should be saved to bulk-score-requests<-abc123>/model-score-results/<filename>, using the same filename as the address list. This makes it so the django admin can easily render a link. The link is a signed link that is valid for 1 hour (generated whenever the record is loaded in the django admin), that allows a user to download the file even though it's in a private bucket. The BatchModelScoringRequest status should be updated to DONE.

The bucket and folder names are saved as settings in the Django app.

@tim-schultz tim-schultz self-assigned this Aug 5, 2024
@tim-schultz tim-schultz moved this from Prioritized to In Progress (WIP) in Passport Aug 5, 2024
@tim-schultz tim-schultz moved this from Prioritized to In Progress (WIP) in Passport New Aug 6, 2024
@tim-schultz tim-schultz moved this from In Progress (WIP) to Product/UX Review in Passport New Aug 14, 2024
@tim-schultz tim-schultz moved this from Code Complete to Product/UX Review in Passport New Aug 14, 2024
@tim-schultz
Copy link
Collaborator

@erichfi @stefi-says @NadjibBenlaldj Here is the documentation on how to upload, process, and access bulk processing of addresses https://github.com/passportxyz/passport-scorer/blob/main/api/registry/bulk-mbd-analysis.md. Just lmk if you have any question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment