Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike - Data Dumps: Availability, Access Controls, and Tracking #3012

Open
erichfi opened this issue Oct 29, 2024 · 5 comments · May be fixed by passportxyz/passport-scorer#733
Open

Spike - Data Dumps: Availability, Access Controls, and Tracking #3012

erichfi opened this issue Oct 29, 2024 · 5 comments · May be fixed by passportxyz/passport-scorer#733
Assignees

Comments

@erichfi
Copy link
Contributor

erichfi commented Oct 29, 2024

Objective: Investigate and enhance the availability, access controls, and tracking mechanisms for Passport XYZ's data dumps to ensure security, accountability, and efficiency.

Key Areas to Explore:

  1. Architecture Review:
  • Examine current data dump processes and infrastructure.
  • Identify areas for improved availability and access.
  1. Access Controls:
  • Assess existing access control policies and their implementation.
  • Evaluate need for role-based access, multi-factor authentication, and other security measures.
  1. Tracking and Auditing:
  • Analyze current tracking systems for who accesses the data and when.
  • Propose methods to improve audit trails for accountability.
  1. Technology Solutions:
  • Explore tools and technologies to enhance data management.
  • Recommend solutions for scalability and future-proofing.

Expected Outcome:

A comprehensive report outlining the strengths, weaknesses, opportunities, and threats (SWOT analysis) in our current data dump management. Provide actionable recommendations for enhancement, a roadmap for implementation, and impact projections on operations and security.

@erichfi erichfi moved this to Prioritized in Passport New Oct 29, 2024
@tim-schultz tim-schultz self-assigned this Nov 13, 2024
@tim-schultz tim-schultz moved this from Prioritized to In Progress (WIP) in Passport New Nov 13, 2024
@tim-schultz
Copy link
Collaborator

Architecture Review

Areas for Improvement

1. Data Format Optimization

  • Currently exporting multiple data types eventhough its mostly parquet
  • Recommendation: Transition to Parquet format for all exports
    • More storage efficient
    • Better compression
    • Optimized for analytics workloads

2. AWS Native Export Methods

  • Alternative methods available through AWS RDS for PostgreSQL to S3 exports
  • Current Status: Not prioritized because:
    • Existing system is stable and reliable
    • Current filtering capabilities meet requirements
    • Scheduled tasks framework is effective
    • No performance issues reported

3. Environment Configuration

  • Export jobs currently running across all environments
  • Recommendation: Limit exports to production environment only
    • Would conserve resources
    • Affects scheduled tasks defined in infrastructure code
    • Example location: infra/aws/index.ts:1385

@tim-schultz
Copy link
Collaborator

Waiting on confirmation of which exports to keep

@tim-schultz tim-schultz moved this from Blocked to In Progress (WIP) in Passport New Nov 15, 2024
@tim-schultz tim-schultz moved this from In Progress (WIP) to Prioritized in Passport New Nov 18, 2024
@tim-schultz tim-schultz moved this from In Progress (WIP) to Blocked in Passport New Nov 19, 2024
@Jkd-eth
Copy link
Contributor

Jkd-eth commented Dec 1, 2024

@nutrina It looks like we're blocked on this one, let's connect on Monday on what needs to be done to unblock?

@Jkd-eth Jkd-eth moved this from Blocked to Prioritized in Passport New Dec 10, 2024
@Jkd-eth Jkd-eth changed the title Data Dumps: Availability, Access Controls, and Tracking Spike - Data Dumps: Availability, Access Controls, and Tracking Dec 16, 2024
@nutrina
Copy link
Collaborator

nutrina commented Dec 16, 2024

We should only keep:

  • registry_passport.parquet
  • registry_score.parquet

@NadjibBenlaldj @stefi-says can you please confirm we do not need keep others?

Also would it help to adjust the table structures to have an updated_at column.
This would allow to make delta data pulls, by filtering by the updated_at value (you can pull only data that was updated since the last pull).

@larisa17 larisa17 assigned larisa17 and unassigned tim-schultz Dec 18, 2024
@larisa17 larisa17 moved this from Prioritized to In Progress (WIP) in Passport New Dec 18, 2024
@larisa17 larisa17 moved this from In Progress (WIP) to Blocked in Passport New Dec 19, 2024
@larisa17
Copy link
Collaborator

This depends on #3106

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Blocked
Development

Successfully merging a pull request may close this issue.

5 participants