Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Modularization pt.2 #1

Closed
wants to merge 10 commits into from
Closed

Commits on Apr 14, 2023

  1. Datasets refactoring

    Moving datasets profiling to datasets modules
    nikpodsh committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    cb3800a View commit details
    Browse the repository at this point in the history
  2. Datasets refactoring

    Renaming profiling
    nikpodsh committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    dd8e597 View commit details
    Browse the repository at this point in the history
  3. Datasets refactoring

    Renaming table_column_model to models to easier import other models
    nikpodsh committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    8ca7bea View commit details
    Browse the repository at this point in the history
  4. Datasets refactoring

    Moving DatasetProfilingRun model
    nikpodsh committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    e36ab3b View commit details
    Browse the repository at this point in the history
  5. Datasets refactoring

    Moving dataset profiling service and renaming it
    nikpodsh committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    31720c2 View commit details
    Browse the repository at this point in the history
  6. Datasets refactoring

    Extracted glue_profiling_handler
    nikpodsh committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    8a907df View commit details
    Browse the repository at this point in the history
  7. Datasets refactoring

    Deleted DatasetTableProfilingJob since could not find any usage of it
    nikpodsh committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    561da72 View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2023

  1. Datasets refactoring

    Returned the name to model after renaming the service
    nikpodsh committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    47a38cc View commit details
    Browse the repository at this point in the history

Commits on Apr 24, 2023

  1. Dataset Modularization pt.1 (data-dot-all#413)

    ### Feature or Bugfix
    - Refactoring (Modularization)
    
    ### Relates
    - Related issues data-dot-all#295 and data-dot-all#412 
    
    ### Short Summary
    First part of migration of `Dataset` (`DatasetTableColumn`) TL;DR :) 
    
    ### Long Summary 
    Datasets are huge. It's one of the central modules that's spread
    everywhere across the application. Migrating the entire Dataset piece
    would be very difficult task and, more importantly, even more difficult
    to review. Therefore, I decided to break down this work into "small"
    steps to make it more convenient to review.
    Dataset's API consist of the following items:
    * `Dataset`
    * `DatasetTable`
    * `DatasetTableColumn`
    * `DatasetLocation`
    * `DatasetProfiling`
    
    In this PR, there is only creation of `Dataset` module and migration of
    `DatasetTableColumn` (and some related to it pieces). Why? Because the
    plan was to migrate it, to see what issues would come up along with it
    and to address them here. The refactoring of `DatasetTableColumn` will
    be in other PR.
    The issues: 
    1) Glossaries
    2) Feed
    3) Long tasks for datasets
    4) Redshift
    
    Glossaries rely on GraphQL UNION of different type (including datasets).
    Created an abstraction for glossary registration. There was an idea to
    change frontend, but it'd require a lot of work to do this
    
    Feed: same as glossaries. Solved the similar way. For feed, changing
    frontend API is more feasible, but I wanted it to be consistent with
    glossaries
    
    Long tasks for datasets. They migrated into tasks folder and doesn't
    require a dedicated loading for its code (at least for now). But there
    are two concerns:
    1) The deployment uses a direct module folder references to run them
    (e.g. `dataall.modules.datasets....`, so basically when a module is
    deactivated, then we shouldn't deploy this tasks as well). I left a TODO
    for it to address in future (when we migrate all modules), but we should
    bear in mind that it might lead to inconsistencies.
    2) There is a reference to `redshift` from long-running tasks = should
    be address in `redshift` module
    
    Redshift: it has some references to `datasets`. So there will be either
    dependencies among modules or small code duplication (if `redshift`
    doesn't rely hard on `datasets`) = will be addressed in `redshift`
    module
    
    Other changes:
    Fixed and improved some tests 
    Extracted glue handler code that related to `DatasetTableColumn`
    Renamed import mode from tasks to handlers for async lambda.
    A few hacks that will go away with next refactoring :) 
    
    Next steps:
    [Part2 ](#1) in preview :) 
    Extract rest of datasets functionality (perhaps, in a few steps)
    Refactor extractor modules the same way as notebooks
    Extract tests to follow the same structure.
    
    
    
    By submitting this pull request, I confirm that my contribution is made
    under the terms of the Apache 2.0 license.
    nikpodsh authored Apr 24, 2023
    Configuration menu
    Copy the full SHA
    3c4ab2d View commit details
    Browse the repository at this point in the history
  2. Resolved code conflict

    nikpodsh committed Apr 24, 2023
    Configuration menu
    Copy the full SHA
    2ac3ae7 View commit details
    Browse the repository at this point in the history