Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-92] Proposal for restructuring the Python code base #20630

Closed
john-bodley opened this issue Jul 6, 2022 · 12 comments
Closed

[SIP-92] Proposal for restructuring the Python code base #20630

john-bodley opened this issue Jul 6, 2022 · 12 comments
Assignees
Labels
sip Superset Improvement Proposal

Comments

@john-bodley
Copy link
Member

john-bodley commented Jul 6, 2022

[SIP-92] Proposal for restructuring the Python code base

Motivation

Superset has evolved somewhat organically over time which is reflected—somewhat apparently—in the how the Python code—which resides solely in the top level superset folder—is organized. Initially Superset used a Model View Controlled (MVC) pattern combined with the notion of database connectors whereas now we've adopted the Data Access Object (DAO) pattern (SIP-35), which when coupled with commands and the API, helps to decouple the business layer from the persistence layer.

Due to partial refactors and years of creep the code organization is fragmented. This has negatively impacted both the code quality and developer experience.

Proposed Change

The TL;DR is the Superset application is primarily composed on a few major functional components:

  • APIs: v1 RESTful API (current) and API view endpoints (legacy)
  • CLI: Suite of command line tools
  • Commands: Used by both the APIs and the CLI
  • DAOs: Used by the commands to interface with the SQLAlchemy models
  • Models: Thin layer reflecting SQLAlchemy's declarative mapping and mixins
  • SQL Engine: Comprising of the engine specs, templating, pre-/post-processing, execution, etc.
  • Tasks: Asynchronous tasks/schedules
  • Views: Non-API endpoints, i.e., rendering HTML templates

The proposed change would be to refactor the code into more functional rather than business top level folders, which has become somewhat bloated and results in a fragmented core—something which has plagued Superset since its inception with its feature driven mindset at the cost of architectural integrity.

Below is the before/after enumeration of the current (as of 10/28/2022) top-level folders and files, where "N/A" denotes that the folder/files will no longer exist in its current form. The additional sub-sections outline more specifics.

Current Proposed Notes
advanced_data_type N/A See APIs, Commands, DAOs, and Models
annotation_layers N/A See APIs, Commands, DAOs, and Models
async_events N/A See APIs, Commands, DAOs, and Models
available_domains N/A See APIs, Commands, DAOs, and Models
cachekeys N/A See APIs, Commands, DAOs, and Models
charts N/A See APIs, Commands, DAOs, and Models
cli cli Unchanged
columns N/A See APIs, Commands, DAOs, and Models
commands commands See APIs, Commands, DAOs, and Models
common N/A See SQL Engine
connectors N/A See APIs, Commands, DAOs, and Models
css_templates N/A See APIs, Commands, DAOs, and Models
dao daos See APIs, Commands, DAOs, and Models
dashboards N/A See APIs, Commands, DAOs, and Models
databases N/A See APIs, Commands, DAOs, and Models
datasets N/A See APIs, Commands, DAOs, and Models
datasource N/A Unclear why we have both datasets and datasource DAOs
db_engine_specs engine/specs See SQL Engine
db_engines N/A Unused. See #20631
embedded N/A See APIs, Commands, DAOs, and Models
embedded_dashboard N/A See APIs, Commands, DAOs, and Models
examples examples Unchanged
explore N/A See APIs, Commands, DAOs, and Models
extensions extensions Unchanged
importexport N/A See APIs, Commands, DAOs, and Models
initialization blueprints See APIs, Commands, DAOs, and Models
key_value N/A See APIs, Commands, DAOs, and Models
migrations migrations Unchanged
models N/A See APIs, Commands, DAOs, and Models
queries N/A See APIs, Commands, DAOs, and Models
reports N/A See APIs, Commands, DAOs, and Models
security security See APIs, Commands, DAOs, and Models
sql_validators engine/query/validators See SQL Engine
sqllab N/A See APIs, Commands, DAOs, and Models
tables N/A See APIs, Commands, DAOs, and Models
tags N/A See APIs, Commands, DAOs, and Models
tasks tasks Unchanged
templates blueprints/templates See APIs, Commands, DAOs, and Models
temporary_cache N/A See APIs, Commands, DAOs, and Models
translations translations Unchanged
utils ? Unclear
views N/A See APIs, Commands, DAOs, and Models
__init__.py __init__.py Existing logic migrated elsewhere
app.py app.py Unchanged
config.py config.py Unchanged
constants.py N/A See APIs, Commands, DAOs, and Modelsand SQL Engine
dataframe.py N/A See SQL Engine
errors.py N/A See APIs, Commands, DAOs, and Models
exceptions.py N/A See APIs, Commands, DAOs, and Models
jinja_context.py N/A Split into components
result_set.py N/A See SQL Engine
schemas.py blueprints/api/base.py See APIs, Commands, DAOs, and Models
sql_lab.py N/A See SQL Engine
sql_parse.py ? See SQL Engine
stats_logger.py ?
superset_typing.py N/A Split into components
viz.py ? Legacy visualization types

APIs, Commands, DAOs, and Models

APIs historically were mostly defined in an ad-hoc manner, i.e., in a non-RESTful way, as views which mostly reside in within the superset/views/ folder. These "legacy" APIs now coexist alongside RESTful APIs which leverage the DAO model which reside in the component specific folder, i.e., superset/datasets/. Furthermore commands are either defined within the superset/commands/ folder or component specific folder, i.e., superset/datasets/commands.

As a developer it isn't overly apparent where an API endpoint resides. The proposed solution is move to a directory structure which more clearly illustrates that the APIs, DAOs, and commands are decoupled (as illustrated below for the datasets components). The API—which leverages blueprints—is comprised both of the v1 RESTful API (current) and the legacy API. This demarcation also helps developers identify which API endpoints need to be migrated to v1.

superset/
├─ blueprints/
│  ├─ api/
|  |  ├─ legacy/         # Previously defined in superset/connectors/*/views.py, superset/views/*, etc.
|  |  ├─ v1/
|  |  |  ├─ datasets.py  # Previously superset/datasets/api.py
|  |  |  ├─ ...
│  ├─ views/             # Previously defined in superset/connectors/*/views.py, superset/views/*, etc.
├─ commands/
│  ├─ datasets/          # Previously superset/datasets/commands
|  |  ├─ ...
│  ├─ base.py
|  ├─ ...
├─ daos/
│  ├─ datasets/          # Previously superset/datasets/
|  |  ├─ ...
│  ├─ base.py            # Previously superset/dao/base.py
│  ├─ exceptions.py      # Previously superset/dao/exceptions.py
|  ├─ ...
├─ models/            # Previously superset/connectors/*/models.py, superset/datasets/models.py, etc.
|  ├─ base.py
|  ├─ ...

Note currently views are a combination of legacy API endpoints and non-API endpoints. The concept of views will remain but should only contain non-API endpoints, i.e., rendering HTML templates.

SQL Engine

Though not as well flushed out as APIs, commands, and DAOs, the actual SQL engine—responsible for preparing, executing, fetching result sets—would be colocated within the broad superset/engine/ subfolder. This would comprise of the engine specifications, templating, SQL parsing, query objects, etc.

superset/
├─ engine/
│  ├─ specs/
│  | ├─ base.py          # Previously superset/db_engine_specs/base.py
|  | ├─ ...
|  ├─ query/
|  |  ├─ validators/
|  |  |  ├─ base.py      # Previously sql_validators/base.py
|  |  |  ├─ ...
│  |  ├─ context.py      # Previously common/query_context.py
|  |  ├─ executor.py     # Previously sql_lab.py et al.
|  |  ├─ results.py      # Previously result_set.py et al.
|  |  ├─ ...
|  ├─ ...

New or Changed Public Interfaces

N/A.

New dependencies

N/A.

Migration Plan and Compatibility

The code restructure can be piecemeal. The general steps are:

  1. Flush out the base APIs, commands, DAOs, and models
  2. Migrate—piece by piece—each of the functional components into the appropriate sub-folders.

Rejected Alternatives

Regarding the engine directory structure I'm unsure whether the current proposal makes the most sense. An alternative could be based more on the flow/path of a query from pre-processing (including construction and SQL parsing), to execution, then fetching, and finally post-processing of the result set.

I presume that calling out the engine as a first class entity rather than treating it as a command likely makes sense. I think this is open for debate/discussion.

@john-bodley john-bodley changed the title [WIP] [SIP-88] Proposal for restructuring the Python code base [WIP] [SIP-92] Proposal for restructuring the Python code base Oct 28, 2022
@john-bodley john-bodley added the sip Superset Improvement Proposal label Oct 28, 2022
@ktmud
Copy link
Member

ktmud commented Oct 31, 2022

I like this! Changing superset/{datasets,dashboard...}/{api,command,dao} to superset/{api,commands,daos}/{dataset,dashboard,...} would at least reduce the amount of top-level folders and make it easier to navigate for people who are just working on the APIs. And more often than not, you'd probably work on APIs utilizing multiple models rather than working on the models AND the API of a single entity type at the same time.

I also like that the core SQL engine gets its own folder. It's long overdue.

One thing I'm not sure about is can we make it faster to find all models that correspond to a db table? I assume they will all be moved to dao/..., but is the nesting under dao/{entity_type}/models.py too deep still?

Another thing that has been bothering me is---there seems to be too many boilerplates and parameters passing around between the API, the commands and the DAO. Also, what is the boundary between a command and a DAO method anyway? Currently it seems a very thick validation layer can be added to either the DAO or a command.

So I wonder if we can further simplify things down to something like this:

superset
├── api 
│   ├── legacy
│   └── v1
├── views
│   └── legacy 
│       ├── core.py
│       └── ...
├── tasks
│   └── ...
├── commands
│   ├── datasets
│   ├── dashboards
│   │   ├── create_dashboard.py
│   │   ├── get_dashboard_charts.py
│   │   └── ...
│   ├── reports
│   └── ...
├── engine
│   ├── specs
│   ├── query
│   │   ├── builder
│   │   ├── executor
│   │   └── results
├── models
│   ├── base.py
│   ├── chart.py
│   ├── dashboard.py
│   ├── dataset.py
│   ├── query.py
│   ├── report.py
│   └── ...
└── utils

A couple of things to note:

  1. Input validation that does not require pulling data from db (e.g.) would stay with the API---ideally done by the API framework directly via custom validators (using Marshmallow or any other validation library).
  2. commands will include any data operations (query or mutation) that require more than one model. Classes in models should try their best to not import any other models unless the dependency is strictly one directional (i.e., it's a 1-many relationship instead of many-to-many, e.g., dataset -> columns).
  3. Validations before data writes should happen privately within the data writer. There should be no need to export the validation methods---so we can avoid passing parameters around for too many layers.

@john-bodley
Copy link
Member Author

@ktmud thanks the the comments. I believe views—which are part of the MVC model; alongside models—are likely deemed legacy, i.e., should be migrated to RESTful API endpoints, and thus likely shouldn't be explicitly called out.

Regarding models, I'm not overly versed with the DAO pattern, though per this post DAOs and models could/should coexist and thus having a top-level superset/models/ directory seems viable. The one aspect I'm not entirely sure about is whether the existing model definitions are actually DAO-esque in nature and would need to be refactored. As an example the Dashboard model contains numerous properties/methods which reach beyond what the original intent of a Flask-SQLAlchemy model is.

@ktmud
Copy link
Member

ktmud commented Oct 31, 2022

The reason I wanted to keep a views folder is exactly to call-out the legacy code so they can be more proactively cleaned up. The views folder can be removed once everything is migrated to client-side rendering.

The one aspect I'm not entirely sure about is whether the existing model definitions are actually DAO-esque in nature and would need to be refactored

Yes, I was thinking of the same thing. models should just be a thin layer of SQLAlchemy's declaritive mapping + some shared classes & mixins. We can have an explicit DAO layer, or, just stay simple and put more things in "commands" instead.

@john-bodley
Copy link
Member Author

@ktmud I've updated the original description which explicitly calls out the models. Regarding views—per here—I alway perceived these as solely defining the legacy API as they merely contain routes hence why they're grouped under the superset/blueprints/api/legacy folder.

@john-bodley john-bodley changed the title [WIP] [SIP-92] Proposal for restructuring the Python code base [SIP-92] Proposal for restructuring the Python code base Nov 1, 2022
@john-bodley
Copy link
Member Author

john-bodley commented Dec 1, 2022

@ktmud I spoke with @hughhhh and @rusackas briefly about this and they were in agreement with you about keeping the concept of views and thus I've updated the directory schematic.

As it currently stands views contain both API endpoints as well as non-API endpoints, i.e., those which render HTML templates, and thus the non-API endpoints would be housed under the superset/blueprints/views subfolder.

@ktmud
Copy link
Member

ktmud commented Dec 1, 2022

Cool. Thanks for the update!

@michael-s-molina
Copy link
Member

Thanks for the SIP @john-bodley. I went through a similar process when writing SIP-61 - Improve the organization of front-end folders. When researching best practices for organizing projects, the concept of a feature-based organization appeared in many articles that described how large codebases were organized. You'll find many of the reasoning behind this model in SIP-61 and its references. The basic idea is that files related to a feature should belong together in a structured way. This allows you to easily switch between feature implementations, facilitates the use of feature flags, and also promotes better-defined dependencies. Maybe we can apply some of the concepts here too 😉

@craig-rueda
Copy link
Member

Thanks for this, @john-bodley! I +1 to @ktmud 's structure. I have noticed recently that models are mixed all around the code and should be considered top-level. Ideally they would be somewhat isolated and should not depend directly on layers above, so this change would be great. I'd be glad to help out with the refactor :)

My view of what each layer "does" (open to discussion):

MVC Views / API
...
...
Commands
...
...
DAO
...
...
Models

MVC Views / API

  • Responsible for handling interactions with external actors (REST, views, etc)
  • Should perform validation from an API point of view
  • Translates errors emitted from mid-tier

Commands

  • Where the business logic lives!
  • Performs its own validations and potentially raises when things don't look right
  • Might chain to other commands
  • Interfaces with DAO's in order to query the DB

DAOs

  • Interfaces with the DB (via SQLA in our case)
  • Exposes low-level methods like get_by_id(), etc.
  • SHOULD NOT CONTAIN BUSINESS LOGIC (keeps things reusable and non-crosscutting)
  • Centralizes query logic, making mid-tier easy to test

Models

@rusackas
Copy link
Member

rusackas commented Oct 3, 2023

Closing the issue since the vote has passed. @john-bodley please move the issue to the "Implemented/Done" column when you think that's appropriate :)

@rusackas rusackas closed this as completed Oct 3, 2023
@rusackas rusackas moved this from [RESULT][VOTE] APPROVED to In Development in SIPs (Superset Improvement Proposals) Oct 4, 2023
@rusackas
Copy link
Member

rusackas commented Feb 7, 2024

@john-bodley is there a way to provide linting rules around this so we can (a) enforce it, and (b) track the tech debt here?

@john-bodley
Copy link
Member Author

@rusackas I'm not sure, though I have started working with custom Pylint checkers. I suspect the "enforcement" likely happens implicitly with developers following the new patterns. That said the state of our current backend code is rather troubling as we have things strewn all over the place.

@rusackas
Copy link
Member

Any updates from anyone following this on implementation status/plans?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sip Superset Improvement Proposal
Projects
Development

No branches or pull requests

5 participants