Skip to content

epic: Cortex Updater can migrate data structure changes #1184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task
dan-menlo opened this issue Sep 11, 2024 · 12 comments
Closed
1 task

epic: Cortex Updater can migrate data structure changes #1184

dan-menlo opened this issue Sep 11, 2024 · 12 comments
Assignees
Milestone

Comments

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 11, 2024

Goal

Scope

Discussion

  • Adapter for DB layer
  • Down migration - additional considerations?
    • Recovery (restore column vs restore data)

Success Criteria:

  • Successful upgrade and db migration from 1.0.2 to 1.0.X (when /model/sources is merged)

Tasklist

  • ?
@dan-menlo dan-menlo added this to Menlo Sep 11, 2024
@dan-menlo dan-menlo converted this from a draft issue Sep 11, 2024
@dan-menlo dan-menlo changed the title epic: Cortex Migrations epic: Cortex Migrations for Data Structures Sep 11, 2024
@dan-menlo dan-menlo changed the title epic: Cortex Migrations for Data Structures epic: Cortex Updater can migrate data structure changes Sep 29, 2024
@dan-menlo dan-menlo moved this to Investigating in Menlo Oct 13, 2024
@vansangpfiev
Copy link
Contributor

vansangpfiev commented Nov 12, 2024

Some thoughts on data migration:

  1. Database Migration Strategy
  • Versioning: Create a table within SQLite database to track the current schema version. This table will have a single row containing the version number
CREATE TABLE IF NOT EXISTS schema_version (
    version INTEGER NOT NULL
);
  • Migration Scripts:
    • Can create sql scripts for migration by version. These scripts can be executed after run cortex update. Not sure if we can run sql scripts by postscript cc: @hiento09
    • Pros: directly writing SQL scripts for migrations is straightforward and gives us full control over the SQL commands executed during migrations.
    • Cons: must manually organize, track, and apply scripts, which can become cumbersome as the number of migrations grows
  • Migration Framework:
    • Develop a C++ component that reads the current schema version from the database and applies the necessary migration scripts to reach the target version.
    • Pros: provide a consistent approach to managing migrations across different environments
    • Cons: not easy to implement
  • Data Preservation Techniques
    • Data Migration: Before dropping any columns or tables, migrate the existing data to a safe location (e.g., a backup table or external storage) so it can be restored if needed.
    • Column Preservation: Instead of dropping columns, consider marking them as deprecated or renaming them.

Question: should we use migration scripts or implement a C++ database migration component?
2. Cortex data structure
Similar to database migration, we can use a script to rearrange data to new data structure.
Note: We store yml file path in database, so if model structure changes, we also need to update the models' database

What do you think? @dan-homebrew @louis-jan @janhq/cortex

@dan-menlo
Copy link
Contributor Author

dan-menlo commented Nov 13, 2024

@vansangpfiev Overall, I think we should do what allows us to ship something quickly this Friday.

Up/Down Migrations

/migrations
    1.0.2.cpp
    1.0.3.cpp
# 1.0.2.cpp
void up() {
    
    // Run the SQL script (assuming it's a shell script that runs SQL commands)
    std::string runSqlCmd = "psql -c \"" + sqlScript + "\"";
    int runSqlResult = system(runSqlCmd.c_str());

    if (runSqlResult == 0) {
        std::cout << "SQL script executed successfully." << std::endl;
    } else {
        std::cerr << "Error: Failed to execute the SQL script." << std::endl;
    }

    // Define the directories
    std::string rootDir = "/";
    std::string targetDir = "/path/to/target/directory";

    // Define the file to be moved
    std::string fileToMove = rootDir + ".cortexrc";

    // Check if the file exists
    std::ifstream file(fileToMove);
    if (!file) {
        std::cerr << "Error: The file .cortexrc does not exist in the root directory." << std::endl;
        return;
    }

    // Move the file to the target directory
    std::string moveCmd = "mv " + fileToMove + " " + targetDir;
    int moveResult = system(moveCmd.c_str());

    if (moveResult == 0) {
        std::cout << "The .cortexrc file has been moved to the target directory." << std::endl;
    } else {
        std::cerr << "Error: Failed to move the .cortexrc file." << std::endl;
    }
void down() {
    ...
}

Please push back on this @janhq/cortex - as you can tell from the RoR reference, my approach may be very outdated.

Database migration strategy

  1. I think SQL migration scripts make most sense
  2. Data Preservation: I think marking columns as deprecated is ok (for now)

Cortex Data Structure

  1. See up/down methods above

@vansangpfiev vansangpfiev moved this from Investigating to In Progress in Menlo Nov 13, 2024
@louis-menlo
Copy link
Contributor

We likely need to design for both local running and cloud deployment. Adding a code constraint would block our migration from the deployment layer, which likely waits for an instance to spin up and perform a health check, which is inefficient. There could also be a scenario where we want to share the database instance across services like cortex.py and cortex.cpp.

However, one advantage is that they can embed cortex.cpp as a standalone binary without having to worry about other services, which is particularly useful when embedding into Jan.

I’d prefer built-in migrations as suggested above for cortex.cpp to ensure embedding in Jan won’t cause issues.

@vansangpfiev
Copy link
Contributor

vansangpfiev commented Nov 13, 2024

We will use SQL scripts for up/down migrations, try to minimize changes between versions to ensure smooth transitions. For complex down migrations, we will create a PR to address any issues.

  • Script Execution: Migrations will run when the server starts for the first time (to support Jan use case)
  • Folder Structure Changes: Ideally managed via scripts, but may require code adjustments.
  • Version Management: We will only guarantee up/down migrations between versions.

Structure

cortexcpp
    |__ models/
    |__ engines/
    |__ migrations/
           |__ v1/
           |    |__ v1.sql
           |    |__ data_structure_v1.sh
           |__ v2/
                |__ v2.sql
                |__ data_structure_v2.sh
          

Question: How do we ship the migration scripts?

  • Option 1: host them in a static server and cortex pull the scripts
  • Option 2: pack the migration scripts into installer
  • Option 3: option 1 + option2, in case users only download the cortex-server binary, they have an option to fall back to option 1

cc: @dan-homebrew @hiento09 @janhq/cortex

@dan-menlo
Copy link
Contributor Author

@vansangpfiev Our discussion:

  • We will have a pure C++ updater
    • Remote .sql and .sh scripts may represent a security/backdoor risk in future
    • We will trade off flexibility (e.g. hotfixing) for verifiability
  • We will align Version for data structures and Cortex
    • Simplify
    • Most migrations will be empty (i.e. empty up/down)
cortexcpp
    |__ models/
    |__ engines/
    |__ migrations/
           |__ v1.0.2.cpp
           |__ v1.0.3.cpp

@louis-menlo
Copy link
Contributor

Oops. This wouldn’t work for nightly migration support or switching between commits.

@dan-menlo
Copy link
Contributor Author

dan-menlo commented Nov 14, 2024

Hmm, can we solve this via running migrations manually for Nightly?

Nightly

  • Otherwise, I think this is an edge case and we should trigger migrations manually for nightly
    • For nightly, we trigger migrations manually
    • e.g. cortex migrations up 1.0.3, cortex migrations down 1.0.3
  • A more complicated approach is to run down and up in migrator between nightly versions
    • e.g. Updater for nightly runs down and then up
    • I don't like this, too complicated

Switching between commits

  • Same as Nightly

@vansangpfiev
Copy link
Contributor

vansangpfiev commented Nov 18, 2024

@dan-homebrew IMO, It is necessary to distinguish between the data version and the software version. Since all migration actions occur within the software, the application may not know its current version during modifications (can be stable, beta, nightly, etc).

To ensure clarity and consistency, we can increment the data version each time a change is made. In the event of a conflict, we will revert to the previous version of the data. This separation allows for more effective management of both software and data changes, enhancing stability and reliability throughout the migration process.

@vansangpfiev vansangpfiev mentioned this issue Nov 18, 2024
3 tasks
@luke-nguyen990
Copy link
Contributor

luke-nguyen990 commented Nov 18, 2024

@dan-homebrew IMO, It is necessary to distinguish between the data version and the software version. Since all migration actions occur within the software, the application may not know its current version during modifications (can be stable, beta, nightly, etc).

To ensure clarity and consistency, we can increment the data version each time a change is made. In the event of a conflict, we will revert to the previous version of the data. This separation allows for more effective management of both software and data changes, enhancing stability and reliability throughout the migration process.

  • I agree with you @vansangpfiev; software/release/app versioning is a separate aspect, more related to business/product management than engineering
  • Then, we have source code versioning a.k.a Git commits. In today's practices, the relationship between source code versioning and data/migration versioning is one-to-one, as migration files or entity/model files are part of the code.
  • If we separate migration/data versioning from source code versioning (Git commits), we still need a mechanism to map them back. The migration/data for a specific Git commit should be deterministic—meaning there should be one and only one possibility. While it's possible for multiple commits to share the same migration/data version, the reverse should not be true.

TL, DR: If we roll back to a specific commit abc123, we must know exactly how the data schema looks in a deterministic way.

@vansangpfiev vansangpfiev moved this from In Review to Review + QA in Menlo Nov 27, 2024
@gabrielle-ong gabrielle-ong added this to the v1.0.4 milestone Nov 27, 2024
@gabrielle-ong gabrielle-ong modified the milestones: v1.0.4, v1.0.5 Dec 6, 2024
@vansangpfiev
Copy link
Contributor

Expected result:

  • Migration happens at server start
  • No change for model structure for v1.0.5
  • Cortex creates new tables if they do not exist: models, engines, hardware.
  • Migrate all data to new database schema version.

@TC117
Copy link

TC117 commented Dec 17, 2024

  • Migration when server start
    image
    image

remove table hardwares on update

@TC117 TC117 moved this from QA to Completed in Menlo Dec 26, 2024
@dan-menlo
Copy link
Contributor Author

dan-menlo commented Dec 29, 2024

@TC117 Please close this ticket once QA sign off and completed

@github-project-automation github-project-automation bot moved this from Completed to QA in Menlo Dec 29, 2024
@TC117 TC117 moved this from QA to Completed in Menlo Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

7 participants