Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MariaDB to PostgreSQL data variance tracker #81

Open
amyfromandi opened this issue Aug 13, 2024 · 1 comment · Fixed by #61
Open

MariaDB to PostgreSQL data variance tracker #81

amyfromandi opened this issue Aug 13, 2024 · 1 comment · Fixed by #61
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@amyfromandi
Copy link
Collaborator

amyfromandi commented Aug 13, 2024

Following up from Issue #60, we migrated the MariaDB data into a new PostgreSQL macrostrat_temp database. Next, we counted and compared rows between macrostrat_temp and the already existent macrostrat database in PostgreSQL. Below is the summary of variances as well as their solutions. Our goal is to retain and not delete any data across both macrostrat (PostgreSQL) and macrostrat_temp (MariaDB) db's.

1. macrostrat tables not found in macrostrat_temp

['strat_name_footprints', 'grainsize', 'pbdb_collections', 'pbdb_collections_strat_names', 'temp_rocks', 'temp_names', 'unit_lith_atts']

  • We have decided ignore retaining temp_rocks and temp_names since they are irrelevant tables. Additionally, macrostrat.unit_lith_atts is equivalent to the already existent macrostrat_temp.unit_liths_atts so we ignored this table as well.
  • Solution: From macrostrat into macrostrat_temp, we copied over tables: strat_name_footprints, grainsize, pbdb_collections, and pbdb_collections_strat_names within the preserve-macrostrat-data script.

2. macrostrat columns not found in macrostrat_temp

table: {'column_name'}: lookup_unit_intervals: {'best_interval_id'}, strat_tree: {'child', 'parent'}, units: {'notes'}, col_areas: {'wkt'}, cols: {'poly_geom', 'wkt', 'notes'}, intervals: {'rank'}, measuremeta: {'geometry'}

  • lookup_unit_intervals: {'best_interval_id'}

    • This column was created during the initial migration to PostgreSQL and populated with data using the schlep process script.
    • Solution: We added this column into macrostrat_temp and seeded it with data here.
  • strat_tree: {'child', 'parent'}

    • These columns were renamed during the initial migration to PostgreSQL using the schlep dump script
    • Solution: We renamed the columns in macrostrat_temp within the pre-script code that is executed before the migration into PostgreSQL. this_name column is renamed to parent and that_name column to child respectively.
    • We repointed the v2/v3 API to use these new column names.
  • units: {'notes'}

    • This column is empty and macrostrat_temp already has a unit_notes table.
    • Solution: We're ignoring this column.
  • col_areas: {'wkt'}

    • Solution: added a wkt text column within macrostrat_temp and seeded it with data.
  • cols: {'poly_geom', 'wkt', 'notes'}

    • This column was created during the initial migration to PostgreSQL and populated with data using the schlep scripts.
    • Solution: We added a poly_geom and wkt column into macrostrat_temp and seeded it with data using the schlep process script. Additionally, the macrostrat_temp db already has a cols_notes table and the notes column is empty, so we are ignoring this column.
  • intervals: {'rank'}

    • This column was created during the initial migration to PostgreSQL and populated with data using the schlep scripts.
    • Solution: We added a rank column into macrostrat_temp and seeded it with data using the schlep process script into our preserve-macrostrat-data script.
  • measuremeta: {'geometry'}

    • Solution: We added a geometry column into macrostrat_temp and seeded it with data using the table's lat/lng columns.
@amyfromandi amyfromandi added the documentation Improvements or additions to documentation label Aug 13, 2024
@amyfromandi amyfromandi linked a pull request Aug 13, 2024 that will close this issue
10 tasks
@amyfromandi amyfromandi reopened this Sep 4, 2024
@amyfromandi
Copy link
Collaborator Author

These are the final results of the MariaDB migration within development!

Comparing macrostrat_temp (MariaDB) to macrostrat_temp.


Checking table counts...

SUCCESS: 0 All tables in macrostrat_temp (MariaDB) exist in macrostrat_temp.

SUCCESS: 4 macrostrat (PostgreSQL) tables succesfully copied into macrostrat_temp to retain data!
['strat_name_footprints', 'grainsize', 'pbdb_collections', 'pbdb_collections_strat_names']

Checking row counts...

SUCCESS: All row counts in all tables are the same in macrostrat_temp (MariaDB) and macrostrat_temp!

Checking column counts...

SUCCESS: Columns for 5 tables successfully copied over from macrostrat (PostgreSQL) into macrostrat_temp, to retain data!
lookup_unit_intervals                 19        20       +1
col_areas                              4         5       +1
cols                                  13        15       +2
intervals                              8         9       +1
measuremeta                           14        15       +1



Comparing macrostrat_temp to macrostrat (PostgreSQL).


Checking table counts...

SUCCESS: 40 macrostrat_temp tables copied over from MariaDB that do not exist in macrostrat (PostgreSQL). This confirms data retention!
[
    'canada_lexicon_dump',
    'interval_boundaries_scratch',
    'lookup_measurements',
    'minerals',
    'offshore_fossils',
    'pbdb_liths',
    'uniquedatafiles2',
    'unit_equiv',
    'unit_dates',
    'unit_liths_atts',
    'colors',
    'col_areas_6april2016',
    'col_equiv',
    'col_notes',
    'interval_boundaries',
    'measuremeta_cols',
    'offshore_baggage',
    'offshore_baggage_units',
    'offshore_hole_ages',
    'offshore_sections',
    'offshore_sites',
    'pbdb_intervals',
    'pbdb_matches',
    'rockd_features',
    'ronov_sediment',
    'stats',
    'strat_names_lookup',
    'structures',
    'structure_atts',
    'tectonics',
    'temp_areas',
    'units_datafiles',
    'unit_boundaries_backup',
    'unit_boundaries_scratch',
    'unit_boundaries_scratch_old',
    'unit_contacts',
    'unit_measures_pbdb',
    'unit_notes',
    'unit_seq_strat',
    'unit_tectonics'
]

SUCCESS: 3 macrostrat (PostgreSQL) tables did not copy into macrostrat_temp. These tables are irrelevant and do not need to be retained.
['temp_rocks', 'temp_names', 'unit_lith_atts']

Checking row counts...

SUCCESS: Row counts are greater in macrostrat_temp rather than macrostrat (PostgreSQL) for 26 tables, indicating data retention from Mariadb!
autocomplete                       58599     58488     +111
col_areas                           5357      5351       +6
lookup_strat_names                 51229     51214      +15
lookup_units                      127229    119462    +7767
lookup_unit_attrs_api             127228    119461    +7767
strat_names                        51229     51210      +19
unit_strat_names                   28322     28316       +6
units                             127229    119508    +7721
unit_econs                          3157      3148       +9
unit_environs                     134617    125138    +9479
unit_liths                        189907    173513   +16394
intervals                           1715      1626      +89
refs                                 213       210       +3
cols                                5654      5651       +3
col_groups                           352       350       +2
col_refs                            5716      5645      +71
liths                                212       207       +5
lookup_unit_intervals             127229    119462    +7767
strat_names_places                 50321     50288      +33
sections                           12813     12713     +100
strat_tree                         29467     17519   +11948
timescales                            37        31       +6
timescales_intervals                2194      1970     +224
units_sections                    127481    119762    +7719
unit_boundaries                   135708     53799   +81909
unit_measures                     105307    105048     +259

Checking column counts...

SUCCESS: Columns for 12 are greater in macrostrat_temp rather than macrostrat (PostgreSQL). This indicates data retention!
col_areas                              5         4       +1
lookup_strat_names                    28        22       +6
strat_names                            9         5       +4
unit_strat_names                       4         3       +1
units                                 15        13       +2
unit_environs                          7         5       +2
intervals                              9         8       +1
cols                                  15        16       -1
environs                               6         5       +1
lith_atts                              5         4       +1
measures                              13        12       +1
sections                               6         2       +4
strat_tree                             6         4       +2
{'places': (0, 0)}
places                                 0         0       +0

SUCCESS: Notes column exists macrostrat but NOT in macrostrat_temp for units. macrostrat_temp.units.notes is its own table from Mariadb.

SUCCESS: Notes column exists macrostrat but NOT in macrostrat_temp for cols. macrostrat_temp.cols.notes is its own table from Mariadb.

SUCCESS: All columns in macrostrat exist in macrostrat_temp for these tables:
['col_refs', 'lookup_unit_attrs_api', 'lookup_unit_intervals', 'strat_names_meta', 'sections', 'unit_econs', 'lookup_strat_names', 'measures', 'projects', 'timescales', 'strat_tree', 'refs', 'unit_liths', 'lookup_units', 'measurements', 'units', 'autocomplete', 'col_areas', 'unit_strat_names', 'unit_environs', 'cols', 'intervals', 'lith_atts', 'timescales_intervals', 'unit_boundaries', 'econs', 'environs', 'units_sections', 'unit_measures', 'strat_names', 'lookup_unit_liths', 'liths', 'concepts_places', 'strat_names_places', 'col_groups', 'measuremeta', 'places']

check-data completed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants