Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not truncate aggregate tables on each ingest #841

Merged
merged 8 commits into from
Mar 18, 2019

Conversation

chakrabortyr
Copy link
Contributor

@chakrabortyr chakrabortyr commented Mar 6, 2019

Description

Motivation and Context

We don't want to be constantly truncating + reingesting potentially millions of rows as cloud usage scales.

Tests performed

Made sure existing tests passed + added new set of anonymised data. This anonymised data is then validated in a post ingest script. I manually did some additional validation with the following queries I couldn't think of a holistic way to operationalise:

This is an old instance with new events in the post ingest. Ran this before and after post_ingest_test.sh to ensure that it got updated with a new end_event/last_modified time:
SELECT * FROM modw_cloud.cloud_events_transient where instance = 'DESIRED_ID';

Then ran this to ensure that we only had two rows where the last_modified timestamp had changed:
SELECT * FROM modw_cloud.cloud_events_transient order by end_time DESC, last_modified DESC;

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project as found in the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@@ -0,0 +1,8 @@
-- Since we are adding a new 'last_modified' column to event tables, we need to set the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need to do this. You can re-aggregate by setting the last-modifed time when you run the etl_overseer.

shippable.yml Outdated Show resolved Hide resolved
REF_SOURCE=`realpath $BASEDIR/../../tests/artifacts/xdmod-test-artifacts/xdmod/referencedata`
REF_DIR=/var/tmp/referencedata

if [ "$XDMOD_TEST_MODE" = "upgrade" ];
Copy link
Member

@jpwhite4 jpwhite4 Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should run the test that adds new data to the system on both the upgrade and fresh install builds.

then
yum -y install ~/rpmbuild/RPMS/*/*.rpm
~/bin/services start
expect $BASEDIR/xdmod-upgrade.tcl | col -b
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not reinstall the RPM or start the servers again or run the upgrade again

@jpwhite4
Copy link
Member

jpwhite4 commented Mar 7, 2019

Don't forget to update xdmod-ingestor so that it actually sets the last-modified time.

PHP Notice:  Undefined property: ETL\Ingestor\CloudStateReconstructorTransformIngestor::$_last_modified in /usr/share/xdmod/classes/ETL/Ingestor/CloudStateReconstructorTransformIngestor.php on line 135

@jpwhite4
Copy link
Member

jpwhite4 commented Mar 7, 2019

There is something amiss with the openstack_staging_event table definition. The overseer is Altering it every time it runs:

2019-03-07 19:38:30 [notice] Altering table `modw_cloud`.`openstack_staging_event`

@jpwhite4
Copy link
Member

jpwhite4 commented Mar 7, 2019

There is something wrong with the jobs-cloud-impor-users-generic pipeline:

2019-03-07 19:29:33 [notice] Start processing section 'xdmod.jobs-cloud-import-users-generic'
2019-03-07 19:29:33 [error] xdmod.jobs-cloud-import-users-generic.GenericCloudStagingUserIngestor (ETL\Ingestor\DatabaseIngestor): SQLSTATE[42S02]: Base table or view not found: 1146 Table 'modw_cloud.generic_cloud_raw_volume' doesn't exist Exception: 'SQLSTATE[42S02]: Base table or view not found: 1146 Table 'modw_cloud.generic_cloud_raw_volume' doesn't exist'
2019-03-07 19:29:33 [warning] Stopping ETL due to exception in xdmod.jobs-cloud-import-users-generic.GenericCloudStagingUserIngestor (ETL\Ingestor\DatabaseIngestor)

@eiffel777 may want to look into this.

@chakrabortyr chakrabortyr force-pushed the RC_DontTruncateAggregates branch 2 times, most recently from af60ab9 to ef7709e Compare March 7, 2019 21:05
@@ -93,7 +99,8 @@
"name": "increment_key",
"columns": [
"resource_id",
"event_id"
"event_id",
"last_modified"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

@@ -58,6 +59,7 @@ public function __construct(aOptions $options, EtlConfiguration $etlConfig, Log
$this->_start_event_ids = array(self::START, self::RESUME, self::STATE_REPORT, self::UNSHELVE, self::UNPAUSE, self::UNSUSPEND, self::POWER_ON);
$this->_all_event_ids = array_merge($this->_start_event_ids, $this->_stop_event_ids);
$this->_end_time = $etlConfig->getVariableStore()->endDate ? date('Y-m-d H:i:s', strtotime($etlConfig->getVariableStore()->endDate)) : null;
$this->_last_modified = date('Y-m-d H:i:s', strtotime('now - 3 minutes'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope.

$this->_last_modified = date('Y-m-d H:i:s',
strtotime($this->getEtlOverseerOptions()->getLastModifiedStartDate()));
$this->_last_modified = date(
'Y-m-d H:i:s',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nay.

@jpwhite4
Copy link
Member

jpwhite4 commented Mar 9, 2019

diff --git a/classes/ETL/Ingestor/CloudStateReconstructorTransformIngestor.php b/classes/ETL/Ingestor/CloudStateReconstructorTransformIngestor.php
index 761dccb2..c3a3afde 100644
--- a/classes/ETL/Ingestor/CloudStateReconstructorTransformIngestor.php
+++ b/classes/ETL/Ingestor/CloudStateReconstructorTransformIngestor.php
@@ -46,7 +46,6 @@ class CloudStateReconstructorTransformIngestor extends pdoIngestor implements iA
     private $_start_event_ids;
     private $_instance_state;
     private $_end_time;
-    private $_last_modified;

     /**
      * @see ETL\Ingestor\pdoIngestor::__construct()
@@ -59,7 +58,6 @@ class CloudStateReconstructorTransformIngestor extends pdoIngestor implements iA
         $this->_start_event_ids = array(self::START, self::RESUME, self::STATE_REPORT, self::UNSHELVE, self::UNPAUSE, self::UNSUSPEND, self::POWER_ON);
         $this->_all_event_ids = array_merge($this->_start_event_ids, $this->_stop_event_ids);
         $this->_end_time = $etlConfig->getVariableStore()->endDate ? date('Y-m-d H:i:s', strtotime($etlConfig->getVariableStore()->endDate)) : null;
-        $this->_last_modified = null;

         $this->resetInstance();
     }
@@ -129,18 +127,12 @@ class CloudStateReconstructorTransformIngestor extends pdoIngestor implements iA
     protected function getSourceQueryString()
     {
         $sql = parent::getSourceQueryString();
-        if($this->_last_modified === null) {
-            $this->_last_modified = date(
-                'Y-m-d H:i:s',
-                $this->getEtlOverseerOptions()->getLastModifiedStartDate()
-            );
-        }

         // Due to the way the Finite State Machine handles the rows in event reconstruction, the last row
         // is lost. To work around this we add a dummy row filled with zeroes.
         $colCount = count($this->etlSourceQuery->records);
         $unionValues = array_fill(0, $colCount, 0);
-        $subSelect = "(SELECT DISTINCT instance_id from modw_cloud.event WHERE last_modified > \"" . $this->_last_modified . "\")";
+        $subSelect = "(SELECT DISTINCT instance_id from modw_cloud.event WHERE last_modified > \"" . $this->getEtlOverseerOptions()->getLastModifiedStartDate() . "\")";

         $sql = "$sql WHERE instance_id IN " . $subSelect . " AND event_type_id IN (" . implode(',', $this->_all_event_ids) . ")\nUNION ALL\nSELECT " . implode(',', $unionValues) . "\nORDER BY 1 DESC, 2 DESC, 3 ASC, 4 DESC";

@@ -11,7 +11,7 @@ sudo -u xdmod xdmod-shredder -r openstack -d $REF_DIR/openstack_upgrade -f opens
sudo -u xdmod xdmod-ingestor --datatype=openstack --last-modified-start-date "$last_modified_start_date"
sudo -u xdmod xdmod-ingestor --aggregate=cloud --last-modified-start-date "$last_modified_start_date"

phpunit="$(readlink -f ../../../../../vendor/bin/phpunit)"
phpunit="$(readlink -f ../../../../vendor/bin/phpunit)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test these things locally before committing if at all possible. Otherwise, you tie up limited shippable resources.

@@ -58,6 +58,7 @@ public function __construct(aOptions $options, EtlConfiguration $etlConfig, Log
$this->_start_event_ids = array(self::START, self::RESUME, self::STATE_REPORT, self::UNSHELVE, self::UNPAUSE, self::UNSUSPEND, self::POWER_ON);
$this->_all_event_ids = array_merge($this->_start_event_ids, $this->_stop_event_ids);
$this->_end_time = $etlConfig->getVariableStore()->endDate ? date('Y-m-d H:i:s', strtotime($etlConfig->getVariableStore()->endDate)) : null;
$this->_last_modified = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not add an unused private member variable to the class.

@@ -35,7 +35,7 @@
"name": "cloud-transient",
"class": "DatabaseIngestor",
"definition_file": "cloud_common/cloud_transient.json",
"truncate_destination": true,
"truncate_destination": false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncate_destination is set to false by default in lint 4. You do not need to has a tunrcate_destination: false here or on line 49.


last_modified_start_date=$(date +'%F %T')

sudo -u xdmod xdmod-shredder -r openstack -d $REF_DIR/openstack_upgrade -f openstack --last-modified-start-date "$last_modified_start_date"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xdmod-shredder command does not accept a --last-modified-start-date commandline parameter. Please explain why you added the parameter.

@jpwhite4
Copy link
Member

The new test data that you added has overlapping instance_ids but the users and projects for those instances are different in the new data vs the old data. As far as I am aware, the current code is not able to handle this scenario to correctly track user / project changes for an instance. Please can you confirm whether this is/isn't the case.

@@ -23,7 +23,8 @@
"wallduration": "FLOOR(e.end_time_ts) - FLOOR(e.start_time_ts)",
"person_id": "ev.person_id",
"systemaccount_id": "ev.systemaccount_id",
"submission_venue_id": "ev.submission_venue_id"
"submission_venue_id": "ev.submission_venue_id",
"last_modified": "ev.last_modified"
Copy link
Member

@jpwhite4 jpwhite4 Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should not set the last_modified time from ev.last_modified since this will correspond to the modification time of the start event for the session. This start event will not have been modified. Instead you should not be setting last_modified explicitly at all and should rely on the database setting it.

@jpwhite4
Copy link
Member

Don't forget that you should add tests to cover two different scenarios:

  1. an existing session that was active in the original data and has new events in the new data.
  2. a brand new session (that was not active in the original data).

You then need to confirm that

  1. the existing row in the cloud_events_transient table corresponding to the existing session (1) has been updated correctly with the new end_time_ts value
  2. a new row was added to the cloud_events_transient table for session scenario (2)
  3. only the days/months/quarters/years that have new data are getting re-aggregated in the aggregation step.

I suggest that it will be much more manageable to use a small set of test data (you only need one event for scenario (1) and two events for scenario (2)).

@chakrabortyr chakrabortyr force-pushed the RC_DontTruncateAggregates branch 2 times, most recently from e78ac76 to 808e01b Compare March 14, 2019 22:06
@chakrabortyr chakrabortyr force-pushed the RC_DontTruncateAggregates branch from cf15f76 to fc9e8f0 Compare March 15, 2019 18:40
@chakrabortyr chakrabortyr force-pushed the RC_DontTruncateAggregates branch from bea9d11 to 55aed11 Compare March 15, 2019 18:43
@jpwhite4
Copy link
Member

The tests you have added do not actually confirm that only the rows that need to change are being updated - i.e. these tests would pass if the whole database were reingested. Please can you explain what testing you did to confirm that only the subset of rows are being changed.

@jpwhite4
Copy link
Member

There are errors like this in the logs:

PHP Warning:  Missing argument 1 for OpenXdmod\DataWarehouseInitializer::ingestCloudDataOpenStack(), called in /usr/share/xdmod/classes/OpenXdmod/DataWarehouseInitializer.php on line 130 and defined in /usr/share/xdmod/classes/OpenXdmod/DataWarehouseInitializer.php on line 205
2019-03-15 19:03:50 [notice] Ingesting OpenStack event log data
PHP Notice:  Undefined variable: lastModifiedStartDate in /usr/share/xdmod/classes/OpenXdmod/DataWarehouseInitializer.php on line 213

@jpwhite4
Copy link
Member

There is something wrong with the cloud_events_transient table definition:

2019-03-15 19:06:00 [notice] Altering table `modw_cloud`.`cloud_events_transient`

the ETL is trying to alter the table even though it does not need altering.

@chakrabortyr
Copy link
Contributor Author

chakrabortyr commented Mar 15, 2019

Github is being weird and not letting me edit my PR description. To ensure it was doing what I expected I ran the following in my Docker:

This is an old instance with new events in the post ingest. Ran this before and after post_ingest_test.sh to ensure that it got updated with a new end_event/last_modified time:
SELECT * FROM modw_cloud.cloud_events_transient where instance = 'DESIRED_ID';

Then ran this to ensure that we only had two rows where the last_modified timestamp had changed:
SELECT * FROM modw_cloud.cloud_events_transient order by end_time DESC, last_modified DESC;

@smgallo
Copy link
Contributor

smgallo commented Mar 15, 2019

Turn on debug and it should tell you why it thinks the table should be altered as per #807

@chakrabortyr
Copy link
Contributor Author

The Alter table issue looks like a preexisting condition:

2019-03-15 20:17:48 [notice] Altering table `modw_cloud`.`cloud_events_transient`
2019-03-15 20:17:48 [debug] Alter table SQL ('Cloud DB', class=ETL\DataEndpoint\Mysql, config=datawarehouse, schema=modw_cloud, host=localhost:3306, user=xdmod):
ALTER TABLE `modw_cloud`.`cloud_events_transient`
CHANGE COLUMN `start_time_ts` `start_time_ts` bigint(18) NULL ;

Copy link
Member

@jpwhite4 jpwhite4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@smgallo
Copy link
Contributor

smgallo commented Mar 15, 2019

What is the reason provided by the ETL as to why it's altering the transient table? You should see something like this in the logs:

2019-02-19 20:38:03 [debug] Discover table 'modw.job_tasks'
2019-02-19 20:38:03 [debug] Column local_job_id_raw: values for "type" differ ("int(11)" != "int(11) unsigned")

@chakrabortyr
Copy link
Contributor Author

I'd have to go check the logs to find the specifics; but start_time_ts is already defined as a nullable bigint of size 18. It's odd, but fairly certain it's not related to the PR so I was hoping to take care of it in its own request.

@smgallo
Copy link
Contributor

smgallo commented Mar 16, 2019

@chakrabortyr Check the logs and provide the requested information. We can address it later but I want to know what the issue is.

@chakrabortyr
Copy link
Contributor Author

2019-03-18 13:31:33 [info] start (action_name: xdmod.cloud-state-pipeline.cloud-transient, action: xdmod.cloud-state-pipeline.cloud-transient (ETL\Ingestor\DatabaseIngestor), start_date: , end_date: )
2019-03-18 13:31:33 [debug] Discover table 'modw_cloud.cloud_events_transient'
2019-03-18 13:31:33 [debug] Column start_time_ts: values for "default" differ ("null" != "0")
2019-03-18 13:31:33 [notice] Altering table `modw_cloud`.`cloud_events_transient`

@chakrabortyr chakrabortyr merged commit eed464d into ubccr:xdmod8.1 Mar 18, 2019
@chakrabortyr chakrabortyr deleted the RC_DontTruncateAggregates branch March 18, 2019 13:35
jtpalmer pushed a commit to jtpalmer/xdmod that referenced this pull request Mar 19, 2019
* Do not truncate aggregate table
@plessbd plessbd added this to the 8.1.0 milestone Mar 23, 2019
@chakrabortyr chakrabortyr added Category:Cloud Cloud Realm enhancement Enhancement of the functionality of an existing feature labels Mar 26, 2019
@jtpalmer jtpalmer mentioned this pull request Apr 2, 2019
6 tasks
@smgallo smgallo added the Category:ETL Extract Transform Load label Apr 10, 2019
@smgallo smgallo changed the title No longer truncate aggregate tables Do not truncate aggregate tables on each ingest Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category:Cloud Cloud Realm Category:ETL Extract Transform Load enhancement Enhancement of the functionality of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants