-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not truncate aggregate tables on each ingest #841
Do not truncate aggregate tables on each ingest #841
Conversation
@@ -0,0 +1,8 @@ | |||
-- Since we are adding a new 'last_modified' column to event tables, we need to set the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do not need to do this. You can re-aggregate by setting the last-modifed time when you run the etl_overseer.
tests/artifacts/xdmod-test-artifacts/xdmod/referencedata/names.csv
Outdated
Show resolved
Hide resolved
REF_SOURCE=`realpath $BASEDIR/../../tests/artifacts/xdmod-test-artifacts/xdmod/referencedata` | ||
REF_DIR=/var/tmp/referencedata | ||
|
||
if [ "$XDMOD_TEST_MODE" = "upgrade" ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should run the test that adds new data to the system on both the upgrade and fresh install builds.
then | ||
yum -y install ~/rpmbuild/RPMS/*/*.rpm | ||
~/bin/services start | ||
expect $BASEDIR/xdmod-upgrade.tcl | col -b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not reinstall the RPM or start the servers again or run the upgrade again
Don't forget to update xdmod-ingestor so that it actually sets the last-modified time.
|
There is something amiss with the openstack_staging_event table definition. The overseer is Altering it every time it runs:
|
There is something wrong with the jobs-cloud-impor-users-generic pipeline:
@eiffel777 may want to look into this. |
af60ab9
to
ef7709e
Compare
@@ -93,7 +99,8 @@ | |||
"name": "increment_key", | |||
"columns": [ | |||
"resource_id", | |||
"event_id" | |||
"event_id", | |||
"last_modified" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
@@ -58,6 +59,7 @@ public function __construct(aOptions $options, EtlConfiguration $etlConfig, Log | |||
$this->_start_event_ids = array(self::START, self::RESUME, self::STATE_REPORT, self::UNSHELVE, self::UNPAUSE, self::UNSUSPEND, self::POWER_ON); | |||
$this->_all_event_ids = array_merge($this->_start_event_ids, $this->_stop_event_ids); | |||
$this->_end_time = $etlConfig->getVariableStore()->endDate ? date('Y-m-d H:i:s', strtotime($etlConfig->getVariableStore()->endDate)) : null; | |||
$this->_last_modified = date('Y-m-d H:i:s', strtotime('now - 3 minutes')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope.
$this->_last_modified = date('Y-m-d H:i:s', | ||
strtotime($this->getEtlOverseerOptions()->getLastModifiedStartDate())); | ||
$this->_last_modified = date( | ||
'Y-m-d H:i:s', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nay.
diff --git a/classes/ETL/Ingestor/CloudStateReconstructorTransformIngestor.php b/classes/ETL/Ingestor/CloudStateReconstructorTransformIngestor.php
index 761dccb2..c3a3afde 100644
--- a/classes/ETL/Ingestor/CloudStateReconstructorTransformIngestor.php
+++ b/classes/ETL/Ingestor/CloudStateReconstructorTransformIngestor.php
@@ -46,7 +46,6 @@ class CloudStateReconstructorTransformIngestor extends pdoIngestor implements iA
private $_start_event_ids;
private $_instance_state;
private $_end_time;
- private $_last_modified;
/**
* @see ETL\Ingestor\pdoIngestor::__construct()
@@ -59,7 +58,6 @@ class CloudStateReconstructorTransformIngestor extends pdoIngestor implements iA
$this->_start_event_ids = array(self::START, self::RESUME, self::STATE_REPORT, self::UNSHELVE, self::UNPAUSE, self::UNSUSPEND, self::POWER_ON);
$this->_all_event_ids = array_merge($this->_start_event_ids, $this->_stop_event_ids);
$this->_end_time = $etlConfig->getVariableStore()->endDate ? date('Y-m-d H:i:s', strtotime($etlConfig->getVariableStore()->endDate)) : null;
- $this->_last_modified = null;
$this->resetInstance();
}
@@ -129,18 +127,12 @@ class CloudStateReconstructorTransformIngestor extends pdoIngestor implements iA
protected function getSourceQueryString()
{
$sql = parent::getSourceQueryString();
- if($this->_last_modified === null) {
- $this->_last_modified = date(
- 'Y-m-d H:i:s',
- $this->getEtlOverseerOptions()->getLastModifiedStartDate()
- );
- }
// Due to the way the Finite State Machine handles the rows in event reconstruction, the last row
// is lost. To work around this we add a dummy row filled with zeroes.
$colCount = count($this->etlSourceQuery->records);
$unionValues = array_fill(0, $colCount, 0);
- $subSelect = "(SELECT DISTINCT instance_id from modw_cloud.event WHERE last_modified > \"" . $this->_last_modified . "\")";
+ $subSelect = "(SELECT DISTINCT instance_id from modw_cloud.event WHERE last_modified > \"" . $this->getEtlOverseerOptions()->getLastModifiedStartDate() . "\")";
$sql = "$sql WHERE instance_id IN " . $subSelect . " AND event_type_id IN (" . implode(',', $this->_all_event_ids) . ")\nUNION ALL\nSELECT " . implode(',', $unionValues) . "\nORDER BY 1 DESC, 2 DESC, 3 ASC, 4 DESC";
|
@@ -11,7 +11,7 @@ sudo -u xdmod xdmod-shredder -r openstack -d $REF_DIR/openstack_upgrade -f opens | |||
sudo -u xdmod xdmod-ingestor --datatype=openstack --last-modified-start-date "$last_modified_start_date" | |||
sudo -u xdmod xdmod-ingestor --aggregate=cloud --last-modified-start-date "$last_modified_start_date" | |||
|
|||
phpunit="$(readlink -f ../../../../../vendor/bin/phpunit)" | |||
phpunit="$(readlink -f ../../../../vendor/bin/phpunit)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test these things locally before committing if at all possible. Otherwise, you tie up limited shippable resources.
@@ -58,6 +58,7 @@ public function __construct(aOptions $options, EtlConfiguration $etlConfig, Log | |||
$this->_start_event_ids = array(self::START, self::RESUME, self::STATE_REPORT, self::UNSHELVE, self::UNPAUSE, self::UNSUSPEND, self::POWER_ON); | |||
$this->_all_event_ids = array_merge($this->_start_event_ids, $this->_stop_event_ids); | |||
$this->_end_time = $etlConfig->getVariableStore()->endDate ? date('Y-m-d H:i:s', strtotime($etlConfig->getVariableStore()->endDate)) : null; | |||
$this->_last_modified = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not add an unused private member variable to the class.
@@ -35,7 +35,7 @@ | |||
"name": "cloud-transient", | |||
"class": "DatabaseIngestor", | |||
"definition_file": "cloud_common/cloud_transient.json", | |||
"truncate_destination": true, | |||
"truncate_destination": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncate_destination is set to false by default in lint 4. You do not need to has a tunrcate_destination: false here or on line 49.
|
||
last_modified_start_date=$(date +'%F %T') | ||
|
||
sudo -u xdmod xdmod-shredder -r openstack -d $REF_DIR/openstack_upgrade -f openstack --last-modified-start-date "$last_modified_start_date" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The xdmod-shredder command does not accept a --last-modified-start-date
commandline parameter. Please explain why you added the parameter.
The new test data that you added has overlapping instance_ids but the users and projects for those instances are different in the new data vs the old data. As far as I am aware, the current code is not able to handle this scenario to correctly track user / project changes for an instance. Please can you confirm whether this is/isn't the case. |
@@ -23,7 +23,8 @@ | |||
"wallduration": "FLOOR(e.end_time_ts) - FLOOR(e.start_time_ts)", | |||
"person_id": "ev.person_id", | |||
"systemaccount_id": "ev.systemaccount_id", | |||
"submission_venue_id": "ev.submission_venue_id" | |||
"submission_venue_id": "ev.submission_venue_id", | |||
"last_modified": "ev.last_modified" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should not set the last_modified time from ev.last_modified since this will correspond to the modification time of the start event for the session. This start event will not have been modified. Instead you should not be setting last_modified explicitly at all and should rely on the database setting it.
Don't forget that you should add tests to cover two different scenarios:
You then need to confirm that
I suggest that it will be much more manageable to use a small set of test data (you only need one event for scenario (1) and two events for scenario (2)). |
e78ac76
to
808e01b
Compare
cf15f76
to
fc9e8f0
Compare
bea9d11
to
55aed11
Compare
The tests you have added do not actually confirm that only the rows that need to change are being updated - i.e. these tests would pass if the whole database were reingested. Please can you explain what testing you did to confirm that only the subset of rows are being changed. |
There are errors like this in the logs:
|
There is something wrong with the cloud_events_transient table definition:
the ETL is trying to alter the table even though it does not need altering. |
Github is being weird and not letting me edit my PR description. To ensure it was doing what I expected I ran the following in my Docker: This is an old instance with new events in the post ingest. Ran this before and after post_ingest_test.sh to ensure that it got updated with a new end_event/last_modified time: Then ran this to ensure that we only had two rows where the last_modified timestamp had changed: |
Turn on debug and it should tell you why it thinks the table should be altered as per #807 |
configuration/etl/etl_tables.d/cloud_common/cloud_transient.json
Outdated
Show resolved
Hide resolved
…bortyr/xdmod into RC_DontTruncateAggregates
The Alter table issue looks like a preexisting condition:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
What is the reason provided by the ETL as to why it's altering the transient table? You should see something like this in the logs:
|
I'd have to go check the logs to find the specifics; but start_time_ts is already defined as a nullable bigint of size 18. It's odd, but fairly certain it's not related to the PR so I was hoping to take care of it in its own request. |
@chakrabortyr Check the logs and provide the requested information. We can address it later but I want to know what the issue is. |
|
* Do not truncate aggregate table
Description
Motivation and Context
We don't want to be constantly truncating + reingesting potentially millions of rows as cloud usage scales.
Tests performed
Made sure existing tests passed + added new set of anonymised data. This anonymised data is then validated in a post ingest script. I manually did some additional validation with the following queries I couldn't think of a holistic way to operationalise:
This is an old instance with new events in the post ingest. Ran this before and after post_ingest_test.sh to ensure that it got updated with a new end_event/last_modified time:
SELECT * FROM modw_cloud.cloud_events_transient where instance = 'DESIRED_ID';
Then ran this to ensure that we only had two rows where the last_modified timestamp had changed:
SELECT * FROM modw_cloud.cloud_events_transient order by end_time DESC, last_modified DESC;
Types of changes
Checklist: