-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric importer refactor #1035
Metric importer refactor #1035
Conversation
… / run_id run_id is now added in segmented helper metric_importer. Run ID is a concept the metric providers should not be aware of and is thus handled now externally. Furthermore the processing mechanism in the providers does now not allow overloading of externally callable functions in read_metrics(). New private sub functions are created that go through different steps of reading, parsing, validation and adding data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
41 file(s) reviewed, 34 comment(s)
Edit PR Review Bot Settings | Greptile
* main: Removed the email column from the runs table (change): Changed default resolution of powermetrics to a non-underflow resolution Bump uvicorn-worker from 0.2.0 to 0.3.0 (#1034) (fix): lmsensors was not reacting to skip_check (Tests): pytest running as default with -vv
…bout if provider can or cannot return empty rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
42 file(s) reviewed, 13 comment(s)
Edit PR Review Bot Settings | Greptile
for _, row in detail_names.iterrows(): | ||
measurement_metric_id = DB().fetch_one(''' | ||
INSERT INTO measurement_metrics (run_id, metric, detail_name, unit) | ||
VALUES (%s, %s, %s, %s) | ||
RETURNING id | ||
''', params=(run_id, metric_name, row['detail_name'], row['unit']))[0] | ||
df.loc[(df['detail_name'] == row['detail_name']) & (df['unit'] == row['unit']), 'measurement_metric_id'] = measurement_metric_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider batching the INSERT statements for better performance with many detail_names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@greptileai: but how do I do that, when I need to return the ID for every one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a VALUES clause with multiple rows and still get the returned IDs using unnest. Here's how:
INSERT INTO measurement_metrics (run_id, metric, detail_name, unit)
VALUES %s
RETURNING id, detail_name, unit
Then construct the values tuple list in Python and use psycopg's execute_values(). This will return all IDs matched with their detail_name and unit, which you can then use to update the dataframe in one go.
Would you like me to show you the specific code changes needed to implement this?
* main: (improvement): Added ICMP, Ehterframe, QUIC and TCP-Streams to TCPDump [skip ci] (#1036)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
44 file(s) reviewed, 12 comment(s)
Edit PR Review Bot Settings | Greptile
resolution_avg DOUBLE PRECISION NOT NULL, | ||
resolution_max DOUBLE PRECISION NOT NULL, | ||
resolution_95p DOUBLE PRECISION NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider adding indexes on resolution columns if they will be frequently queried for analysis or filtering
f = StringIO(df[['measurement_metric_id', 'value', 'time']] | ||
.to_csv(index=False, header=False)) | ||
DB().copy_from(file=f, table='measurement_values', columns=['measurement_metric_id', 'value', 'time'], sep=',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider using a context manager (with statement) for the StringIO object to ensure proper cleanup
df['value'] = df['received_bytes_intervals'] + df['transmitted_bytes_intervals'] | ||
df['value'] = df.value.astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: consider using df.loc[] for assignment to avoid SettingWithCopyWarning in pandas
df['detail_name'] = df.psys_id | ||
df = df.drop('psys_id', axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider adding error handling in case psys_id column is missing from the dataframe
if not self.cpu_chips: | ||
raise MetricProviderConfigurationError( | ||
'Please set the CPUChips config option for PsuEnergyAcSdiaMachineProvider in the config.yml') | ||
if not self.tdp: | ||
raise MetricProviderConfigurationError('Please set the TDP config option for PsuEnergyAcSdiaMachineProvider in the config.yml') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider moving configuration validation to check_system() to maintain consistency with other validation logic there
assert data[1]['metric'] == 'psu_energy_ac_mcp_machine' | ||
assert data[1]['detail_name'] == '[machine]' | ||
assert data[1]['unit'] == 'mJ' | ||
assert data[1]['value'] == 13175452 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: hardcoded value 13175452 should be defined as a constant or pulled from test data for better maintainability
run_id = Tests.insert_run() | ||
Tests.import_machine_energy(run_id) | ||
|
||
sci = {"I":436,"R":0,"EL":4,"RS":1,"TE":181000,"R_d":"page request"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: consider moving sci configuration to a test constants file for reuse across tests
def test_powermetrics(): | ||
obj = PowermetricsProvider(499, skip_check=True) | ||
obj._filename = os.path.join(GMT_ROOT_DIR, './tests/data/metrics/powermetrics.log') | ||
|
||
df = obj.read_metrics() | ||
|
||
assert list(df.metric.unique()) == ['cpu_time_powermetrics_vm', 'disk_io_bytesread_powermetrics_vm', 'disk_io_byteswritten_powermetrics_vm', 'energy_impact_powermetrics_vm', 'cores_energy_powermetrics_component', 'gpu_energy_powermetrics_component', 'ane_energy_powermetrics_component'] | ||
|
||
assert math.isclose(df[df.metric == 'energy_impact_powermetrics_vm'].value.mean(), 430.823529, rel_tol=1e-5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: test_powermetrics() should verify all expected metrics are present and have valid values, not just energy_impact_powermetrics_vm
|
||
def import_cpu_utilization(run_id): | ||
|
||
obj = CpuUtilizationCgroupContainerProvider(99, skip_check=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: The hardcoded resolution value of 99 should be defined as a constant at the top of the file with other test constants for better maintainability
phases = [ | ||
{"start": TEST_MEASUREMENT_START_TIME-8, "name": "[BASELINE]", "end": TEST_MEASUREMENT_START_TIME-7}, | ||
{"start": TEST_MEASUREMENT_START_TIME-6, "name": "[INSTALL]", "end": TEST_MEASUREMENT_START_TIME-5}, | ||
{"start": TEST_MEASUREMENT_START_TIME-4, "name": "[BOOT]", "end": TEST_MEASUREMENT_START_TIME-3}, | ||
{"start": TEST_MEASUREMENT_START_TIME-2, "name": "[IDLE]", "end": TEST_MEASUREMENT_START_TIME-1}, | ||
{"start": TEST_MEASUREMENT_START_TIME, "name": "[RUNTIME]", "end": TEST_MEASUREMENT_END_TIME}, | ||
{"start": TEST_MEASUREMENT_END_TIME+1, "name": "[REMOVE]", "end": TEST_MEASUREMENT_END_TIME+2}, | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Phase timestamps use simple arithmetic (-8, -7 etc) which could cause edge cases if TEST_MEASUREMENT_START_TIME is too small. Consider adding validation
Rework metric provider processing mechanism and removed storing to DB / run_id
Greptile Summary
Major refactoring of the metric provider processing mechanism to improve separation of concerns by centralizing database operations and removing run_id handling from individual providers.
lib/metric_importer.py
to centralize database operations and run_id handling previously scattered across providersresolution_avg
,resolution_max
, andresolution_95p
columns to measurements table for better timing statistics_read_metrics()
,_parse_metrics()
,_add_unit_and_metric()
in base provider class for better encapsulationread_metrics()
final in base provider class to prevent overriding and ensure consistent processing flow