Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database io additional tags & fields #7103

Merged

Conversation

Trovalo
Copy link
Collaborator

@Trovalo Trovalo commented Mar 3, 2020

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

Resolves #7073
The PR covers both Azure SQL DB and "Other versions", by adding new tags and fields that are better if calculated on a per-point basis and might be unhandy to calculate in influxDB but also removes some fields.
Below a recap

Changes

For SQL Server On-Prem (or better all except Azure SQL DB)

New Old New vs Old Note
measurement measurement
sql_instance sql_instance
database_name database_name
physical_filename physical_filename
logical_filename logical_filename
file_type file_type edited the value is kept as is
volume_mount_point added provides by disk metrics
read_latency_ms read_latency_ms
reads reads
read_bytes read_bytes
write_latency_ms write_latency_ms
writes writes
write_bytes write_bytes
read_latency_ms_per_read added
write_latency_ms_per_read added
overall_latency_ms_per_transfer added
avg_bytes_per_read added
avg_bytes_per_write added
avg_bytes_per_transfer added
current_size_mb removed misleading value updated only on service restart
space_used_mb removed fixed value at -1 in the query
rg_read_stall_ms removed not compatible for SQL Server 2012 11.x
rg_write_stall_ms removed not compatible for SQL Server 2012 11.x

Edited

  • file_type - previously set to "DATA" or "LOG", but it can have also other values, now the value is kept as is (ROWS | LOG | FILESTREAM)

Removed

  • rg_read_stall_ms & rg_write_stall_ms - I don't think it's worth breaking compatibility with SQL server 2012 11.x and previous just to add those two fields
  • space_used_mb - the value is fixed at -1 in the query itself, also, for reasons I do not know the values does not reach influxdb anyway
  • current_size_mb - the value is misleading. It gets updated on instance restart, which does not happen often, meaning that the real value can be totally different, also, for reasons I do not know the values does not reach influxdb anyway

Added

  • volume_mount_point (tag)- provides "per-disk" data, in fact, multiple databases usually share the same disk, if something is wrong with the disk then all the databases on it will incur in performance issues
  • various fields - nothing to say, just some fields to get the actual disk latency, useful to monitor disk-related performance issues.
  • also, a set of new rows (points) has been added in order to have the total by volume_mount_point, in these rows, the other tags have values like 'All Databases'

For SQL Azure DB

New Old New vs Old
measurement measurement
sql_instance sql_instance
database_name database_name
read_latency_ms read_latency_ms
reads reads
read_bytes read_bytes
write_latency_ms write_latency_ms
writes writes
write_bytes write_bytes
rg_read_stall_ms rg_read_stall_ms
rg_write_stall_ms rg_write_stall_ms
read_latency_ms_per_read added
write_latency_ms_per_read added
overall_latency_ms_per_transfer added
avg_bytes_per_read added
avg_bytes_per_write added
avg_bytes_per_transfer added
logical_filename logical_filename
physical_filename physical_filename
file_type file_type
current_size_mb current_size_mb
space_used_mb space_used_mb

Added

  • various fields - nothing to say, just some fields to get the actual disk latency, useful to monitor disk-related performance issues. The precision problem applies also here.

Tests

  • The query itself has been tested on SQL Server 2012 and later versions.
  • Tests have been made running telegraf with those changes

Known Issues

missing precision
In the new the calculated fields, the division is made between integers, which causes the result to be an integer (the decimal part gets truncated). If the values get converted to float, the final result will be converted to a string after being serialized and becomes useless.
Sample:

SELECT
  ([io_stall_read_ms] / [num_of_reads]) AS [avg_latency_ms_per_read_int],
  ([io_stall_read_ms] *1.0 / [num_of_reads]) AS [avg_latency_ms_per_read_float] --implicit conversion
FROM [sys].[dm_io_virtual_file_stats](NULL,NULL)
When this gets serialized by the plugin, the int will be saved as integer, but the float gets converted to a string

Personally, I'm not able to understand why a returned value like 0.55 is seen as as string and outputtent in line protocol as value="0.55", therefore maiing it unusable.
Can someone help me understand this? (changing it afterwards will cause errors due to field type conflict)

@danielnelson
Copy link
Contributor

I kind of dislike adding read_latency_ms / reads since most TSDBs, including InfluxDB, can perform this type of calculation in the query. However, I recognize that Chronograf and Grafana require switching to text query mode in order to do so...

rg_read_stall_ms & rg_write_stall_ms - I don't think it's worth breaking compatibility with SQL server 2012 11.x and previous just to add those two fields

We could special case these in the query perhaps?

space_used_mb - the value is fixed at -1 in the query itself
current_size_mb - the value is misleading

What do you think about these @denzilribeiro

also, a set of new rows (points) has been added in order to have the total by volume_mount_point, in these rows, the other tags have values like 'All Databases'

I don't think we should do this change, instead it should be done at query time with a sum aggregation. While it can be tempting to add totals they add extra unneeded metrics and can cause confusion when applying the normal aggregation patterns, for example sum() could return double the correct value if the total tag isn't excluded.

Personally, I'm not able to understand why a returned value like 0.55 is seen as as string and outputtent in line protocol as value="0.55", therefore maiing it unusable.

Maybe it is being returned as a Decimal type, similar to #6397

@danielnelson danielnelson added area/sqlserver feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin labels Mar 4, 2020
@denzilribeiro
Copy link
Contributor

My opinion is that the more complex the queries the more overhead on the server you are monitoring, People use this to monitor instances with few cores the query this PR submits is on the heavy side for sure.

  • Virtual filestats is per file reporting, not per disk reporting, I don't like calculating all per volume metrics, you can can derive that from the file at the reporting layer if needed or from appropriate Volume level disk counters at the OS. If the DMV doesn't give you metrics per volume, I wouldn't calculate those per volume.

  • I also am of opinion all averages aka all of these as it shouldn't be done at the collection layer rather at the reporting layer.
    read_latency_ms_per_read
    write_latency_ms_per_read |   | added
    overall_latency_ms_per_transfer |   | added
    avg_bytes_per_read |   | added
    avg_bytes_per_write |   | added
    avg_bytes_per_transfer |   | added

  • We cannot just remove these. These values are critical to all cloud versions ( SQL DB, Hyperscale, SQL managed instance) and are indicative of throttling due to resource governance and you have no other metrics (unlike on-prem where can look at disks queuing counters). We can special case and have a 0 value for on-prem versions that don't have this.

  | rg_read_stall_ms | removed | not compatible for SQL Server 2012 11.x
  | rg_write_stall_ms | removed | not compatible for SQL Server 2012 11.x

  • space_used_mb has to be changed for on-prem for sure, I will try to submit a PR. I think I mistakenly was under impression that all measurements have to be the same, space used only returns right values within context of the database, should be able to not collect for on-prem.
  • My opinion is if volumestats is a need then submit it as a separate collector that can be disabled as you are either looking at metrics based on file or volume not both.

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 4, 2020

  • additional calculated fields - (read_latency_ms_per_read, etc), Not having those "out of the box" is not a big problem since they can be calculated later, the point is that the source columns by themselves are of no use (ie: why should I care to know that since the last restart the I had N millisecond of wait and N reads?), the useful information is the ratio between the two, adding a simple division in the query won't add any burden on SQL Server. (the union all can be removed though, see next point)

  • Total Rows - I see the point in the rejection of the "total" rows, which gives the total by disk, with other tags valued as "All Database" and similar, since, as you said, those rows may break current reports and cause confusion if not treated properly. That's not a problem since the totals can be calculated in other ways.

  • volume_mount_point - I would like to have this tag anyway, it won't increase the cardinality or cause additional issues since, in the end, it is a subpart of the current "physical_filename" tag.
    It is just additional information that allows the calculation of metrics on a wider level, by volume (even if a subquery will be needed to do so). It differs from the Windows plugin since it considers only the volumes used by SQL Server.

  • DB Size for SQL On-Prem - (current_size_mb, space_used_mb) as @denzilribeiro said, those make sense only in the context of the single database, therefore to get meaningful values you must loop on each DB. Doing so in a query is not difficult, but running it with high frequency is not that good. It could be added as a new specific measurement, something like "sqlserver_db_size", so it's easier to schedule it with a different "interval".
    Regarding the measurement structure, it is not required to have all the columns (tag & fields) on all the series, in this case, some columns might exists in the measurement, but are only used by the "Azure SQL DB" databases, if no "Azure SQL DB" is monitored then the columns won't exist at all, if it's added later, then the columns will be created.

I will edit the PR soon to remove the "total rows" and maybe simplify the query.

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 4, 2020

I've just updated the code, below a quick recap (I'm not sure if it's better to edit the PR text instead of adding comments. if yes I will update it later)

  • Fields for SQL Srv 2012 vs Other version - the two fields [rg_read_stall_ms] and [rg_write_stall_ms] are back. now the query use dynamic SQL, if the version is SQL Server 2012 those columns are not used, for the other versions (2012 +) the column are used
  • removed "Total" rows - the rows with the total per disk and 'all databases' have been removed. It can be calculated if needed

@denzilribeiro
Copy link
Contributor

I am still not a fan of any of the calculations that can be done outside for most part.
Isn't calculating latency with simple division fundamentally flawed as you are calculating average latency from server restart? In my opinion totally is as gives you wrong results.
Lets say write latency: aka vfs.[io_stall_write_ms] / vfs.[num_of_writes]
io_stall_write_ms is a time from server restart So you are dividing latency from server restart to reads from server restart.
Ideally when calculating latency between point at time T1 and time T2
Latency = (io_stall_write_ms at T2 - io_stall_write_ms at T1) / (num_of_writes at T2 - num_of_writes at T1)
aka non_negative_derivative("write_latency_ms", 1s) / non_negative_derivative("writes", 1s)

That is true for all of these:

read_latency_ms_per_read |   | added
write_latency_ms_per_read |   | added
overall_latency_ms_per_transfer |   | added
avg_bytes_per_read |   | added
avg_bytes_per_write |   | added
avg_bytes_per_transfer |   | added

@danielnelson thoughts?

@danielnelson
Copy link
Contributor

I expect the the per_read style values will be less flexible when it comes to querying, since they introduce a hidden time unit: the collection interval. The units become latency ms / (reads/10s). They also don't introduce any new information but add additional storage requirements. Normally we wouldn't add these rate values to plugins.

However, I do see why you would want this since it does make querying easier, especially with Grafana/Chronograf, if you can precompute the results you are interested in ahead of time.

What I think might be a better solution for the division is to hold tight for Telegraf 1.15, we are planning a processor that will allow you to compute these rates at ingestion time. It would be a little more setup in Telegraf, but will be more flexible overall.

I can't promise it for 1.15 yet, right now its a weekend project for me, but we are also working on a query plugin that will let you do ad-hoc queries as well, so there are going to be more options available for power users and we can keep this plugin lean for performance and simplicity.

@denzilribeiro
Copy link
Contributor

There is the correctness issue too that many people will miss. A straight division is an average since server startup - which means if server started 2 days ago and you had a bad time period 22 hours ago, it will still affect the average that you take now. Hope am making sense :)

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 4, 2020

@denzilribeiro is right on the precision, by calculating it in the frontend (Grafana/Chronograf) you are more accurate.
In fact, an anomaly that causes a peak will somehow have repercussion on any values that comes later, the more time passes the more it will be mitigated (by using the precomputed metric), on the other side the result achievable by calculating in frontend is way more precise even if a bit "harder" to manage (at least form Grafana/Chronograf).
I didn't see that until @denzilribeiro made an example, thanks.

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 6, 2020

The code has been edited (hopefully for the last time).

  • Calculated columns have been removed

In the end, this is what the whole PR does:

Only for SQL Server on-prem

New Old New vs Old Note
file_type file_type edited the value is kept as is
volume_mount_point added allows by volume calculation
current_size_mb removed misleading value updated only on service restart
space_used_mb removed fixed value at -1 in the query
rg_read_stall_ms removed not fetched on SQL 2012
rg_write_stall_ms removed not fetched on SQL 2012

For Azure SQL DB nothing has changed (not even "file_type")

@denzilribeiro
Copy link
Contributor

Sorry to ask you to possibly change again to make query shorter :) You could coalesce the on-prem scenario to make query shorter :) can you test this on SQL 2012/below - Should generate a statement that is correct for both ( but I haven't tested with 2012)

DECLARE @sqlstatement AS nvarchar(max);
SELECT @sqlstatement = N'
SELECT
''sqlserver_database_io'' AS [measurement]
,REPLACE(@@ServerName,'''','':'') AS [sql_instance]
,DB_NAME(vfs.[database_id]) AS [database_name]
,COALESCE(mf.[physical_name],''RBPEX'') AS [physical_filename] --RPBEX = Resilient Buffer Pool Extension
,COALESCE(mf.[name],''RBPEX'') AS [logical_filename] --RPBEX = Resilient Buffer Pool Extension
,mf.[type_desc] AS [file_type]
,IIF( RIGHT(vs.[volume_mount_point],1) = '''' /*Tag value cannot end with \ */
,LEFT(vs.[volume_mount_point],LEN(vs.[volume_mount_point])-1)
,vs.[volume_mount_point]
) AS [volume_mount_point]
,vfs.[io_stall_read_ms] AS [read_latency_ms]
,vfs.[num_of_reads] AS [reads]
,vfs.[num_of_bytes_read] AS [read_bytes]
,vfs.[io_stall_write_ms] AS [write_latency_ms]
,vfs.[num_of_writes] AS [writes]
,vfs.[num_of_bytes_written] AS [write_bytes] '
+ CASE
WHEN (SERVERPROPERTY('ProductMajorVersion') <=11) THEN ''
ELSE ',vfs.io_stall_queued_read_ms AS [rg_read_stall_ms],vfs.io_stall_queued_write_ms AS [rg_write_stall_ms]'
END

  • N' FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
    INNER JOIN sys.master_files AS mf WITH (NOLOCK)
    ON vfs.[database_id] = mf.[database_id] AND vfs.[file_id] = mf.[file_id]
    CROSS APPLY sys.dm_os_volume_stats(vfs.[database_id], vfs.[file_id]) AS vs
    '
    EXEC sp_executesql @sqlstatement

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 6, 2020

This reduces code duplication, also using 'ProductMajorVersion' is easier to use (I had to cast it anyway to avoid "strange" results)...
The query itself works perfectly, but sadly an error occurs in telegraf. (using --test)

E! [inputs.sqlserver] Error in plugin: sql: expected 16 destination arguments in Scan, not 15

The cast is needed otherwise the result is always true see test below

-- using a SQL 2017
SELECT 
	SERVERPROPERTY('ProductMajorVersion') as Version,	--type sql_variant
	IIF( CAST(SERVERPROPERTY('ProductMajorVersion') AS int) <= 11,  1 , 0) as checkWithCast,	--Correct
	IIF( SERVERPROPERTY('ProductMajorVersion') <= 11,  1 , 0) as CheckWithoutCast	--unexpected result

Here is the whole query (it works)

SET DEADLOCK_PRIORITY -10;
IF SERVERPROPERTY('EngineEdition') = 5
BEGIN
SELECT
	 'sqlserver_database_io' As [measurement]
	,REPLACE(@@SERVERNAME,'\',':') AS [sql_instance]
	,DB_NAME([vfs].[database_id]) AS [database_name]
	,vfs.io_stall_read_ms AS read_latency_ms
	,vfs.num_of_reads AS reads
	,vfs.num_of_bytes_read AS read_bytes
	,vfs.io_stall_write_ms AS write_latency_ms
	,vfs.num_of_writes AS writes
	,vfs.num_of_bytes_written AS write_bytes
	,vfs.io_stall_queued_read_ms as rg_read_stall_ms
	,ISNULL(b.name ,'RBPEX') as logical_filename
	,ISNULL(b.physical_name, 'RBPEX') as physical_filename
	,CASE WHEN vfs.file_id = 2 THEN 'LOG' ELSE 'DATA' END AS file_type
	,ISNULL(size,0)/128 AS current_size_mb
	,ISNULL(FILEPROPERTY(b.name,'SpaceUsed')/128,0) as space_used_mb
	,vfs.io_stall_queued_read_ms AS [rg_read_stall_ms]
	,vfs.io_stall_queued_write_ms AS [rg_write_stall_ms]
FROM [sys].[dm_io_virtual_file_stats](NULL,NULL) AS vfs
LEFT OUTER join sys.database_files b 
	ON b.file_id = vfs.file_id
END
ELSE

BEGIN
DECLARE @SqlStatement AS nvarchar(max);

	SET @SqlStatement = N'
	SELECT
		''sqlserver_database_io'' AS [measurement]
		,REPLACE(@@SERVERNAME,''\'','':'') AS [sql_instance]
		,DB_NAME(vfs.[database_id]) AS [database_name]
		,COALESCE(mf.[physical_name],''RBPEX'') AS [physical_filename]	--RPBEX = Resilient Buffer Pool Extension
		,COALESCE(mf.[name],''RBPEX'') AS [logical_filename]	--RPBEX = Resilient Buffer Pool Extension
		,mf.[type_desc] AS [file_type]
		,vs.[volume_mount_point]
		,IIF( RIGHT(vs.[volume_mount_point],1) = ''\''	/*Tag value cannot end with \ */
			,LEFT(vs.[volume_mount_point],LEN(vs.[volume_mount_point])-1)
			,vs.[volume_mount_point]
		) AS [volume_mount_point]
		,vfs.[io_stall_read_ms] AS [read_latency_ms]
		,vfs.[num_of_reads] AS [reads]
		,vfs.[num_of_bytes_read] AS [read_bytes]
		,vfs.[io_stall_write_ms] AS [write_latency_ms]
		,vfs.[num_of_writes] AS [writes]
		,vfs.[num_of_bytes_written] AS [write_bytes]
		'
		+ 
		CASE
			WHEN CAST(SERVERPROPERTY('ProductMajorVersion') AS int) <= 11
				/*SQL Server 2012 (ver 11.x) does not have [io_stall_queued_read_ms] and [io_stall_queued_write_ms]*/
				THEN ''
				ELSE N',vfs.io_stall_queued_read_ms AS [rg_read_stall_ms] ,vfs.io_stall_queued_write_ms AS [rg_write_stall_ms]'
		END 
		+
	N'FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS vfs
	INNER JOIN sys.master_files AS mf WITH (NOLOCK)
		ON vfs.[database_id] = mf.[database_id] AND vfs.[file_id] = mf.[file_id]
	CROSS APPLY sys.dm_os_volume_stats(vfs.[database_id], vfs.[file_id]) AS vs
	'
	EXEC sp_executesql @SqlStatement

END

I will push the broken code anyway and try to understand what goes wrong later. (I'm still don't get that much when reading GO but I'm working on it)

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 9, 2020

I figured it out, the last time I copy/pasted some code I didn't notice that there where 2 columns with the same name. (one has been removed, it was there by error).
Now everything works again. I will run some more tests to confirm that everything works as expected.

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 9, 2020

I've run some tests and made some changes.

  • To maintain the widest compatibility, the SERVERPROPERY 'ProductVersion' is used to choose the query, because 'ProductMajorVersion' is missing in the earlier version of SQL 2012 (is available after 2015 patches).
  • The query used by azure SQL DB is now declared as a string and then executed. Otherwise, SQL 2012 tries to compile the code and fails because of the missing columns.

I've tested the whole plugin with multiple instances of different versions, from 2012 to 2019 and everything works fine. Some more testing is welcomed though.

Copy link
Contributor

@denzilribeiro denzilribeiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these 2 were moved for the Azure case ? They are definitely useful so would leave them as is from original. I am good with all the other cahnces.

    ,ISNULL(size,0)/128 AS current_size_mb
,ISNULL(FILEPROPERTY(b.name,''SpaceUsed'')/128,0) as space_used_mb

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 9, 2020

If those are missing then I've made an error, in the Azure SQL DB nothing has changed, not even the file type (which now is kept as is in the on-prem queries).

I've looked at the file and those two fields are still there. Are you looking at the last version?

Copy link
Contributor

@denzilribeiro denzilribeiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, was looking at wrong version

@danielnelson danielnelson added this to the 1.14.0 milestone Mar 10, 2020
@danielnelson danielnelson merged commit 1601a06 into influxdata:master Mar 10, 2020
@danielnelson danielnelson added the fix pr to fix corresponding bug label Mar 10, 2020
Trovalo added a commit to Trovalo/telegraf that referenced this pull request Mar 11, 2020
@denzilribeiro
Copy link
Contributor

Seems like this broke something - pulled latest and getting errors in this plugin - Did you test?
./telegraf --input-filter sqlserver --test > output.txt

2020-03-11T20:11:47Z E! [inputs.sqlserver] Error in plugin: sql: expected 17 destination arguments in Scan, not 16
2020-03-11T20:11:48Z E! [inputs.sqlserver] Error in plugin: sql: expected 17 destination arguments in Scan, not 16
2020-03-11T20:11:49Z E! [inputs.sqlserver] Error in plugin: sql: expected 17 destination arguments in Scan, not 16

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 11, 2020 via email

@denzilribeiro
Copy link
Contributor

This is on SQL DB but doesn't matter on what - run the test irrespective and it throws an error.
./telegraf --input-filter sqlserver --test > output.txt
Currently this is happening as the on-prem query has 16 columns, the SQL DB query has 17 columns returned -

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 11, 2020

It's happening because the azure version gets 2 times rg_read_stall_ms which causes this error.

Also I tested with on prem with 2012 and 2019, on the same plugin configuration, one returns 14 columns and the other 16 and it worked perfectly

@denzilribeiro
Copy link
Contributor

Good catch this worked before this PR - I missed catching it when you - I am making a PR anyways here will fix it.

@Trovalo
Copy link
Collaborator Author

Trovalo commented Mar 11, 2020

If you can submit a PR I will be glad, as I said It's hard for me to test this right now (I'm almost offline... the internet is so slow in those days in Italy I am barely able access GitHub, also I need someone to turn on the Azure SQL Db instance which won't happen until tomorrow).

I think the root problem is that the keys (col names) are kept in distinct while the values are not (and it makes sense). In this case you will have 17 columns, with 16 distinct names but 17 values

@denzilribeiro
Copy link
Contributor

I will submit it soon - I think you moved columns around for your PR , aka was working before.
Also am removing the coalesce part for RBPEX for the on-prem query - there is no RBPEX for on-prem neither will you get that if you do an inner join ( row is DBID 0, file ID 0).

@Trovalo Trovalo mentioned this pull request Mar 12, 2020
3 tasks
athoune pushed a commit to bearstech/telegraf that referenced this pull request Apr 17, 2020
veorlo pushed a commit to GlobalNOC/telegraf that referenced this pull request May 4, 2020
HarshitOnGitHub pushed a commit to HarshitOnGitHub/telegraf that referenced this pull request May 8, 2020
idohalevi pushed a commit to idohalevi/telegraf that referenced this pull request Sep 29, 2020
arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/sqlserver feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin fix pr to fix corresponding bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SqlServer input - Change to "Database IO" measurement
3 participants