Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a method to provide statistics about CRUD operations #224

Closed
ligurio opened this issue Oct 15, 2021 · 5 comments · Fixed by #244
Closed

Add a method to provide statistics about CRUD operations #224

ligurio opened this issue Oct 15, 2021 · 5 comments · Fixed by #244
Assignees
Labels
feature A new functionality

Comments

@ligurio
Copy link
Member

ligurio commented Oct 15, 2021

Description

This is a raw idea, but I hope it would be useful for CRUD users.

CRUD perform operations on storage servers, and sometimes we need to know how many operations were done on each storage (in tests for #222 and #166). Probably (it's a hypothesis now) our users need such information too. However, we cannot use box.stat on storage servers because CRUD is not a single who uses storage servers, there is at least vshard that can move buckets between servers in background.

In this ticket we need to understand

  • a priority of task
  • a set of counters interested in CRUD users
  • integration with metrics module?

Stat counters in Tarantool

CRUD implements the same interface as in Tarantool (create, replace, update, delete etc) and probably we can inherit stat interface from Tarantool too. See module box.stat:

tarantool> box.stat() -- return 11 tables
---
- DELETE:
    total: 1873949
    rps: 123
  SELECT:
    total: 1237723
    rps: 4099
  INSERT:
    total: 0
    rps: 0
  EVAL:
    total: 0
    rps: 0
  CALL:
    total: 0
    rps: 0
  REPLACE:
    total: 1239123
    rps: 7849
  UPSERT:
    total: 0
    rps: 0
  AUTH:
    total: 0
    rps: 0
  ERROR:
    total: 0
    rps: 0
  EXECUTE:
    total: 0
    rps: 0
  UPDATE:
    total: 0
    rps: 0
...

Stat counters in graphql

Also graphql.0 has a method to provide statistics:

function statistics.new(opts)
    local opts = opts or {}
    local resulting_object_cnt_max = opts.resulting_object_cnt_max
    local fetched_object_cnt_max = opts.fetched_object_cnt_max

    return setmetatable({
        resulting_object_cnt = 0,          -- retire
        fetches_cnt = 0,                   -- fetch
        fetched_object_cnt = 0,            -- fetch
        full_scan_cnt = 0,                 -- fetch
        index_lookup_cnt = 0,              -- fetch
        cache_hits_cnt = 0,                -- cache lookup
        cache_hit_objects_cnt = 0,         -- cache lookup
        limits = {
            resulting_object_cnt_max = resulting_object_cnt_max, -- retire limit
            fetched_object_cnt_max = fetched_object_cnt_max,     -- fetch limit
        }
    }, {
        __index = {
            objects_fetched = objects_fetched,
            objects_retired = objects_retired,
            cache_lookup = cache_lookup,
        }
    })
end
@Totktonada Totktonada added feature A new functionality teamE labels Oct 15, 2021
Totktonada added a commit that referenced this issue Oct 18, 2021
I don't want to lean on box.stat() information here, because I don't
control all iproto calls to storages: say, vshard rebalancer may perform
them in background.

Instead, I wrapped particular storage function I'm interested in.

The goal is to be able to determine how much storages are involved into
a select/pairs request.

It is implemented as a helper for testing, but hopefully we'll implement
some nice statistics as part of the module in a future (see #224).

Part of #220
Totktonada added a commit that referenced this issue Oct 19, 2021
I don't want to lean on box.stat() information here, because I don't
control all iproto calls to storages: say, vshard rebalancer may perform
them in background.

Instead, I wrapped particular storage function I'm interested in.

The goal is to be able to determine how much storages are involved into
a select/pairs request.

It is implemented as a helper for testing, but hopefully we'll implement
some nice statistics as part of the module in a future (see #224).

Part of #220
@Totktonada
Copy link
Member

NB: Whatever we'll implement, add it into default metrics.

@Totktonada
Copy link
Member

Raw thoughts about implementation.

We can decouple statistics implementation from main crud code for readability reasons. We should provide ability to set a callback on storage and on router, which is called before/after execution a function that serves insert/delete/select/etc (or one callback for all of them with a types passed to the function). A return value of 'before' callback should be passed to 'after' one, so we can store startup time and calculate a request duration. Or a context table should be created per request and passed to 'before' and 'after' callbacks both.

After this we can place all statistics code in its own module within crud or even implement it entirely in metrics. I would prefer the former, because it may be useful to inspect crud's statistics interactively from tarantool console (say, if monitoring is not configured for the application).

Whether we need to pass statistics information from storage to router together with a response? If all nodes are in monitoring, it does not look required.

@DifferentialOrange
Copy link
Member

DifferentialOrange commented Nov 19, 2021

One of the main questions is how crud statistics will be related to metrics module.

There are two possible options:

  1. crud rock depends on metrics,
  2. crud rock not depends on metrics.

The plan is to support crud statistics, such as requests count and requests latency. If requests count can be manually organized with a plain number variable, it is much more complicated for a latency statistics. To aggregate information about time of execution to something useful, we need a tool like histogram or summary with quantiles. It would be unwise to reimplement it here because we already have it in metrics, especially since there is a high-load summary collector with aging based on ffi and non-trivial algorithms, which was tested in production. Drawback of "crud rock depends on metrics" approach is described by itself -- it is an excessive dependency tarantool packages, both of which are used widely. There are many projects that depends on both crud and metrics now, and if we add this dependency, it became complicated. "Pessimistic" or "greater or equal" metrics version dependency in rockspec (~> or >=) may help, but since current metrics version policy is 0.Y.Z with new non-breaking 0.Y+1.0 releasing every couple of months, it may not help.

So I think the answer is

  • crud rock depends on metrics.

@DifferentialOrange
Copy link
Member

The other one is how to enable/disable metrics for crud. I think it is a reasonable requirement to have such a handle.

metrics since 0.10.0 (tarantool/metrics#222) supports enable/disable for groups of metrics. It is a mechanism that works both with vanilla Tarantool and have convenient Cartridge integration through configuration. Currently there are no ability to add custom groups of default metrics from the outside. I see two solutions:

  1. add a feature to support external groups of default metrics,
  2. add a group of crud metrics to list of default metrics in metrics.

Since there are already "optional" groups of default metrics (like cartridge metrics or luajit metrics) that works if they can work (there is cartridge/Tarantool have required version), I don't think there will be any problems with second one. Both solutions should work fine, but both integrations should be discussed with metrics maintainers.

Considering the handles, we can describe them in crud code and then run them with metrics enable/disable process. So we will be able to enable them even without metrics, and at the same time integrate them into common pipeline. But it will have not much sense if we will use metrics collectors and, thus, depend on metrics package.

@DifferentialOrange
Copy link
Member

metrics collector values can be seen without any monitoring systems attached in console, but it may be not very human-friendly (but still readable). But if we will use metrics collectors, there will be no in-built way to see current rps like in box.* response, only summary details (quantiles with aging will show current situation, but total requests may be not very useful).

Considering sending info between storages and routers, I don't think it is necessary. We can measure full time of request execution on router without getting info from storage, since each request starts and finishes on a router. I see the following model:

  • router collects info on requests and their latency on its side,
  • storage collects info on requests and their latency on its side.

This info then could be visualized separately. So we may analyze

  • how requests are distributed between storages,
  • are requests stuck on a storage or on a router/on a way between router and storage (the latter two will be inseparatable).

DifferentialOrange added a commit that referenced this issue Dec 3, 2021
Add statistics module for collecting metrics of CRUD operations on
router. Wrap all CRUD operation calls in statistics collector.
Statistics may be disabled and re-enabled. Some internal methods of
select/pairs were reworked or extended to provide statistics info.
`cursor` returned from storage on select/pairs now contains stats of
tuple count and lookup count. All changes are backward-compatible and
should work even with older versions of crud routers and storages.

Part of #224
DifferentialOrange added a commit that referenced this issue Dec 3, 2021
Use in-built `crud.stats()` info instead on `storage_stat` helper
in tests to track map reduce calls.

Part of #224
DifferentialOrange added a commit that referenced this issue Dec 3, 2021
Add statistics module for collecting metrics of CRUD operations on
router. Wrap all CRUD operation calls in statistics collector.
Statistics may be disabled and re-enabled. Some internal methods of
select/pairs were reworked or extended to provide statistics info.
`cursor` returned from storage on select/pairs now contains stats of
tuple count and lookup count. All changes are backward-compatible and
should work even with older versions of crud routers and storages.

Part of #224
DifferentialOrange added a commit that referenced this issue Dec 3, 2021
Use in-built `crud.stats()` info instead on `storage_stat` helper
in tests to track map reduce calls.

Part of #224
DifferentialOrange added a commit that referenced this issue Dec 4, 2021
Add statistics module for collecting metrics of CRUD operations on
router. Wrap all CRUD operation calls in statistics collector.
Statistics may be disabled and re-enabled. Some internal methods of
select/pairs were reworked or extended to provide statistics info.
`cursor` returned from storage on select/pairs now contains stats of
tuple count and lookup count. All changes are backward-compatible and
should work even with older versions of crud routers and storages.

Part of #224
DifferentialOrange added a commit that referenced this issue Dec 4, 2021
Use in-built `crud.stats()` info instead on `storage_stat` helper
in tests to track map reduce calls.

Part of #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
Add statistics module for collecting metrics of CRUD operations on
router. Wrap all CRUD operation calls in statistics collector.
Statistics may be disabled and re-enabled. Some internal methods of
select/pairs were reworked or extended to provide statistics info.
`cursor` returned from storage on select/pairs now contains stats of
tuple count and lookup count. All changes are backward-compatible and
should work between older versions of crud routers or storages.

Part of #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
Use in-built `crud.stats()` info instead on `storage_stat` helper
in tests to track map reduce calls.

Part of #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
If `metrics` [1] found, metrics collectors are used to store
statistics. It is required to use `>= 0.5.0`, while at least `0.9.0`
is recommended to support age buckets in summary. The metrics are part
of global registry and can be exported together (e.g. to Prometheus)
with default tools without any additional configuration. Disabling
stats destroys the collectors.

If `metrics` found, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging).

Add CI matrix to run tests with `metrics` installed.

1. https://github.com/tarantool/metrics

Closes #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
Since introducing CI matrix for `metrics` install, there are more than
one pipeline that should be taken into consideration while computing
coverage. See more in documentation [1].

1. https://docs.coveralls.io/parallel-build-webhook

Follows up #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
Since introducing CI matrix for `metrics` install, there are more than
one pipeline that should be taken into consideration while computing
coverage. See more in documentation [1].

1. https://docs.coveralls.io/parallel-build-webhook

Follows up #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
If `metrics` [1] found, metrics collectors are used to store
statistics. It is required to use `>= 0.5.0`, while at least `0.9.0`
is recommended to support age buckets in summary. The metrics are part
of global registry and can be exported together (e.g. to Prometheus)
with default tools without any additional configuration. Disabling
stats destroys the collectors.

If `metrics` found, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging).

Add CI matrix to run tests with `metrics` installed.

1. https://github.com/tarantool/metrics

Closes #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
Since introducing CI matrix for `metrics` install, there are more than
one pipeline that should be taken into consideration while computing
coverage. See more in documentation [1].

1. https://docs.coveralls.io/parallel-build-webhook

Follows up #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
If `metrics` [1] found, metrics collectors are used to store
statistics. It is required to use `>= 0.5.0`, while at least `0.9.0`
is recommended to support age buckets in summary. The metrics are part
of global registry and can be exported together (e.g. to Prometheus)
with default tools without any additional configuration. Disabling
stats destroys the collectors.

If `metrics` found, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging).

Add CI matrix to run tests with `metrics` installed.

1. https://github.com/tarantool/metrics

Closes #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
Since introducing CI matrix for `metrics` install, there are more than
one pipeline that should be taken into consideration while computing
coverage. See more in documentation [1].

1. https://docs.coveralls.io/parallel-build-webhook

Follows up #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
If `metrics` [1] found, metrics collectors are used to store
statistics. It is required to use `>= 0.5.0`, while at least `0.9.0`
is recommended to support age buckets in summary. The metrics are part
of global registry and can be exported together (e.g. to Prometheus)
with default tools without any additional configuration. Disabling
stats destroys the collectors.

If `metrics` found, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging).

Add CI matrix to run tests with `metrics` installed.

1. https://github.com/tarantool/metrics

Closes #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
Since introducing CI matrix for `metrics` install, there are more than
one pipeline that should be taken into consideration while computing
coverage. See more in documentation [1].

1. https://docs.coveralls.io/parallel-build-webhook

Follows up #224
DifferentialOrange added a commit that referenced this issue Dec 6, 2021
If `metrics` [1] found, metrics collectors are used to store
statistics. It is required to use `>= 0.5.0`, while at least `0.9.0`
is recommended to support age buckets in summary. The metrics are part
of global registry and can be exported together (e.g. to Prometheus)
with default tools without any additional configuration. Disabling
stats destroys the collectors.

If `metrics` found, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging).

Add CI matrix to run tests with `metrics` installed.

1. https://github.com/tarantool/metrics

Closes #224
DifferentialOrange added a commit that referenced this issue Feb 24, 2022
Before this patch, performance tests ran together with unit and
integration with `--coverage` flag. Coverage analysis cropped the
result of performance tests to 10-15 times. For metrics integration
it resulted in timeout errors and drop of performance which is not
reproduces with coverage disabled. Moreover, before this patch log
capture was disabled and performance tests did not displayed any
results after run. Now performance tests also run is separate CI job.

After this patch, `make -C build coverage` will run lightweight
version of performance test. `make -C build performance` will run real
performance tests.

You can paste output table to GitHub [1].

This path also reworks current performance test. It adds new cases to
compare module performance with or without statistics, statistic
wrappers and compare different metrics drivers and reports new info:
average call time and max call time.

Performance test result: overhead is 3-10% in case of `local` driver and
5-15% in case of `metrics` driver, up to 20% for `metrics` with
quantiles. Based on several runs on HP ProBook 440 G7 i7/16Gb/256SSD.

1. https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables

Closes #233, follows up #224
DifferentialOrange added a commit that referenced this issue Feb 24, 2022
If `metrics` [1] found, you can use metrics collectors to store
statistics. `metrics >= 0.10.0` is required to use metrics driver.
(`metrics >= 0.9.0` is required to use summary quantiles with
age buckets. `metrics >= 0.5.0, < 0.9.0` is unsupported
due to quantile overflow bug [2]. `metrics == 0.9.0` has bug that do
not permits to create summary collector without quantiles [3].
In fact, user may use `metrics >= 0.5.0`, `metrics != 0.9.0`
if he wants to use metrics without quantiles, and `metrics >= 0.9.0`
if he wants to use metrics with quantiles. But this is confusing,
so let's use a single restriction for both cases.)

The metrics are part of global registry and can be exported together
(e.g. to Prometheus) with default tools without any additional
configuration. Disabling stats destroys the collectors.

Metrics collectors are used by default if supported. To explicitly set
driver, call `crud.cfg{ stats = true, stats_driver = driver }`
('local' or 'metrics'). To enable quantiles, call
```
crud.cfg{
    stats = true,
    stats_driver = 'metrics',
    stats_quantiles = true,
}
```
With quantiles, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging). Quantiles computations increases
performance overhead up to 10% when used in statistics.

Add CI matrix to run tests with `metrics` installed. To get full
coverage on coveralls, #248 must be resolved.

1. https://github.com/tarantool/metrics
2. tarantool/metrics#235
3. tarantool/metrics#262

Closes #224
DifferentialOrange added a commit that referenced this issue Feb 24, 2022
Before this patch, performance tests ran together with unit and
integration with `--coverage` flag. Coverage analysis cropped the
result of performance tests to 10-15 times. For metrics integration
it resulted in timeout errors and drop of performance which is not
reproduces with coverage disabled. Moreover, before this patch log
capture was disabled and performance tests did not displayed any
results after run. Now performance tests also run is separate CI job.

After this patch, `make -C build coverage` will run lightweight
version of performance test. `make -C build performance` will run real
performance tests.

You can paste output table to GitHub [1].

This path also reworks current performance test. It adds new cases to
compare module performance with or without statistics, statistic
wrappers and compare different metrics drivers and reports new info:
average call time and max call time.

Performance test result: overhead is 3-10% in case of `local` driver and
5-15% in case of `metrics` driver, up to 20% for `metrics` with
quantiles. Based on several runs on HP ProBook 440 G7 i7/16Gb/256SSD.

1. https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables

Closes #233, follows up #224
DifferentialOrange added a commit that referenced this issue Feb 24, 2022
If `metrics` [1] found, you can use metrics collectors to store
statistics. `metrics >= 0.10.0` is required to use metrics driver.
(`metrics >= 0.9.0` is required to use summary quantiles with
age buckets. `metrics >= 0.5.0, < 0.9.0` is unsupported
due to quantile overflow bug [2]. `metrics == 0.9.0` has bug that do
not permits to create summary collector without quantiles [3].
In fact, user may use `metrics >= 0.5.0`, `metrics != 0.9.0`
if he wants to use metrics without quantiles, and `metrics >= 0.9.0`
if he wants to use metrics with quantiles. But this is confusing,
so let's use a single restriction for both cases.)

The metrics are part of global registry and can be exported together
(e.g. to Prometheus) with default tools without any additional
configuration. Disabling stats destroys the collectors.

Metrics collectors are used by default if supported. To explicitly set
driver, call `crud.cfg{ stats = true, stats_driver = driver }`
('local' or 'metrics'). To enable quantiles, call
```
crud.cfg{
    stats = true,
    stats_driver = 'metrics',
    stats_quantiles = true,
}
```
With quantiles, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging). Quantiles computations increases
performance overhead up to 10% when used in statistics.

Add CI matrix to run tests with `metrics` installed. To get full
coverage on coveralls, #248 must be resolved.

1. https://github.com/tarantool/metrics
2. tarantool/metrics#235
3. tarantool/metrics#262

Closes #224
DifferentialOrange added a commit that referenced this issue Feb 24, 2022
Before this patch, performance tests ran together with unit and
integration with `--coverage` flag. Coverage analysis cropped the
result of performance tests to 10-15 times. For metrics integration
it resulted in timeout errors and drop of performance which is not
reproduces with coverage disabled. Moreover, before this patch log
capture was disabled and performance tests did not displayed any
results after run. Now performance tests also run is separate CI job.

After this patch, `make -C build coverage` will run lightweight
version of performance test. `make -C build performance` will run real
performance tests.

You can paste output table to GitHub [1].

This path also reworks current performance test. It adds new cases to
compare module performance with or without statistics, statistic
wrappers and compare different metrics drivers and reports new info:
average call time and max call time.

Performance test result: overhead is 3-10% in case of `local` driver and
5-15% in case of `metrics` driver, up to 20% for `metrics` with
quantiles. Based on several runs on HP ProBook 440 G7 i7/16Gb/256SSD.

1. https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables

Closes #233, follows up #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
After this patch, statistics `select` section additionally contains
`details` collectors.

```
crud.stats('my_space').select.details
---
- map_reduces: 4
  tuples_fetched: 10500
  tuples_lookup: 238000
...
```

`map_reduces` is the count of planned map reduces (including those not
executed successfully). `tuples_fetched` is the count of tuples fetched
from storages during execution, `tuples_lookup` is the count of tuples
looked up on storages while collecting responses for calls (including
scrolls for multibatch requests). Details data is updated as part of
the request process, so you may get new details before `select`/`pairs`
call is finished and observed with count, latency and time collectors.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
Use in-built `crud.stats()` info instead on `storage_stat` helper
in tests to track map reduce calls.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
`crud.len` supports using space id instead of name. After this patch,
stats wrapper support mapping id to name.

Since using space id is a questionable pattern (see #255), this commit
may be reverted later.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
If `metrics` [1] found, you can use metrics collectors to store
statistics. `metrics >= 0.10.0` is required to use metrics driver.
(`metrics >= 0.9.0` is required to use summary quantiles with
age buckets. `metrics >= 0.5.0, < 0.9.0` is unsupported
due to quantile overflow bug [2]. `metrics == 0.9.0` has bug that do
not permits to create summary collector without quantiles [3].
In fact, user may use `metrics >= 0.5.0`, `metrics != 0.9.0`
if he wants to use metrics without quantiles, and `metrics >= 0.9.0`
if he wants to use metrics with quantiles. But this is confusing,
so let's use a single restriction for both cases.)

The metrics are part of global registry and can be exported together
(e.g. to Prometheus) with default tools without any additional
configuration. Disabling stats destroys the collectors.

Metrics collectors are used by default if supported. To explicitly set
driver, call `crud.cfg{ stats = true, stats_driver = driver }`
('local' or 'metrics'). To enable quantiles, call
```
crud.cfg{
    stats = true,
    stats_driver = 'metrics',
    stats_quantiles = true,
}
```
With quantiles, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging). Quantiles computations increases
performance overhead up to 10% when used in statistics.

Add CI matrix to run tests with `metrics` installed. To get full
coverage on coveralls, #248 must be resolved.

1. https://github.com/tarantool/metrics
2. tarantool/metrics#235
3. tarantool/metrics#262

Closes #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
Before this patch, performance tests ran together with unit and
integration with `--coverage` flag. Coverage analysis cropped the
result of performance tests to 10-15 times. For metrics integration
it resulted in timeout errors and drop of performance which is not
reproduces with coverage disabled. Moreover, before this patch log
capture was disabled and performance tests did not displayed any
results after run. Now performance tests also run is separate CI job.

After this patch, `make -C build coverage` will run lightweight
version of performance test. `make -C build performance` will run real
performance tests.

You can paste output table to GitHub [1].

This path also reworks current performance test. It adds new cases to
compare module performance with or without statistics, statistic
wrappers and compare different metrics drivers and reports new info:
average call time and max call time.

Performance test result: overhead is 3-10% in case of `local` driver and
5-15% in case of `metrics` driver, up to 20% for `metrics` with
quantiles. Based on several runs on HP ProBook 440 G7 i7/16Gb/256SSD.

1. https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables

Closes #233, follows up #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
After this patch, statistics `select` section additionally contains
`details` collectors.

```
crud.stats('my_space').select.details
---
- map_reduces: 4
  tuples_fetched: 10500
  tuples_lookup: 238000
...
```

`map_reduces` is the count of planned map reduces (including those not
executed successfully). `tuples_fetched` is the count of tuples fetched
from storages during execution, `tuples_lookup` is the count of tuples
looked up on storages while collecting responses for calls (including
scrolls for multibatch requests). Details data is updated as part of
the request process, so you may get new details before `select`/`pairs`
call is finished and observed with count, latency and time collectors.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
Use in-built `crud.stats()` info instead on `storage_stat` helper
in tests to track map reduce calls.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
`crud.len` supports using space id instead of name. After this patch,
stats wrapper support mapping id to name.

Since using space id is a questionable pattern (see #255), this commit
may be reverted later.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
If `metrics` [1] found, you can use metrics collectors to store
statistics. `metrics >= 0.10.0` is required to use metrics driver.
(`metrics >= 0.9.0` is required to use summary quantiles with
age buckets. `metrics >= 0.5.0, < 0.9.0` is unsupported
due to quantile overflow bug [2]. `metrics == 0.9.0` has bug that do
not permits to create summary collector without quantiles [3].
In fact, user may use `metrics >= 0.5.0`, `metrics != 0.9.0`
if he wants to use metrics without quantiles, and `metrics >= 0.9.0`
if he wants to use metrics with quantiles. But this is confusing,
so let's use a single restriction for both cases.)

The metrics are part of global registry and can be exported together
(e.g. to Prometheus) with default tools without any additional
configuration. Disabling stats destroys the collectors.

Metrics collectors are used by default if supported. To explicitly set
driver, call `crud.cfg{ stats = true, stats_driver = driver }`
('local' or 'metrics'). To enable quantiles, call
```
crud.cfg{
    stats = true,
    stats_driver = 'metrics',
    stats_quantiles = true,
}
```
With quantiles, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging). Quantiles computations increases
performance overhead up to 10% when used in statistics.

Add CI matrix to run tests with `metrics` installed. To get full
coverage on coveralls, #248 must be resolved.

1. https://github.com/tarantool/metrics
2. tarantool/metrics#235
3. tarantool/metrics#262

Closes #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
Before this patch, performance tests ran together with unit and
integration with `--coverage` flag. Coverage analysis cropped the
result of performance tests to 10-15 times. For metrics integration
it resulted in timeout errors and drop of performance which is not
reproduces with coverage disabled. Moreover, before this patch log
capture was disabled and performance tests did not displayed any
results after run. Now performance tests also run is separate CI job.

After this patch, `make -C build coverage` will run lightweight
version of performance test. `make -C build performance` will run real
performance tests.

You can paste output table to GitHub [1].

This path also reworks current performance test. It adds new cases to
compare module performance with or without statistics, statistic
wrappers and compare different metrics drivers and reports new info:
average call time and max call time.

Performance test result: overhead is 3-10% in case of `local` driver and
5-15% in case of `metrics` driver, up to 20% for `metrics` with
quantiles. Based on several runs on HP ProBook 440 G7 i7/16Gb/256SSD.

1. https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables

Closes #233, follows up #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
Add statistics module for collecting metrics of CRUD operations on
router. Wrap all CRUD operation calls in the statistics collector.
Statistics must be enabled manually with `crud.cfg`. They can be
disabled, restarted or re-enabled later.

This patch introduces `crud.cfg`. `crud.cfg` is a tool to set module
configuration. It is similar to Tarantool `box.cfg`, although we don't
need to call it to bootstrap the module -- it is used only to change
configuration. `crud.cfg` is a callable table. To change configuration,
call it: `crud.cfg{ stats = true }`. You can check table contents as
with ordinary table, but do not change them directly -- use call
instead. Table contents is immutable and use proxy approach
(see [1, 2]). Iterating through `crud.cfg` with pairs is not supported
yet, refer to #265.

`crud.stats()` returns

---
- spaces:
    my_space:
      insert:
        ok:
          latency: 0.002
          count: 19800
          time: 39.6
        error:
          latency: 0.000001
          count: 4
          time: 0.000004
...

`spaces` section contains statistics for each observed space.
If operation has never been called for a space, the corresponding
field will be empty. If no requests has been called for a
space, it will not be represented. Space data is based on
client requests rather than storages schema, so requests
for non-existing spaces are also collected.

Possible statistics operation labels are
`insert` (for `insert` and `insert_object` calls),
`get`, `replace` (for `replace` and `replace_object` calls), `update`,
`upsert` (for `upsert` and `upsert_object` calls), `delete`,
`select` (for `select` and `pairs` calls), `truncate`, `len`, `count`
and `borders` (for `min` and `max` calls).

Each operation section consists of different collectors
for success calls and error (both error throw and `nil, err`)
returns. `count` is the total requests count since instance start
or stats restart. `latency` is the average time of requests execution,
`time` is the total time of requests execution.

Since `pairs` request behavior differs from any other crud request, its
statistics collection also has specific behavior. Statistics (`select`
section) are updated after `pairs` cycle is finished: you
either have iterated through all records or an error was thrown.
If your pairs cycle was interrupted with `break`, statistics will
be collected when pairs objects are cleaned up with Lua garbage
collector.

Statistics are preserved between package reloads. Statistics are
preserved between Tarantool Cartridge role reloads [3] if CRUD Cartridge
roles are used.

1. http://lua-users.org/wiki/ReadOnlyTables
2. tarantool/tarantool#2867
3. https://www.tarantool.io/en/doc/latest/book/cartridge/cartridge_api/modules/cartridge.roles/#reload

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
In some cases LuaJit optimizes using gc_observer table to handle pairs
object gc. It had lead to incorrect behavior (ignoring some pairs
interrupted with break in stats) and tests fail in some cases
(for example, if you run only stats unit tests).

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
After this patch, statistics `select` section additionally contains
`details` collectors.

```
crud.stats('my_space').select.details
---
- map_reduces: 4
  tuples_fetched: 10500
  tuples_lookup: 238000
...
```

`map_reduces` is the count of planned map reduces (including those not
executed successfully). `tuples_fetched` is the count of tuples fetched
from storages during execution, `tuples_lookup` is the count of tuples
looked up on storages while collecting responses for calls (including
scrolls for multibatch requests). Details data is updated as part of
the request process, so you may get new details before `select`/`pairs`
call is finished and observed with count, latency and time collectors.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
Use in-built `crud.stats()` info instead on `storage_stat` helper
in tests to track map reduce calls.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
`crud.len` supports using space id instead of name. After this patch,
stats wrapper support mapping id to name.

Since using space id is a questionable pattern (see #255), this commit
may be reverted later.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
If `metrics` [1] found, you can use metrics collectors to store
statistics. `metrics >= 0.10.0` is required to use metrics driver.
(`metrics >= 0.9.0` is required to use summary quantiles with
age buckets. `metrics >= 0.5.0, < 0.9.0` is unsupported
due to quantile overflow bug [2]. `metrics == 0.9.0` has bug that do
not permits to create summary collector without quantiles [3].
In fact, user may use `metrics >= 0.5.0`, `metrics != 0.9.0`
if he wants to use metrics without quantiles, and `metrics >= 0.9.0`
if he wants to use metrics with quantiles. But this is confusing,
so let's use a single restriction for both cases.)

The metrics are part of global registry and can be exported together
(e.g. to Prometheus) with default tools without any additional
configuration. Disabling stats destroys the collectors.

Metrics collectors are used by default if supported. To explicitly set
driver, call `crud.cfg{ stats = true, stats_driver = driver }`
('local' or 'metrics'). To enable quantiles, call
```
crud.cfg{
    stats = true,
    stats_driver = 'metrics',
    stats_quantiles = true,
}
```
With quantiles, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging). Quantiles computations increases
performance overhead up to 10% when used in statistics.

Add CI matrix to run tests with `metrics` installed. To get full
coverage on coveralls, #248 must be resolved.

1. https://github.com/tarantool/metrics
2. tarantool/metrics#235
3. tarantool/metrics#262

Closes #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
Before this patch, performance tests ran together with unit and
integration with `--coverage` flag. Coverage analysis cropped the
result of performance tests to 10-15 times. For metrics integration
it resulted in timeout errors and drop of performance which is not
reproduces with coverage disabled. Moreover, before this patch log
capture was disabled and performance tests did not displayed any
results after run. Now performance tests also run is separate CI job.

After this patch, `make -C build coverage` will run lightweight
version of performance test. `make -C build performance` will run real
performance tests.

You can paste output table to GitHub [1].

This path also reworks current performance test. It adds new cases to
compare module performance with or without statistics, statistic
wrappers and compare different metrics drivers and reports new info:
average call time and max call time.

Performance test result: overhead is 3-10% in case of `local` driver and
5-15% in case of `metrics` driver, up to 20% for `metrics` with
quantiles. Based on several runs on HP ProBook 440 G7 i7/16Gb/256SSD.

1. https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables

Closes #233, follows up #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
Add statistics module for collecting metrics of CRUD operations on
router. Wrap all CRUD operation calls in the statistics collector.
Statistics must be enabled manually with `crud.cfg`. They can be
disabled, restarted or re-enabled later.

This patch introduces `crud.cfg`. `crud.cfg` is a tool to set module
configuration. It is similar to Tarantool `box.cfg`, although we don't
need to call it to bootstrap the module -- it is used only to change
configuration. `crud.cfg` is a callable table. To change configuration,
call it: `crud.cfg{ stats = true }`. You can check table contents as
with ordinary table, but do not change them directly -- use call
instead. Table contents is immutable and use proxy approach
(see [1, 2]). Iterating through `crud.cfg` with pairs is not supported
yet, refer to #265.

`crud.stats()` returns

---
- spaces:
    my_space:
      insert:
        ok:
          latency: 0.002
          count: 19800
          time: 39.6
        error:
          latency: 0.000001
          count: 4
          time: 0.000004
...

`spaces` section contains statistics for each observed space.
If operation has never been called for a space, the corresponding
field will be empty. If no requests has been called for a
space, it will not be represented. Space data is based on
client requests rather than storages schema, so requests
for non-existing spaces are also collected.

Possible statistics operation labels are
`insert` (for `insert` and `insert_object` calls),
`get`, `replace` (for `replace` and `replace_object` calls), `update`,
`upsert` (for `upsert` and `upsert_object` calls), `delete`,
`select` (for `select` and `pairs` calls), `truncate`, `len`, `count`
and `borders` (for `min` and `max` calls).

Each operation section consists of different collectors
for success calls and error (both error throw and `nil, err`)
returns. `count` is the total requests count since instance start
or stats restart. `latency` is the average time of requests execution,
`time` is the total time of requests execution.

Since `pairs` request behavior differs from any other crud request, its
statistics collection also has specific behavior. Statistics (`select`
section) are updated after `pairs` cycle is finished: you
either have iterated through all records or an error was thrown.
If your pairs cycle was interrupted with `break`, statistics will
be collected when pairs objects are cleaned up with Lua garbage
collector.

Statistics are preserved between package reloads. Statistics are
preserved between Tarantool Cartridge role reloads [3] if CRUD Cartridge
roles are used.

1. http://lua-users.org/wiki/ReadOnlyTables
2. tarantool/tarantool#2867
3. https://www.tarantool.io/en/doc/latest/book/cartridge/cartridge_api/modules/cartridge.roles/#reload

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
In some cases LuaJit optimizes using gc_observer table to handle pairs
object gc. It had lead to incorrect behavior (ignoring some pairs
interrupted with break in stats) and tests fail in some cases
(for example, if you run only stats unit tests).

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
After this patch, statistics `select` section additionally contains
`details` collectors.

```
crud.stats('my_space').select.details
---
- map_reduces: 4
  tuples_fetched: 10500
  tuples_lookup: 238000
...
```

`map_reduces` is the count of planned map reduces (including those not
executed successfully). `tuples_fetched` is the count of tuples fetched
from storages during execution, `tuples_lookup` is the count of tuples
looked up on storages while collecting responses for calls (including
scrolls for multibatch requests). Details data is updated as part of
the request process, so you may get new details before `select`/`pairs`
call is finished and observed with count, latency and time collectors.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
Use in-built `crud.stats()` info instead on `storage_stat` helper
in tests to track map reduce calls.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
`crud.len` supports using space id instead of name. After this patch,
stats wrapper support mapping id to name.

Since using space id is a questionable pattern (see #255), this commit
may be reverted later.

Part of #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
If `metrics` [1] found, you can use metrics collectors to store
statistics. `metrics >= 0.10.0` is required to use metrics driver.
(`metrics >= 0.9.0` is required to use summary quantiles with
age buckets. `metrics >= 0.5.0, < 0.9.0` is unsupported
due to quantile overflow bug [2]. `metrics == 0.9.0` has bug that do
not permits to create summary collector without quantiles [3].
In fact, user may use `metrics >= 0.5.0`, `metrics != 0.9.0`
if he wants to use metrics without quantiles, and `metrics >= 0.9.0`
if he wants to use metrics with quantiles. But this is confusing,
so let's use a single restriction for both cases.)

The metrics are part of global registry and can be exported together
(e.g. to Prometheus) with default tools without any additional
configuration. Disabling stats destroys the collectors.

Metrics collectors are used by default if supported. To explicitly set
driver, call `crud.cfg{ stats = true, stats_driver = driver }`
('local' or 'metrics'). To enable quantiles, call
```
crud.cfg{
    stats = true,
    stats_driver = 'metrics',
    stats_quantiles = true,
}
```
With quantiles, `latency` statistics are changed to 0.99 quantile
of request execution time (with aging). Quantiles computations increases
performance overhead up to 10% when used in statistics.

Add CI matrix to run tests with `metrics` installed. To get full
coverage on coveralls, #248 must be resolved.

1. https://github.com/tarantool/metrics
2. tarantool/metrics#235
3. tarantool/metrics#262

Closes #224
DifferentialOrange added a commit that referenced this issue Feb 25, 2022
Before this patch, performance tests ran together with unit and
integration with `--coverage` flag. Coverage analysis cropped the
result of performance tests to 10-15 times. For metrics integration
it resulted in timeout errors and drop of performance which is not
reproduces with coverage disabled. Moreover, before this patch log
capture was disabled and performance tests did not displayed any
results after run. Now performance tests also run is separate CI job.

After this patch, `make -C build coverage` will run lightweight
version of performance test. `make -C build performance` will run real
performance tests.

You can paste output table to GitHub [1].

This path also reworks current performance test. It adds new cases to
compare module performance with or without statistics, statistic
wrappers and compare different metrics drivers and reports new info:
average call time and max call time.

Performance test result: overhead is 3-10% in case of `local` driver and
5-15% in case of `metrics` driver, up to 20% for `metrics` with
quantiles. Based on several runs on HP ProBook 440 G7 i7/16Gb/256SSD.

1. https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/organizing-information-with-tables

Closes #233, follows up #224
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants