Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create financial-focused OHLC aggregation #445

Closed
ryanbooz opened this issue Jun 10, 2022 · 13 comments · Fixed by #596
Closed

Create financial-focused OHLC aggregation #445

ryanbooz opened this issue Jun 10, 2022 · 13 comments · Fixed by #596
Labels
feature-request And area of analysis that could be made easier

Comments

@ryanbooz
Copy link

Is your feature request related to a problem? Please describe.
Many users of TimescaleDB utilize financial time-series data. Many of our examples and tutorials show how to do basic OHLC bucketing, saving at least four columns using MAX/MIN/FIRST/LAST as the functions. It's a repetitive processes and some functions don't work well with re-aggregation - and even when they do (MAX/MIN), most of our documentation/context warns users against using functions to re-aggregate unless they were designed to do so.

Describe the solution you'd like
Saving an aggregate that stores the necessary data that can then be accessed at query time to return the appropriate value for financial charts. At a minimum, users would have to be able to easily identify the Open/High/Low/Closing value within a bucket. Allowing re-aggregation over smaller buckets would be powerful, which means that users could now satisfy multiple bucket widths from one CAGG, rather than having to create various granularities. (1-min, 5-min, 10-min, 60-min).

Describe alternatives you've considered
Currently, the only solution we really have to offer people is to follow our schema examples for OHLC data using separate aggregate functions and creating multiple CAGGs. The big challenge here, is that developing an application to view multiple "zoom levels" isn't easy without more complicated UNION examples that pull from different CAGGs depending on the "zoom level" (overall query length). Being able to using the same SQL query and simply re-aggregate to a different time bucket would be 🔥

Additional context
@davidkohn88 suggested I jump-start the conversation by creating the issue 😉. We have so many users that do this kind of aggregation. If we could simplify their storage (one CAGG) and query (only have to change the time bucket interval value) - this could simplify how users query data significantly.

@ryanbooz ryanbooz added the feature-request And area of analysis that could be made easier label Jun 10, 2022
@jerryxwu
Copy link
Contributor

jerryxwu commented Jun 15, 2022

@ryanbooz @davidkohn88 One obvious UX design that I can think of looks like this,

Create a 1-minute CAGG from a raw trading data table called “raw_trades”.

CREATE MATERIALIZED VIEW ohlcv_1m
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS time,
       symbol, 
       ts_ohlcv_agg(ts, price, volume) AS ohlcv
FROM raw_trades
GROUP BY time, symbol

Rollup the 1-minute ohlcv aggregates to the desired time interval at query time and get open, high, low, close and volume data through accessor functions ts_open, ts_high, and etc.

SELECT time_bucket('1 hour', time) AS hour,
       ts_open(rollup(ohlcv)) AS open,
       ts_high(rollup(ohlcv)) AS high,
       ts_low(rollup(ohlcv)) AS low,
       ts_close(rollup(ohlcv) AS close,
       ts_volume(rollup(ohlcv) AS volume
FROM ohlcv_1m
WHERE time > $time_from AND
      symbol = $symbol
GROUP BY hour

This design follows our established two-step aggregation design pattern. Do you feel that's sufficient UX improvement worth pursuing?

@davidkohn88
Copy link
Contributor

I think we can remove the ts_ part of these, but otherwise, that's exactly the sort of thing I think we were thinking about. I think the UX improvement is pretty significant in terms of the way it works with continuous aggregates.

I think also that making this specific to the financial use case and using their terminology would be great here as well and act as an entrance point for folks where we might get more feedback on other things we could be doing with something like this.

@ryanbooz
Copy link
Author

Yep. Overall I think this could be a huge win. The one challenge will be volume as we're finding that different APIs provide it in different ways. The provider we chose (allowed us to use their data for free!) provides an incremental value with each trade, but only during trading hours.

Other APIs provide the volume of each trade and the user has to determine it (with a SUM or similar).

So, knowing the basic UX of how most APIs tend to provide data might be helpful here. If it's only two or three ways, maybe we have different accessors depending on what's being stored in volume?

@davidkohn88
Copy link
Contributor

Leaving out the volume part might also be easier here, though creating a weighted average price could be a useful metric we could add on to the typical that would be useful there if we had volume.

I think we'd need an argument to agg about how the volume works, though we might be able to just store the results for both things, ie, store the sum and the min/max of volume and depending on the accessor either output the sum or the max -min depending on the type...hmmm...could end up overflowing in some cases, so would need to account for that...but not crazy...

@ryanbooz
Copy link
Author

I like the weighted average price idea!

The argument agg/accessor is an interesting idea. I suppose one of the benefits of storing both could be that we future proof the data as an API might change how they'd present the data over time, but users wouldn't (necessarily) have to change the CAGG to accommodate that... just the query. 🤔

@jerryxwu
Copy link
Contributor

I also like the idea of volume-weighted average. This is how Investopedia defines it:

The volume-weighted average price (VWAP) is a technical analysis indicator used on intraday charts that resets at the start of every new trading session.

It appears to me that we may want to think this problem in a broader stock/security trading context. For example, once a CAGG is defined, we need to have a way to deal with stock splits?

@zseta
Copy link

zseta commented Jun 28, 2022

I think this feature would greatly help users who deal with financial data. Additional values that we could calculate within the aggregation are technical indicators. AFAIK, most if not all technical indicators can be calculated from the raw tick data. On Twelve Data Docs you can find a list of technical indicators that we could consider. You can also see which are in high demand that could be prioritized to implement first.
An example would look like this (extending the previous SQL example):

SELECT time_bucket('1 hour', time) AS hour,
       open(rollup(ohlcv)) AS open,
       high(rollup(ohlcv)) AS high,
       low(rollup(ohlcv)) AS low,
       close(rollup(ohlcv) AS close,
       rsi(rollup(ohlcv) AS rsi_indicator, --calculates the magnitude of a price change
       macd(rollup(ohlcv) AS macd_indicator, --Moving Average Convergence Divergence
       ema(rollup(ohlcv) AS ema --Exponential Moving Average, places greater importance on recent data points
FROM ohlcv_1m
WHERE time > $time_from AND symbol = $symbol
GROUP BY hour

@zhenyakovalyov
Copy link

as a related question to technical indicator functionality - is there a lower level API that one can use to define a custom function (such as a volume weighted average price calculation)?

@rtwalker
Copy link
Contributor

@zhenyakovalyov we're planning to release an OHLC aggregate (without volume taken into account) in the very near future. This was a more straightforward implementation and gives us some more time to think about how we'd like to treat volume data and calculations.

We're aware of volume being reported as either

  • per trade volume
  • per day running/adjusted/incremental volume

Do you have an expectation as one of these being the default case instead of the other? For example, we're considering a finance_agg (placeholder name) aggregate that might do OHLC, moving average, and other basic technical indicator calculations with the timestamp, price, volume inputs and an optional, default parameter to indicate the volume format that could look like:

CREATE MATERIALIZED VIEW ohlcv
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS time
     , symbol
     , finance_agg(ts, price, volume)
  FROM raw_trades
 GROUP BY time, symbol

or

CREATE MATERIALIZED VIEW ohlcv
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) AS time
     , symbol
     , finance_agg(ts, price, volume, incremental_volume => true)
  FROM raw_trades
 GROUP BY time, symbol

We would love feedback on this kind of API while we're still in the early stages of designing it! For example, is what you had in mind more like another argument, e.g. finance_agg(..., weight_price_by_volume => true) or something else?

@zseta
Copy link

zseta commented Aug 11, 2022

I like this API so far! 💯
Some feedback if it helps: About volume, in my experience there can be three ways to calculate volume within the aggregate function, depending on the raw data:

  • scenario 1: raw data contains no volume field at all (eg.: one of the polygon APIs works this way AFAIK), in this case it should be aggregated by doing a count(*) or similar (and so volume needs to be an optional parameter for the agg function because in this case it doesn't exist)
  • scenario 2: raw data contains volume in an incremental way, in this case we need to find the last(volume, time) value to get the aggregated volume for the chosen time_bucket() interval
  • scenario 3: raw data contains volume and other fields in an already aggregated form, this means that the raw data is already aggregated (there are already open-high-low-close-volume fields) and user just wants to use a larger time_bucket interval (eg to go from 1m to 1d). In this case the current API could not be used. Do we want to accommodate this use case as well? (or let's just focus on non-aggreagted raw data)

bors bot added a commit that referenced this issue Aug 16, 2022
496: Create OHLC aggregate and accompanying accessors r=rtwalker a=rtwalker

Begins to address #445. 

Introduces OHLC aggregate called as `toolkit_experimental.ohlc(timestamp, price)`. 

The following accessors are provided for use with the `OpenHighLowClose` aggregate:
* `open`, `high`, `low`, `close`

along with the timestamp-returning versions:
* `open_at`, `high_at`, `low_at`, `close_at`

Additionally, there is a new `toolkit_experimental.rollup()` aggregate that takes `OpenHighLowClose` aggregates as input.

In the event that multiple rows contain the highest or lowest value, only the earliest occurrence is retained. Similarly, should multiple rows contain the opening or closing timestamp, this implementation only updates the open/close points with a new value when the new timestamp is strictly earlier/later than current timestamp.

Additional work on #445 remains to address handling volume (and possibly creating technical indicators).

Co-authored-by: Ryan Walker <rwalker@timescale.com>
@SuperDaveOsbourne
Copy link

If there were two columns, one for period volume and one for aggregated volume as many data companies deliver it, then less of an issue. If only one then I would look to have just the period volume as it's a lot easier (maybe less costly?) to sum the volume column than have a window function where you subtract the previous from the current volume column. Could that be a switch, if only one column would deliver the difference between the two previous current) as an option? A kind of built-in window function for volume?

@rtwalker
Copy link
Contributor

rtwalker commented Nov 8, 2022

closed by #596

@rtwalker rtwalker closed this as completed Nov 8, 2022
@rnbokade
Copy link

Is there a scope to add another function to calculate proce change as (current close - previous close) by using lag() and accounting for resets as such...
Same can also be done for change in volume where volume is given as cumulative volume.

Also there could be extension to include another value called "open interest" which is generally a very imp indicator in derivative market.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request And area of analysis that could be made easier
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants