From fb9d4e7afe68a21e65d3f4407e675cab28974654 Mon Sep 17 00:00:00 2001 From: Pavithra Eswaramoorthy Date: Mon, 13 May 2024 11:38:39 +0200 Subject: [PATCH] Documentation Restructure MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fixes #111 The structure has diverged a fair bit from the initial proposal The Getting Started tutorials will cover a lot of csp.baselib usage details, hence it's not a huge focus in the concepts section (→ Already tracking in a separate issue) The "How-to" guides only have the migrated content from old docs right now, we'll be updating all the pages to follow a "how-to" format (→ Opened a new issue) The docs authoring workflow will change a little with the new GitHub sidebar, we'll add relevant docs to make this easier Note: This is amnual squash and rewrite of 7894f05069dbd0fa15446d95cf2701aa0a40454d --- docs/wiki/0.-Introduction.md | 871 ---------- docs/wiki/5.-Adapters.md | 1517 ----------------- docs/wiki/6.-Dynamic-Graphs.md | 110 -- docs/wiki/9.-Caching.md | 3 - docs/wiki/Home.md | 84 +- docs/wiki/_Footer.md | 1 + docs/wiki/_Sidebar.md | 61 + docs/wiki/api-references/Base-Adapters-API.md | 110 ++ .../Base-Nodes-API.md} | 284 +-- .../api-references/Functional-Methods-API.md | 64 + .../Input-Output-Adapters-API.md | 360 ++++ .../Math-and-Logic-Nodes-API.md} | 16 +- .../Random-Time-Series-Generators-API.md} | 17 +- .../Statistical-Nodes-API.md} | 698 +------- .../csp.Struct-API.md} | 10 +- docs/wiki/api-references/csp.dynamic-API.md | 49 + .../csp.profiler-API.md} | 63 +- docs/wiki/concepts/Adapters.md | 15 + docs/wiki/concepts/CSP-Graph.md | 114 ++ docs/wiki/concepts/CSP-Node.md | 271 +++ docs/wiki/concepts/Execution-Modes.md | 243 +++ docs/wiki/concepts/Historical-Buffers.md | 133 ++ .../Build-CSP-from-Source.md} | 149 +- docs/wiki/dev-guides/Contribute.md | 9 + docs/wiki/dev-guides/GitHub-Conventions.md | 73 + .../dev-guides/Local-Development-Setup.md | 87 + .../Release-Process.md} | 192 +-- docs/wiki/dev-guides/Roadmap.md | 17 + docs/wiki/get-started/First-Steps.md | 48 + docs/wiki/get-started/Installation.md | 20 + docs/wiki/how-tos/Add-Cycles-in-Graphs.md | 52 + docs/wiki/how-tos/Create-Dynamic-Baskets.md | 58 + docs/wiki/how-tos/Profile-CSP-Code.md | 77 + docs/wiki/how-tos/Use-Statistical-Nodes.md | 433 +++++ .../Write-Historical-Input-Adapters.md | 415 +++++ docs/wiki/how-tos/Write-Output-Adapters.md | 317 ++++ .../how-tos/Write-Realtime-Input-Adapters.md | 407 +++++ docs/wiki/references/Examples.md | 7 + docs/wiki/references/Glossary.md | 142 ++ 39 files changed, 3864 insertions(+), 3733 deletions(-) delete mode 100644 docs/wiki/0.-Introduction.md delete mode 100644 docs/wiki/5.-Adapters.md delete mode 100644 docs/wiki/6.-Dynamic-Graphs.md delete mode 100644 docs/wiki/9.-Caching.md create mode 100644 docs/wiki/_Footer.md create mode 100644 docs/wiki/_Sidebar.md create mode 100644 docs/wiki/api-references/Base-Adapters-API.md rename docs/wiki/{1.-Generic-Nodes-(csp.baselib).md => api-references/Base-Nodes-API.md} (54%) create mode 100644 docs/wiki/api-references/Functional-Methods-API.md create mode 100644 docs/wiki/api-references/Input-Output-Adapters-API.md rename docs/wiki/{2.-Math-Nodes-(csp.math).md => api-references/Math-and-Logic-Nodes-API.md} (85%) rename docs/wiki/{4.-Random-Time-Series-Generation-(csp.random).md => api-references/Random-Time-Series-Generators-API.md} (92%) rename docs/wiki/{3.-Statistics-Nodes-(csp.stats).md => api-references/Statistical-Nodes-API.md} (76%) rename docs/wiki/{7.-csp.Struct.md => api-references/csp.Struct-API.md} (87%) create mode 100644 docs/wiki/api-references/csp.dynamic-API.md rename docs/wiki/{8.-Profiler.md => api-references/csp.profiler-API.md} (67%) create mode 100644 docs/wiki/concepts/Adapters.md create mode 100644 docs/wiki/concepts/CSP-Graph.md create mode 100644 docs/wiki/concepts/CSP-Node.md create mode 100644 docs/wiki/concepts/Execution-Modes.md create mode 100644 docs/wiki/concepts/Historical-Buffers.md rename docs/wiki/{98.-Building-From-Source.md => dev-guides/Build-CSP-from-Source.md} (64%) create mode 100644 docs/wiki/dev-guides/Contribute.md create mode 100644 docs/wiki/dev-guides/GitHub-Conventions.md create mode 100644 docs/wiki/dev-guides/Local-Development-Setup.md rename docs/wiki/{99.-Developer.md => dev-guides/Release-Process.md} (52%) create mode 100644 docs/wiki/dev-guides/Roadmap.md create mode 100644 docs/wiki/get-started/First-Steps.md create mode 100644 docs/wiki/get-started/Installation.md create mode 100644 docs/wiki/how-tos/Add-Cycles-in-Graphs.md create mode 100644 docs/wiki/how-tos/Create-Dynamic-Baskets.md create mode 100644 docs/wiki/how-tos/Profile-CSP-Code.md create mode 100644 docs/wiki/how-tos/Use-Statistical-Nodes.md create mode 100644 docs/wiki/how-tos/Write-Historical-Input-Adapters.md create mode 100644 docs/wiki/how-tos/Write-Output-Adapters.md create mode 100644 docs/wiki/how-tos/Write-Realtime-Input-Adapters.md create mode 100644 docs/wiki/references/Examples.md create mode 100644 docs/wiki/references/Glossary.md diff --git a/docs/wiki/0.-Introduction.md b/docs/wiki/0.-Introduction.md deleted file mode 100644 index ea23d3a3..00000000 --- a/docs/wiki/0.-Introduction.md +++ /dev/null @@ -1,871 +0,0 @@ -# Graph building concepts - -When writing csp code there will be runtime components in the form of `csp.node` methods, as well as graph-building components in the form of `csp.graph` components. - -It is important to understand that `csp.graph` components will only be executed once at application startup in order to construct the graph. -Once the graph is constructed, `csp.graph` code is no longer needed. -Once the graph is run, only inputs, csp.nodes and outputs will be active as data flows through the graph, driven by input ticks. -For example, this is a simple bit of graph code: - -```python -import csp -from csp import ts -from datetime import datetime - - -@csp.node -def spread(bid: ts[float], ask: ts[float]) -> ts[float]: - if csp.valid(bid, ask): - return ask - bid - - -@csp.graph -def my_graph(): - bid = csp.const(1.0) - ask = csp.const(2.0) - bid = csp.multiply(bid, csp.const(4)) - ask = csp.multiply(ask, csp.const(3)) - s = spread(bid, ask) - - csp.print('spread', s) - csp.print('bid', bid) - csp.print('ask', ask) - - -if __name__ == '__main__': - csp.run(my_graph, starttime=datetime.utcnow()) -``` - -In this simple example `my_graph` is defined as a `csp.graph` component. -This method will be called once by `csp.run` in order to construct the graph. -`csp.const` defines a constant value as a timeseries which will tick once upon startup (this is effectively an input). - -`bid = csp.multiply(bid, csp.const(4))` will insert a `csp.multiply` node to do timeseries multiplication. -`bid` and `ask` are then connected to the user defined `csp.node` `spread`. -`bid`/`ask` and the calculated `spread` are then linked to the `csp.print` output to print the results. - -In order to help visualize this graph, you can call `csp.show_graph`: - -![359407708](https://github.com/Point72/csp/assets/3105306/8cc50ad4-68f9-4199-9695-11c136e3946c) - -The result of this would be: - -``` -2020-04-02 15:33:38.256724 bid:4.0 -2020-04-02 15:33:38.256724 ask:6.0 -2020-04-02 15:33:38.256724 spread:2.0 -``` - -## Anatomy of a csp.node - -The heart of a calculation graph are the csp.nodes that run the computations. -`csp.node` methods can take any number of scalar and timeseries arguments, and can return 0 → N timeseries outputs. -Timeseries inputs/outputs should be thought of as the edges that connect components of the graph. -These "edges" can tick whenever they have a new value. -Every tick is associated with a value and the time of the tick. -csp.nodes can have various other features, here is a an example of a csp.node that demonstrates many of the features. -Keep in mind that nodes will execute repeatedly as inputs tick with new data. -They may (or may not) generate an output as a result of an input tick. - -```python -from datetime import timedelta - -@csp.node # 1 -def demo_node(n: int, xs: ts[float], ys: ts[float]) -> ts[float]: # 2 - with csp.alarms(): # 3 - # Define an alarm time-series of type bool # 4 - alarm = csp.alarm(bool) # 5 - # 6 - with csp.state(): # 7 - # Create a state variable bound to the node # 8 - s_sum = 0.0 # 9 - # 10 - with csp.start(): # 11 - # Code block that executes once on start of the engine # 12 - # one can set timeseries properties here as well, such as # 13 - # csp.set_buffering_policy(xs, tick_count=5) # 14 - # csp.set_buffering_policy(xs, tick_history=timedelta(minutes=1))# 15 - # csp.make_passive(xs) # 16 - csp.schedule_alarm(alarm, timedelta(seconds=1), True) # 17 - # 18 - with csp.stop(): # 19 - pass # code block to execute when the engine is done # 20 - # 21 - if csp.ticked(xs, ys) and csp.valid(xs, ys): # 22 - s_sum += xs * ys # 23 - # 24 - if csp.ticked(alarm): # 25 - csp.schedule_alarm(alarm, timedelta(seconds=1), True) # 26 - return s_sum # 27 -``` - -Lets review line by line - -1\) Every csp node must start with the **`@csp.node`** decorator - -2\) `csp` nodes are fully typed and type-checking is strictly enforced. -All arguments must be typed, as well as all outputs. -Outputs are typed using function annotation syntax. - -Single outputs can be unnamed, for multiple outputs they must be named. -When using multiple outputs, annotate the type using **`def my_node(inputs) → csp.Outputs(name1=ts[], name2=ts[])`** where `T` and `V` are the respective types of `name1` and `name2`. - -Note the syntax of timeseries inputs, they are denoted by **`ts[type]`**. -Scalars can be passed in as regular types, in this example we pass in `n` which expects a type of `int` - -3\) **`with csp.alarms()`**: nodes can (optionally) declare internal alarms, every instance of the node will get its own alarm that can be scheduled and act just like a timeseries input. -All alarms must be declared within the alarms context. - -5\) Instantiate an alarm in the alarms context using the `csp.alarm(typ)` function. This creates an alarm which is a time-series of type `typ`. - -7\) **`with csp.state()`**: optional state variables can be defined under the state context. -Note that variables declared in state will live across invocations of the method. - -9\) An example declaration and initialization of state variable `s_sum`. -It is good practice to name state variables prefixed with `s_`, which is the convention in the `csp` codebase. - -11\) **`with csp.start()`**: an optional block to execute code at the start of the engine. -Generally this is used to setup initial timers or set input timeseries properties such as buffer sizes, or to make inputs passive - -14-15) **`csp.set_buffering_policy`**: nodes can request a certain amount of history be kept on the incoming time series, this can be denoted in number of ticks or in time. -By setting a buffering policy, nodes can access historical values of the timeseries (by default only the last value is kept) - -16\) **`csp.make_passive`** / **`csp.make_active`**: Nodes may not need to react to all of their inputs, they may just need their latest value. -For performance purposes the node can mark an input as passive to avoid triggering the node unnecessarily. -`make_active` can be called to reactivate an input. - -17\) **`csp.schedule_alarm`**: scheduled a one-shot tick on the given alarm input. -The values given are the timedelta before the alarm triggers and the value it will have when it triggers. -Note that `schedule_alarm` can be called multiple times on the same alarm to schedule multiple triggers. - -19\) **`with csp.stop()`** is an optional block that can be called when the engine is done running. - -22\) all nodes will have if conditions to react to different inputs. -**`csp.ticked()`** takes any number of inputs and returns true if **any** of the inputs ticked. -**`csp.valid`** similar takes any number of inputs however it only returns true if **all** inputs are valid. -Valid means that an input has had at least one tick and so it has a "current value". - -23\) One of the benefits of `csp` is that you always have easy access to the latest value of all inputs. -`xs` and `ys` on line 22,23 will always have the latest value of both inputs, even if only one of them just ticked. - -25\) This demonstrates how an alarm can be treated like any other input. - -27\) We tick our running "sum" as an output here every second. - -## Basket inputs - -In addition to single time-series inputs, a node can also accept a **basket** of time series as an argument. -A basket is essentially a collection of timeseries which can be passed in as a single argument. -Baskets can either be list baskets or dict baskets. -Individual timeseries in a basket can tick independently, and they can be looked at and reacted to individually or as a collection. - -For example: - -```python -@csp.node # 1 -def demo_basket_node( # 2 - list_basket: [ts[int]], # 3 - dict_basket: {str: ts[int]} # 4 -) -> ts[float]: # 5 - # 6 - if csp.ticked(list_basket): # 7 - return sum(list_basket.validvalues()) # 8 - # 9 - if csp.ticked(list_basket[3]): # 10 - return list_basket[3] # 11 - # 12 - if csp.ticked(dict_basket): # 13 - # can iterate over ticked key,items # 14 - # for k,v in dict_basket.tickeditems():# 15 - # ... # 16 - return sum(dict_basket.tickedvalues()) # 17 -``` - -3\) Note the syntax of basket inputs. -list baskets are noted as `[ts[type]]` (a list of time series) and dict baskets are `{key_type: ts[ts_type]}` (a dictionary of timeseries keyed by type `key_type`). - -7\) Just like single timeseries, we can react to a basket if it ticked. -The convention is the same as passing multiple inputs to `csp.ticked`, `csp.ticked` is true if **any** basket input ticked. -`csp.valid` is true is **all** basket inputs are valid. - -8\) baskets have various iterators to access their inputs: - -- **`tickedvalues`**: iterator of values of all ticked inputs -- **`tickedkeys`**: iterator of keys of all ticked inputs (keys are list index for list baskets) -- **`tickeditems`**: iterator of (key,value) tuples of ticked inputs -- **`validvalues`**: iterator of values of all valid inputs -- **`validkeys`**: iterator of keys of all valid inputs -- **`validitems`**: iterator of (key,value) tuples of valid inputs -- **`keys`**: list of keys on the basket (**dictionary baskets only** ) - -10-11) This demonstrates the ability to access an individual element of a -basket and react to it as well as access its current value - -## **Node Outputs** - -Nodes can return any number of outputs (including no outputs, in which case it is considered an "output" or sink node, -see [Graph Pruning](https://github.com/Point72/csp/wiki/0.-Introduction#graph-pruning)). -Nodes with single outputs can return the output as an unnamed output. -Nodes returning multiple outputs must have them be named. -When a node is called at graph building time, if its is a single unnamed node the return variable is an edge representing the output which can be passed into other nodes. -If the outputs are named, the return value is an object with the outputs available as attributes. -For example (examples below demonstrate various ways to output the data as well) - -```python -@csp.node -def single_unnamed_outputs(n: ts[int]) -> ts[int]: - # can either do - return n - # or - # csp.output(n) to continue processes after output - - -@csp.node -def multiple_named_outputs(n: ts[int]) -> csp.Outputs(y=ts[int], z=ts[float]): - # can do - # csp.output(y=n, z=n+1.) to output to multiple outputs - # or separate the outputs to tick out at separate points: - # csp.output(y=n) - # ... - # csp.output(z=n+1.) - # or can return multiple values with: - return csp.output(y=n, z=n+1.) - -@csp.graph -def my_graph(n: ts[int]): - x = single_unnamed_outputs(n) - # x represents the output edge of single_unnamed_outputs, - # we can pass it a time series input to other nodes - csp.print('x', x) - - - result = multiple_named_outputs(n) - # result holds all the outputs of multiple_named_outputs, which can be accessed as attributes - csp.print('y', result.y) - csp.print('z', result.z) -``` - -## Basket Outputs - -Similarly to inputs, a node can also produce a basket of timeseries as an output. -For example: - -```python -class MyStruct(csp.Struct): # 1 - symbol: str # 2 - index: int # 3 - value: float # 4 - # 5 -@csp.node # 6 -def demo_basket_output_node( # 7 - in_: ts[MyStruct], # 8 - symbols: [str], # 9 - num_symbols: int # 10 -) -> csp.Outputs( # 11 - dict_basket=csp.OutputBasket( # 12 - Dict[str, ts[float]], # 13 - shape="symbols", # 14 - ), # 15 - list_basket=csp.OutputBasket( # 16 - List[ts[float]], # 17 - shape="num_symbols" # 18 - ), # 19 -): # 20 - # 21 - if csp.ticked(in_): # 22 - # output to dict basket # 23 - csp.output(dict_basket[in_.symbol], in_.value) - # alternate output syntax, can output multiple keys at once - # csp.output(dict_basket={in_.symbol: in_.value}) - # output to list basket - csp.output(list_basket[in_.index], in_.value) - # alternate output syntax, can output multiple keys at once - # csp.output(list_basket={in_.index: in_.value}) -``` - -11-20) Note the output declaration syntax. -A basket output can be either named or unnamed (both examples here are named), and its shape can be specified two ways. -The `shape` parameter is used with a scalar value that defines the shape of the basket, or the name of the scalar argument (a dict basket expects shape to be a list of keys. lists basket expects `shape` to be an `int`). -`shape_of` is used to take the shape of an input basket and apply it to the output basket. - -23+) There are several choices for output syntax. -The following work for both list and dict baskets: - -- `csp.output(basket={key: value, key2: value2, ...})` -- `csp.output(basket[key], value)` -- `csp.output({key: value}) # only works if the basket is the only output` - -## Generic Types - -`csp` supports syntax for generic types as well. -To denote a generic type we use a string (typically `'T'` is used) to denote a generic type. -When a node is called the type of the argument will get bound to the given type variable, and further inputs / outputs will be checked and bound to said typevar. -Note that the string syntax `'~T'` denotes the argument expects the *value* of a type, rather than a type itself: - -```python -@csp.node -def sample(trigger: ts[object], x: ts['T']) -> ts['T']: - '''will return current value of x on trigger ticks''' - with csp.state(): - csp.make_passive(x) - - if csp.ticked(trigger) and csp.valid(x): - return x - - -@csp.node -def const(value: '~T') -> ts['T']: - ... -``` - -`sample` takes a timeseries of type `'T'` as an input, and returns a timeseries of type `'T'`. -This allows us to pass in a `ts[int]` for example, and get a `ts[int]` as an output, or `ts[bool]` → `ts[bool]` - -`const` takes value as an *instance* of type `T`, and returns a timeseries of type `T`. -So we can call `const(5)` and get a `ts[int]` output, or `const('hello!')` and get a `ts[str]` output, etc... - -## Engine Time - -The `csp` engine always maintains its current view of time. -The current time of the engine can be accessed at any time within a csp.node by calling `csp.now()` - -## Graph Propagation and Single-dispatch - -The `csp` graph propagation algorithm ensures that all nodes are executed *once* per engine cycle, and in the correct order. -Correct order means, that all input dependencies of a given node are guaranteed to have been evaluated before a given node is executed. -Take this graph for example: - -![359407953](https://github.com/Point72/csp/assets/3105306/d9416353-6755-4e37-8467-01da516499cf) - -On a given cycle lets say the `bid` input ticks. -The `csp` engine will ensure that **`mid`** is executed, followed by **`spread`** and only once **`spread`**'s output is updated will **`quote`** be called. -When **`quote`** executes it will have the latest values of the `mid` and `spread` calc for this cycle. - -## Graph Pruning - -One should note a subtle optimization technique in `csp` graphs. -Any part of a graph that is created at graph building time, but is NOT connected to any output nodes, will be pruned from the graph and will not exist during runtime. -An output is defined as either an output adapter or a `csp.node` without any outputs of its own. -The idea here is that we can avoid doing work if it doesn't result in any output being generated. -In general its best practice for all csp.nodes to be \***side-effect free**, in other words they shouldn't mutate any state outside of the node. -Assuming all nodes are side-effect free, pruning the graph would not have any noticeable effects. - -## Anatomy of a `csp.graph` - -To reiterate, csp.graph methods are called in order to construct the graph and are only executed before the engine is run. -csp.graph methods don't do anything special, they are essentially regular python methods, but they can be defined to accept inputs and generate outputs similar to csp.nodes. -This is solely used for type checking. -csp.graph methods can be created to encapsulate components of a graph, and can be called from other csp.graph methods in order to help facilitate graph building. - -Simple example: - -```python -@csp.graph -def calc_symbol_pnl(symbol: str, trades: ts[Trade]) -> ts[float]: - # sub-graph code needed to compute pnl for given symbol and symbol's trades - # sub-graph can subscribe to market data for the symbol as needed - ... - - -@csp.graph -def calc_portfolio_pnl(symbols: [str]) -> ts[float]: - symbol_pnl = [] - for symbol in symbols: - symbol_trades = trade_adapter.subscribe(symbol) - symbol_pnl.append(calc_symbol_pnl(symbol, symbol_trades)) - - return csp.sum(symbol_pnl) -``` - -In this simple example we have a csp.graph component `calc_symbol_pnl` which encapsulates computing pnl for a single symbol. -`calc_portfolio_pnl` is a graph that computes portfolio level pnl, it invokes the symbol-level pnl calc for every symbol, then sums up the results for the portfolio level pnl. - -# Historical Buffers - -`csp` provides access to historical input data as well. -By default only the last value of an input is kept in memory, however one can request history to be kept on an input either by number of ticks or by time using **csp.set_buffering_policy.** - -The methods **csp.value_at**, **csp.time_at** and **csp.item_at** can be used to retrieve historical input values. -Each node should call **csp.set_buffering_policy** to make sure that its inputs are configured to store sufficiently long history for correct implementation. -For example, let's assume that we have a stream of data and we want to create equally sized buckets from the data. -A possible implementation of such a node would be: - -```python -@csp.node -def data_bin_generator(bin_size: int, input: ts['T']) -> ts[['T']]: - with csp.start(): - assert bin_size > 0 - # This makes sure that input stores at least bin_size entries - csp.set_buffering_policy(input, tick_count=bin_size) - if csp.ticked(input) and (csp.num_ticks(input) % bin_size == 0): - return [csp.value_at(input, -i) for i in range(bin_size)] -``` - -In this example, we use **`csp.set_buffering_policy(input, tick_count=bin_size)`** to ensure that the buffer history contains at least **`bin_size`** elements. -Note that an input can be shared by multiple nodes, if multiple nodes provide size requirements, the buffer size would be resolved to the maximum size to support all requests. - -Alternatively, **`csp.set_buffering_policy`** supports a **`timedelta`** parameter **`tick_history`** instead of **`tick_count`.** -If **`tick_history`** is provided, the buffer will scale dynamically to ensure that any period of length **`tick_history`** will fit into the history buffer. - -To identify when there are enough samples to construct a bin we use **`csp.num_ticks(input) % bin_size == 0`**. -The function **`csp.num_ticks`** returns the number or total ticks for a given time series. -NOTE: The actual size of the history buffer is usually less than **`csp.num_ticks`** as buffer is dynamically truncated to satisfy the set policy. - -The past values in this example are accessed using **`csp.value_at`**. -The various historical access methods take the same arguments and return the value, time and tuple of `(time,value)` respectively: - -- **`csp.value_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns **value** at requested `index_or_time` -- **`csp.time_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns **datetime** at requested `index_or_time` -- **`csp.item_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns tuple of `(datetime,value)` at requested `index_or_time` - - **`ts`**: the name of the input - - **`index_or_time`**: - - If providing an **index**, this represents how many ticks back to rereieve **and should be \<= 0**. - 0 indicates the current value, -1 is the previous value, etc. - - If providing **time** one can either provide a datetime for absolute time, or a timedelta for how far back to access. - **NOTE** that timedelta must be negative to represent time in the past.. - - **`duplicate_policy`**: when requesting history by datetime or timedelta, its possible that there could be multiple values that match the given time. - **`duplicate_policy`** can be provided to control the behavior of what to return in this case. - The default policy is to return the LAST_VALUE that exists at the given time. - - **`default`**: value to be returned if the requested time is out of the history bounds (if default is not provided and a request is out of bounds an exception will be raised). - -To illustrate the usage of history access using the **timedelta** indexing, consider a possible implementation of a function that sums up samples taken every second for each periods of **n_seconds** of the input time series. -If the value ticks slower than every second then this implementation could sample the same value more than once (this is just an illustration, it's NOT recommended to use such implementation in real application as it could be implemented more efficiently): - -```python -@csp.node -def sample_sum(n_seconds: int, input: ts[int], default_sample_value: int = 0) -> ts[int]: - with csp.alarms(): - a = csp.alarm(bool) - with csp.start(): - assert n_seconds > 0 - # This makes sure that input stores at least n_seconds seconds - csp.set_buffering_policy(input, tick_history=timedelta(seconds=n_seconds)) - # Flag the input as passive since we don't need to react to its ticks - csp.make_passive(input) - # Schedule the first sample in n_seconds-1 from start, to also capture the initial value - csp.schedule_alarm(a, timedelta(seconds=n_seconds - 1), True) - if csp.ticked(a): - # Schedule the next sample in n_seconds from start - csp.schedule_alarm(a, timedelta(seconds=n_seconds), True) - res = 0 - for i in range(n_seconds): - res += csp.value_at(input, timedelta(seconds=-i), default=default_sample_value) - return res -``` - -## Historical Range Access - -In similar fashion, the methods **`csp.values_at`**, **`csp.times_at`** and **`csp.items_at`** can be used to retrieve a range of historical input values as numpy arrays. -The bin generator example above can be accomplished more efficiently with range access: - -```python -@csp.node -def data_bin_generator(bin_size: int, input: ts['T']) -> ts[['T']]: - with csp.start(): - assert bin_size > 0 - # This makes sure that input stores at least bin_size entries - csp.set_buffering_policy(input, tick_count=bin_size) - if csp.ticked(input) and (csp.num_ticks(input) % bin_size == 0): - return csp.values_at(input, -bin_size + 1, 0).tolist() -``` - -The past values in this example are accessed using **`csp.values_at`**. -The various historical access methods take the same arguments and return the value, time and tuple of `(times,values)` respectively: - -- **`csp.values_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: - returns values in specified range as a numpy array -- **`csp.times_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: - returns times in specified range as a numpy array -- **`csp.items_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: - returns a tuple of (times, values) numpy arrays - - **`ts`** - the name of the input - - **`start_index_or_time`**: - - If providing an **index**, this represents how many ticks back to retrieve **and should be \<= 0**. - 0 indicates the current value, -1 is the previous value, etc. - - If providing  **time** one can either provide a datetime for absolute time, or a timedelta for how far back to access. - **NOTE that timedelta must be negative** to represent time in the past.. - - If **None** is provided, the range will begin "from the beginning" - i.e., the oldest tick in the buffer. - - **end_index_or_time:** same as start_index_or_time - - If **None** is provided, the range will go "until the end" - i.e., the newest tick in the buffer. - - **`start_index_policy`**: only for use with datetime/timedelta as the start and end parameters. - - **\`TimeIndexPolicy.INCLUSIVE**: if there is a tick exactly at the requested time, include it - - **TimeIndexPolicy.EXCLUSIVE**: if there is a tick exactly at the requested time, exclude it - - **TimeIndexPolicy.EXTRAPOLATE**: if there is a tick at the beginning timestamp, include it. - Otherwise, if there is a tick before the beginning timestamp, force a tick at the beginning timestamp with the prevailing value at the time. - - **end_index_policy** only for use with datetime/timedelta and the start and end parameters. - - **TimeIndexPolicy.INCLUSIVE**: if there is a tick exactly at the requested time, include it - - **TimeIndexPolicy.EXCLUSIVE**: if there is a tick exactly at the requested time, exclude it - - **TimeIndexPolicy.EXTRAPOLATE**: if there is a tick at the end timestamp, include it. - Otherwise, if there is a tick before the end timestamp, force a tick at the end timestamp with the prevailing value at the time - -Range access is optimized at the C++ layer and for this reason its far more efficient than calling the single value access methods in a loop, and they should be substituted in where possible. - -Below is a rolling average example to illustrate the use of timedelta indexing. -Note that `timedelta(seconds=-n_seconds)` is equivalent to `csp.now() - timedelta(seconds=n_seconds)`, since datetime indexing is supported. - -```python -@csp.node -def rolling_average(x: ts[float], n_seconds: int) -> ts[float]: - with csp.start(): - assert n_seconds > 0 - csp.set_buffering_policy(x, tick_history=timedelta(seconds=n_seconds)) - if csp.ticked(x): - avg = np.mean(csp.values_at(x, timedelta(seconds=-n_seconds), timedelta(seconds=0), - csp.TimeIndexPolicy.INCLUSIVE, csp.TimeIndexPolicy.INCLUSIVE)) - csp.output(avg) -``` - -When accessing all elements within the buffering policy window like -this, it would be more succinct to pass None as the start and end time, -but datetime/timedelta allows for more general use (e.g. rolling average -between 5 seconds and 1 second ago, or average specifically between -9:30:00 and 10:00:00) - -# Cyclical graph - `csp.feedback` - -By definition of the graph building code, csp graphs can only produce acyclical graphs. -However, there are many occasions where a cycle may be required. -For example, lets say you want part of your graph to simulate an exchange. -That part of the graph would need to accept new orders and return acks and executions. -However, the acks / executions would likely need to *feedback* into the same part of the graph that generated the orders. -For this reason, the `csp.feedback` construct exists. -Using `csp.feedback` one can wire a feedback as an input to a node, and effectively bind the actual edge that feeds it later in the graph. -Note that internally the graph is still acyclical. -Internally `csp.feedback` creates a pair of output and input adapters that are bound together. -When a timeseries that is bound to a feedback ticks, it is fed to the feedback which then schedules the tick on its bound input to be executed on the **next engine cycle**. -The next engine cycle will execute with the same engine time as the cycle that generated it, but it will be evaluated in a subsequent cycle. - -- **`csp.feedback(ts_type)`**: `ts_type` is the type of the timeseries (ie int, str). - This returns an instance of a feedback object - - **`out()`**: this method returns the timeseries edge which can be passed as an input to your node - - **`bind(ts)`**: this method is called to bind an edge as the source of the feedback after the fact - -A simple example should help demonstrate a possible usage. -Lets say we want to simulate acking orders that are generated from a node called `my_algo`. -In addition to generating the orders, `my_algo` also wants needs to receive the execution reports (this is demonstrated in example `e_13_feedback.py`) - -The graph code would look something like this: - -```python -# Simulate acking an order -@csp.node -def my_exchange(order:ts[Order]) -> ts[ExecReport]: - # ... impl details ... - -@csp.node -def my_algo(exec_report:ts[ExecReport]) -> ts[Order]: - # .. impl details ... - -@csp.graph -def my_graph(): - # create the feedback first so that we can refer to it later - exec_report_fb = csp.feedback(ExecReport) - - # generate orders, passing feedback out() which isn't bound yet - orders = my_algo(exec_report_fb.out()) - - # get exec_reports from "simulator" - exec_report = my_exchange(orders) - - # now bind the exec reports to the feedback, finishing the "loop" - exec_report_fb.bind(exec_report) -``` - -The graph would end up looking like this. -It remains acyclical, but the `FeedbackOutputDef` is bound to the `FeedbackInputDef` here, any tick to out will push the tick to in on the next cycle: - -![366521848](https://github.com/Point72/csp/assets/3105306/c4f920ff-49f9-4a52-8404-7c1989768da7) - -# Collecting Graph Outputs - -If the `csp.graph` passed to `csp.run` has outputs, the full timeseries will be returned from `csp.run` like so: - -**outputs example** - -```python -import csp -from datetime import datetime, timedelta - -@csp.graph -def my_graph() -> ts[int]: - return csp.merge(csp.const(1), csp.const(2, timedelta(seconds=1))) - -if __name__ == '__main__': - res = csp.run(my_graph, starttime=datetime(2021,11,8)) - print(res) -``` - -result: - -```raw -{0: [(datetime.datetime(2021, 11, 8, 0, 0), 1), (datetime.datetime(2021, 11, 8, 0, 0, 1), 2)]} -``` - -Note that the result is a list of `(datetime, value)` tuples. - -You can also use [csp.add_graph_output]() to add outputs. -These do not need to be in the top-level graph called directly from `csp.run`. - -This gives the same result: - -**add_graph_output example** - -```python -@csp.graph -def my_graph(): -    csp.add_graph_output('a', csp.merge(csp.const(1), csp.const(2, timedelta(seconds=1)))) -``` - -In addition to python outputs like above, you can set the optional `csp.run` argument `output_numpy` to `True` to get outputs as numpy arrays: - -**numpy outputs** - -```python -result = csp.run(my_graph, starttime=datetime(2021,11,8), output_numpy=True) -``` - -result: - -```raw -{0: (array(['2021-11-08T00:00:00.000000000', '2021-11-08T00:00:01.000000000'], dtype='datetime64[ns]'), array([1, 2], dtype=int64))} -``` - -Note that the result there is a tuple per output, containing two numpy arrays, one with the datetimes and one with the values. - -# Realtime / Simulation Modes - -The `csp` engine can be run in two flavors, realtime and simulation. - -In simulation mode, the engine is always run at full speed pulling in time-based data from its input adapters and running them through the graph. -All inputs in simulation are driven off the provided timestamped data of its inputs. - -In realtime mode, the engine runs in wallclock time as of "now". -Realtime engines can get data from realtime adapters which source data on separate threads and pass them through to the engine (ie think of activeMQ events happening on an activeMQ thread and being passed along to the engine in "realtime"). - -Since engines can run in both simulated and realtime mode, users should **always** use **`csp.now()`** to get the current time in csp.nodes - -## Simulation Mode - -Simulation mode is the default mode of the engine. -As stated above, simulation mode is used when you want your engine to crunch through historical data as fast as possible. -In simulation mode, the engine runs on some historical data that is fed in through various adapters. -The adapters provide events by time, and they are streamed into the engine via the adapter timeseries in time order. -csp.timer and csp.node alarms are scheduled and executed in "historical time" as well. -Note that there is no strict requirement for simulated runs to run on historical dates. -As long as the engine is not in realtime mode, it remains in simulation mode until the provided endtime, even if endtime is in the future. - -## Realtime Mode - -Realtime mode is opted into by passing `realtime=True` to `csp.run(...)`. -When run in realtime mode, the engine will run in simulation mode from the provided starttime → wallclock "now" as of the time of calling run. -Once the simulation run is done, the engine switches into realtime mode. -Under realtime mode, external realtime adapters will be able to send data into the engine thread. -All time based inputs such as csp.timer and alarms will switch to executing in wallclock time as well. - -As always, `csp.now()` should still be used in csp.node code, even when running in realtime mode. -`csp.now()` will be the time assigned to the current engine cycle. - -## csp.PushMode - -When consuming data from input adapters there are three choices on how one can consume the data: - -| PushMode | EngineMode | Description | -| :------- | :--------- | :---------- | -| **LAST_VALUE** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with the last value on a given timestamp | -|   | Realtime | all ticks that occurred since previous engine cycle will collapse / conflate to the latest value | -| **NON_COLLAPSING** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once per engine cycle. subsequent cycles will execute with the same time | -|   | Realtime | all ticks that occurred since previous engine cycle will be ticked across subsequent engine cycles as fast as possible | -| **BURST** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with a list of all values | -|   | Realtime | all ticks that occurred since previous engine cycle will tick once with a list of all the values | - -## Realtime Group Event Synchronization - -The `csp` framework supports properly synchronizing events across multiple timeseries that are sourced from the same realtime adapter. -A classical example of this is a market data feed. -Say you consume bid, ask and trade as 3 separate time series for the same product / exchange. -Since the data flows in asynchronously from a separate thread, bid, ask and trade events could end up executing in the engine at arbitrary slices of time, leading to crossed books and trades that are out of range of the bid/ask. -The engine can properly provide a correct synchronous view of all the inputs, regardless of their PushModes. -Its up to adapter implementations to determine which inputs are part of a synchronous "PushGroup". - -Here's a classical example. -An Application wants to consume conflating bid/ask as LAST_VALUE but it doesn't want to conflate trades, so its consumed as NON_COLLAPSING. - -Lets say we have this sequence of events on the actual market data feed's thread, coming in one the wire in this order. -The columns denote the time the callbacks come in off the market data thread. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
EventTT+1T+2T+3T+4T+5T+6
BID100.00100.01
-
99.9799.9899.99
-
ASK100.02
-
100.03
-

-

-
100.00
TRADE
-

-
100.02
-

-
100.03
-
- -Without any synchronization you can end up with nonsensical views based on random timing. -Here's one such possibility (bid/ask are still LAST_VALUE, trade is NON_COLLAPSING). - -Over here ET is engine time. -Lets assume engine had a huge delay and hasn't processed any data submitted above yet. -Without any synchronization, bid/ask would completely conflate, and trade would unroll over multiple engine cycles - - - - - - - - - - - - - - - - - - - - - - - - -
EventETET+1
BID99.99
-
ASK100.00
-
TRADE100.02100.03
- -However, since market data adapters will group bid/ask/trade inputs together, the engine won't let bid/ask events advance ahead of trade events since trade is NON_COLLAPSING. -NON_COLLAPSING inputs will essentially act as a barrier, not allowing events ahead of the barrier tick before the barrier is complete. -Lets assume again that the engine had a huge delay and hasn't processed any data submitted above. -With proper barrier synchronizations the engine cycles would look like this under the same conditions: - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
EventETET+1ET+2
BID100.0199.99
-
ASK100.03
-
100.00
TRADE100.02100.03
-
- -Note how the last ask tick of 100.00 got held up to a separate cycle (ET+2) so that trade could tick with the correct view of bid/ask at the time of the second trade (ET+1) - -As another example, lets say the engine got delayed briefly at wire time T, so it was able to process T+1 data. -Similarly it got briefly delayed at time T+4 until after T+6.  The engine would be able to process all data at time T+1, T+2, T+3 and T+6, leading to this sequence of engine cycles. -The equivalent "wire time" is denoted in parenthesis - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
EventET (T+1)ET+1 (T+2)ET+2 (T+3)ET+3 (T+5)ET+4 (T+6)
BID100.01
-
99.9799.99
-
ASK100.02100.03
-

-
100.00
TRADE
-
100.02
-
100.03
-
diff --git a/docs/wiki/5.-Adapters.md b/docs/wiki/5.-Adapters.md deleted file mode 100644 index 7b7b97f7..00000000 --- a/docs/wiki/5.-Adapters.md +++ /dev/null @@ -1,1517 +0,0 @@ -# Intro - -To get various data sources into and out of the graph, various Input and Output Adapters are available, such as CSV, Parquet, and database adapters (amongst others). -Users can also write their own input and output adapters, as explained below. - -There are two types of Input Adapters: **Historical** (aka Simulated) adapters and **Realtime** Adapters. -Historical adapters are used to feed in historical timeseries data into the graph from some data source which has timeseries data. -Realtime Adapters are used to feed in live event based data in realtime, generally events created from external sources on separate threads. - -There is not distinction of Historical vs Realtime output adapters since outputs need not care if the generated timeseries data which are wired into them are generated from realtime or historical inputs. - -In CSP terminology, a single adapter corresponds to a single timeseries edge in the graph. -There are common cases where a single data source may be used to provide data to multiple adapter (timeseries) instances, for example a single CSV file with price data for many stocks can be read once but used to provide data to many individual, one per stock. -In such cases an AdapterManager is used to coordinate management of the single source (CSV file, database, Kafka connection, etc) and provided data to individual adapters. - -Note that adapters can be quickly written and prototyped in python, and if needed can be moved to a c+ implementation for more efficiency. - -# Kafka - -The Kafka adapter is a user adapter to stream data from a Kafka bus as a reactive time series. It leverages the [librdkafka](https://github.com/confluentinc/librdkafka) C/C++ library internally. - -The `KafkaAdapterManager` instance represents a single connection to a broker. -A single connection can subscribe and/or publish to multiple topics. - -## API - -```python -KafkaAdapterManager( - broker, - start_offset: typing.Union[KafkaStartOffset,timedelta,datetime] = None, - group_id: str = None, - group_id_prefix: str = '', - max_threads=100, - max_queue_size=1000000, - auth=False, - security_protocol='SASL_SSL', - sasl_kerberos_keytab='', - sasl_kerberos_principal='', - ssl_ca_location='', - sasl_kerberos_service_name='kafka', - rd_kafka_conf_options=None, - debug: bool = False, - poll_timeout: timedelta = timedelta(seconds=1) -): -``` - -- **`broker`**: name of the Kafka broker, such as `protocol://host:port` - -- **`start_offset`**: signify where to start the stream playback from (defaults to `KafkaStartOffset.LATEST`). - Can be one of the`KafkaStartOffset` enum types or: - - - `datetime`: to replay from the given absolute time - - `timedelta`: this will be taken as an absolute offset from starttime to playback from - -- **`group_id`**: if set, this adapter will behave as a consume-once consumer. - `start_offset` may not be set in this case since adapter will always replay from the last consumed offset. - -- **\`group_id_prefix**: when not passing an explicit group_id, a prefix can be supplied that will be use to prefix the UUID generated for the group_id - -- **`max_threads`**: maximum number of threads to create for consumers. - The topics are round-robin'd onto threads to balance the load. - The adapter won't create more threads than topics. - -- **`max_queue_size`**: maximum size of the (internal to Kafka) message queue. - If the queue is full, messages can be dropped, so the default is very large. - -## MessageMapper - -In order to publish or subscribe, you need to define a MsgMapper. -These are the supported message types: - -- **`JSONTextMessageMapper(datetime_type = DateTimeType.UNKNOWN)`** -- **`ProtoMessageMapper(datetime_type = DateTimeType.UNKNOWN)`** - -You should choose the `DateTimeType` based on how you want (when publishing) or expect (when subscribing) your datetimes to be represented on the wire. -The supported options are: - -- `UINT64_NANOS` -- `UINT64_MICROS` -- `UINT64_MILLIS` -- `UINT64_SECONDS` - -The enum is defined in [csp/adapters/utils.py](https://github.com/Point72/csp/blob/main/csp/adapters/utils.py#L5). - -Note the `JSONTextMessageMapper` currently does not have support for lists. -To subscribe to json data with lists, simply subscribe using the `RawTextMessageMapper` and process the text into json (e.g. via json.loads). - -## Subscribing and Publishing - -Once you have an `KafkaAdapterManager` object and a `MsgMapper` object, you can subscribe to topics using the following method: - -```python -KafkaAdapterManager.subscribe( - ts_type: type, - msg_mapper: MsgMapper, - topic: str, - key=None, - field_map: typing.Union[dict,str] = None, - meta_field_map: dict = None, - push_mode: csp.PushMode = csp.PushMode.LAST_VALUE, - adjust_out_of_order_time: bool = False -): -``` - -- **`ts_type`**: the timeseries type you want to get the data on. This can be a `csp.Struct` or basic timeseries type -- **`msg_mapper`**: the `MsgMapper` object discussed above -- **`topic`**: the topic to subscribe to -- **`key`**: The key to subscribe to. If `None`, then this will subscribe to all messages on the topic. Note that in this "wildcard" mode, all messages will tick as "live" as replay in engine time cannot be supported -- **`field_map`**: dictionary of `{message_field: struct_field}` to define how the subscribed message gets mapped onto the struct -- **`meta_field_map`**: to extract meta information from the kafka message, provide a meta_field_map dictionary of meta field info → struct field name to place it into. - The following meta fields are currently supported: - - **`"partition"`**: which partition the message came from - - **`"offset"`**: the kafka offset of the given message - - **`"live"`**: whether this message is "live" and not being replayed - - **`"timestamp"`**: timestamp of the kafka message - - **`"key"`**: key of the message -- **`push_mode`**: `csp.PushMode` (LAST_VALUE, NON_COLLAPSING, BURST) -- **`adjust_out_of_order_time`**: in some cases it has been seen that kafka can produce out of order messages, even for the same key. - This allows the adapter to be more laz and allow it through by forcing time to max(time, prev time) - -Similarly, you can publish on topics using the following method: - -```python -KafkaAdapterManager.publish( - msg_mapper: MsgMapper, - topic: str, - key: str, - x: ts['T'], - field_map: typing.Union[dict,str] = None -): -``` - -- **`msg_mapper`**: same as above -- **`topic`**: same as above -- **`key`**: key to publish to -- **`x`**: the timeseries to publish -- **`field_map`**: dictionary of {struct_field: message_field} to define how the struct gets mapped onto the published message. - Note this dictionary is the opposite of the field_map in subscribe() - -## Known Issues - -If you are having issues, such as not getting any output or the application simply locking up, start by ensuring that you are logging the adapter's `status()` with a `csp.print`/`log` call and set `debug=True`. -Then follow the known issues below. - -- Reason: `GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (No Kerberos credentials available)` - - - **Resolution**: Kafka uses kerberos tickets for authentication. Need to set-up kerberos token first - -- `Message received on unknown topic: errcode: Broker: Group authorization failed error: FindCoordinator response error: Group authorization failed.` - - - **Resolution**: Kafka broker running on windows are case sensitive to kerberos token. When creating Kerberos token with kinit, make sure to use principal name with case sensitive user id. - -- `authentication: SASL handshake failed (start (-4)): SASL(-4): no mechanism available: No worthy mechs found (after 0ms in state AUTH_REQ)` - - - **Resolution**: cyrus-sasl-gssapi needs to be installed on the box for Kafka kerberos authentication - -- `Message error on topic "an-example-topic". errcode: Broker: Topic authorization failed error: Subscribed topic not available: an-example-topic: Broker: Topic authorization failed)` - - - **Resolution**: The user account does not have access to the topic - -# Parquet - -## ParquetReader - -The `ParquetReader` adapter is a generic user adapter to stream data from [Apache Parquet](https://parquet.apache.org/) files as a CSP time series. -`ParquetReader` adapter supports only flat (non hierarchical) parquet files with all the primitive types that are supported by the CSP framework. - -### API - -```python -ParquetReader( - self, - filename_or_list, - symbol_column=None, - time_column=None, - tz=None -): - """ - :param filename_or_list: The specifier of the file/files to be read. Can be either: - - Instance of str, in which case it's interpreted os a path of single file to be read - - A callable, in which case it's interpreted as a generator function that will be called like f(starttime, endtime) where starttime and endtime - are the start and end times of the current engine run. It's expected to generate a sequence of filenames to read. - - Iterable container, for example a list of files to read - :param symbol_column: An optional parameter that specifies the name of the symbol column if the file if there is any - :param time_column: A mandatory specification of the time column name in the parquet files. This column will be used to inject the row values - from parquet at the given timestamps. - :param tz: The pytz timezone of the timestamp column, should only be provided if the time_column in parquet file doesn't have tz info. -""" -``` - -### Subscription - -```python -def subscribe( - self, - symbol, - typ, - field_map=None, - push_mode: csp.PushMode = csp.PushMode.NON_COLLAPSING -): - """Subscribe to the rows corresponding to a given symbol - This form of subscription can be used only if non empty symbol_column was supplied during ParquetReader construction. - :param symbol: The symbol to subscribe to, for example 'AAPL' - :param typ: The type of the CSP time series subscription. Can either be a primitive type like int or alternatively a type - that inherits from csp.Struct, in which case each instance of the struct will be constructed from the matching file columns. - :param field_map: A map of the fields from parquet columns for the CSP time series. If typ is a primitive, then field_map should be - a string specifying the column name, if typ is a csp Struct then field_map should be a str->str dictionary of the form - {column_name:struct_field_name}. For structs field_map can be omitted in which case we expect a one to one match between the given Struct - fields and the parquet files columns. - :param push_mode: A push mode for the output adapter - """ - -def subscribe_all( - self, - typ, - field_map=None, - push_mode: csp.PushMode = csp.PushMode.NON_COLLAPSING -): - """Subscribe to all rows of the input files. - :param typ: The type of the CSP time series subscription. Can either be a primitive type like int or alternatively a type - that inherits from csp.Struct, in which case each instance of the struct will be constructed from the matching file columns. - :param field_map: A map of the fields from parquet columns for the CSP time series. If typ is a primitive, then field_map should be - a string specifying the column name, if typ is a csp Struct then field_map should be a str->str dictionary of the form - {column_name:struct_field_name}. For structs field_map can be omitted in which case we expect a one to one match between the given Struct - fields and the parquet files columns. - :param push_mode: A push mode for the output adapter - """ -``` - -Parquet reader provides two subscription methods. -**`subscribe`** produces a time series only of the rows that correspond to the given symbol, -\*\*`subscribe_all`\*\*produces a time series of all rows in the parquet files. - -## ParquetWriter - -The ParquetWriter adapter is a generic user adapter to stream data from CSP time series to [Apache Parquet](https://parquet.apache.org/) files. -`ParquetWriter` adapter supports only flat (non hierarchical) parquet files with all the primitive types that are supported by the `csp` framework. -Any time series of Struct objects will be flattened to multiple columns. - -### Construction - -```python -ParquetWriter( - self, - file_name: Optional[str], - timestamp_column_name, - config: Optional[ParquetOutputConfig] = None, - filename_provider: Optional[csp.ts[str]] = None -): - """ - :param file_name: The path of the output parquet file name. Must be provided if no filename_provider specified. If both file_name and filename_provider are specified then file_name will be used as the initial output file name until filename_provider provides a new file name. - :param timestamp_column_name: Required field, if None is provided then no timestamp will be written. - :param config: Optional configuration of how the file should be written (such as compression, block size,...). - :param filename_provider: An optional time series that provides a times series of file paths. When a filename_provider time series provides a new file path, the previous open file name will be closed and all subsequent data will be written to the new file provided by the path. This enable partitioning and splitting the data based on time. - """ -``` - -### Publishing - -```python -def publish_struct( - self, - value: ts[csp.Struct], - field_map: Dict[str, str] = None -): - """Publish a time series of csp.Struct objects to file - - :param value: The time series of Struct objects that should be published. - :param field_map: An optional dict str->str of the form {struct_field_name:column_name} that maps the names of the - structure fields to the column names to which the values should be written. If the field_map is non None, then only - the fields that are specified in the field_map will be written to file. If field_map is not provided then all fields - of a structure will be written to columns that match exactly the field_name. - """ - -def publish( - self, - column_name, - value: ts[object] -): - """Publish a time series of primitive type to file - :param column_name: The name of the parquet file column to which the data should be written to - :param value: The time series that should be published - """ -``` - -Parquet writer provides two publishing methods. -**`publish_struct`** is used to publish time series of **`csp.Struct`** objects while **`publish`** is used to publish primitive time series. -The columns in the written parquet file is a union of all columns that were published (the order is preserved). -A new row is written to parquet file whenever any of the inputs ticks. -For the given row, any column that corresponds to a time series that didn't tick, will have null values. - -### Example of using ParquetReader and ParquetWriter - -```python -import tempfile -from datetime import datetime, timedelta - -import csp -from csp.adapters.parquet import ParquetOutputConfig, ParquetReader, ParquetWriter - - -class Dummy(csp.Struct): - int_val: int - float_val: float - - -@csp.graph -def write_struct(file_name: str): - st = datetime(2020, 1, 1) - - curve = csp.curve(Dummy, [(st + timedelta(seconds=1), Dummy(int_val=1, float_val=1.0)), - (st + timedelta(seconds=2), Dummy(int_val=2, float_val=2.0)), - (st + timedelta(seconds=3), Dummy(int_val=3, float_val=3.0))]) - writer = ParquetWriter(file_name=file_name, timestamp_column_name='csp_time', - config=ParquetOutputConfig(allow_overwrite=True)) - writer.publish_struct(curve) - - -@csp.graph -def write_series(file_name: str): - st = datetime(2020, 1, 1) - - curve_int = csp.curve(int, [(st + timedelta(seconds=i), i * 5) for i in range(10)]) - curve_str = csp.curve(str, [(st + timedelta(seconds=i), f'str_{i}') for i in range(10)]) - writer = ParquetWriter(file_name=file_name, timestamp_column_name='csp_time', - config=ParquetOutputConfig(allow_overwrite=True)) - writer.publish('int_vals', curve_int) - writer.publish('str_vals', curve_str) - - -@csp.graph -def writer_graph(struct_file_name: str, series_file_name: str): - write_struct(struct_file_name) - write_series(series_file_name) - - -@csp.graph -def reader_graph(series_file_name: str): - reader = ParquetReader(series_file_name, time_column='csp_time') - csp.print('Read as struct', reader.subscribe_all(Dummy)) - csp.print('Read as single int column', reader.subscribe_all(int, 'int_val')) - csp.print('Read as single float column', reader.subscribe_all(float, 'float_val')) - - -if __name__ == '__main__': - with tempfile.NamedTemporaryFile(suffix='.parquet') as struct_file: - struct_file.file.close() - with tempfile.NamedTemporaryFile(suffix='.parquet') as series_file: - series_file.file.close() - g = csp.run(writer_graph, struct_file.name, series_file.name, - starttime=datetime(2020, 1, 1), endtime=timedelta(minutes=1)) - g = csp.run(reader_graph, struct_file.name, - starttime=datetime(2020, 1, 1), endtime=timedelta(minutes=1)) - - -``` - -# DBReader - -The DBReader adapter is a generic user adapter to stream data from a database as a reactive time series. -It leverages sqlalchemy internally in order to be able to access various DB backends. - -Please refer to the [SQLAlchemy Docs](https://docs.sqlalchemy.org/en/13/core/tutorial.html) for information on how to create sqlalchemy connections. - -The DBReader instance represents a single connection to a database. -From a single reader you can subscribe to various streams, either the entire stream of data (which would basically represent the result of a single join) or if a symbol column is declared, subscribe by symbol which will then demultiplex rows to the right adapter. - -## API - -```python -DBReader(self, connection, time_accessor, table_name=None, schema_name=None, query=None, symbol_column=None, constraint=None): - """ - :param connection: sqlalchemy engine or (already connected) connection object. - :param time_accessor: TimeAccessor object - :param table_name: name of table in database as a string - :param query: either string query or sqlalchemy query object. Ex: "select * from users" - :param symbol_column: name of symbol column in table as a string - :param constraint: additional sqlalchemy constraints for query. Ex: constraint = db.text('PRICE>:price').bindparams(price = 100.0) - """ -``` - -- **connection**: seqlalchemy engine or existing connection object. -- **time_accessor**: see below -- **table_name**: either table or query is required. - If passing a table_name then this table will be queried against for subscribe calls -- **query**: (optional) if table isn't supplied user can provide a direct query string or sqlalchemy query object. - This is useful if you want to run a join call. - For basic single-table queries passing table_name is preferred -- **symbol_column**: (optional) in order to be able to demux rows bysome column, pass `symbol_column`. - Example case for this is if database has data stored for many symbols in a single table, and you want to have a timeseries tick per symbol. -- **constraint**: (optional) additional sqlalchemy constraints for query. Ex: `constraint = db.text('PRICE>:price').bindparams(price= 100.0)` - -## TimeAccessor - -All data fed into `csp` must be time based. -`TimeAccessor` is a helper class that defines how to extract timestamp information from the results of the data. -Users can define their own `TimeAccessor` implementation or use pre-canned ones: - -- `TimestampAccessor( self, time_column, tz=None)`: use this if there exists a single datetime column already. - Provide the column name and optionally the timezone of the column (if its timezone-less in the db) -- `DateTimeAccessor(self, date_column, time_column, tz=None)`: use this if there are two separate columns for date and time, this accessor will combine the two columns to create a single datetime. - Optionally pass tz if time column is timezone-less in the db - -User implementations would have to extend `TimeAccessor` interface. -In addition to defining how to convert db columns to timestamps, accessors are also used to augment the query to limit the data for the graph's start and end times. - -Once you have a DBReader object created, you can subscribe to time_series from it using the following methods: - -- `subscribe(self, symbol, typ, field_map=None)` -- `subscribe_all(self, typ, field_map=None)` - -Both of these calls expect `typ` to be a `csp.Struct` type. -`field_map` is a dictionary of `{ db_column : struct_column }` mappings that define how to map the database column names to the fields on the struct. - -`subscribe` is used to subscribe to a stream for the given symbol (symbol_column is required when creating DBReader) - -`subscribe_all` is used to retrieve all the data resulting from the request as a single timeseries. - -# Symphony - -The Symphony adapter allows for reading and writing of messages from the [Symphony](https://symphony.com/) message platform using [`requests`](https://requests.readthedocs.io/en/latest/) and the [Symphony SDK](https://docs.developers.symphony.com/). - -# Slack - -The Slack adapter allows for reading and writing of messages from the [Slack](https://slack.com) message platform using the [Slack Python SDK](https://slack.dev/python-slack-sdk/). - -# Writing Input and Output Adapters - -## Input Adapters - -There are two main categories of writing input adapters, historical and realtime. -When writing historical adapters you will need to implement a "pull" adapter, which pulls data from a historical data source in time order, one event at a time. -There are also ManagedSimAdapters for feeding multiple "managed" pull adapters from a single source (more on that below). -When writing realtime adapters, you will need to implement a "push" adapter, which will get data from a separate thread that drives external events and "pushes" them into the engine as they occur. - -When writing input adapters it is also very important to denote the difference between "graph building time" and "runtime" versions of your adapter. -For example, `csp.adapters.csv` has a `CSVReader` class that is used at graph building time. -**Graph build time components** solely *describe* the adapter. -They are meant to do little else than keep track of the type of adapter and its parameters, which will then be used to construct the actual adapter implementation when the engine is constructed from the graph description. -It is the runtime implementation that actual runs during the engine execution phase to process data. - -For clarity of this distinction, in the descriptions below we will denote graph build time components with *--graph--* and runtime implementations with *--impl--*. - -### Historical Adapters - -There are two flavors of historical input adapters that can be written. -The simplest one is a PullInputAdapter. -A PullInputAdapter can be used to convert a single source into a single timeseries. -The csp.curve implementation is a good example of this. -Single source to single timeseries adapters are of limited use however, and the more typical use case is for AdapterManager based input adapters to service multiple InputAdapters from a single source. -For this one would use an AdapterManager to coordinate processing of the data source, and ManagedSimInputAdapter as the individual timeseries providers. - -#### PullInputAdapter - Python - -To write a Python based PullInputAdapter one must write a class that derives from csp.impl.pulladapter.PullInputAdapter. -The derived type should the define two methods: - -- `def start(self, start_time, end_time)`: this will be called at the start of the engine with the start/end times of the engine. - start_Time and end_time will be tz-unaware datetime objects in UTC time. - At this point the adapter should open its resource and seek to the requested starttime. -- `def next(self)`: this method will be repeatedly called by the engine. - The adapter should return the next event as a time,value tuple. - If there are no more events, then the method should return None - -The PullInputAdapter that you define will be used as the runtime *--impl–-*. -You also need to define a *--graph--* time representation of the time series edge. -In order to do this you should define a csp.impl.wiring.py_pull_adapter_def. -The py_pull_adapter_def creates a *--graph--* time representation of your adapter: - -```python -def py_pull_adapter_def(name, adapterimpl, out_type, **kwargs) -``` - -- **`name`**: string name for the adapter -- **`adapterimpl`**: a derived implementation of csp.impl.pulladapter.PullInputAdapter -- **`out_type`**: the type of the output, should be a `ts[]` type. Note this can use tvar types if a subsequent argument defines the tvar -- **`kwargs`**: \*\*kwargs here be passed through as arguments to the PullInputAdapter implementation - -Note that the \*\*kwargs passed to py_pull_adapter_def should be the names and types of the variables, like arg1=type1, arg2=type2. -These are the names of the kwargs that the returned input adapter will take and pass through to the PullInputAdapter implementation, and the types expected for the values of those args. - -csp.curve is a good simple example of this: - -```python -import copy -from csp.impl.pulladapter import PullInputAdapter -from csp.impl.wiring import py_pull_adapter_def -from csp import ts -from datetime import timedelta - - -class Curve(PullInputAdapter): - def __init__(self, typ, data): - ''' data should be a list of tuples of (datetime, value) or (timedelta, value)''' - self._data = data - self._index = 0 - super().__init__() - - def start(self, start_time, end_time): - if isinstance(self._data[0][0], timedelta): - self._data = copy.copy(self._data) - for idx, data in enumerate(self._data): - self._data[idx] = (start_time + data[0], data[1]) - - while self._index < len(self._data) and self._data[self._index][0] < start_time: - self._index += 1 - - super().start(start_time, end_time) - - def next(self): - - if self._index < len(self._data): - time, value = self._data[self._index] - if time <= self._end_time: - self._index += 1 - return time, value - return None - - -curve = py_pull_adapter_def('curve', Curve, ts['T'], typ='T', data=list) -``` - -Now curve can be called in graph code to create a curve input adapter: - -```python -x = csp.curve(int, [ (t1, v1), (t2, v2), .. ]) -csp.print('x', x) -``` - -See example "e_14_user_adapters_01_pullinput.py for - -#### PullInputAdapter - C++ - -**Step 1)** PullInputAdapter impl - -Similar to the Python PullInputAdapter API is the c++ API which one can leverage to improve performance of an adapter implementation. -The *--impl--* is very similar to python pull adapter. -One should derive from `PullInputAdapter`, a templatized base class (templatized on the type of the timeseries) and define these methods: - -- **`start(DateTime start, DateTime end)`**: similar to python API start, called when engine starts. - Open resource and seek to start time here -- **`stop()`**: called on engine shutdown, cleanup resource -- **`bool next(DateTime & t, T & value)`**: if there is data to provide, sets the next time and value for the adapter and returns true. - Otherwise, return false - -**Step 2)** Expose creator func to python - -Now that we have a c++ impl defined, we need to expose a python creator for it. -Define a method that conforms to the signature - -```cpp -csp::InputAdapter * create_my_adapter( - csp::AdapterManager * manager, - PyEngine * pyengine, - PyTypeObject * pyType, - PushMode pushMode, - PyObject * args) -``` - -- **`manager`**: will be nullptr for pull adapters -- **`pyengine `**: PyEngine engine wrapper object -- **`pyType`**: this is the type of the timeseries input adapter to be created as a PyTypeObject. - one can switch on this type using switchPyType to create the properly typed instance -- **`pushMode`**: the csp PushMode for the adapter (pass through to base InputAdapter) -- **`args`**: arguments to pass to the adapter impl - -Then simply register the creator method: - -**`REGISTER_INPUT_ADAPTER(_my_adapter, create_my_adapter)`** - -This will register methodname onto your python module, to be accessed as your module.methodname. -Note this uses csp/python/InitHelpers which is used in the \_cspimpl module. -To do this in a separate python module, you need to register InitHelpers in that module. - -**Step 3)** Define your *--graph–-* time adapter - -One liner now to wrap your impl in a graph time construct using csp.impl.wiring.input_adapter_def: - -```python -my_adapter = input_adapter_def('my_adapter', my_module._my_adapter, ts[int], arg1=int, arg2={str:'foo'}) -``` - -my_adapter can now be called with arg1, arg2 to create adapters in your graph. -Note that the arguments are typed using v=t syntax.  v=(t,default) is used to define arguments with defaults. - -Also note that all input adapters implicitly get a push_mode argument that is defaulted to csp.PushMode.LAST_VALUE. - -#### ManagedSimInputAdapter - Python - -In most cases you will likely want to expose a single source of data into multiple input adapters. -For this use case your adapter should define an AdapterManager *--graph--* time component, and AdapterManagerImpl *--impl--* runtime component. -The AdapterManager *--graph--* time component just represents the parameters needed to create the *--impl--* AdapterManager. -Its the *--impl--* that will have the actual implementation that will open the data source, parse the data and provide it to individual Adapters. - -Similarly you will need to define a derived ManagedSimInputAdapter *--impl--* component to handle events directed at an individual time series adapter. - -**NOTE** It is highly recommended not to open any resources in the *--graph--* time component. -graph time components can be pruned and/or memoized into a single instance, opening resources at graph time shouldn't be necessary. - -#### AdapterManager - **--graph-- time** - -The graph time AdapterManager doesn't need to derive from any interface. -It should be initialized with any information the impl needs in order to open/process the data source (ie csv file, time column, db connection information, etc etc). -It should also have an API to create individual timeseries adapters. -These adapters will then get passed the adapter manager *--impl--* as an argument where they are created, so that they can register themselves for processing. -The AdapterManager also needs to define a **\_create** method. -The **\_create** is the bridge between the *--graph--* time AdapterManager representation and the runtime *--impl--* object. -**\_create** will be called on the *--graph--* time AdapterManager which will in turn create the *--impl--* instance. -\_create will get two arguments, engine (this represents the runtime engine object that will run the graph) and a memo dict which can optionally be used for any memoization that on might want. - -Lets take a look at CSVReader as an example: - -```python -# GRAPH TIME -class CSVReader: - def __init__(self, filename, time_converter, delimiter=',', symbol_column=None): - self._filename = filename - self._symbol_column = symbol_column - self._delimiter = delimiter - self._time_converter = time_converter - - def subscribe(self, symbol, typ, field_map=None): - return CSVReadAdapter(self, symbol, typ, field_map) - - def _create(self, engine, memo): - return CSVReaderImpl(engine, self) -``` - -- **`__init__`**: as you can see, all `__init__` does is keep the parameters that the impl will need. -- **`subscribe`**: API to create an individual timeseries / edge from this file for the given symbol. - typ denotes the type of the timeseries to create (ie `ts[int]`) and field_map is used for mapping columns onto csp.Struct types. - Note that subscribe returns a CSVReadAdapter instance. - CSVReadAdapter is the *--graph--* time representation of the edge (similar to how we defined csp.curve above). - We pass it `self` as its first argument, which will be used to create the AdapterManager *--impl--* -- **`\_create`**: the method to create the *--impl--* object from the given *--graph--* time representation of the manager - -The CSVReader would then be used in graph building code like so: - -```python -reader = CSVReader('my_data.csv', time_formatter, symbol_column='SYMBOL', delimiter='|') -# aapl will represent a ts[PriceQuantity] edge that will tick with rows from -# the csv file matching on SYMBOL column AAPL -aapl = reader.subscribe('AAPL', PriceQuantity) -``` - -##### AdapterManager - **--impl-- runtime** - -The AdapterManager *--impl--* is responsible for opening the data source, parsing and processing through all the data and managing all the adapters it needs to feed. -The impl class should derive from csp.impl.adaptermanager.AdapterManagerImpl and implement the following methods: - -- **`start(self,starttime,endtime)`**: this is called when the engine starts up. - At this point the impl should open the resource providing the data and seek to starttime. - starttime/endtime will be tz-unaware datetime objects in UTC time -- **`stop(self)`**: this is called at the end of the run, resources should be cleaned up at this point -- **`process_next_sim_timeslice(self, now)`**: this method will be called multiple times through the run. - The initial call will provide now with starttime. - The impl's responsibility is to process all data at the given timestamp (more on how to do this below). - The method should return the next time in the data source, or None if there is no more data to process. - The method will be called again with the provided timestamp as "now" in the next iteration. - **NOTE** that process_next_sim_timeslice is required to move ahead in time. - In most cases the resource data can be supplied in time order, if not it would have to be sorted up front. - -process_next_sim_timeslice should parse data for a given time/row of data and then push it through to any registered ManagedSimInputAdapter that matches on the given row - -##### ManagedSimInputAdapter - **--impl-- runtime** - -Users will need to define ManagedSimInputAdapter derived types to represent the individual timeseries adapter *--impl--* objects. -Objects should derive from csp.impl.adaptermanager.ManagedSimInputAdapter. - -ManagedSimInputAdapter.`__init__` takes two arguments: - -- **`typ`**: this is the type of the timeseries, ie int for a `ts[int]` -- **`field_map`**: Optional, field_map is a dictionary used to map source column names → csp.Struct field names. - -ManagedSimInputAdapter defines a method `push_tick()` which takes the value to feed the input for given timeslice (as defined by "now" at the adapter manager level). -There is also a convenience method called `process_dict()` which will take a dictionary of `{column : value}` entries and convert it properly into the right value based on the given **field_map.** - -##### \*\*ManagedSimInputAdapter - **--graph-- time** - -As with the csp.curve example, we need to define a graph-time construct that represents a ManagedSimInputAdapter edge. -In order to define this we use py_managed_adapter_def. -py_managed_adapter_defis AdapterManager "aware" and will properly create the AdapterManager *--impl--* the first time its encountered. -It will then pass the manager impl as an argument to the ManagedSimInputAdapter. - -```python -def py_managed_adapter_def(name, adapterimpl, out_type, manager_type, **kwargs): -""" -Create a graph representation of a python managed sim input adapter. -:param name: string name for the adapter -:param adapterimpl: a derived implementation of csp.impl.adaptermanager.ManagedSimInputAdapter -:param out_type: the type of the output, should be a ts[] type. Note this can use tvar types if a subsequent argument defines the tvar -:param manager_type: the type of the graph time representation of the AdapterManager that will manage this adapter -:param kwargs: **kwargs will be passed through as arguments to the ManagedSimInputAdapter implementation -the first argument to the implementation will be the adapter manager impl instance -""" -``` - -##### Example - CSVReader - -Putting this all together lets take a look at a CSVReader implementation -and step through what's going on: - -```python -import csv as pycsv -from datetime import datetime - -from csp import ts -from csp.impl.adaptermanager import AdapterManagerImpl, ManagedSimInputAdapter -from csp.impl.wiring import pymanagedadapterdef - -# GRAPH TIME -class CSVReader: - def __init__(self, filename, time_converter, delimiter=',', symbol_column=None): - self._filename = filename - self._symbol_column = symbol_column - self._delimiter = delimiter - self._time_converter = time_converter - - def subscribe(self, symbol, typ, field_map=None): - return CSVReadAdapter(self, symbol, typ, field_map) - - def _create(self, engine, memo): - return CSVReaderImpl(engine, self) -``` - -Here we define CSVReader, our AdapterManager *--graph--* time representation. -It holds the parameters that will be used for the impl, it implements a `subscribe()` call for users to create timeseries and defines a \_create method to create a runtime *--impl–-* instance from the graphtime representation. -Note how on line 17 we pass self to the CSVReadAdapter, this is what binds the input adapter to this AdapterManager - -```python -# RUN TIME -class CSVReaderImpl(AdapterManagerImpl): # 1 - def __init__(self, engine, adapterRep): # 2 - super().__init__(engine) # 3 - # 4 - self._rep = adapterRep # 5 - self._inputs = {} # 6 - self._csv_reader = None # 7 - self._next_row = None # 8 - # 9 - def start(self, starttime, endtime): # 10 - self._csv_reader = pycsv.DictReader( # 11 - open(self._rep._filename, 'r'), # 12 - delimiter=self._rep._delimiter # 13 - ) # 14 - self._next_row = None # 15 - # 16 - for row in self._csv_reader: # 17 - time = self._rep._time_converter(row) # 18 - self._next_row = row # 19 - if time >= starttime: # 20 - break # 21 - # 22 - def stop(self): # 23 - self._csv_reader = None # 24 - # 25 - def register_input_adapter(self, symbol, adapter): # 26 - if symbol not in self._inputs: # 27 - self._inputs[symbol] = [] # 28 - self._inputs[symbol].append(adapter) # 29 - # 30 - def process_next_sim_timeslice(self, now): # 31 - if not self._next_row: # 32 - return None # 33 - # 34 - while True: # 35 - time = self._rep._time_converter(self._next_row) # 36 - if time > now: # 37 - return time # 38 - self.process_row(self._next_row) # 39 - try: # 40 - self._next_row = next(self._csv_reader) # 41 - except StopIteration: # 42 - return None # 43 - # 44 - def process_row(self, row): # 45 - symbol = row[self._rep._symbol_column] # 46 - if symbol in self._inputs: # 47 - for input in self._inputs.get(symbol, []): # 48 - input.process_dict(row) # 49 -``` - -CSVReaderImpl is the runtime *--impl–-*. -It gets created when the engine is being built from the described graph. - -- **lines 10-21 - start()**: this is the start method that gets called with the time range the graph will be run against. - Here we open our resource (pycsv.DictReader) and scan t through the data until we reach the requested starttime. - -- **lines 23-24 - stop()**: this is the stop call that gets called when the engine is done running and is shutdown, we free our resource here - -- **lines 26-29**: the CSVReader allows one to subscribe to many symbols from one file. - symbols are keyed by a provided SYMBOL column. - The individual adapters will self-register with the CSVReaderImpl when they are created with the requested symbol. - CSVReaderImpl keeps track of what adapters have been registered for what symbol in its self.\_inputs map - -- **lines 31-43**: this is main method that gets invoked repeatedly throughout the run. - For every distinct timestamp in the file, this method will get invoked once and the method is expected to go through the resource data for all points with time now, process the row and push the data to any matching adapters. - The method returns the next timestamp when its done processing all data for "now", or None if there is no more data. - **NOTE** that the csv impl expects the data to be in time order. - process_next_sim_timeslice must advance time forward. - -- **lines 45-49**: this method takes a row of data (provided as a dict from DictReader), extracts the symbol and pushes the row through to all input adapters that match - -```python -class CSVReadAdapterImpl(ManagedSimInputAdapter): # 1 - def __init__(self, managerImpl, symbol, typ, field_map): # 2 - managerImpl.register_input_adapter(symbol, self) # 3 - super().__init__(typ, field_map) # 4 - # 5 -CSVReadAdapter = py_managed_adapter_def( # 6 - 'csvadapter', - CSVReadAdapterImpl, - ts['T'], - CSVReader, - symbol=str, - typ='T', - fieldMap=(object, None) -) -``` - -- **line 3**: this is where the instance of an adapter *--impl--* registers itself with the CSVReaderImpl. -- **line 6+**: this is where we define CSVReadAdapter, the *--graph--* time representation of a CSV adapter, returned from CSVReader.subscribe - -See example "e_14_user_adapters_02_adaptermanager_siminput" for another example of how to write a managed sim adapter manager. - -### Realtime Adapters - -#### PushInputAdapter - python - -To write a Python based PushInputAdapter one must write a class that derives from csp.impl.pushadapter.PushInputAdapter. -The derived type should the define two methods: - -- `def start(self, start_time, end_time)`: this will be called at the start of the engine with the start/end times of the engine. - start_time and end_time will be tz-unaware datetime objects in UTC time (generally these aren't needed for realtime adapters). - At this point the adapter should open its resource / connect the data source / start any driver threads that are needed. -- `def stop(self)`: This method well be called when the engine is done running. - At this point any open threads should be stopped and resources cleaned up. - -The PushInputAdapter that you define will be used as the runtime *--impl–-*. -You also need to define a *--graph--* time representation of the time series edge. -In order to do this you should define a csp.impl.wiring.py_push_adapter_def. -The py_push_adapter_def creates a *--graph--* time representation of your adapter: - -**def py_push_adapter_def(name, adapterimpl, out_type, \*\*kwargs)** - -- **`name`**: string name for the adapter -- **`adapterimpl`**: a derived implementation of - csp.impl.pushadapter.PushInputAdapter -- **`out_type`**: the type of the output, should be a ts\[\] type. - Note this can use tvar types if a subsequent argument defines the - tvar -- **`kwargs`**: \*\*kwargs here be passed through as arguments to the - PushInputAdapter implementation - -Note that the \*\*kwargs passed to py_push_adapter_def should be the names and types of the variables, like arg1=type1, arg2=type2. -These are the names of the kwargs that the returned input adapter will take and pass through to the PushInputAdapter implementation, and the types expected for the values of those args. - -Example e_14_user_adapters_03_pushinput.py demonstrates a simple example of this - -```python -from csp.impl.pushadapter import PushInputAdapter -from csp.impl.wiring import py_push_adapter_def -import csp -from csp import ts -from datetime import datetime, timedelta -import threading -import time - - -# The Impl object is created at runtime when the graph is converted into the runtime engine -# it does not exist at graph building time! -class MyPushAdapterImpl(PushInputAdapter): - def __init__(self, interval): - print("MyPushAdapterImpl::__init__") - self._interval = interval - self._thread = None - self._running = False - - def start(self, starttime, endtime): - """ start will get called at the start of the engine, at which point the push - input adapter should start its thread that will push the data onto the adapter. Note - that push adapters will ALWAYS have a separate thread driving ticks into the csp engine thread - """ - print("MyPushAdapterImpl::start") - self._running = True - self._thread = threading.Thread(target=self._run) - self._thread.start() - - def stop(self): - """ stop will get called at the end of the run, at which point resources should - be cleaned up - """ - print("MyPushAdapterImpl::stop") - if self._running: - self._running = False - self._thread.join() - - def _run(self): - counter = 0 - while self._running: - self.push_tick(counter) - counter += 1 - time.sleep(self._interval.total_seconds()) - - -# MyPushAdapter is the graph-building time construct. This is simply a representation of what the -# input adapter is and how to create it, including the Impl to create and arguments to pass into it -MyPushAdapter = py_push_adapter_def('MyPushAdapter', MyPushAdapterImpl, ts[int], interval=timedelta) -``` - -Note how line 41 calls **self.push_tick**. -This is the call to get data from the adapter thread ticking into the csp engine - -Now MyPushAdapter can be called in graph code to create a timeseries that is sourced by MyPushAdapterImpl - -```python -@csp.graph -def my_graph(): - # At this point we create the graph-time representation of the input adapter. This will be converted - # into the impl once the graph is done constructing and the engine is created in order to run - data = MyPushAdapter(timedelta(seconds=1)) - csp.print('data', data) -``` - -#### GenericPushAdapter - -If you dont need as much control as PushInputAdapter provides, or if you have some existing source of data on a thread you can't control, another option is to use the higher-level abstraction csp.GenericPushAdapter. -csp.GenericPushAdapter wraps a csp.PushInputAdapter implementation internally and provides a simplified interface. -The downside of csp.GenericPushAdapter is that you lose some control of when the input feed starts and stop. - -Lets take a look at the example found in "e_14_generic_push_adapter" - -```python -# This is an example of some separate thread providing data -class Driver: - def __init__(self, adapter : csp.GenericPushAdapter): - self._adapter = adapter - self._active = False - self._thread = None - - def start(self): - self._active = True - self._thread = threading.Thread(target=self._run) - self._thread.start() - - def stop(self): - if self._active: - self._active = False - self._thread.join() - - def _run(self): - print("driver thread started") - counter = 0 - # Optionally, we can wait for the adapter to start before proceeding - # Alternatively we can start pushing data, but push_tick may fail and return False if - # the csp engine isn't ready yet - self._adapter.wait_for_start() - - while self._active and not self._adapter.stopped(): - self._adapter.push_tick(counter) - counter += 1 - time.sleep(1) - -@csp.graph -def my_graph(): - adapter = csp.GenericPushAdapter(int) - driver = Driver(adapter) - # Note that the driver thread starts *before* the engine is started here, which means some ticks may potentially get dropped if the - # data source doesn't wait for the adapter to start. This may be ok for some feeds, but not others - driver.start() - - # Lets be nice and shutdown the driver thread when the engine is done - csp.schedule_on_engine_stop(driver.stop) -``` - -In this example we have this dummy Driver class which simply represents some external source of data which arrives on a thread that's completely independent of the engine. -We pass along a csp.GenericInputAdapter instance to this thread, which can then call adapter.push_tick to get data into the engine (see line 27). - -On line 24 we can also see an optional feature which allows the unrelated thread to wait for the adapter to be ready to accept data before ticking data onto it. -If push_tick is called before the engine starts / the adapter is ready to receive data, it will simply drop the data. -Note that GenericPushAadapter.push_tick will return a bool to indicate whether the data was successfully pushed to the engine or not. - -### Realtime AdapterManager - -In most cases you will likely want to expose a single source of data into multiple input adapters. -For this use case your adapter should define an AdapterManager *--graph--* time component, and AdapterManagerImpl *--impl--* runtime component. -The AdapterManager *--graph--* time component just represents the parameters needed to create the *--impl--* AdapterManager. -Its the *--impl--* that will have the actual implementation that will open the data source, parse the data and provide it to individual Adapters. - -Similarly you will need to define a derived PushInputAdapter *--impl--* component to handle events directed at an individual time series adapter. - -**NOTE** It is highly recommended not to open any resources in the *--graph--* time component. -Graph time components can be pruned and/or memoized into a single instance, opening resources at graph time shouldn't be necessary. - -#### AdapterManager - **graph-- time** - -The graph time AdapterManager doesn't need to derive from any interface. -It should be initialized with any information the impl needs in order to open/process the data source (ie activemq connection information, server host port, multicast channels, config files, etc etc). -It should also have an API to create individual timeseries adapters. -These adapters will then get passed the adapter manager *--impl--* as an argument when they are created, so that they can register themselves for processing. -The AdapterManager also needs to define a **\_create** method. -The **\_create** is the bridge between the *--graph--* time AdapterManager representation and the runtime *--impl--* object. -**\_create** will be called on the *--graph--* time AdapterManager which will in turn create the *--impl--* instance. -\_create will get two arguments, engine (this represents the runtime engine object that will run the graph) and memo dict which can optionally be used for any memoization that on might want. - -Lets take a look at the example found in -"e_14_user_adapters_04_adaptermanager_pushinput" - -```python -# This object represents our AdapterManager at graph time. It describes the manager's properties -# and will be used to create the actual impl when its time to build the engine -class MyAdapterManager: - def __init__(self, interval: timedelta): - """ - Normally one would pass properties of the manager here, ie filename, - message bus, etc - """ - self._interval = interval - - def subscribe(self, symbol, push_mode=csp.PushMode.NON_COLLAPSING): - """ User facing API to subscribe to a timeseries stream from this adapter manager """ - # This will return a graph-time timeseries edge representing and edge from this - # adapter manager for the given symbol / arguments - return MyPushAdapter(self, symbol, push_mode=push_mode) - - def _create(self, engine, memo): - """ This method will get called at engine build time, at which point the graph time manager representation - will create the actual impl that will be used for runtime - """ - # Normally you would pass the arguments down into the impl here - return MyAdapterManagerImpl(engine, self._interval) -``` - -- **\_\_init\_\_** - as you can see, all \_\_init\_\_ does is keep the parameters that the impl will need. -- **subscribe** - API to create an individual timeseries / edge from this file for the given symbol. - The interface defined here is up to the adapter writer, but generally "subscribe" is recommended, and it should take any number of arguments needed to define a single stream of data. - *MyPushAdapter* is the *--graph--* time representation of the edge, which will be described below. - We pass it *self* as its first argument, which will be used to create the AdapterManager *--impl--* -- **\_create** - the method to create the *--impl--* object from the given *--graph--* time representation of the manager - -MyAdapterManager would then be used in graph building code like so: - -```python -adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) -data = adapter_manager.subscribe('AAPL', push_mode=csp.PushMode.LAST_VALUE) -csp.print(symbol + " last_value", data) -``` - -## AdapterManager - **impl-- runtime** - -The AdapterManager *--impl--* is responsible for opening the data source, parsing and processing all the data and managing all the adapters it needs to feed. -The impl class should derive from csp.impl.adaptermanager.AdapterManagerImpl and implement the following methods: - -- **start(self,starttime,endtime)**: this is called when the engine starts up. - At this point the impl should open the resource providing the data and start up any thread(s) needed to listen to and react to external data. - starttime/endtime will be tz-unaware datetime objects in UTC time, though typically these aren't needed for realtime adapters -- **`stop(self)`**: this is called at the end of the run, resources should be cleaned up at this point -- **`process_next_sim_timeslice(self, now)`**: this is used by sim adapters, for realtime adapter managers we simply return None - -In the example manager, we spawn a processing thread in the `start()` call. -This thread runs in a loop until it is shutdown, and will generate random data to tick out to the registered input adapters. -Data is passed to a given adapter by calling **push_tick**() - -#### PushInputAdapter - **--impl-- runtime** - -Users will need to define PushInputAdapter derived types to represent the individual timeseries adapter *--impl--* objects. -Objects should derive from csp.impl.pushadapter.PushInputAdapter. - -PushInputAdapter defines a method `push_tick()` which takes the value to feed the input timeseries. - -#### PushInputAdapter - **--graph-- time** - -Similar to the stand alone PushInputAdapter described above, we need to define a graph-time construct that represents a PushInputAdapter edge. -In order to define this we use py_push_adapter_def again, but this time we pass the adapter manager *--graph--* time type so that it gets constructed properly. -When the PushInputAdapter instance is created it will also receive an instance of the adapter manager *--impl–-*, which it can then self-register on/ - -```python -def py_push_adapter_def (name, adapterimpl, out_type, manager_type=None, memoize=True, force_memoize=False, **kwargs): -""" -Create a graph representation of a python push input adapter. -:param name: string name for the adapter -:param adapterimpl: a derived implementation of csp.impl.pushadapter.PushInputAdapter -:param out_type: the type of the output, should be a ts[] type. Note this can use tvar types if a subsequent argument defines the tvar -:param manager_type: the type of the graph time representation of the AdapterManager that will manage this adapter -:param kwargs: **kwargs will be passed through as arguments to the ManagedSimInputAdapter implementation -the first argument to the implementation will be the adapter manager impl instance -""" -``` - -#### Example - -Continuing with the --graph-- time AdapterManager described above, we -now define the impl: - -```python -# This is the actual manager impl that will be created and executed during runtime -class MyAdapterManagerImpl(AdapterManagerImpl): - def __init__(self, engine, interval): - super().__init__(engine) - - # These are just used to simulate a data source - self._interval = interval - self._counter = 0 - - # We will keep track of requested input adapters here - self._inputs = {} - - # Our driving thread, all realtime adapters will need a separate thread of execution that - # drives data into the engine thread - self._running = False - self._thread = None - - def start(self, starttime, endtime): - """ start will get called at the start of the engine run. At this point - one would start up the realtime data source / spawn the driving thread(s) and - subscribe to the needed data """ - self._running = True - self._thread = threading.Thread(target=self._run) - self._thread.start() - - def stop(self): - """ This will be called at the end of the engine run, at which point resources should be - closed and cleaned up """ - if self._running: - self._running = False - self._thread.join() - - def register_input_adapter(self, symbol, adapter): - """ Actual PushInputAdapters will self register when they are created as part of the engine - This is the place we gather all requested input adapters and their properties - """ - if symbol not in self._inputs: - self._inputs[symbol] = [] - # Keep a list of adapters by key in case we get duplicate adapters (should be memoized in reality) - self._inputs[symbol].append(adapter) - - def process_next_sim_timeslice(self, now): - """ This method is only used by simulated / historical adapters, for realtime we just return None """ - return None - - def _run(self): - """ Our driving thread, in reality this will be reacting to external events, parsing the data and - pushing it into the respective adapter - """ - symbols = list(self._inputs.keys()) - while self._running: - # Lets pick a random symbol from the requested symbols - symbol = symbols[random.randint(0, len(symbols) - 1)] - adapters = self._inputs[symbol] - data = MyData(symbol=symbol, value=self._counter) - self._counter += 1 - for adapter in adapters: - adapter.push_tick(data) - - time.sleep(self._interval.total_seconds()) -``` - -Then we define our PushInputAdapter --impl--, which basically just -self-registers with the adapter manager --impl-- upon construction. We -also define our PushInputAdapter *--graph--* time construct using `py_push_adapter_def`. - -```python -# The Impl object is created at runtime when the graph is converted into the runtime engine -# it does not exist at graph building time. a managed sim adapter impl will get the -# adapter manager runtime impl as its first argument -class MyPushAdapterImpl(PushInputAdapter): - def __init__(self, manager_impl, symbol): - print(f"MyPushAdapterImpl::__init__ {symbol}") - manager_impl.register_input_adapter(symbol, self) - super().__init__() - - -MyPushAdapter = py_push_adapter_def('MyPushAdapter', MyPushAdapterImpl, ts[MyData], MyAdapterManager, symbol=str) -``` - -And then we can run our adapter in a csp graph - -```python -@csp.graph -def my_graph(): - print("Start of graph building") - - adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) - symbols = ['AAPL', 'IBM', 'TSLA', 'GS', 'JPM'] - for symbol in symbols: - # your data source might tick faster than the engine thread can consume it - # push_mode can be used to buffered up tick events will get processed - # LAST_VALUE will conflate and only tick the latest value since the last cycle - data = adapter_manager.subscribe(symbol, csp.PushMode.LAST_VALUE) - csp.print(symbol + " last_value", data) - - # BURST will change the timeseries type from ts[T] to ts[[T]] (list of ticks) - # that will tick with all values that have buffered since the last engine cycle - data = adapter_manager.subscribe(symbol, csp.PushMode.BURST) - csp.print(symbol + " burst", data) - - # NON_COLLAPSING will tick all events without collapsing, unrolling the events - # over multiple engine cycles - data = adapter_manager.subscribe(symbol, csp.PushMode.NON_COLLAPSING) - csp.print(symbol + " non_collapsing", data) - - print("End of graph building") - - -csp.run(my_graph, starttime=datetime.utcnow(), endtime=timedelta(seconds=10), realtime=True) -``` - -Do note that realtime adapters will only run in realtime engines (note the `realtime=True` argument to `csp.run`). - -## Output Adapters - -Output adapters are used to define graph outputs, and they differ from input adapters in a number of important ways. -Output adapters also differ from terminal nodes, e.g. regular `csp.node` instances that do not define outputs, and instead consume and emit their inputs inside their `csp.ticked`  blocks. - -For many use cases, it will be sufficient to omit writing an output adapter entirely. -Consider the following example of a terminal node that writes an input dictionary timeseries to a file. - -```python -@csp.node -def write_to_file(x: ts[Dict], filename: str): - if csp.ticked(x): - with open(filename, "a") as fp: - fp.write(json.dumps(x)) -``` - -This is a perfectly fine node, and serves its purpose. -Unlike input adapters, output adapters do not need to differentiate between *historical* and *realtime* mode. -Input adapters drive the execution of the graph, whereas output adapters are reactive to their input nodes and subject to the graph's execution. - -However, there are a number of reasons why you might want to define an output adapter instead of using a vanilla node. -The most important of these is when you want to share resources across a number of output adapters (e.g. with a Manager), or between an input and an output node, e.g. reading data from a websocket, routing it through your csp graph, and publishing data *to the same websocket connection*. -For most use cases, a vanilla csp node will suffice, but let's explore some anyway. - -### OutputAdapter - Python - -To write a Python based OutputAdapter one must write a class that derives from `csp.impl.outputadapter.OutputAdapter`. -The derived type should define the method: - -- `def on_tick(self, time: datetime, value: object)`: this will be called when the input to the output adapter ticks. - -The OutputAdapter that you define will be used as the runtime *--impl–-*.  You also need to define a *--graph--* time representation of the time series edge. -In order to do this you should define a csp.impl.wiring.py_output_adapter_def. -The py_output_adapter_def creates a *--graph--* time representation of your adapter: - -**def py_output_adapter_def(name, adapterimpl, \*\*kwargs)** - -- **`name`**: string name for the adapter -- **`adapterclass`**: a derived implementation of `csp.impl.outputadapter.OutputAdapter` -- **`kwargs`**: \*\*kwargs here be passed through as arguments to the OutputAdapter implementation - -Note that the `**kwargs` passed to py_output_adapter_def should be the names and types of the variables, like `arg1=type1, arg2=type2`. -These are the names of the kwargs that the returned output adapter will take and pass through to the OutputAdapter implementation, and the types expected for the values of those args. - -Here is a simple example of the same filewriter from above: - -```python -from csp.impl.outputadapter import OutputAdapter -from csp.impl.wiring import py_output_adapter_def -from csp import ts -import csp -from json import dumps -from datetime import datetime, timedelta - - -class MyFileWriterAdapterImpl(OutputAdapter): - def __init__(self, filename: str): - super().__init__() - self._filename = filename - - def start(self): -        self._fp = open(self._filename, "a") - - def stop(self): -        self._fp.close() - -   def on_tick(self, time, value): - self._fp.write(dumps(value) + "\n") - - -MyFileWriterAdapter = py_output_adapter_def( - name='MyFileWriterAdapter', - adapterimpl=MyFileWriterAdapterImpl, - input=ts['T'], - filename=str, -) -``` - -Now our adapter can be called in graph code: - -```python -@csp.graph -def my_graph(): - curve = csp.curve( - data=[ - (timedelta(seconds=0), {"a": 1, "b": 2, "c": 3}), - (timedelta(seconds=1), {"a": 1, "b": 2, "c": 3}), - (timedelta(seconds=1), {"a": 1, "b": 2, "c": 3}), - ], - typ=object, - ) - - MyFileWriterAdapter(curve, filename="testfile.jsonl") -``` - -As explained above, we could also do this via single node (this is probably the best version between the three): - -```python -@csp.node -def dump_json(data: ts['T'], filename: str): - with csp.state(): - s_file=None - with csp.start(): - s_file = open(filename, "w") - with csp.stop(): - s_file.close() - if csp.ticked(data): - s_file.write(json.dumps(data) + "\n") - s_file.flush() -``` - -### OutputAdapter - C++ - -TODO - -### OutputAdapter with Manager - -Adapter managers function the same way for output adapters as for input adapters, i.e. to manage a single shared resource from the manager across a variety of discrete output adapters. - -### InputOutputAdapter - Python - -As a as last example, lets tie everything together and implement a managed push input adapter combined with a managed output adapter. -This example is available in `e_14_user_adapters_05_adaptermanager_inputoutput` . - -First, we will define our adapter manager. -In this example, we're going to cheat a little bit and combine our adapter manager (graph time) and our adapter manager impl (run time). - -```python -class MyAdapterManager(AdapterManagerImpl): - ''' - This example adapter will generate random `MyData` structs every `interval`. This simulates an upstream - data feed, which we "connect" to only a single time. We then multiplex the results to an arbitrary - number of subscribers via the `subscribe` method. - - We can also receive messages via the `publish` method from an arbitrary number of publishers. These messages - are demultiplexex into a number of outputs, simulating sharing a connection to a downstream feed or responses - to the upstream feed. - ''' - def __init__(self, interval: timedelta): - self._interval = interval - self._counter = 0 - self._subscriptions = {} - self._publications = {} - self._running = False - self._thread = None - - def subscribe(self, symbol): - '''This method creates a new input adapter implementation via the manager.''' - return _my_input_adapter(self, symbol, push_mode=csp.PushMode.NON_COLLAPSING) - - def publish(self, data: ts['T'], symbol: str): - '''This method creates a new output adapter implementation via the manager.''' - return _my_output_adapter(self, data, symbol) - - def _create(self, engine, memo): - # We'll avoid having a second class and make our AdapterManager and AdapterManagerImpl the same - super().__init__(engine) - return self - - def start(self, starttime, endtime): - self._running = True - self._thread = threading.Thread(target=self._run) - self._thread.start() - - def stop(self): - if self._running: - self._running = False - self._thread.join() - - # print closing of the resources - for name in self._publications.values(): - print("closing asset {}".format(name)) - - def register_subscription(self, symbol, adapter): - if symbol not in self._subscriptions: - self._subscriptions[symbol] = [] - self._subscriptions[symbol].append(adapter) - - def register_publication(self, symbol): - if symbol not in self._publications: - self._publications[symbol] = "publication_{}".format(symbol) - - def _run(self): - '''This method runs in a background thread and generates random input events to push to the corresponding adapter''' - symbols = list(self._subscriptions.keys()) - while self._running: - # Lets pick a random symbol from the requested symbols - symbol = symbols[random.randint(0, len(symbols) - 1)] - - data = MyData(symbol=symbol, value=self._counter) - - self._counter += 1 - - for adapter in self._subscriptions[symbol]: - # push to all the subscribers - adapter.push_tick(data) - - time.sleep(self._interval.total_seconds()) - - def _on_tick(self, symbol, value): - '''This method just writes the data to the appropriate outbound "channel"''' - print("{}:{}".format(self._publications[symbol], value)) -``` - -This adapter manager is a bit of a silly example, but it demonstrates the core concepts. -The adapter manager will demultiplex a shared stream (in this case, the stream defined in `_run`  is a random sequence of `MyData` structs) between all the input adapters it manages. -The input adapter itself will do nothing more than let the adapter manager know that it exists: - -```python -class MyInputAdapterImpl(PushInputAdapter): - '''Our input adapter is a very simple implementation, and just - defers its work back to the manager who is expected to deal with - sharing a single connection. - ''' - def __init__(self, manager, symbol): - manager.register_subscription(symbol, self) - super().__init__() -``` - -Similarly, the adapter manager will multiplex the output adapter streams, in this case combining them into streams of print statements. -And similar to the input adapter, the output adapter does relatively little more than letting the adapter manager know that it has work available, using its triggered `on_tick` method to call the adapter manager's `_on_tick` method. - -``` -class MyOutputAdapterImpl(OutputAdapter): - '''Similarly, our output adapter is simple as well, deferring - its functionality to the manager - ''' - def __init__(self, manager, symbol): - manager.register_publication(symbol) - self._manager = manager - self._symbol = symbol - super().__init__() - - def on_tick(self, time, value): - self._manager._on_tick(self._symbol, value) -``` - -As a last step, we need to ensure that the runtime adapter implementations are registered with our graph: - -```python -_my_input_adapter = py_push_adapter_def(name='MyInputAdapter', adapterimpl=MyInputAdapterImpl, out_type=ts[MyData], manager_type=MyAdapterManager, symbol=str) -_my_output_adapter = py_output_adapter_def(name='MyOutputAdapter', adapterimpl=MyOutputAdapterImpl, manager_type=MyAdapterManager, input=ts['T'], symbol=str) -``` - -To test this example, we will: - -- instantiate our manager -- subscribe to a certain number of input adapter "streams" (which the adapter manager will demultiplex out of a single random node) -- print the data -- sink each stream into a smaller number of output adapters (which the adapter manager will multiplex into print statements) - -```python -@csp.graph -def my_graph(): - adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) - - data_1 = adapter_manager.subscribe("data_1") - data_2 = adapter_manager.subscribe("data_2") - data_3 = adapter_manager.subscribe("data_3") - - csp.print("data_1", data_1) - csp.print("data_2", data_2) - csp.print("data_3", data_3) - - # pump two streams into 1 output and 1 stream into another - adapter_manager.publish(data_1, "data_1") - adapter_manager.publish(data_2, "data_1") - adapter_manager.publish(data_3, "data_3") -``` - -Here is the result of a single run: - -``` -2023-02-15 19:14:53.859951 data_1:MyData(symbol=data_1, value=0) -publication_data_1:MyData(symbol=data_1, value=0) -2023-02-15 19:14:54.610281 data_3:MyData(symbol=data_3, value=1) -publication_data_3:MyData(symbol=data_3, value=1) -2023-02-15 19:14:55.361157 data_3:MyData(symbol=data_3, value=2) -publication_data_3:MyData(symbol=data_3, value=2) -2023-02-15 19:14:56.112030 data_2:MyData(symbol=data_2, value=3) -publication_data_1:MyData(symbol=data_2, value=3) -2023-02-15 19:14:56.862881 data_2:MyData(symbol=data_2, value=4) -publication_data_1:MyData(symbol=data_2, value=4) -2023-02-15 19:14:57.613775 data_1:MyData(symbol=data_1, value=5) -publication_data_1:MyData(symbol=data_1, value=5) -2023-02-15 19:14:58.364408 data_3:MyData(symbol=data_3, value=6) -publication_data_3:MyData(symbol=data_3, value=6) -2023-02-15 19:14:59.115290 data_2:MyData(symbol=data_2, value=7) -publication_data_1:MyData(symbol=data_2, value=7) -2023-02-15 19:14:59.866160 data_2:MyData(symbol=data_2, value=8) -publication_data_1:MyData(symbol=data_2, value=8) -2023-02-15 19:15:00.617068 data_1:MyData(symbol=data_1, value=9) -publication_data_1:MyData(symbol=data_1, value=9) -2023-02-15 19:15:01.367955 data_2:MyData(symbol=data_2, value=10) -publication_data_1:MyData(symbol=data_2, value=10) -2023-02-15 19:15:02.118259 data_3:MyData(symbol=data_3, value=11) -publication_data_3:MyData(symbol=data_3, value=11) -2023-02-15 19:15:02.869170 data_2:MyData(symbol=data_2, value=12) -publication_data_1:MyData(symbol=data_2, value=12) -2023-02-15 19:15:03.620047 data_1:MyData(symbol=data_1, value=13) -publication_data_1:MyData(symbol=data_1, value=13) -closing asset publication_data_1 -closing asset publication_data_3 -``` - -Although simple, this examples demonstrates the utility of the adapters and adapter managers. -An input resource is managed by one entity, distributed across a variety of downstream subscribers. -Then a collection of streams is piped back into a single entity. diff --git a/docs/wiki/6.-Dynamic-Graphs.md b/docs/wiki/6.-Dynamic-Graphs.md deleted file mode 100644 index d9c188ac..00000000 --- a/docs/wiki/6.-Dynamic-Graphs.md +++ /dev/null @@ -1,110 +0,0 @@ -`csp` graphs are somewhat limiting in that they cannot change shape once the process starts up. -`csp` dynamic graphs addresses this issue by introducing a construct to allow applications to dynamically add / remove sub-graphs from a running graph. - -# csp.DynamicBasket - -`csp` dynamic baskets are a pre-requisite construct needed for dynamic graphs. -csp.DynamicBaskets work just like regular static `csp` baskets, however dynamic baskets can change their shape over time. -csp.DynamicBaskts can only be created from either `csp` nodes or from csp.dynamic calls, as described below. -A node can take a csp.DynamicBasket as an input or generate a dynamic basket as an output. -Dynamic baskets are always dictionary-style baskets, where time series can be added by key. -Note that timeseries can also be removed from dynamic baskets. - -## Syntax - -Dynamic baskets are denoted by the type `csp.DynamicBasket[key_type, ts_type]`, so for example `csp.DynamicBasket[str,int]` would be a dynamic basket that will have keys of type str, and timeseries of type int. -One can also use the non-python shorthand `{ ts[str] : ts[int] }` to signify the same. - -## Generating dynamic basket output - -For nodes that generate dynamic basket output, they would use the same interface as regular basket outputs. -The difference being that if you output a key that hasn't been seen before, it will automatically be added to the dynamic basket. -In order to remove a key from a dynamic basket output, you would use the csp.remove_dynamic_key method. -**NOTE** that it is illegal to add and remove a key in the same cycle: - -```python -@csp.node -def dynamic_demultiplex_example(data : ts[ 'T' ], key : ts['K']) -> csp.DynamicBasket['T', 'K']: - if csp.ticked(data) and csp.valid(key): - csp.output({ key : data }) - - - ## To remove a key, which wouldn't be done in this example node: - ## csp.remove_dynamic_key(key) -``` - -To remove a key one would use `csp.remove_dynamic_key`. -For a single unnamed output, the method expects the key. -For named outputs, the arguments would be `csp.remove_dynamic_key(output_name, key)` - -## Consuming dynamic basket input - -Taking dynamic baskets as input is exactly the same as static baskets. -There is one additional bit of information available on dynamic basket inputs though, which is the .shape property. -As keys are added or removed, the `basket.shape` property will tick the the change events. -The `.shape` property behaves effectively as a `ts[csp.DynamicBasketEvents]`: - -```python -@csp.node -def consume_dynamic_basket(data : csp.DynamicBasket[str,int]): - if csp.ticked(data.shape): - for key in data.shape.added: - print(f'key {key} was added') - for key in data.shape.removed: - print(f'key {key} was removed') - - - if csp.ticked(data): - for key,value in data.tickeditems(): - #...regular basket access here -``` - -# csp.dynamic - -- **`csp.dynamic(trigger, sub_graph, graph_args...) → csp.DynamicBasket[ ... ]`** - - **`trigger`**: a csp.DynamicBasket input. - As new keys are added to the basket, they will trigger sub_graph instances to be created. - As keys are removed, they will shutdown their respective sub-graph - - **`sub_graph`** - a regular csp.graph method that will be wired as new keys are added on trigger - - **`graph_args`**: these are the args passed to the sub_graph at the time of creation. - Note the special semantics of argument passing to dynamic sub-graphs: - - **`scalars`**: can be passed as is, assuming they are known at main graph build time - - **`timeseries`** - can be passed as is, assuming they are known at main graph build time - - **`csp.snap(ts)`**: this will convert a timeseries input to a **`scalar`** at the time of graph creation, allowing you to get a "dynamic" scalar value to use at sub_graph build time - - **`csp.snapkey()`**: this will pass through the key that was added which triggered this dynamic sub-graph. - One can use this to get the key triggering the sub-graph. - - **`csp.attach()`**: this will pass through the timeseries of the input trigger for the key which triggered this dynamic sub-graph. - For example, say we have a dynamic basket of `{ symbol : ts[orders ]}` as our input trigger. - As a new symbol is added, we will trigger a sub-graph to process this symbol. - Say we also want to feed in the `ts[orders]` for the given symbol into our sub_graph, we would pass `csp.attach()` as the argument. - - **`output`**: every output of sub_graph (if there are any) will be returned as a member of a csp.DynamicBasket output. - As new keys are added to the trigger, which generates sub-graphs, keys will be added to the output dynamic basket - (Note, output keys will only generate on first tick of some output data, not upon instantiation of the sub-graph, since csp.DynamicBasket requires all keys to have valid values) - -```python -@csp.graph -def my_sub_graph(symbol : str, orders : ts[ Orders ], portfolio_position : ts[int], some_scalar : int) -> ts[Fill]: - ... regular csp.graph code ... - - -@csp.graph -def main(): - # position as ts[int] - portfolio_position = get_portfolio_position() - - - all_orders = get_orders() - # demux fat-pipe of orders into a dynamic basket keyed by symbol - demuxed_orders = csp.dynamic_demultiplex(all_orders, all_orders.symbol) - - - result = csp.dynamic(demuxed_orders, my_sub_graph, - csp.snap(all_orders.symbol), # Grab scalar value of all_orders.symbol at time of instantiation - #csp.snapkey(), # Alternative way to grab the key that instantiated the sub-graph - csp.attach(), # extract the demuxed_orders[symbol] time series of the symbol being created in the sub_graph - portfolio_position, # pass in regular ts[] - 123) # pass in some scalar - - - # process result.fills which will be a csp.DynamicBasket of { symbol : ts[Fill] } -``` diff --git a/docs/wiki/9.-Caching.md b/docs/wiki/9.-Caching.md deleted file mode 100644 index e9351960..00000000 --- a/docs/wiki/9.-Caching.md +++ /dev/null @@ -1,3 +0,0 @@ -`csp` provides a caching layer of graph outputs. The caching layer is generally a parquet writer/reader wrapper of graph outputs. The system automatically manages resolving the run time of the engine and resolving whether the data can be read from cache or isn't available in cache (in which case data will be written to cache). Future runs can then read the data from cache and avoid calculations of the same data. Goals of the caching layer: - -More documentation to follow! diff --git a/docs/wiki/Home.md b/docs/wiki/Home.md index 8901882e..9d7ae95f 100644 --- a/docs/wiki/Home.md +++ b/docs/wiki/Home.md @@ -1,70 +1,38 @@ -`csp` ("Composable Stream Processing") is a functional-like reactive -language that makes time-series stream processing simple to do.  The -main reactive engine is a C++ based engine which has been exposed to -python ( other languages may optionally be extended in future versions -). `csp` applications define a connected graph of components using a -declarative language (which is essentially python).  Once a graph is -constructed it can be run using the C++ engine. Graphs are composed of -some number of "input" adapters, a set of connected calculation "nodes" -and at the end sent off to "output" adapters. Inputs as well as the -engine can be seamlessly run in simulation mode using historical input -adapters or in realtime mode using realtime input adapters. + + + + CSP logo mark - text will be black in light color mode and white in dark color mode. + -# Contents +CSP (Composable Stream Processing) is a library for high-performance real-time event stream processing in Python. -- [0. Introduction](https://github.com/Point72/csp/wiki/0.-Introduction) -- [1. Generic Nodes (csp.baselib)]() -- [2. Math Nodes (csp.math)]() -- [3. Statistics Nodes (csp.stats)]() -- [4. Random Time Series Generation]() -- [5. Adapters](https://github.com/Point72/csp/wiki/5.-Adapters) -- [6. Dynamic Graphs](https://github.com/Point72/csp/wiki/6.-Dynamic-Graphs) -- [7. csp.Struct](https://github.com/Point72/csp/wiki/7.-csp.Struct) -- [8. Profiler](https://github.com/Point72/csp/wiki/8.-Profiler) -- [9. Caching](https://github.com/Point72/csp/wiki/9.-Caching) +## Key Features -# Installation +- **Powerful C++ Engine:** Execute the graph using CSP's C++ Graph Processing Engine +- **Simulation (i.e., offline) mode:** Test workflows on historical data and quickly move to real-time data in deployment +- **Infrastructure-agnostic:** Connect to any data format or storage database, using built-in (Parquet, Kafka, etc.) or custom adapters +- **Highly-customizable:** Write your own input and output adapters for any data/storage formats, and real-time adapters for specific workflows +- **PyData interoperability:** Use your favorite libraries from the Scientific Python Ecosystem for numerical and statistical computations +- **Functional/declarative style:** Write concise and composable code for stream processing by building graphs in Python -We ship binary wheels to install `csp` on MacOS and Linux via `pip`: + -```bash -pip install csp -``` +## Get Started -Other platforms will need to see the instructions to [build `csp` from -source](https://github.com/Point72/csp/wiki/98.-Building-From-Source). +- [Install CSP](Installation) and [write your first CSP program](First-Steps) +- Learn more about [nodes](CSP-Node), [graphs](CSP-Graph), and [execution modes](Execution-Modes) +- Learn to extend CSP with [adapters](Adapters) -We plan to create conda packages on conda-forge and ship binaries for Windows in -the near future. + -# Contributing +> \[!TIP\] +> Find relevant docs with GitHub’s search function, use `repo:Point72/csp type:wiki ` to search the documentation Wiki Pages. -Contributions are welcome on this project. We distribute under the terms of the [Apache 2.0 license](https://github.com/Point72/csp/blob/main/LICENSE). +## Community -For **bug reports** or **small feature requests**, please open an issue on our [issues page](https://github.com/Point72/csp/issues). +- [Contribute](Contribute) to CSP and help improve the project +- Read about future plans in the [project roadmap](Roadmap) -For **questions** or to discuss **larger changes or features**, please use our [discussions page](https://github.com/Point72/csp/discussions). +## License -For **contributions**, please see our [developer documentation](https://github.com/Point72/csp/wiki/99.-Developer). We have `help wanted` and `good first issue` tags on our issues page, so these are a great place to start. - -For **documentation updates**, make PRs that update the pages in `/docs/wiki`. The documentation is pushed to the GitHub wiki automatically through a GitHub workflow. Note that direct updates to this wiki will be overwritten. - -# Roadmap - -We do not have a formal roadmap, but we're happy to discuss features, improvements, new adapters, etc, in our [discussions area](https://github.com/Point72/csp/discussions). Here are some high level items we hope to accomplish in the next few months: - -- Support `clang` compiler and full MacOS support ([#33](https://github.com/Point72/csp/issues/33) / [#132](https://github.com/Point72/csp/pull/132)) -- Support `msvc` compiler and full Windows support ([#109](https://github.com/Point72/csp/issues/109)) -- Establish a better pattern for adapters ([#165](https://github.com/Point72/csp/discussions/165)) - -## Adapters and Extensions - -- Redis Pub/Sub Adapter with [Redis-plus-plus](https://github.com/sewenew/redis-plus-plus) ([#61](https://github.com/Point72/csp/issues/61)) -- C++-based websocket adapter - - Client adapter in [#152](https://github.com/Point72/csp/pull/152) -- C++-based HTTP/SSE adapter -- Add support for other graph viewers, including interactive / standalone / Jupyter - -## Other Open Source Projects - -- `csp-gateway`: Application development framework, built with [FastAPI](https://fastapi.tiangolo.com) and [Perspective](https://github.com/finos/perspective). This is a library we have built internally at Point72 on top of `csp` that we hope to open source later in 2024. It allows for easier construction of modular `csp` applications, along with a pluggable REST/WebSocket API and interactive UI. +CSP is licensed under the Apache 2.0 license. See the [LICENSE](https://github.com/Point72/csp/blob/main/LICENSE) file for details. diff --git a/docs/wiki/_Footer.md b/docs/wiki/_Footer.md new file mode 100644 index 00000000..602a2550 --- /dev/null +++ b/docs/wiki/_Footer.md @@ -0,0 +1 @@ +_This wiki is autogenerated. To made updates, open a PR against the original source file in [`docs/wiki`](https://github.com/Point72/csp/tree/main/docs/wiki)._ diff --git a/docs/wiki/_Sidebar.md b/docs/wiki/_Sidebar.md new file mode 100644 index 00000000..cd137edf --- /dev/null +++ b/docs/wiki/_Sidebar.md @@ -0,0 +1,61 @@ + + +**[Home](Home)** + +**Get Started (Tutorials)** + +- [Installation](Installation) +- [First steps](First-Steps) + + + +**Concepts** + +- [CSP Node](CSP-Node) +- [CSP Graph](CSP-Graph) +- [Historical Buffers](Historical-Buffers) +- [Execution Modes](Execution-Modes) +- [Adapters](Adapters) + +**How-to guides** + +- [Use Statistical Nodes](Use-Statistical-Nodes) +- Use Adapters (coming soon) +- [Add Cycles in Graphs](Add-Cycles-in-Graphs) +- [Create Dynamic Baskets](Create-Dynamic-Baskets) +- Write Adapters: + - [Write Historical Input Adapters](Write-Historical-Input-Adapters) + - [Write Realtime Input Adapters](Write-Realtime-Input-Adapters) + - [Write Output Adapters](Write-Output-Adapters) +- [Profile CSP Code](Profile-CSP-Code) + +**References** + +- API Reference + - [Base Nodes API](Base-Nodes-API) + - [Base Adapters API](Base-Adapters-API) + - [Math and Logic Nodes API](Math-and-Logic-Nodes-API) + - [Statistical Nodes API](Statistical-Nodes-API) + - [Functional Methods API](Functional-Methods-API) + - [Adapters (Kafka, Parquet, DBReader) API](Input-Output-Adapters-API) + - [Random Time Series Generators API](Random-Time-Series-Generators-API) + - [`csp.Struct` API](csp.Struct-API) + - [`csp.dynamic` API](csp.dynamic-API) + - [`csp.profiler` API](csp.profiler-API) +- [Examples](Examples) +- [Glossary of Terms](Glossary) + +**Developer Guide** + +- [Contributing](Contribute) +- [Development Setup](Local-Development-Setup) +- [Build CSP from Source](Build-CSP-from-Source) +- [GitHub Conventions (for maintainers)](GitHub-Conventions) +- [Release Process (for maintainers)](Release-Process) +- [Roadmap](Roadmap) diff --git a/docs/wiki/api-references/Base-Adapters-API.md b/docs/wiki/api-references/Base-Adapters-API.md new file mode 100644 index 00000000..a72820cc --- /dev/null +++ b/docs/wiki/api-references/Base-Adapters-API.md @@ -0,0 +1,110 @@ +`csp.baselib` defines some generally useful adapters, which are also imported directly into the CSP namespace when importing CSP. + +These are all graph-time constructs. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [`csp.timer`](#csptimer) +- [`csp.const`](#cspconst) +- [`csp.curve`](#cspcurve) +- [`csp.add_graph_output`](#cspadd_graph_output) +- [`csp.feedback`](#cspfeedback) + +## `csp.timer` + +```python +csp.timer( + interval: timedelta, + value: '~T' = True, + allow_deviation: bool = False +) +``` + +This will create a repeating timer edge that will tick on the given `timedelta` with the given value (value defaults to `True`, returning a `ts[bool]`) + +Args: + +- **`interval`**: how often to tick value +- **`value`**: the actual value that will tick every interval (defaults to the value `True`) +- **`allow_deviation`**: When running in realtime the engine will ensure timers execute exactly when they requested on their intervals. + If your engine begins to lag, timers will still execute at the expected time "in the past" as the engine catches up + (imagine having a `csp.timer` fire every 1/2 second but the engine becomes delayed for 1 second. + By default the half seconds will still execute until time catches up to wallclock). + When `allow_deviation` is `True`, and the engine is in realtime mode, subsequent timers will always be scheduled from the current wallclock + interval, + so they won't end up lagging behind at the expensive of the timer skewing. + +## `csp.const` + +```python +csp.const( + value: '~T', + delay: timedelta = timedelta() +) +``` + +This will create an edge that ticks one time with the value provided. +By default this will tick at the start of the engine, delta can be provided to delay the tick + +## `csp.curve` + +```python +csp.curve( + typ: 'T', + data: typing.Union[list, tuple] +) +``` + +This allows you to convert a list of non-CSP data into a ticking edge in CSP + +Args: + +- **`typ`**: is the type of the value of the data of this edge +- **`data`**: is either a list of tuples of `(datetime, value)`, or a tuple of two equal-length numpy ndarrays, the first with datetimes and the second with values. + In either case, that will tick on the returned edge into the engine, and the data must be in time order. + Note that for the list of tuples case, you can also provide tuples of (timedelta, value) where timedelta will be the offset from the engine's start time. + +## `csp.add_graph_output` + +```python +csp.add_graph_output( + key: object, + input: ts['T'], + tick_count: int = -1, + tick_history: timedelta = timedelta() +) +``` + +This allows you to connect an edge as a "graph output". +All edges added as outputs will be returned to the caller from `csp.run` as a dictionary of `key: [(datetime, value)]` +(list of datetime, values that ticked on the edge) or if `csp.run` is passed `output_numpy=True`, as a dictionary of +`key: (array, array)` (tuple of two numpy arrays, one with datetimes and one with values). +See [Collecting Graph Outputs](https://github.com/Point72/csp/wiki/0.-Introduction#collecting-graph-outputs) + +Args: + +- **`key`**: key to return the results as from `csp.run` +- **`input`**: edge to connect +- **`tick_count`**: number of ticks to keep in the buffer (defaults to -1 - all ticks) +- **`tick_history`**: amount of ticks to keep by time window (defaults to keeping all history) + +## `csp.feedback` + +```python +csp.feedback(typ) +``` + +`csp.feedback` is a construct that can be used to create artificial loops in the graph. +Use feedbacks in order to delay bind an input to a node in order to be able to create a loop +(think of writing a simulated exchange that takes orders in and needs to feed responses back to the originating node). + +`csp.feedback` itself is not an edge, its a construct that allows you to access the delayed edge / bind a delayed input. + +Args: + +- **`typ`**: type of the edge's data to be bound + +Methods: + +- **`out()`**: call this method on the feedback object to get the edge which can be wired as an input +- **`bind(x: ts[object])`**: call this to bind an edge to the feedback diff --git a/docs/wiki/1.-Generic-Nodes-(csp.baselib).md b/docs/wiki/api-references/Base-Nodes-API.md similarity index 54% rename from docs/wiki/1.-Generic-Nodes-(csp.baselib).md rename to docs/wiki/api-references/Base-Nodes-API.md index 913b39e6..81acf4b8 100644 --- a/docs/wiki/1.-Generic-Nodes-(csp.baselib).md +++ b/docs/wiki/api-references/Base-Nodes-API.md @@ -1,114 +1,43 @@ -# Intro - CSP comes with some basic constructs readily available and commonly used. -The latest set of baselib nodes / adapters can be found in the csp.baselib module. - -All of the nodes / adapters noted here are imported directly into the csp namespace when importing csp. -These are all graph-time constructs. - -# Adapters - -## `timer` - -```python -csp.timer( - interval: timedelta, - value: '~T' = True, - allow_deviation: bool = False -) -``` - -This will create a repeating timer edge that will tick on the given `timedelta` with the given value (value defaults to `True`, returning a `ts[bool]`) - -Args: - -- **`interval`**: how often to tick value -- **`value`**: the actual value that will tick every interval (defaults to the value `True`) -- **`allow_deviation`**: When running in realtime the engine will ensure timers execute exactly when they requested on their intervals. - If your engine begins to lag, timers will still execute at the expected time "in the past" as the engine catches up - (imagine having a `csp.timer` fire every 1/2 second but the engine becomes delayed for 1 second. - By default the half seconds will still execute until time catches up to wallclock). - When `allow_deviation` is `True`, and the engine is in realtime mode, subsequent timers will always be scheduled from the current wallclock + interval, - so they won't end up lagging behind at the expensive of the timer skewing. - -## `const` - -```python -csp.const( - value: '~T', - delay: timedelta = timedelta() -) -``` - -This will create an edge that ticks one time with the value provided. -By default this will tick at the start of the engine, delta can be provided to delay the tick - -## `curve` - -```python -csp.curve( - typ: 'T', - data: typing.Union[list, tuple] -) -``` - -This allows you to convert a list of non-csp data into a ticking edge in csp - -Args: - -- **`typ`**: is the type of the value of the data of this edge -- **`data`**: is either a list of tuples of `(datetime, value)`, or a tuple of two equal-length numpy ndarrays, the first with datetimes and the second with values. - In either case, that will tick on the returned edge into the engine, and the data must be in time order. - Note that for the list of tuples case, you can also provide tuples of (timedelta, value) where timedelta will be the offset from the engine's start time. - -## `add_graph_output` - -```python -csp.add_graph_output( - key: object, - input: ts['T'], - tick_count: int = -1, - tick_history: timedelta = timedelta() -) -``` - -This allows you to connect an edge as a "graph output". -All edges added as outputs will be returned to the caller from `csp.run` as a dictionary of `key: [(datetime, value)]` -(list of datetime, values that ticked on the edge) or if `csp.run` is passed `output_numpy=True`, as a dictionary of -`key: (array, array)` (tuple of two numpy arrays, one with datetimes and one with values). -See [Collecting Graph Outputs](https://github.com/Point72/csp/wiki/0.-Introduction#collecting-graph-outputs) - -Args: - -- **`key`**: key to return the results as from csp.run -- **`input`**: edge to connect -- **`tick_count`**: number of ticks to keep in the buffer (defaults to -1 - all ticks) -- **`tick_history`**: amount of ticks to keep by time window (defaults to keeping all history) - -## `feedback` - -```python -csp.feedback(typ) -``` - -`csp.feedback` is a construct that can be used to create artificial loops in the graph. -Use feedbacks in order to delay bind an input to a node in order to be able to create a loop -(think of writing a simulated exchange that takes orders in and needs to feed responses back to the originating node). - -`csp.feedback` itself is not an edge, its a construct that allows you to access the delayed edge / bind a delayed input. - -Args: - -- **`typ`**: type of the edge's data to be bound - -Methods: +The latest set of base nodes can be found in the `csp.baselib` module. -- **`out()`**: call this method on the feedback object to get the edge which can be wired as an input -- **`bind(x: ts[object])`**: call this to bind an edge to the feedback +All of the nodes noted here are imported directly into the CSP namespace when importing CSP. -# Basic Nodes +These are all graph-time constructs. -## `print` +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [`csp.print`](#cspprint) +- [`csp.log`](#csplog) +- [`csp.sample`](#cspsample) +- [`csp.firstN`](#cspfirstn) +- [`csp.count`](#cspcount) +- [`csp.delay`](#cspdelay) +- [`csp.diff`](#cspdiff) +- [`csp.merge`](#cspmerge) +- [`csp.split`](#cspsplit) +- [`csp.filter`](#cspfilter) +- [`csp.drop_dups`](#cspdrop_dups) +- [`csp.unroll`](#cspunroll) +- [`csp.collect`](#cspcollect) +- [`csp.flatten`](#cspflatten) +- [`csp.default`](#cspdefault) +- [`csp.gate`](#cspgate) +- [`csp.apply`](#cspapply) +- [`csp.null_ts`](#cspnull_ts) +- [`csp.stop_engine`](#cspstop_engine) +- [`csp.multiplex`](#cspmultiplex) +- [`csp.demultiplex`](#cspdemultiplex) +- [`csp.dynamic_demultiplex`](#cspdynamic_demultiplex) +- [`csp.dynamic_collect`](#cspdynamic_collect) +- [`csp.drop_nans`](#cspdrop_nans) +- [`csp.times`](#csptimes) +- [`csp.times_ns`](#csptimes_ns) +- [`csp.accum`](#cspaccum) +- [`csp.exprtk`](#cspexprtk) + +## `csp.print` ```python csp.print( @@ -118,7 +47,7 @@ csp.print( This node will print (using python `print()`) the time, tag and value of `x` for every tick of `x` -## `log` +## `csp.log` ```python csp.log( @@ -132,7 +61,7 @@ csp.log( ``` Similar to `csp.print`, this will log ticks using the logger on the provided level. -The default 'csp' logger is used if none is provided to the node. +The default CSP logger is used if none is provided to the node. Args: @@ -141,7 +70,7 @@ Args: This can be useful when printing large strings in log calls. If individual time-series values are subject to modification *after* the log call, then the user must pass in a copy of the time-series if they wish to have proper threaded logging. -## `sample` +## `csp.sample` ```python csp.sample( @@ -154,7 +83,7 @@ Use this to down-sample an input. `csp.sample` will return the current value of `x` any time trigger ticks. This can be combined with `csp.timer` to sample the input on a time interval. -## `firstN` +## `csp.firstN` ```python csp.firstN( @@ -165,15 +94,15 @@ csp.firstN( Only output the first `N` ticks of the input. -## `count` +## `csp.count` ```python -csp.count(x: ts[object]) → ts[int] +csp.count(x: ts[object]) → ts[int] ``` Returns the ticking count of ticks of the input -## `delay` +## `csp.delay` ```python csp.delay( @@ -184,7 +113,7 @@ csp.delay( This will delay all ticks of the input `x` by the given `delay`, which can be given as a `timedelta` to delay a specified amount of time, or as an int to delay a specified number of ticks (delay must be positive) -## `diff` +## `csp.diff` ```python csp.diff( @@ -195,7 +124,7 @@ csp.diff( When `x` ticks, output difference between current tick and value time or ticks ago (once that exists) -## `merge` +## `csp.merge` ```python csp.merge( x: ts['T'], y: ts['T']) → ts['T'] @@ -203,9 +132,9 @@ csp.merge( x: ts['T'], y: ts['T']) → ts['T'] Merges the two timeseries `x` and `y` into a single series. If both tick on the same cycle, the first input (`x`) wins and the value of `y` is dropped. -For loss-less merging see `csp.flatten` +For loss-less merging see `csp.flatten` -## `split` +## `csp.split` ```python csp.split( @@ -219,7 +148,7 @@ If `flag` is `True` when `x` ticks, output 'true' will tick with the value of `x If `flag` is `False` at the time of the input tick, then 'false' will tick. Note that if flag is not valid at the time of the input tick, the input will be dropped. -## `filter` +## `csp.filter` ```python csp.filter(flag: ts[bool], x: ts['T']) → ts['T'] @@ -228,7 +157,7 @@ csp.filter(flag: ts[bool], x: ts['T']) → ts['T'] Will only tick out input ticks of `x` if the current value of `flag` is `True`. If flag is `False`, or if flag is not valid (hasn't ticked yet) then `x` is suppressed. -## `drop_dups` +## `csp.drop_dups` ```python csp.drop_dups(x: ts['T']) → ts['T'] @@ -236,34 +165,34 @@ csp.drop_dups(x: ts['T']) → ts['T'] Will drop consecutive duplicate values from the input. -## `unroll` +## `csp.unroll` ```python csp.unroll(x: ts[['T']]) → ts['T'] ``` -Given a timeseries of a *list* of values, unroll will "unroll" the values in the list into a timeseries of the elements. +Given a timeseries of a *list* of values, unroll will "unroll" the values in the list into a timeseries of the elements. `unroll` will ensure to preserve the order across all list ticks. Ticks will be unrolled in subsequent engine cycles. -## `collect` +## `csp.collect` ```python csp.collect(x: [ts['T']]) → ts[['T']] ``` -Given a basket of inputs, return a timeseries of a *list* of all values that ticked +Given a basket of inputs, return a timeseries of a *list* of all values that ticked -## `flatten` +## `csp.flatten` ```python csp.flatten(x: [ts['T']]) → ts['T'] ``` Given a basket of inputs, return all ticks across all inputs as a single timeseries of type 'T' -(This is similar to `csp.merge` except that it can take more than two inputs, and is lossless) +(This is similar to `csp.merge` except that it can take more than two inputs, and is lossless) -## `default` +## `csp.default` ```python csp.default( @@ -276,7 +205,7 @@ csp.default( Defaults the input series to the value of `default` at start of the engine, or after `delay` if `delay` is provided. If `x` ticks right at the start of the engine, or before `delay` if `delay` is provided, `default` value will be discarded. -## `gate` +## `csp.gate` ```python csp.gate( @@ -290,7 +219,7 @@ csp.gate( While open, the input will tick out as a single value burst. While closed, input ticks will buffer up until they can be released. -## `apply` +## `csp.apply` ```python csp.apply( @@ -302,7 +231,7 @@ csp.apply( Applies the provided callable `f` on every tick of the input and returns the result of the callable. -## `null_ts` +## `csp.null_ts` ```python csp.null_ts(typ: 'T') @@ -310,7 +239,7 @@ csp.null_ts(typ: 'T') Returns a "null" timeseries of the given type which will never tick. -## `stop_engine` +## `csp.stop_engine` ```python csp.stop_engine(x: ts['T']) @@ -318,7 +247,7 @@ csp.stop_engine(x: ts['T']) Forces the engine to stop if `x` ticks -## `multiplex` +## `csp.multiplex` ```python csp.multiplex( @@ -339,7 +268,7 @@ Args: the input basket whenever the key ticks (defaults to `False`) - **`raise_on_bad_key`**: if `True` an exception will be raised if key ticks with an unrecognized key (defaults to `False`) -## `demultiplex` +## `csp.demultiplex` ```python csp.demultiplex( @@ -350,18 +279,18 @@ csp.demultiplex( ) → {key: ts['T']} ``` -Given a single timeseries input, a key timeseries to demultiplex on and a set of expected keys, will output the given input onto the corresponding basket output of the current value of `key`. +Given a single timeseries input, a key timeseries to demultiplex on and a set of expected keys, will output the given input onto the corresponding basket output of the current value of `key`. A good example use case of this is demultiplexing a timeseries of trades by account. -Assuming your trade struct has an account field, you can `demultiplex(trades, trades.account, [ 'acct1', 'acct2', ... ])`. +Assuming your trade struct has an account field, you can `demultiplex(trades, trades.account, [ 'acct1', 'acct2', ... ])`. Args: - **`x`**: the input timeseries to demultiplex - **`key`**: a ticking timeseries of the current key to output to -- **`keys`**: a list of expected keys that will define the shape of the output basket.  The list of keys must be known at graph building time +- **`keys`**: a list of expected keys that will define the shape of the output basket. The list of keys must be known at graph building time - **`raise_on_bad_key`**: if `True` an exception will be raised of key ticks with an unrecognized key (defaults to `False`) -## `dynamic_demultiplex` +## `csp.dynamic_demultiplex` ```python csp.dynamic_demultiplex( @@ -372,7 +301,7 @@ csp.dynamic_demultiplex( Similar to `csp.demultiplex`, this version will return a [Dynamic Basket](https://github.com/Point72/csp/wiki/6.-Dynamic-Graphs) output that will dynamically add new keys as they are seen. -## `dynamic_collect` +## `csp.dynamic_collect` ```python csp.dynamic_collect( @@ -382,7 +311,7 @@ csp.dynamic_collect( Similar to `csp.collect`, this function takes a [Dynamic Basket](https://github.com/Point72/csp/wiki/6.-Dynamic-Graphs) input and returns a dictionary of the key-value pairs corresponding to the values that ticked. -## `drop_nans` +## `csp.drop_nans` ```python csp.drop_nans(x: ts[float]) → ts[float] @@ -390,7 +319,7 @@ csp.drop_nans(x: ts[float]) → ts[float] Filters nan (Not-a-number) values out of the time series. -## `times` +## `csp.times` ```python csp.times(x: ts['T']) → ts[datetime] @@ -398,7 +327,7 @@ csp.times(x: ts['T']) → ts[datetime] Given a timeseries, returns the time at which that series ticks -## `times_ns` +## `csp.times_ns` ```python csp.times_ns(x: ts['T']) → ts[int] @@ -406,7 +335,7 @@ csp.times_ns(x: ts['T']) → ts[int] Given a timeseries, returns the epoch time in nanoseconds at which that series ticks -## `accum` +## `csp.accum` ```python csp.accum(x: ts["T"], start: "~T" = 0) -> ts["T"] @@ -414,72 +343,7 @@ csp.accum(x: ts["T"], start: "~T" = 0) -> ts["T"] Given a timeseries, accumulate via `+=` with starting value `start`. -# Math and Logic nodes - -See [Math Nodes](). - -# Functional Methods - -Edges in csp contain some methods to serve as syntactic sugar for stringing nodes together in a pipeline. This makes it easier to read/modify workflows and avoids the need for nested brackets. - -## `apply` - -```python -Edge.apply(self, func, *args, **kwargs) -``` - -Calls `csp.apply` on the edge with the provided python `func`. - -Args: - -- **`func`**: A scalar function that will be applied on each value of the Edge. If a different output type is returned, pass a tuple `(f, typ)`, where `typ` is the output type of f -- **`args`**: Positional arguments passed into `func` -- **`kwargs`**: Dictionary of keyword arguments passed into func - -## `pipe` - -```python -Edge.pipe(self, node, *args, **kwargs) -``` - -Calls the `node` on the edge. - -Args: - -- **`node`**: A graph node that will be applied to the Edge, which is passed into node as the first argument. - Alternatively, a `(node, edge_keyword)` tuple where `edge_keyword` is a string indicating the keyword of node that expects the edge. -- **`args`**: Positional arguments passed into `node` -- **`kwargs`**: Dictionary of keyword arguments passed into `node` - -## `run` - -```python -Edge.run(self, node, *args, **kwargs) -``` - -Alias for `csp.run(self, *args, **kwargs)` - -## Example of functional methods - -```python -import csp -from datetime import datetime, timedelta -import math - -(csp.timer(timedelta(minutes=1)) - .pipe(csp.count) - .pipe(csp.delay, timedelta(seconds=1)) - .pipe((csp.sample, 'x'), trigger=csp.timer(timedelta(minutes=2))) - .apply((math.sin, float)) - .apply(math.pow, 3) - .pipe(csp.firstN, 10) - .run(starttime=datetime(2000,1,1), endtime=datetime(2000,1,2))) - -``` - -# Other nodes - -## `exprtk` +## `csp.exprtk` ```python csp.exprtk( @@ -498,8 +362,8 @@ Args: - **`expression_str`**: an expression, as per the [C++ Mathematical Expression Library](http://www.partow.net/programming/exprtk/) (see [readme](http://www.partow.net/programming/exprtk/code/readme.txt) - **`inputs`**: a dict basket of timeseries. The keys will correspond to the variables in the expression. The timeseries can be of float or string -- **`state_vars`**: an optional dictionary of variables to be held in state between executions, and assignable within the expression.  Keys are the variable names and values are the starting values +- **`state_vars`**: an optional dictionary of variables to be held in state between executions, and assignable within the expression. Keys are the variable names and values are the starting values - **`trigger`**: an optional trigger for when to calculate. By default will calculate on any input tick - **`functions`**: an optional dictionary whose keys are function names that can be used in the expression, and whose values are of the form `(("arg1", ..), "function body")`, for example `{"foo": (("x","y"), "x\*y")}` -- **`constants`**: an optional dictionary of constants.  Keys are constant names and values are their values +- **`constants`**: an optional dictionary of constants. Keys are constant names and values are their values - **`output_ndarray`**: if `True`, output ndarray (1D) instead of float. Note that to output `ndarray`, the expression needs to use return like `return [a, b, c]`. The length of the array can vary between ticks. diff --git a/docs/wiki/api-references/Functional-Methods-API.md b/docs/wiki/api-references/Functional-Methods-API.md new file mode 100644 index 00000000..3f2447fa --- /dev/null +++ b/docs/wiki/api-references/Functional-Methods-API.md @@ -0,0 +1,64 @@ +Edges in CSP contain some methods to serve as syntactic sugar for stringing nodes together in a pipeline. This makes it easier to read/modify workflows and avoids the need for nested brackets. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [`apply`](#apply) +- [`pipe`](#pipe) +- [`run`](#run) +- [Example of functional methods](#example-of-functional-methods) + +## `apply` + +```python +Edge.apply(self, func, *args, **kwargs) +``` + +Calls `csp.apply` on the edge with the provided python `func`. + +Args: + +- **`func`**: A scalar function that will be applied on each value of the Edge. If a different output type is returned, pass a tuple `(f, typ)`, where `typ` is the output type of f +- **`args`**: Positional arguments passed into `func` +- **`kwargs`**: Dictionary of keyword arguments passed into func + +## `pipe` + +```python +Edge.pipe(self, node, *args, **kwargs) +``` + +Calls the `node` on the edge. + +Args: + +- **`node`**: A graph node that will be applied to the Edge, which is passed into node as the first argument. + Alternatively, a `(node, edge_keyword)` tuple where `edge_keyword` is a string indicating the keyword of node that expects the edge. +- **`args`**: Positional arguments passed into `node` +- **`kwargs`**: Dictionary of keyword arguments passed into `node` + +## `run` + +```python +Edge.run(self, node, *args, **kwargs) +``` + +Alias for `csp.run(self, *args, **kwargs)` + +## Example of functional methods + +```python +import csp +from datetime import datetime, timedelta +import math + +(csp.timer(timedelta(minutes=1)) + .pipe(csp.count) + .pipe(csp.delay, timedelta(seconds=1)) + .pipe((csp.sample, 'x'), trigger=csp.timer(timedelta(minutes=2))) + .apply((math.sin, float)) + .apply(math.pow, 3) + .pipe(csp.firstN, 10) + .run(starttime=datetime(2000,1,1), endtime=datetime(2000,1,2))) + +``` diff --git a/docs/wiki/api-references/Input-Output-Adapters-API.md b/docs/wiki/api-references/Input-Output-Adapters-API.md new file mode 100644 index 00000000..7dcd70d7 --- /dev/null +++ b/docs/wiki/api-references/Input-Output-Adapters-API.md @@ -0,0 +1,360 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Kafka](#kafka) + - [API](#api) + - [MessageMapper](#messagemapper) + - [Subscribing and Publishing](#subscribing-and-publishing) + - [Known Issues](#known-issues) +- [Parquet](#parquet) + - [ParquetReader](#parquetreader) + - [API](#api-1) + - [Subscription](#subscription) + - [ParquetWriter](#parquetwriter) + - [Construction](#construction) + - [Publishing](#publishing) +- [DBReader](#dbreader) + - [TimeAccessor](#timeaccessor) +- [Symphony](#symphony) +- [Slack](#slack) + +## Kafka + +The Kafka adapter is a user adapter to stream data from a Kafka bus as a reactive time series. It leverages the [librdkafka](https://github.com/confluentinc/librdkafka) C/C++ library internally. + +The `KafkaAdapterManager` instance represents a single connection to a broker. +A single connection can subscribe and/or publish to multiple topics. + +### API + +```python +KafkaAdapterManager( + broker, + start_offset: typing.Union[KafkaStartOffset,timedelta,datetime] = None, + group_id: str = None, + group_id_prefix: str = '', + max_threads=100, + max_queue_size=1000000, + auth=False, + security_protocol='SASL_SSL', + sasl_kerberos_keytab='', + sasl_kerberos_principal='', + ssl_ca_location='', + sasl_kerberos_service_name='kafka', + rd_kafka_conf_options=None, + debug: bool = False, + poll_timeout: timedelta = timedelta(seconds=1) +): +``` + +- **`broker`**: name of the Kafka broker, such as `protocol://host:port` + +- **`start_offset`**: signify where to start the stream playback from (defaults to `KafkaStartOffset.LATEST`). + Can be one of the`KafkaStartOffset` enum types or: + + - `datetime`: to replay from the given absolute time + - `timedelta`: this will be taken as an absolute offset from starttime to playback from + +- **`group_id`**: if set, this adapter will behave as a consume-once consumer. + `start_offset` may not be set in this case since adapter will always replay from the last consumed offset. + +- **\`group_id_prefix**: when not passing an explicit group_id, a prefix can be supplied that will be use to prefix the UUID generated for the group_id + +- **`max_threads`**: maximum number of threads to create for consumers. + The topics are round-robin'd onto threads to balance the load. + The adapter won't create more threads than topics. + +- **`max_queue_size`**: maximum size of the (internal to Kafka) message queue. + If the queue is full, messages can be dropped, so the default is very large. + +### MessageMapper + +In order to publish or subscribe, you need to define a MsgMapper. +These are the supported message types: + +- **`JSONTextMessageMapper(datetime_type = DateTimeType.UNKNOWN)`** +- **`ProtoMessageMapper(datetime_type = DateTimeType.UNKNOWN)`** + +You should choose the `DateTimeType` based on how you want (when publishing) or expect (when subscribing) your datetimes to be represented on the wire. +The supported options are: + +- `UINT64_NANOS` +- `UINT64_MICROS` +- `UINT64_MILLIS` +- `UINT64_SECONDS` + +The enum is defined in [csp/adapters/utils.py](https://github.com/Point72/csp/blob/main/csp/adapters/utils.py#L5). + +Note the `JSONTextMessageMapper` currently does not have support for lists. +To subscribe to json data with lists, simply subscribe using the `RawTextMessageMapper` and process the text into json (e.g. via json.loads). + +### Subscribing and Publishing + +Once you have an `KafkaAdapterManager` object and a `MsgMapper` object, you can subscribe to topics using the following method: + +```python +KafkaAdapterManager.subscribe( + ts_type: type, + msg_mapper: MsgMapper, + topic: str, + key=None, + field_map: typing.Union[dict,str] = None, + meta_field_map: dict = None, + push_mode: csp.PushMode = csp.PushMode.LAST_VALUE, + adjust_out_of_order_time: bool = False +): +``` + +- **`ts_type`**: the timeseries type you want to get the data on. This can be a `csp.Struct` or basic timeseries type +- **`msg_mapper`**: the `MsgMapper` object discussed above +- **`topic`**: the topic to subscribe to +- **`key`**: The key to subscribe to. If `None`, then this will subscribe to all messages on the topic. Note that in this "wildcard" mode, all messages will tick as "live" as replay in engine time cannot be supported +- **`field_map`**: dictionary of `{message_field: struct_field}` to define how the subscribed message gets mapped onto the struct +- **`meta_field_map`**: to extract meta information from the kafka message, provide a meta_field_map dictionary of meta field info → struct field name to place it into. + The following meta fields are currently supported: + - **`"partition"`**: which partition the message came from + - **`"offset"`**: the kafka offset of the given message + - **`"live"`**: whether this message is "live" and not being replayed + - **`"timestamp"`**: timestamp of the kafka message + - **`"key"`**: key of the message +- **`push_mode`**: `csp.PushMode` (LAST_VALUE, NON_COLLAPSING, BURST) +- **`adjust_out_of_order_time`**: in some cases it has been seen that kafka can produce out of order messages, even for the same key. + This allows the adapter to be more laz and allow it through by forcing time to max(time, prev time) + +Similarly, you can publish on topics using the following method: + +```python +KafkaAdapterManager.publish( + msg_mapper: MsgMapper, + topic: str, + key: str, + x: ts['T'], + field_map: typing.Union[dict,str] = None +): +``` + +- **`msg_mapper`**: same as above +- **`topic`**: same as above +- **`key`**: key to publish to +- **`x`**: the timeseries to publish +- **`field_map`**: dictionary of {struct_field: message_field} to define how the struct gets mapped onto the published message. + Note this dictionary is the opposite of the field_map in subscribe() + +### Known Issues + +If you are having issues, such as not getting any output or the application simply locking up, start by ensuring that you are logging the adapter's `status()` with a `csp.print`/`log` call and set `debug=True`. +Then follow the known issues below. + +- Reason: `GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (No Kerberos credentials available)` + + - **Resolution**: Kafka uses kerberos tickets for authentication. Need to set-up kerberos token first + +- `Message received on unknown topic: errcode: Broker: Group authorization failed error: FindCoordinator response error: Group authorization failed.` + + - **Resolution**: Kafka broker running on windows are case sensitive to kerberos token. When creating Kerberos token with kinit, make sure to use principal name with case sensitive user id. + +- `authentication: SASL handshake failed (start (-4)): SASL(-4): no mechanism available: No worthy mechs found (after 0ms in state AUTH_REQ)` + + - **Resolution**: cyrus-sasl-gssapi needs to be installed on the box for Kafka kerberos authentication + +- `Message error on topic "an-example-topic". errcode: Broker: Topic authorization failed error: Subscribed topic not available: an-example-topic: Broker: Topic authorization failed)` + + - **Resolution**: The user account does not have access to the topic + +## Parquet + +### ParquetReader + +The `ParquetReader` adapter is a generic user adapter to stream data from [Apache Parquet](https://parquet.apache.org/) files as a CSP time series. +`ParquetReader` adapter supports only flat (non hierarchical) parquet files with all the primitive types that are supported by the CSP framework. + +#### API + +```python +ParquetReader( + self, + filename_or_list, + symbol_column=None, + time_column=None, + tz=None +): + """ + :param filename_or_list: The specifier of the file/files to be read. Can be either: + - Instance of str, in which case it's interpreted os a path of single file to be read + - A callable, in which case it's interpreted as a generator function that will be called like f(starttime, endtime) where starttime and endtime + are the start and end times of the current engine run. It's expected to generate a sequence of filenames to read. + - Iterable container, for example a list of files to read + :param symbol_column: An optional parameter that specifies the name of the symbol column if the file if there is any + :param time_column: A mandatory specification of the time column name in the parquet files. This column will be used to inject the row values + from parquet at the given timestamps. + :param tz: The pytz timezone of the timestamp column, should only be provided if the time_column in parquet file doesn't have tz info. +""" +``` + +#### Subscription + +```python +def subscribe( + self, + symbol, + typ, + field_map=None, + push_mode: csp.PushMode = csp.PushMode.NON_COLLAPSING +): + """Subscribe to the rows corresponding to a given symbol + This form of subscription can be used only if non empty symbol_column was supplied during ParquetReader construction. + :param symbol: The symbol to subscribe to, for example 'AAPL' + :param typ: The type of the CSP time series subscription. Can either be a primitive type like int or alternatively a type + that inherits from csp.Struct, in which case each instance of the struct will be constructed from the matching file columns. + :param field_map: A map of the fields from parquet columns for the CSP time series. If typ is a primitive, then field_map should be + a string specifying the column name, if typ is a csp.Struct then field_map should be a str->str dictionary of the form + {column_name:struct_field_name}. For structs field_map can be omitted in which case we expect a one to one match between the given Struct + fields and the parquet files columns. + :param push_mode: A push mode for the output adapter + """ + +def subscribe_all( + self, + typ, + field_map=None, + push_mode: csp.PushMode = csp.PushMode.NON_COLLAPSING +): + """Subscribe to all rows of the input files. + :param typ: The type of the CSP time series subscription. Can either be a primitive type like int or alternatively a type + that inherits from csp.Struct, in which case each instance of the struct will be constructed from the matching file columns. + :param field_map: A map of the fields from parquet columns for the CSP time series. If typ is a primitive, then field_map should be + a string specifying the column name, if typ is a csp.Struct then field_map should be a str->str dictionary of the form + {column_name:struct_field_name}. For structs field_map can be omitted in which case we expect a one to one match between the given Struct + fields and the parquet files columns. + :param push_mode: A push mode for the output adapter + """ +``` + +Parquet reader provides two subscription methods. +**`subscribe`** produces a time series only of the rows that correspond to the given symbol, +\*\*`subscribe_all`\*\*produces a time series of all rows in the parquet files. + +### ParquetWriter + +The ParquetWriter adapter is a generic user adapter to stream data from CSP time series to [Apache Parquet](https://parquet.apache.org/) files. +`ParquetWriter` adapter supports only flat (non hierarchical) parquet files with all the primitive types that are supported by the CSP framework. +Any time series of Struct objects will be flattened to multiple columns. + +#### Construction + +```python +ParquetWriter( + self, + file_name: Optional[str], + timestamp_column_name, + config: Optional[ParquetOutputConfig] = None, + filename_provider: Optional[csp.ts[str]] = None +): + """ + :param file_name: The path of the output parquet file name. Must be provided if no filename_provider specified. If both file_name and filename_provider are specified then file_name will be used as the initial output file name until filename_provider provides a new file name. + :param timestamp_column_name: Required field, if None is provided then no timestamp will be written. + :param config: Optional configuration of how the file should be written (such as compression, block size,...). + :param filename_provider: An optional time series that provides a times series of file paths. When a filename_provider time series provides a new file path, the previous open file name will be closed and all subsequent data will be written to the new file provided by the path. This enable partitioning and splitting the data based on time. + """ +``` + +#### Publishing + +```python +def publish_struct( + self, + value: ts[csp.Struct], + field_map: Dict[str, str] = None +): + """Publish a time series of csp.Struct objects to file + + :param value: The time series of Struct objects that should be published. + :param field_map: An optional dict str->str of the form {struct_field_name:column_name} that maps the names of the + structure fields to the column names to which the values should be written. If the field_map is non None, then only + the fields that are specified in the field_map will be written to file. If field_map is not provided then all fields + of a structure will be written to columns that match exactly the field_name. + """ + +def publish( + self, + column_name, + value: ts[object] +): + """Publish a time series of primitive type to file + :param column_name: The name of the parquet file column to which the data should be written to + :param value: The time series that should be published + """ +``` + +Parquet writer provides two publishing methods. +**`publish_struct`** is used to publish time series of **`csp.Struct`** objects while **`publish`** is used to publish primitive time series. +The columns in the written parquet file is a union of all columns that were published (the order is preserved). +A new row is written to parquet file whenever any of the inputs ticks. +For the given row, any column that corresponds to a time series that didn't tick, will have null values. + +## DBReader + +The DBReader adapter is a generic user adapter to stream data from a database as a reactive time series. +It leverages sqlalchemy internally in order to be able to access various DB backends. + +Please refer to the [SQLAlchemy Docs](https://docs.sqlalchemy.org/en/13/core/tutorial.html) for information on how to create sqlalchemy connections. + +The DBReader instance represents a single connection to a database. +From a single reader you can subscribe to various streams, either the entire stream of data (which would basically represent the result of a single join) or if a symbol column is declared, subscribe by symbol which will then demultiplex rows to the right adapter. + +```python +DBReader(self, connection, time_accessor, table_name=None, schema_name=None, query=None, symbol_column=None, constraint=None): + """ + :param connection: sqlalchemy engine or (already connected) connection object. + :param time_accessor: TimeAccessor object + :param table_name: name of table in database as a string + :param query: either string query or sqlalchemy query object. Ex: "select * from users" + :param symbol_column: name of symbol column in table as a string + :param constraint: additional sqlalchemy constraints for query. Ex: constraint = db.text('PRICE>:price').bindparams(price = 100.0) + """ +``` + +- **connection**: seqlalchemy engine or existing connection object. +- **time_accessor**: see below +- **table_name**: either table or query is required. + If passing a table_name then this table will be queried against for subscribe calls +- **query**: (optional) if table isn't supplied user can provide a direct query string or sqlalchemy query object. + This is useful if you want to run a join call. + For basic single-table queries passing table_name is preferred +- **symbol_column**: (optional) in order to be able to demux rows bysome column, pass `symbol_column`. + Example case for this is if database has data stored for many symbols in a single table, and you want to have a timeseries tick per symbol. +- **constraint**: (optional) additional sqlalchemy constraints for query. Ex: `constraint = db.text('PRICE>:price').bindparams(price= 100.0)` + +### TimeAccessor + +All data fed into CSP must be time based. +`TimeAccessor` is a helper class that defines how to extract timestamp information from the results of the data. +Users can define their own `TimeAccessor` implementation or use pre-canned ones: + +- `TimestampAccessor( self, time_column, tz=None)`: use this if there exists a single datetime column already. + Provide the column name and optionally the timezone of the column (if its timezone-less in the db) +- `DateTimeAccessor(self, date_column, time_column, tz=None)`: use this if there are two separate columns for date and time, this accessor will combine the two columns to create a single datetime. + Optionally pass tz if time column is timezone-less in the db + +User implementations would have to extend `TimeAccessor` interface. +In addition to defining how to convert db columns to timestamps, accessors are also used to augment the query to limit the data for the graph's start and end times. + +Once you have a DBReader object created, you can subscribe to time_series from it using the following methods: + +- `subscribe(self, symbol, typ, field_map=None)` +- `subscribe_all(self, typ, field_map=None)` + +Both of these calls expect `typ` to be a `csp.Struct` type. +`field_map` is a dictionary of `{ db_column : struct_column }` mappings that define how to map the database column names to the fields on the struct. + +`subscribe` is used to subscribe to a stream for the given symbol (symbol_column is required when creating DBReader) + +`subscribe_all` is used to retrieve all the data resulting from the request as a single timeseries. + +## Symphony + +The Symphony adapter allows for reading and writing of messages from the [Symphony](https://symphony.com/) message platform using [`requests`](https://requests.readthedocs.io/en/latest/) and the [Symphony SDK](https://docs.developers.symphony.com/). + +## Slack + +The Slack adapter allows for reading and writing of messages from the [Slack](https://slack.com) message platform using the [Slack Python SDK](https://slack.dev/python-slack-sdk/). diff --git a/docs/wiki/2.-Math-Nodes-(csp.math).md b/docs/wiki/api-references/Math-and-Logic-Nodes-API.md similarity index 85% rename from docs/wiki/2.-Math-Nodes-(csp.math).md rename to docs/wiki/api-references/Math-and-Logic-Nodes-API.md index 4da20dda..0f0d7d78 100644 --- a/docs/wiki/2.-Math-Nodes-(csp.math).md +++ b/docs/wiki/api-references/Math-and-Logic-Nodes-API.md @@ -1,18 +1,18 @@ -# Math and Logic nodes - -In an effort not to bloat the wiki, the following boolean and mathematical operations are available which should be self explanatory. +The following boolean and mathematical operations are available which should be self explanatory. Also note that there is syntactic sugar in place when wiring a graph. Edges have most operators overloaded includes `+`, `-`, `*`, `/`, `**`, `>`, `>=`, `<`, `<=`, `==`, `!=`, so you can have code like `csp.const(1) + csp.const(2)` work properly. Right hand side values will also automatically be upgraded to `csp.const()` if its detected that its not an edge, so something like `x = csp.const(1) + 2` will work as well. -## Binary logical operators +## Table of Contents + +1. Binary logical operators - **`csp.not_(ts[bool]) → ts[bool]`** - **`csp.and_(x: [ts[bool]]) → ts[bool]`** - **`csp.or_(x: [ts[bool]]) → ts[bool]`** -## Binary mathematical operators +2. Binary mathematical operators - **`csp.add(x: ts['T'], y: ts['T']) → ts['T']`** - **`csp.sub(x: ts['T'], y: ts['T']) → ts['T']`** @@ -22,7 +22,7 @@ Right hand side values will also automatically be upgraded to `csp.const( ts[Fill]: + ... regular csp.graph code ... + + +@csp.graph +def main(): + # position as ts[int] + portfolio_position = get_portfolio_position() + + + all_orders = get_orders() + # demux fat-pipe of orders into a dynamic basket keyed by symbol + demuxed_orders = csp.dynamic_demultiplex(all_orders, all_orders.symbol) + + + result = csp.dynamic(demuxed_orders, my_sub_graph, + csp.snap(all_orders.symbol), # Grab scalar value of all_orders.symbol at time of instantiation + #csp.snapkey(), # Alternative way to grab the key that instantiated the sub-graph + csp.attach(), # extract the demuxed_orders[symbol] time series of the symbol being created in the sub_graph + portfolio_position, # pass in regular ts[] + 123) # pass in some scalar + + + # process result.fills which will be a csp.DynamicBasket of { symbol : ts[Fill] } +``` diff --git a/docs/wiki/8.-Profiler.md b/docs/wiki/api-references/csp.profiler-API.md similarity index 67% rename from docs/wiki/8.-Profiler.md rename to docs/wiki/api-references/csp.profiler-API.md index 3a485cce..e0e5c0c3 100644 --- a/docs/wiki/8.-Profiler.md +++ b/docs/wiki/api-references/csp.profiler-API.md @@ -1,6 +1,4 @@ -The `csp.profiler` library allows users to time cycle/node executions during a graph run. There are two available utilities. - -# Profiler: runtime profiling +## `csp.profiler()` Users can simply run graphs under a `Profiler()` context to extract profiling information. The code snippet below runs a graph in profile mode and extracts the profiling data by calling `results()`. @@ -44,64 +42,7 @@ ProfilerInfo additionally comes with some useful utilities. These are: - **`ProfilerInfo.max_exec_node(self)`** - Returns the node type which had the most total executions as a tuple: `(name, node_stat)` where node_stat is a dictionary with the same keys as `node_stats[elem]` -One can use these metrics to identify bottlenecks/inefficiencies in their graphs. - -## Profiling a real-time csp.graph - -The `csp.profiler` library provides a GUI for profiling real-time csp graphs. -One can access this GUI by adding a `http_port`  argument to their profiler call. - -```python -with profiler.Profiler(http_port=8888) as p: - results = csp.run(graph, starttime=st, endtime=et) # run the graph normally -``` - -This will open up the GUI on `localhost:8888` (as http_port=8888) which will display real-time node timing, cycle timing and memory snapshots. -Profiling stats will be calculated whenever you refresh the page or call a GET request. -Additionally, you can add the `format=json`argument (`localhost:8888?format=json`) to your request to receive the ProfilerInfo as a `JSON`  object rather than the `HTML` display. - -Users can add the `display_graphs=True` flag to include bar/pie charts of node execution times in the web UI. -The matplotlib package is required to use the flag. - -```python -with profiler.Profiler(http_port=8888, display_graphs=True) as p: - ... -``` - -new_profiler - -## Saving raw profiling data to a file - -Users can save individual node execution times and individual cycle execution times to a `.csv` file if they desire. -This is useful if you want to apply your own analysis e.g. calculate percentiles. -To do this, simply add the flags `node_file=` or `cycle_file=` - -```python -with profiler.Profiler(cycle_file="cycle_data.csv", node_file="node_data.csv") as p: - ... -``` - -After the graph is run, the file `node_data.csv`  contains: - -``` -Node Type,Execution Time -count,1.9814e-05 -cast_int_to_float,1.2791e-05 -_time_window_updates,4.759e-06 -... -``` - -After the graph is run, the file `cycle_data.csv`  contains: - -``` -Execution Time -9.4757e-05 -4.5205e-05 -2.2873e-05 -... -``` - -# graph_info: build-time information +## `profiler.graph_info()` Users can also extract build-time information about the graph without running it by calling profiler.graph_info. The code snippet below shows how to call graph_info. diff --git a/docs/wiki/concepts/Adapters.md b/docs/wiki/concepts/Adapters.md new file mode 100644 index 00000000..fc4e5490 --- /dev/null +++ b/docs/wiki/concepts/Adapters.md @@ -0,0 +1,15 @@ +To get various data sources into and out of the graph, various Input and Output Adapters are available, such as CSV, Parquet, and database adapters (amongst others). +Users can also write their own input and output adapters, as explained below. + +There are two types of Input Adapters: **Historical** (aka Simulated) adapters and **Realtime** Adapters. + +Historical adapters are used to feed in historical timeseries data into the graph from some data source which has timeseries data. +Realtime Adapters are used to feed in live event based data in realtime, generally events created from external sources on separate threads. + +There is not distinction of Historical vs Realtime output adapters since outputs need not care if the generated timeseries data which are wired into them are generated from realtime or historical inputs. + +In CSP terminology, a single adapter corresponds to a single timeseries edge in the graph. +There are common cases where a single data source may be used to provide data to multiple adapter (timeseries) instances, for example a single CSV file with price data for many stocks can be read once but used to provide data to many individual, one per stock. +In such cases an AdapterManager is used to coordinate management of the single source (CSV file, database, Kafka connection, etc) and provided data to individual adapters. + +Note that adapters can be quickly written and prototyped in python, and if needed can be moved to a c+ implementation for more efficiency. diff --git a/docs/wiki/concepts/CSP-Graph.md b/docs/wiki/concepts/CSP-Graph.md new file mode 100644 index 00000000..c84b6542 --- /dev/null +++ b/docs/wiki/concepts/CSP-Graph.md @@ -0,0 +1,114 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Anatomy of a `csp.graph`](#anatomy-of-a-cspgraph) +- [Graph Propagation and Single-dispatch](#graph-propagation-and-single-dispatch) +- [Graph Pruning](#graph-pruning) +- [Collecting Graph Outputs](#collecting-graph-outputs) + +## Anatomy of a `csp.graph` + +To reiterate, `csp.graph` methods are called in order to construct the graph and are only executed before the engine is run. +`csp.graph` methods don't do anything special, they are essentially regular python methods, but they can be defined to accept inputs and generate outputs similar to `csp.nodes`. +This is solely used for type checking. +`csp.graph` methods can be created to encapsulate components of a graph, and can be called from other `csp.graph` methods in order to help facilitate graph building. + +Simple example: + +```python +@csp.graph +def calc_symbol_pnl(symbol: str, trades: ts[Trade]) -> ts[float]: + # sub-graph code needed to compute pnl for given symbol and symbol's trades + # sub-graph can subscribe to market data for the symbol as needed + ... + + +@csp.graph +def calc_portfolio_pnl(symbols: [str]) -> ts[float]: + symbol_pnl = [] + for symbol in symbols: + symbol_trades = trade_adapter.subscribe(symbol) + symbol_pnl.append(calc_symbol_pnl(symbol, symbol_trades)) + + return csp.sum(symbol_pnl) +``` + +In this simple example we have a `csp.graph` component `calc_symbol_pnl` which encapsulates computing pnl for a single symbol. +`calc_portfolio_pnl` is a graph that computes portfolio level pnl, it invokes the symbol-level pnl calc for every symbol, then sums up the results for the portfolio level pnl. + +## Graph Propagation and Single-dispatch + +The CSP graph propagation algorithm ensures that all nodes are executed *once* per engine cycle, and in the correct order. +Correct order means, that all input dependencies of a given node are guaranteed to have been evaluated before a given node is executed. +Take this graph for example: + +![359407953](https://github.com/Point72/csp/assets/3105306/d9416353-6755-4e37-8467-01da516499cf) + +On a given cycle lets say the `bid` input ticks. +The CSP engine will ensure that **`mid`** is executed, followed by **`spread`** and only once **`spread`**'s output is updated will **`quote`** be called. +When **`quote`** executes it will have the latest values of the `mid` and `spread` calc for this cycle. + +## Graph Pruning + +One should note a subtle optimization technique in CSP graphs. +Any part of a graph that is created at graph building time, but is NOT connected to any output nodes, will be pruned from the graph and will not exist during runtime. +An output is defined as either an output adapter or a `csp.node` without any outputs of its own. +The idea here is that we can avoid doing work if it doesn't result in any output being generated. +In general its best practice for all `csp.nodes` to be \***side-effect free**, in other words they shouldn't mutate any state outside of the node. +Assuming all nodes are side-effect free, pruning the graph would not have any noticeable effects. + +## Collecting Graph Outputs + +If the `csp.graph` passed to `csp.run` has outputs, the full timeseries will be returned from `csp.run` like so: + +**outputs example** + +```python +import csp +from datetime import datetime, timedelta + +@csp.graph +def my_graph() -> ts[int]: + return csp.merge(csp.const(1), csp.const(2, timedelta(seconds=1))) + +if __name__ == '__main__': + res = csp.run(my_graph, starttime=datetime(2021,11,8)) + print(res) +``` + +result: + +```raw +{0: [(datetime.datetime(2021, 11, 8, 0, 0), 1), (datetime.datetime(2021, 11, 8, 0, 0, 1), 2)]} +``` + +Note that the result is a list of `(datetime, value)` tuples. + +You can also use [csp.add_graph_output]() to add outputs. +These do not need to be in the top-level graph called directly from `csp.run`. + +This gives the same result: + +**add_graph_output example** + +```python +@csp.graph +def my_graph(): + csp.add_graph_output('a', csp.merge(csp.const(1), csp.const(2, timedelta(seconds=1)))) +``` + +In addition to python outputs like above, you can set the optional `csp.run` argument `output_numpy` to `True` to get outputs as numpy arrays: + +**numpy outputs** + +```python +result = csp.run(my_graph, starttime=datetime(2021,11,8), output_numpy=True) +``` + +result: + +```raw +{0: (array(['2021-11-08T00:00:00.000000000', '2021-11-08T00:00:01.000000000'], dtype='datetime64[ns]'), array([1, 2], dtype=int64))} +``` + +Note that the result there is a tuple per output, containing two numpy arrays, one with the datetimes and one with the values. diff --git a/docs/wiki/concepts/CSP-Node.md b/docs/wiki/concepts/CSP-Node.md new file mode 100644 index 00000000..229bfdc3 --- /dev/null +++ b/docs/wiki/concepts/CSP-Node.md @@ -0,0 +1,271 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Anatomy of a `csp.node`](#anatomy-of-a-cspnode) +- [Basket inputs](#basket-inputs) +- [**Node Outputs**](#node-outputs) +- [Basket Outputs](#basket-outputs) +- [Generic Types](#generic-types) + +## Anatomy of a `csp.node` + +The heart of a calculation graph are the `csp.nodes` that run the computations. +`csp.node` methods can take any number of scalar and timeseries arguments, and can return 0 → N timeseries outputs. +Timeseries inputs/outputs should be thought of as the edges that connect components of the graph. +These "edges" can tick whenever they have a new value. +Every tick is associated with a value and the time of the tick. +`csp.nodes` can have various other features, here is a an example of a `csp.node` that demonstrates many of the features. +Keep in mind that nodes will execute repeatedly as inputs tick with new data. +They may (or may not) generate an output as a result of an input tick. + +```python +from datetime import timedelta + +@csp.node # 1 +def demo_node(n: int, xs: ts[float], ys: ts[float]) -> ts[float]: # 2 + with csp.alarms(): # 3 + # Define an alarm time-series of type bool # 4 + alarm = csp.alarm(bool) # 5 + # 6 + with csp.state(): # 7 + # Create a state variable bound to the node # 8 + s_sum = 0.0 # 9 + # 10 + with csp.start(): # 11 + # Code block that executes once on start of the engine # 12 + # one can set timeseries properties here as well, such as # 13 + # csp.set_buffering_policy(xs, tick_count=5) # 14 + # csp.set_buffering_policy(xs, tick_history=timedelta(minutes=1))# 15 + # csp.make_passive(xs) # 16 + csp.schedule_alarm(alarm, timedelta(seconds=1), True) # 17 + # 18 + with csp.stop(): # 19 + pass # code block to execute when the engine is done # 20 + # 21 + if csp.ticked(xs, ys) and csp.valid(xs, ys): # 22 + s_sum += xs * ys # 23 + # 24 + if csp.ticked(alarm): # 25 + csp.schedule_alarm(alarm, timedelta(seconds=1), True) # 26 + return s_sum # 27 +``` + +Lets review line by line + +1\) Every CSP node must start with the **`@csp.node`** decorator + +2\) CSP nodes are fully typed and type-checking is strictly enforced. +All arguments must be typed, as well as all outputs. +Outputs are typed using function annotation syntax. + +Single outputs can be unnamed, for multiple outputs they must be named. +When using multiple outputs, annotate the type using **`def my_node(inputs) → csp.Outputs(name1=ts[], name2=ts[])`** where `T` and `V` are the respective types of `name1` and `name2`. + +Note the syntax of timeseries inputs, they are denoted by **`ts[type]`**. +Scalars can be passed in as regular types, in this example we pass in `n` which expects a type of `int` + +3\) **`with csp.alarms()`**: nodes can (optionally) declare internal alarms, every instance of the node will get its own alarm that can be scheduled and act just like a timeseries input. +All alarms must be declared within the alarms context. + +5\) Instantiate an alarm in the alarms context using the `csp.alarm(typ)` function. This creates an alarm which is a time-series of type `typ`. + +7\) **`with csp.state()`**: optional state variables can be defined under the state context. +Note that variables declared in state will live across invocations of the method. + +9\) An example declaration and initialization of state variable `s_sum`. +It is good practice to name state variables prefixed with `s_`, which is the convention in the CSP codebase. + +11\) **`with csp.start()`**: an optional block to execute code at the start of the engine. +Generally this is used to setup initial timers or set input timeseries properties such as buffer sizes, or to make inputs passive + +14-15) **`csp.set_buffering_policy`**: nodes can request a certain amount of history be kept on the incoming time series, this can be denoted in number of ticks or in time. +By setting a buffering policy, nodes can access historical values of the timeseries (by default only the last value is kept) + +16\) **`csp.make_passive`** / **`csp.make_active`**: Nodes may not need to react to all of their inputs, they may just need their latest value. +For performance purposes the node can mark an input as passive to avoid triggering the node unnecessarily. +`make_active` can be called to reactivate an input. + +17\) **`csp.schedule_alarm`**: scheduled a one-shot tick on the given alarm input. +The values given are the timedelta before the alarm triggers and the value it will have when it triggers. +Note that `schedule_alarm` can be called multiple times on the same alarm to schedule multiple triggers. + +19\) **`with csp.stop()`** is an optional block that can be called when the engine is done running. + +22\) all nodes will have if conditions to react to different inputs. +**`csp.ticked()`** takes any number of inputs and returns true if **any** of the inputs ticked. +**`csp.valid()`** similar takes any number of inputs however it only returns true if **all** inputs are valid. +Valid means that an input has had at least one tick and so it has a "current value". + +23\) One of the benefits of CSP is that you always have easy access to the latest value of all inputs. +`xs` and `ys` on line 22,23 will always have the latest value of both inputs, even if only one of them just ticked. + +25\) This demonstrates how an alarm can be treated like any other input. + +27\) We tick our running "sum" as an output here every second. + +## Basket inputs + +In addition to single time-series inputs, a node can also accept a **basket** of time series as an argument. +A basket is essentially a collection of timeseries which can be passed in as a single argument. +Baskets can either be list baskets or dict baskets. +Individual timeseries in a basket can tick independently, and they can be looked at and reacted to individually or as a collection. + +For example: + +```python +@csp.node # 1 +def demo_basket_node( # 2 + list_basket: [ts[int]], # 3 + dict_basket: {str: ts[int]} # 4 +) -> ts[float]: # 5 + # 6 + if csp.ticked(list_basket): # 7 + return sum(list_basket.validvalues()) # 8 + # 9 + if csp.ticked(list_basket[3]): # 10 + return list_basket[3] # 11 + # 12 + if csp.ticked(dict_basket): # 13 + # can iterate over ticked key,items # 14 + # for k,v in dict_basket.tickeditems():# 15 + # ... # 16 + return sum(dict_basket.tickedvalues()) # 17 +``` + +3\) Note the syntax of basket inputs. +list baskets are noted as `[ts[type]]` (a list of time series) and dict baskets are `{key_type: ts[ts_type]}` (a dictionary of timeseries keyed by type `key_type`). It is also possible to use the `List[ts[int]]` and `Dict[str, ts[int]]` typing notation. + +7\) Just like single timeseries, we can react to a basket if it ticked. +The convention is the same as passing multiple inputs to `csp.ticked`, `csp.ticked` is true if **any** basket input ticked. +`csp.valid` is true is **all** basket inputs are valid. + +8\) baskets have various iterators to access their inputs: + +- **`tickedvalues`**: iterator of values of all ticked inputs +- **`tickedkeys`**: iterator of keys of all ticked inputs (keys are list index for list baskets) +- **`tickeditems`**: iterator of (key,value) tuples of ticked inputs +- **`validvalues`**: iterator of values of all valid inputs +- **`validkeys`**: iterator of keys of all valid inputs +- **`validitems`**: iterator of (key,value) tuples of valid inputs +- **`keys`**: list of keys on the basket (**dictionary baskets only** ) + +10-11) This demonstrates the ability to access an individual element of a +basket and react to it as well as access its current value + +## **Node Outputs** + +Nodes can return any number of outputs (including no outputs, in which case it is considered an "output" or sink node, +see [Graph Pruning](https://github.com/Point72/csp/wiki/0.-Introduction#graph-pruning)). +Nodes with single outputs can return the output as an unnamed output. +Nodes returning multiple outputs must have them be named. +When a node is called at graph building time, if it is a single unnamed node the return variable is an edge representing the output which can be passed into other nodes. +An output timeseries cannot be ticked more than once in a given node invocation. +If the outputs are named, the return value is an object with the outputs available as attributes. +For example (examples below demonstrate various ways to output the data as well) + +```python +@csp.node +def single_unnamed_outputs(n: ts[int]) -> ts[int]: + # can either do + return n + # or + # csp.output(n) to continue processes after output + + +@csp.node +def multiple_named_outputs(n: ts[int]) -> csp.Outputs(y=ts[int], z=ts[float]): + # can do + # csp.output(y=n, z=n+1.) to output to multiple outputs + # or separate the outputs to tick out at separate points: + # csp.output(y=n) + # ... + # csp.output(z=n+1.) + # or can return multiple values with: + return csp.output(y=n, z=n+1.) + +@csp.graph +def my_graph(n: ts[int]): + x = single_unnamed_outputs(n) + # x represents the output edge of single_unnamed_outputs, + # we can pass it a time series input to other nodes + csp.print('x', x) + + + result = multiple_named_outputs(n) + # result holds all the outputs of multiple_named_outputs, which can be accessed as attributes + csp.print('y', result.y) + csp.print('z', result.z) +``` + +## Basket Outputs + +Similarly to inputs, a node can also produce a basket of timeseries as an output. +For example: + +```python +class MyStruct(csp.Struct): # 1 + symbol: str # 2 + index: int # 3 + value: float # 4 + # 5 +@csp.node # 6 +def demo_basket_output_node( # 7 + in_: ts[MyStruct], # 8 + symbols: [str], # 9 + num_symbols: int # 10 +) -> csp.Outputs( # 11 + dict_basket=csp.OutputBasket({str: ts[float]}, shape="symbols"), # 15 + list_basket=csp.OutputBasket([ts[float]], shape="num_symbols"), # 16 +): # 17 + # 18 + if csp.ticked(in_): # 19 + # output to dict basket # 20 + csp.output(dict_basket[in_.symbol], in_.value) # 21 + # alternate output syntax, can output multiple keys at once # 22 + # csp.output(dict_basket={in_.symbol: in_.value}) # 23 + # output to list basket # 24 + csp.output(list_basket[in_.index], in_.value) # 25 + # alternate output syntax, can output multiple keys at once # 26 + # csp.output(list_basket={in_.index: in_.value}) # 27 +``` + +11-17) Note the output declaration syntax. +A basket output can be either named or unnamed (both examples here are named), and its shape can be specified two ways. +The `shape` parameter is used with a scalar value that defines the shape of the basket, or the name of the scalar argument (a dict basket expects shape to be a list of keys. lists basket expects `shape` to be an `int`). +`shape_of` is used to take the shape of an input basket and apply it to the output basket. + +20+) There are several choices for output syntax. +The following work for both list and dict baskets: + +- `csp.output(basket={key: value, key2: value2, ...})` +- `csp.output(basket[key], value)` +- `csp.output({key: value}) # only works if the basket is the only output` + +## Generic Types + +CSP supports syntax for generic types as well. +To denote a generic type we use a string (typically `'T'` is used) to denote a generic type. +When a node is called the type of the argument will get bound to the given type variable, and further inputs / outputs will be checked and bound to said typevar. +Note that the string syntax `'~T'` denotes the argument expects the *value* of a type, rather than a type itself: + +```python +@csp.node +def sample(trigger: ts[object], x: ts['T']) -> ts['T']: + '''will return current value of x on trigger ticks''' + with csp.state(): + csp.make_passive(x) + + if csp.ticked(trigger) and csp.valid(x): + return x + + +@csp.node +def const(value: '~T') -> ts['T']: + ... +``` + +`sample` takes a timeseries of type `'T'` as an input, and returns a timeseries of type `'T'`. +This allows us to pass in a `ts[int]` for example, and get a `ts[int]` as an output, or `ts[bool]` → `ts[bool]` + +`const` takes value as an *instance* of type `T`, and returns a timeseries of type `T`. +So we can call `const(5)` and get a `ts[int]` output, or `const('hello!')` and get a `ts[str]` output, etc... diff --git a/docs/wiki/concepts/Execution-Modes.md b/docs/wiki/concepts/Execution-Modes.md new file mode 100644 index 00000000..46902a82 --- /dev/null +++ b/docs/wiki/concepts/Execution-Modes.md @@ -0,0 +1,243 @@ +The CSP engine can be run in two flavors, realtime and simulation. + +In simulation mode, the engine is always run at full speed pulling in time-based data from its input adapters and running them through the graph. +All inputs in simulation are driven off the provided timestamped data of its inputs. + +In realtime mode, the engine runs in wallclock time as of "now". +Realtime engines can get data from realtime adapters which source data on separate threads and pass them through to the engine (ie think of activeMQ events happening on an activeMQ thread and being passed along to the engine in "realtime"). + +Since engines can run in both simulated and realtime mode, users should **always** use **`csp.now()`** to get the current time in `csp.node`s. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Simulation Mode](#simulation-mode) +- [Realtime Mode](#realtime-mode) +- [csp.PushMode](#csppushmode) +- [Realtime Group Event Synchronization](#realtime-group-event-synchronization) + +## Simulation Mode + +Simulation mode is the default mode of the engine. +As stated above, simulation mode is used when you want your engine to crunch through historical data as fast as possible. +In simulation mode, the engine runs on some historical data that is fed in through various adapters. +The adapters provide events by time, and they are streamed into the engine via the adapter timeseries in time order. +`csp.timer` and `csp.node` alarms are scheduled and executed in "historical time" as well. +Note that there is no strict requirement for simulated runs to run on historical dates. +As long as the engine is not in realtime mode, it remains in simulation mode until the provided endtime, even if endtime is in the future. + +## Realtime Mode + +Realtime mode is opted into by passing `realtime=True` to `csp.run(...)`. +When run in realtime mode, the engine will run in simulation mode from the provided starttime → wallclock "now" as of the time of calling run. +Once the simulation run is done, the engine switches into realtime mode. +Under realtime mode, external realtime adapters will be able to send data into the engine thread. +All time based inputs such as `csp.timer` and alarms will switch to executing in wallclock time as well. + +As always, `csp.now()` should still be used in `csp.node` code, even when running in realtime mode. +`csp.now()` will be the time assigned to the current engine cycle. + +## csp.PushMode + +When consuming data from input adapters there are three choices on how one can consume the data: + +| PushMode | EngineMode | Description | +| :------- | :--------- | :---------- | +| **LAST_VALUE** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with the last value on a given timestamp | +| | Realtime | all ticks that occurred since previous engine cycle will collapse / conflate to the latest value | +| **NON_COLLAPSING** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once per engine cycle. subsequent cycles will execute with the same time | +| | Realtime | all ticks that occurred since previous engine cycle will be ticked across subsequent engine cycles as fast as possible | +| **BURST** | Simulation | all ticks from input source with duplicate timestamps (on the same timeseries) will tick once with a list of all values | +| | Realtime | all ticks that occurred since previous engine cycle will tick once with a list of all the values | + +## Realtime Group Event Synchronization + +The CSP framework supports properly synchronizing events across multiple timeseries that are sourced from the same realtime adapter. +A classical example of this is a market data feed. +Say you consume bid, ask and trade as 3 separate time series for the same product / exchange. +Since the data flows in asynchronously from a separate thread, bid, ask and trade events could end up executing in the engine at arbitrary slices of time, leading to crossed books and trades that are out of range of the bid/ask. +The engine can properly provide a correct synchronous view of all the inputs, regardless of their PushModes. +Its up to adapter implementations to determine which inputs are part of a synchronous "PushGroup". + +Here's a classical example. +An Application wants to consume conflating bid/ask as LAST_VALUE but it doesn't want to conflate trades, so its consumed as NON_COLLAPSING. + +Lets say we have this sequence of events on the actual market data feed's thread, coming in one the wire in this order. +The columns denote the time the callbacks come in off the market data thread. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
EventTT+1T+2T+3T+4T+5T+6
BID100.00100.01
+
99.9799.9899.99
+
ASK100.02
+
100.03
+

+

+
100.00
TRADE
+

+
100.02
+

+
100.03
+
+ +Without any synchronization you can end up with nonsensical views based on random timing. +Here's one such possibility (bid/ask are still LAST_VALUE, trade is NON_COLLAPSING). + +Over here ET is engine time. +Lets assume engine had a huge delay and hasn't processed any data submitted above yet. +Without any synchronization, bid/ask would completely conflate, and trade would unroll over multiple engine cycles + + + + + + + + + + + + + + + + + + + + + + + + +
EventETET+1
BID99.99
+
ASK100.00
+
TRADE100.02100.03
+ +However, since market data adapters will group bid/ask/trade inputs together, the engine won't let bid/ask events advance ahead of trade events since trade is NON_COLLAPSING. +NON_COLLAPSING inputs will essentially act as a barrier, not allowing events ahead of the barrier tick before the barrier is complete. +Lets assume again that the engine had a huge delay and hasn't processed any data submitted above. +With proper barrier synchronizations the engine cycles would look like this under the same conditions: + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
EventETET+1ET+2
BID100.0199.99
+
ASK100.03
+
100.00
TRADE100.02100.03
+
+ +Note how the last ask tick of 100.00 got held up to a separate cycle (ET+2) so that trade could tick with the correct view of bid/ask at the time of the second trade (ET+1) + +As another example, lets say the engine got delayed briefly at wire time T, so it was able to process T+1 data. +Similarly it got briefly delayed at time T+4 until after T+6. The engine would be able to process all data at time T+1, T+2, T+3 and T+6, leading to this sequence of engine cycles. +The equivalent "wire time" is denoted in parenthesis + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
EventET (T+1)ET+1 (T+2)ET+2 (T+3)ET+3 (T+5)ET+4 (T+6)
BID100.01
+
99.9799.99
+
ASK100.02100.03
+

+
100.00
TRADE
+
100.02
+
100.03
+
diff --git a/docs/wiki/concepts/Historical-Buffers.md b/docs/wiki/concepts/Historical-Buffers.md new file mode 100644 index 00000000..8eafc3c2 --- /dev/null +++ b/docs/wiki/concepts/Historical-Buffers.md @@ -0,0 +1,133 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Historical Buffers](#historical-buffers) +- [Historical Range Access](#historical-range-access) + +## Historical Buffers + +CSP can provide access to historical input data as well. +By default only the last value of an input is kept in memory, however one can request history to be kept on an input either by number of ticks or by time using **csp.set_buffering_policy.** + +The methods **csp.value_at**, **csp.time_at** and **csp.item_at** can be used to retrieve historical input values. +Each node should call **csp.set_buffering_policy** to make sure that its inputs are configured to store sufficiently long history for correct implementation. +For example, let's assume that we have a stream of data and we want to create equally sized buckets from the data. +A possible implementation of such a node would be: + +```python +@csp.node +def data_bin_generator(bin_size: int, input: ts['T']) -> ts[['T']]: + with csp.start(): + assert bin_size > 0 + # This makes sure that input stores at least bin_size entries + csp.set_buffering_policy(input, tick_count=bin_size) + if csp.ticked(input) and (csp.num_ticks(input) % bin_size == 0): + return [csp.value_at(input, -i) for i in range(bin_size)] +``` + +In this example, we use **`csp.set_buffering_policy(input, tick_count=bin_size)`** to ensure that the buffer history contains at least **`bin_size`** elements. +Note that an input can be shared by multiple nodes, if multiple nodes provide size requirements, the buffer size would be resolved to the maximum size to support all requests. + +Alternatively, **`csp.set_buffering_policy`** supports a **`timedelta`** parameter **`tick_history`** instead of **`tick_count`.** +If **`tick_history`** is provided, the buffer will scale dynamically to ensure that any period of length **`tick_history`** will fit into the history buffer. + +To identify when there are enough samples to construct a bin we use **`csp.num_ticks(input) % bin_size == 0`**. +The function **`csp.num_ticks`** returns the number or total ticks for a given time series. +NOTE: The actual size of the history buffer is usually less than **`csp.num_ticks`** as buffer is dynamically truncated to satisfy the set policy. + +The past values in this example are accessed using **`csp.value_at`**. +The various historical access methods take the same arguments and return the value, time and tuple of `(time,value)` respectively: + +- **`csp.value_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns **value** of the timeseries at requested `index_or_time` +- **`csp.time_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns **datetime** of the timeseries at requested `index_or_time` +- **`csp.item_at`**`(ts, index_or_time, duplicate_policy=DuplicatePolicy.LAST_VALUE, default=UNSET)`: returns tuple of `(datetime,value)` of the timeseries at requested `index_or_time` + - **`ts`**: the name of the input + - **`index_or_time`**: + - If providing an **index**, this represents how many ticks back to rereieve **and should be \<= 0**. + 0 indicates the current value, -1 is the previous value, etc. + - If providing **time** one can either provide a datetime for absolute time, or a timedelta for how far back to access. + **NOTE** that timedelta must be negative to represent time in the past.. + - **`duplicate_policy`**: when requesting history by datetime or timedelta, its possible that there could be multiple values that match the given time. + **`duplicate_policy`** can be provided to control the behavior of what to return in this case. + The default policy is to return the LAST_VALUE that exists at the given time. + - **`default`**: value to be returned if the requested time is out of the history bounds (if default is not provided and a request is out of bounds an exception will be raised). + +The following demonstrate a possible way to compute a rolling sum for the past N ticks. Please note that this is for demonstration purposes only and is not efficient. A more efficient +vectorized version can be seen below, though even that would not be recommended for a rolling sum since csp.stats.sum would be even more efficient with its C++ impl in-line calculation + +```python +@csp.node +def rolling_sum(x:ts[float], tick_count: int) -> ts[float]: + with csp.start(): + csp.set_buffering_policy(x, tick_count=tick_count) + + if csp.ticked(x): + return sum(csp.value_at(x, -i) for i in range(min(csp.num_ticks(x), tick_count))) +``` + +## Historical Range Access + +In similar fashion, the methods **`csp.values_at`**, **`csp.times_at`** and **`csp.items_at`** can be used to retrieve a range of historical input values as numpy arrays. +The sample_sum example above can be accomplished more efficiently with range access: + +```python +@csp.node +def rolling_sum(x:ts[float], tick_count: int) -> ts[float]: + with csp.start(): + csp.set_buffering_policy(x, tick_count=tick_count) + + if csp.ticked(x): + return csp.values_at(x).sum() +``` + +The past values in this example are accessed using **`csp.values_at`**. +The various historical access methods take the same arguments and return the value, time and tuple of `(times,values)` respectively: + +- **`csp.values_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: + returns values in specified range as a numpy array +- **`csp.times_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: + returns times in specified range as a numpy array +- **`csp.items_at`**`(ts, start_index_or_time, end_index_or_time, start_index_policy=TimeIndexPolicy.INCLUSIVE, end_index_policy=TimeIndexPolicy.INCLUSIVE)`: + returns a tuple of (times, values) numpy arrays + - **`ts`** - the name of the input + - **`start_index_or_time`**: + - If providing an **index**, this represents how many ticks back to retrieve **and should be \<= 0**. + 0 indicates the current value, -1 is the previous value, etc. + - If providing **time** one can either provide a datetime for absolute time, or a timedelta for how far back to access. + **NOTE that timedelta must be negative** to represent time in the past.. + - If **None** is provided, the range will begin "from the beginning" - i.e., the oldest tick in the buffer. + - **end_index_or_time:** same as start_index_or_time + - If **None** is provided, the range will go "until the end" - i.e., the newest tick in the buffer. + - **`start_index_policy`**: only for use with datetime/timedelta as the start and end parameters. + - **\`TimeIndexPolicy.INCLUSIVE**: if there is a tick exactly at the requested time, include it + - **TimeIndexPolicy.EXCLUSIVE**: if there is a tick exactly at the requested time, exclude it + - **TimeIndexPolicy.EXTRAPOLATE**: if there is a tick at the beginning timestamp, include it. + Otherwise, if there is a tick before the beginning timestamp, force a tick at the beginning timestamp with the prevailing value at the time. + - **end_index_policy** only for use with datetime/timedelta and the start and end parameters. + - **TimeIndexPolicy.INCLUSIVE**: if there is a tick exactly at the requested time, include it + - **TimeIndexPolicy.EXCLUSIVE**: if there is a tick exactly at the requested time, exclude it + - **TimeIndexPolicy.EXTRAPOLATE**: if there is a tick at the end timestamp, include it. + Otherwise, if there is a tick before the end timestamp, force a tick at the end timestamp with the prevailing value at the time + +Range access is optimized at the C++ layer and for this reason its far more efficient than calling the single value access methods in a loop, and they should be substituted in where possible. + +Below is a rolling average example to illustrate the use of timedelta indexing. +Note that `timedelta(seconds=-n_seconds)` is equivalent to `csp.now() - timedelta(seconds=n_seconds)`, since datetime indexing is supported. + +```python +@csp.node +def rolling_average(x: ts[float], n_seconds: int) -> ts[float]: + with csp.start(): + assert n_seconds > 0 + csp.set_buffering_policy(x, tick_history=timedelta(seconds=n_seconds)) + if csp.ticked(x): + avg = np.mean(csp.values_at(x, timedelta(seconds=-n_seconds), timedelta(seconds=0), + csp.TimeIndexPolicy.INCLUSIVE, csp.TimeIndexPolicy.INCLUSIVE)) + csp.output(avg) +``` + +When accessing all elements within the buffering policy window like +this, it would be more succinct to pass None as the start and end time, +but datetime/timedelta allows for more general use (e.g. rolling average +between 5 seconds and 1 second ago, or average specifically between +9:30:00 and 10:00:00) diff --git a/docs/wiki/98.-Building-From-Source.md b/docs/wiki/dev-guides/Build-CSP-from-Source.md similarity index 64% rename from docs/wiki/98.-Building-From-Source.md rename to docs/wiki/dev-guides/Build-CSP-from-Source.md index cc39d82b..6aeda422 100644 --- a/docs/wiki/98.-Building-From-Source.md +++ b/docs/wiki/dev-guides/Build-CSP-from-Source.md @@ -1,6 +1,34 @@ -`csp` is written in Python and C++ with Python and C++ build dependencies. While prebuilt wheels are provided for end users, it is also straightforward to build `csp` from either the Python [source distribution](https://packaging.python.org/en/latest/specifications/source-distribution-format/) or the GitHub repository. - -As a convenience, `csp` uses a `Makefile` for commonly used commands. You can print the main available commands by running `make` with no arguments +CSP is written in Python and C++ with Python and C++ build dependencies. While prebuilt wheels are provided for end users, it is also straightforward to build CSP from either the Python [source distribution](https://packaging.python.org/en/latest/specifications/source-distribution-format/) or the GitHub repository. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Make commands](#make-commands) +- [Prerequisites](#prerequisites) +- [Building with Conda on Linux](#building-with-conda-on-linux) + - [Install conda](#install-conda) + - [Clone](#clone) + - [Install build dependencies](#install-build-dependencies) + - [Build](#build) +- [Building with a system package manager](#building-with-a-system-package-manager) + - [Clone](#clone-1) + - [Install build dependencies](#install-build-dependencies-1) + - [Linux](#linux) + - [MacOS](#macos) + - [Install Python dependencies](#install-python-dependencies) + - [Build](#build-1) + - [Building on `aarch64` Linux](#building-on-aarch64-linux) +- [Lint and Autoformat](#lint-and-autoformat) +- [Testing](#testing) +- [Troubleshooting](#troubleshooting) + - [MacOS](#macos-1) + - [vcpkg install failed](#vcpkg-install-failed) + - [Building thrift:arm64-osx/thrift:x64-osx failed](#building-thriftarm64-osxthriftx64-osx-failed) + - [CMake was unable to find a build program corresponding to "Unix Makefiles".](#cmake-was-unable-to-find-a-build-program-corresponding-to-unix-makefiles) + +## Make commands + +As a convenience, CSP uses a `Makefile` for commonly used commands. You can print the main available commands by running `make` with no arguments ```bash > make @@ -13,27 +41,40 @@ lint run lints test run the tests ``` -# Prerequisites +## Prerequisites -`csp` has a few system-level dependencies which you can install from your machine package manager. Other package managers like `conda`, `nix`, etc, should also work fine. Currently, `csp` relies on the `GNU` compiler toolchain only. +CSP has a few system-level dependencies which you can install from your machine package manager. Other package managers like `conda`, `nix`, etc, should also work fine. Currently, CSP relies on the `GNU` compiler toolchain only. -# Building with Conda on Linux +## Building with Conda on Linux The easiest way to get started on a Linux machine is by installing the necessary dependencies in a self-contained conda environment. -Tweak this script to create a conda environment, install the build dependencies, build, and install a development version of `csp` into the environment. Note that we use [micromamba](https://mamba.readthedocs.io/en/latest/index.html) in this example, but [Anaconda](https://www.anaconda.com/download), [Miniconda](https://docs.anaconda.com/free/miniconda/index.html), [Miniforge](https://github.com/conda-forge/miniforge), etc, should all work fine. +Tweak this script to create a conda environment, install the build dependencies, build, and install a development version of CSP into the environment. -## Install Conda +### Install conda ```bash -# download and install micromamba for Linux/Mac -"${SHELL}" <(curl -L micro.mamba.pm/install.sh) +mkdir ~/github +cd ~/github + +# this downloads a Linux x86_64 build, change your architecture to match your development machine +# see https://conda-forge.org/miniforge/ for alternate download links + +wget https://github.com/conda-forge/miniforge/releases/download/23.3.1-1/Mambaforge-23.3.1-1-Linux-x86_64.sh +chmod 755 Mambaforge-23.3.1-1-Linux-x86_64.sh +./Mambaforge-23.3.1-1-Linux-x86_64.sh -b -f -u -p csp_venv + +. ~/github/csp_venv/etc/profile.d/conda.sh -# on windows powershell -# Invoke-Expression ((Invoke-WebRequest -Uri https://micro.mamba.pm/install.ps1).Content) +# optionally, run this if you want to set up conda in your .bashrc +# conda init bash + +conda config --add channels conda-forge +conda config --set channel_priority strict +conda activate base ``` -## Clone +### Clone ```bash git clone https://github.com/Point72/csp.git @@ -41,7 +82,7 @@ cd csp git submodule update --init --recursive ``` -## Install build dependencies +### Install build dependencies ```bash # Note the operating system, change as needed @@ -50,22 +91,32 @@ micromamba create -n csp -f conda/dev-environment-unix.yml micromamba activate csp ``` -## Build +### Build ```bash -make build-conda +make build -# finally install into the csp conda environment +# on aarch64 linux, comment the above command and use this instead +# VCPKG_FORCE_SYSTEM_BINARIES=1 make build + +# finally install into the csp_venv conda environment make develop ``` -## A note about dependencies +If you didn’t do `conda init bash` you’ll need to re-add conda to your shell environment and activate the `csp` environment to use it: + +```bash +. ~/github/csp_venv/etc/profile.d/conda.sh +conda activate csp -In Conda, we pull our dependencies from the Conda environment by setting the environment variable `CSP_USE_VCPKG=0`. This will force the build to not pull dependencies from vcpkg. This may or may not work in other environments or with packages provided by other package managers or built from source, but there is too much variability for us to support alternative patterns. +# make sure everything works +cd ~/github/csp +make test +``` -# Building with a system package manager +## Building with a system package manager -## Clone +### Clone Clone the repo and submodules with: @@ -75,9 +126,9 @@ cd csp git submodule update --init --recursive ``` -## Install build dependencies +### Install build dependencies -### Linux +#### Linux **Debian/Ubuntu/etc** @@ -103,7 +154,7 @@ sudo make dependencies-fedora sudo dnf group install "Development Tools" ``` -### MacOS +#### MacOS **Homebrew** @@ -114,7 +165,7 @@ make dependencies-mac # brew install bison cmake flex make ninja ``` -## Install Python dependencies +### Install Python dependencies Python build and develop dependencies are specified in the `pyproject.toml`, but you can manually install them: @@ -129,16 +180,13 @@ make requirements Note that these dependencies would otherwise be installed normally as part of [PEP517](https://peps.python.org/pep-0517/) / [PEP518](https://peps.python.org/pep-0518/). -## Build +### Build Build the python project in the usual manner: ```bash make build -# on aarch64 linux, comment the above command and use this instead -# VCPKG_FORCE_SYSTEM_BINARIES=1 make build - # or # python setup.py build build_ext --inplace ``` @@ -151,20 +199,15 @@ On `aarch64` Linux the VCPKG_FORCE_SYSTEM_BINARIES environment variable must be VCPKG_FORCE_SYSTEM_BINARIES=1 make build ``` -## Using System Dependencies - -By default, we pull and build dependencies with [vcpkg](https://vcpkg.io/en/). We only support non-vendored dependencies via Conda (see [A note about dependencies](#A-note-about-dependencies) above). +## Lint and Autoformat -# Lint and Autoformat - -`csp` has listing and auto formatting. +CSP has listing and auto formatting. | Language | Linter | Autoformatter | Description | | :------- | :----- | :------------ | :---------- | | C++ | `clang-format` | `clang-format` | Style | | Python | `ruff` | `ruff` | Style | | Python | `isort` | `isort` | Imports | -| Markdown | `mdformat` / `codespell` | `mdformat` / `codespell` | Style/Spelling | **C++ Linting** @@ -188,7 +231,7 @@ make fix-cpp make lint-py # or # python -m isort --check csp/ setup.py -# python -m ruff check csp/ setup.py +# python -m ruff csp/ setup.py ``` **Python Autoformatting** @@ -200,27 +243,9 @@ make fix-py # python -m ruff format csp/ setup.py ``` -**Documentation Linting** - -```bash -make lint-docs -# or -# python -m mdformat --check docs/wiki/ README.md examples/README.md -# python -m codespell_lib docs/wiki/ README.md examples/README.md -``` - -**Documentation Autoformatting** +## Testing -```bash -make fix-docs -# or -# python -m mdformat docs/wiki/ README.md examples/README.md -# python -m codespell_lib --write docs/wiki/ README.md examples/README.md -``` - -# Testing - -`csp` has both Python and C++ tests. The bulk of the functionality is tested in Python, which can be run via `pytest`. First, install the Python development dependencies with +CSP has both Python and C++ tests. The bulk of the functionality is tested in Python, which can be run via `pytest`. First, install the Python development dependencies with ```bash make develop @@ -289,17 +314,17 @@ There are a few test flags available: - **`CSP_TEST_KAFKA`** - **`CSP_TEST_SKIP_EXAMPLES`**: skip tests of examples folder -# Troubleshooting +## Troubleshooting -## MacOS +### MacOS -### vcpkg install failed +#### vcpkg install failed Check the `vcpkg-manifest-install.log` files, and install the corresponding packages if needed. For example, you may need to `brew install pkg-config`. -### Building thrift:arm64-osx/thrift:x64-osx failed +#### Building thrift:arm64-osx/thrift:x64-osx failed ``` Thrift requires bison > 2.5, but the default `/usr/bin/bison` is version 2.3. @@ -311,7 +336,7 @@ On ARM: `export PATH="/opt/homebrew/opt/bison/bin:$PATH"` On Intel: `export PATH="/usr/local/opt/bison/bin:$PATH"` -### CMake was unable to find a build program corresponding to "Unix Makefiles". +#### CMake was unable to find a build program corresponding to "Unix Makefiles". Complete error message: diff --git a/docs/wiki/dev-guides/Contribute.md b/docs/wiki/dev-guides/Contribute.md new file mode 100644 index 00000000..7de8b8d3 --- /dev/null +++ b/docs/wiki/dev-guides/Contribute.md @@ -0,0 +1,9 @@ +Contributions are welcome on this project. We distribute under the terms of the [Apache 2.0 license](https://github.com/Point72/csp/blob/main/LICENSE). + +For **bug reports** or **small feature requests**, please open an issue on our [issues page](https://github.com/Point72/csp/issues). + +For **questions** or to discuss **larger changes or features**, please use our [discussions page](https://github.com/Point72/csp/discussions). + +For **contributions**, please see our [developer documentation](https://github.com/Point72/csp/wiki/99.-Developer). We have `help wanted` and `good first issue` tags on our issues page, so these are a great place to start. + +For **documentation updates**, make PRs that update the pages in `/docs/wiki`. The documentation is pushed to the GitHub wiki automatically through a GitHub workflow. Note that direct updates to this wiki will be overwritten. diff --git a/docs/wiki/dev-guides/GitHub-Conventions.md b/docs/wiki/dev-guides/GitHub-Conventions.md new file mode 100644 index 00000000..a60d6453 --- /dev/null +++ b/docs/wiki/dev-guides/GitHub-Conventions.md @@ -0,0 +1,73 @@ +## Triaging Issues + +The bug tracker is both a venue for users to communicate to the +developers about defects and a database of known defects. It's up to the +maintainers to ensure issues are high quality. + +We have a number of labels that can be applied to sort issues into +categories. If you notice newly created issues that are poorly labeled, +consider adding or removing some labels that do not apply to the issue. + +The issue template encourages users to write bug reports that clearly +describe the problem they are having and include steps to reproduce the +issue. However, users sometimes ignore the template or are not used to +GitHub and make mistakes in formatting or communication. + +If you are able to infer what they meant and are able to understand the +issue, feel free to edit their issue description to fix formatting or +correct issues with a script demonstrating the issue. + +If there is still not enough information or if the issue is unclear, +request more information from the submitter. If they do not respond or +do not clarify sufficiently, close the issue. Try to be polite and have +empathy for inexperienced issue authors. + +## How to check out a PR locally + +This workflow is described in the [GitHub +docs](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally#modifying-an-inactive-pull-request-locally). + +1. Identify the pull request ID. This is the number of the pull request + in the GitHub UI, which shows up in the URL for the pull request. For + example, https://github.com/Point72/csp/pull/98 has PR ID 98. + +1. Fetch the pull request ref and assign it to a local branch name. + + ```bash + git fetch upstream pull//HEAD/:LOCAL_BRANCH_NAME + ``` + + where `` is the PR ID number and `LOCAL_BRANCH_NAME` is a name + chosen for the PR branch in your local checkout of CSP. + +1. Switch to the PR branch + + ```bash + git switch LOCAL_BRANCH_NAME + ``` + +1. Rebuild CSP + +## Pushing Fixups to Pull Requests + +Sometimes pull requests don't quite make it across the finish line. In +cases where only a small fixup is required to make a PR mergeable and +the author of the pull request is unresponsive to requests, the best +course of action is often to push to the pull request directly to +resolve the issues. + +To do this, check out the pull request locally using the above +instructions. Then make the changes needed for the pull request and push +the local branch back to GitHub: + +```bash +git push upstream LOCAL_BRANCH_NAME +``` + +Where `LOCAL_BRANCH_NAME` is the name you gave to the PR branch when you +fetched it from GitHub. + +Note that if the user who created the pull request selected the option +to forbid pushes to their pull request, you will instead need to +recreate the pull request by pushing the PR branch to your fork and +making a pull request like normal. diff --git a/docs/wiki/dev-guides/Local-Development-Setup.md b/docs/wiki/dev-guides/Local-Development-Setup.md new file mode 100644 index 00000000..15de91cc --- /dev/null +++ b/docs/wiki/dev-guides/Local-Development-Setup.md @@ -0,0 +1,87 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Step 1: Build CSP from Source](#step-1-build-csp-from-source) +- [Step 2: Configuring Git and GitHub for Development](#step-2-configuring-git-and-github-for-development) + - [Create your fork](#create-your-fork) + - [Configure remotes](#configure-remotes) + - [Authenticating with GitHub](#authenticating-with-github) + - [Configure commit signing](#configure-commit-signing) +- [Guidelines](#guidelines) + +## Step 1: Build CSP from Source + +To work on CSP, you are going to need to build it from source. See +[Build CSP from Source](Build-CSP-from-Source) for +detailed build instructions. + +Once you've built CSP from a `git` clone, you will also need to +configure `git` and your GitHub account for CSP development. + +## Step 2: Configuring Git and GitHub for Development + +### Create your fork + +The first step is to create a personal fork of CSP. To do so, click +the "fork" button at https://github.com/Point72/csp, or just navigate +[here](https://github.com/Point72/csp/fork) in your browser. Set the +owner of the repository to your personal GitHub account if it is not +already set that way and click "Create fork". + +### Configure remotes + +Next, you should set some names for the `git` remotes corresponding to +main Point72 repository and your fork. If you started with a clone of +the main `Point72` repository, you could do something like: + +```bash +cd csp +git remote rename origin upstream + +# for SSH authentication +git remote add origin git@github.com:/csp.git + +# for HTTP authentication +git remote add origin https://github.com//csp.git +``` + +### Authenticating with GitHub + +If you have not already configured `ssh` access to GitHub, you can find +instructions to do so +[here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh), +including instructions to create an SSH key if you have not done +so. Authenticating with SSH is usually the easiest route. If you are working in +an environment that does not allow SSH connections to GitHub, you can look into +[configuring a hardware +passkey](https://docs.github.com/en/authentication/authenticating-with-a-passkey/about-passkeys) +or adding a [personal access +token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) +to avoid the need to type in your password every time you push to your fork. + +### Configure commit signing + +Additionally, you will need to configure your local `git` setup and +GitHub account to use [commit +signing](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification). All +commits to the `csp` repository must be signed to increase the +difficulty of a supply-chain attack against the CSP codebase. The +easiest way to do this is to [configure `git` to sign commits with your +SSH +key](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#ssh-commit-signature-verification). You +can also use a [GPG +key](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#gpg-commit-signature-verification) +to sign commits. + +In either case, you must also add your public key to your github account +as a signing key. Note that if you have already added an SSH key as an +authentication key, you will need to add it again [as a signing +key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account). + +## Guidelines + +After developing a change locally, ensure that both [lints](Build-CSP-from-Source#lint-and-autoformat) and [tests](Build-CSP-from-Source#testing) pass. Commits should be squashed into logical units, and all commits must be signed (e.g. with the `-s` git flag). CSP requires [Developer Certificate of Origin](https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin) for all contributions. + +If your work is still in-progress, open a [draft pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests#draft-pull-requests). Otherwise, open a normal pull request. It might take a few days for a maintainer to review and provide feedback, so please be patient. If a maintainer asks for changes, please make said changes and squash your commits if necessary. If everything looks good to go, a maintainer will approve and merge your changes for inclusion in the next release. + +Please note that non substantive changes, large changes without prior discussion, etc, are not accepted and pull requests may be closed. diff --git a/docs/wiki/99.-Developer.md b/docs/wiki/dev-guides/Release-Process.md similarity index 52% rename from docs/wiki/99.-Developer.md rename to docs/wiki/dev-guides/Release-Process.md index 1c37d8fc..94866e57 100644 --- a/docs/wiki/99.-Developer.md +++ b/docs/wiki/dev-guides/Release-Process.md @@ -1,157 +1,16 @@ -# tl;dr - -After developing a change locally, ensure that both [lints](https://github.com/Point72/csp/wiki/98.-Building-From-Source#lint-and-autoformat) and [tests](https://github.com/Point72/csp/wiki/98.-Building-From-Source#testing) pass. Commits should be squashed into logical units, and all commits must be signed (e.g. with the `-s` git flag). `csp` requires [Developer Certificate of Origin](https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin) for all contributions. - -If your work is still in-progress, open a [draft pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests#draft-pull-requests). Otherwise, open a normal pull request. It might take a few days for a maintainer to review and provide feedback, so please be patient. If a maintainer asks for changes, please make said changes and squash your commits if necessary. If everything looks good to go, a maintainer will approve and merge your changes for inclusion in the next release. - -**Please note that non substantive changes, large changes without prior discussion, etc, are not accepted and pull requests may be closed.** - -# Setting up a development environment - -To work on `csp`, you are going to need to build it from source. See -https://github.com/Point72/csp/wiki/98.-Building-From-Source for -detailed build instructions. - -Once you've built `csp` from a `git` clone, you will also need to -configure `git` and your GitHub account for `csp` development. - -## Configuring Git and GitHub for Development - -### Create your fork - -The first step is to create a personal fork of `csp`. To do so, click -the "fork" button at https://github.com/Point72/csp, or just navigate -[here](https://github.com/Point72/csp/fork) in your browser. Set the -owner of the repository to your personal GitHub account if it is not -already set that way and click "Create fork". - -### Configure remotes - -Next, you should set some names for the `git` remotes corresponding to -main Point72 repository and your fork. If you started with a clone of -the main `Point72` repository, you could do something like: - -```bash -cd csp -git remote rename origin upstream - -# for SSH authentication -git remote add origin git@github.com:/csp.git - -# for HTTP authentication -git remote add origin https://github.com//csp.git -``` - -### Authenticating with GitHub - -If you have not already configured `ssh` access to GitHub, you can find -instructions to do so -[here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh), -including instructions to create an SSH key if you have not done -so. Authenticating with SSH is usually the easiest route. If you are working in -an environment that does not allow SSH connections to GitHub, you can look into -[configuring a hardware -passkey](https://docs.github.com/en/authentication/authenticating-with-a-passkey/about-passkeys) -or adding a [personal access -token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) -to avoid the need to type in your password every time you push to your fork. - -### Configure commit signing - -Additionally, you will need to configure your local `git` setup and -GitHub account to use [commit -signing](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification). All -commits to the `csp` repository must be signed to increase the -difficulty of a supply-chain attack against the `csp` codebase. The -easiest way to do this is to [configure `git` to sign commits with your -SSH -key](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#ssh-commit-signature-verification). You -can also use a [GPG -key](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification#gpg-commit-signature-verification) -to sign commits. - -In either case, you must also add your public key to your github account -as a signing key. Note that if you have already added an SSH key as an -authentication key, you will need to add it again [as a signing -key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account). - -# Github for maintainers - -## Triaging Issues - -The bug tracker is both a venue for users to communicate to the -developers about defects and a database of known defects. It's up to the -maintainers to ensure issues are high quality. - -We have a number of labels that can be applied to sort issues into -categories. If you notice newly created issues that are poorly labeled, -consider adding or removing some labels that do not apply to the issue. - -The issue template encourages users to write bug reports that clearly -describe the problem they are having and include steps to reproduce the -issue. However, users sometimes ignore the template or are not used to -GitHub and make mistakes in formatting or communication. - -If you are able to infer what they meant and are able to understand the -issue, feel free to edit their issue description to fix formatting or -correct issues with a script demonstrating the issue. - -If there is still not enough information or if the issue is unclear, -request more information from the submitter. If they do not respond or -do not clarify sufficiently, close the issue. Try to be polite and have -empathy for inexperienced issue authors. - -## How to check out a PR locally - -This workflow is described in the [GitHub -docs](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally#modifying-an-inactive-pull-request-locally). - -1. Identify the pull request ID. This is the number of the pull request - in the GitHub UI, which shows up in the URL for the pull request. For - example, https://github.com/Point72/csp/pull/98 has PR ID 98. - -1. Fetch the pull request ref and assign it to a local branch name. - - ```bash - git fetch upstream pull//HEAD/:LOCAL_BRANCH_NAME - ``` - - where `` is the PR ID number and `LOCAL_BRANCH_NAME` is a name - chosen for the PR branch in your local checkout of `csp`. - -1. Switch to the PR branch - - ```bash - git switch LOCAL_BRANCH_NAME - ``` - -1. Rebuild `csp` - -## Pushing Fixups to Pull Requests - -Sometimes pull requests don't quite make it across the finish line. In -cases where only a small fixup is required to make a PR mergeable and -the author of the pull request is unresponsive to requests, the best -course of action is often to push to the pull request directly to -resolve the issues. - -To do this, check out the pull request locally using the above -instructions. Then make the changes needed for the pull request and push -the local branch back to GitHub: - -```bash -git push upstream LOCAL_BRANCH_NAME -``` - -Where `LOCAL_BRANCH_NAME` is the name you gave to the PR branch when you -fetched it from GitHub. - -Note that if the user who created the pull request selected the option -to forbid pushes to their pull request, you will instead need to -recreate the pull request by pushing the PR branch to your fork and -making a pull request like normal. - -# Release Manual +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Doing a "normal" release](#doing-a-normal-release) + - [Choosing a version number](#choosing-a-version-number) + - [Preparing and tagging a release](#preparing-and-tagging-a-release) + - [Releasing to PyPI](#releasing-to-pypi) + - [A developer's first release](#a-developers-first-release) + - [Doing the release](#doing-the-release) + - [Download release artifacts from github actions](#download-release-artifacts-from-github-actions) + - [Optionally upload to testpypi to test "pip install"](#optionally-upload-to-testpypi-to-test-pip-install) + - [Upload to pypi](#upload-to-pypi) +- [Dealing with release mistakes](#dealing-with-release-mistakes) ## Doing a "normal" release @@ -176,7 +35,7 @@ different potential impact on users. This is the most common kind of release. A patch release should only include fixes for bugs or other changes that cannot impact code a user writes with the `csp` package. A user should be able to safely - upgrade `csp` from the previous version to a new patch release with + upgrade CSP from the previous version to a new patch release with no changes to the output of their code and no new errors being raised, except for fixed bugs. Whether or not a bug fix is sufficiently impactful to break backward compatibility is a @@ -215,6 +74,13 @@ Follow these steps when it's time to tag a new release. Before doing this, you will need to ensure `bump2version` is installed into your development environment. +> \[!NOTE\] +> The following steps assume you have a personal fork of csp. +> If you are working from the main `Point72/csp` repo, use `origin` +> instead of `upstream` in the git commands. Specifically, +> `git pull origin main --tags` in the step 1, +> and `git push origin main --follow-tags` in step 7. + 1. Ensure your local clone of `csp` is synced up with GitHub, including any tags that have been pushed since you last synced: @@ -283,18 +149,20 @@ actions running, one for the push to `main` and one for the new tag. You want to inspect the action running for the new tag. Once the run finishes, there should be a new release on the ["Releases" page](https://github.com/Point72/csp/releases). +If the release is in "Draft" state, click on the pencil icon to +"Edit" and publish it with the "Publish release" button. ### Releasing to PyPI #### A developer's first release If this is your first release, you will need an account on pypi.org and -your account will need to be added as a maintainer to the `csp` project -on pypi. You will also need to have two factor authentication enabled on +your account will need to be added as a maintainer to the CSP project +on PyPI. You will also need to have two factor authentication enabled on your PyPI account. Once that is set up, navigate to the API token page in your PyPI -settings and generate an API token scoped to the `csp` project. **Do not** +settings and generate an API token scoped to the CSP project. **Do not** navigate away from the page displaying the API token before the next step. @@ -321,7 +189,7 @@ content: #### Doing the release -##### Download release artifacts from github actions +#### Download release artifacts from github actions Make sure you are in the root of the `csp` repository and execute the following commands. @@ -347,7 +215,7 @@ twine check --strict dist/* This happens as part of the CI so this should only be a double-check. -##### Optionally upload to testpypi to test "pip install" +#### Optionally upload to testpypi to test "pip install" ``` twine upload --repository testpypi dist/* @@ -363,12 +231,12 @@ pip install --index-url https://test.pypi.org/simple --extra-index-url https://p Note that `extra-index-url` is necessary to ensure downloading dependencies succeeds. -##### Upload to pypi +#### Upload to pypi If you are sure the release is ready, you can upload to pypi like so: ```bash -twine upload --repository csp dist/*` +twine upload --repository csp dist/* ``` Note that this assumes you have a `.pypirc` set up as explained above. diff --git a/docs/wiki/dev-guides/Roadmap.md b/docs/wiki/dev-guides/Roadmap.md new file mode 100644 index 00000000..25bb34c4 --- /dev/null +++ b/docs/wiki/dev-guides/Roadmap.md @@ -0,0 +1,17 @@ +We do not have a formal roadmap, but we're happy to discuss features, improvements, new adapters, etc, in our [discussions area](https://github.com/Point72/csp/discussions). + +Here are some high level items we hope to accomplish in the next few months: + +- Support `msvc` compiler and full Windows support ([#109](https://github.com/Point72/csp/issues/109)) +- Establish a better pattern for adapters ([#165](https://github.com/Point72/csp/discussions/165)) +- Parallelization to improve runtime, for historical/offline distributions +- Support for cross-process communication in realtime distributions + +## Adapters and Extensions + +- C++-based HTTP/SSE adapter +- Add support for other graph viewers, including interactive / standalone / Jupyter + +## Other Open Source Projects + +- `csp-gateway`: Application development framework, built with [FastAPI](https://fastapi.tiangolo.com) and [Perspective](https://github.com/finos/perspective). This is a library we have built internally at Point72 on top of `csp` that we hope to open source later in 2024. It allows for easier construction of modular `csp` applications, along with a pluggable REST/WebSocket API and interactive UI. diff --git a/docs/wiki/get-started/First-Steps.md b/docs/wiki/get-started/First-Steps.md new file mode 100644 index 00000000..eb9bc0ec --- /dev/null +++ b/docs/wiki/get-started/First-Steps.md @@ -0,0 +1,48 @@ +When writing CSP code there will be runtime components in the form of `csp.node` methods, as well as graph-building components in the form of `csp.graph` components. + +It is important to understand that `csp.graph` components will only be executed once at application startup in order to construct the graph. +Once the graph is constructed, `csp.graph` code is no longer needed. +Once the graph is run, only inputs, `csp.node`s and outputs will be active as data flows through the graph, driven by input ticks. + +For example, this is a simple bit of graph code: + +```python +import csp +from csp import ts +from datetime import datetime + + +@csp.node +def spread(bid: ts[float], ask: ts[float]) -> ts[float]: + if csp.valid(bid, ask): + return ask - bid + + +@csp.graph +def my_graph(): + bid = csp.const(1.0) + ask = csp.const(2.0) + bid = csp.multiply(bid, csp.const(4)) + ask = csp.multiply(ask, csp.const(3)) + s = spread(bid, ask) + + csp.print('spread', s) + csp.print('bid', bid) + csp.print('ask', ask) + + +if __name__ == '__main__': + csp.run(my_graph, starttime=datetime.utcnow()) +``` + +In order to help visualize this graph, you can call `csp.show_graph`: + +![359407708](https://github.com/Point72/csp/assets/3105306/8cc50ad4-68f9-4199-9695-11c136e3946c) + +The result of this would be: + +``` +2020-04-02 15:33:38.256724 bid:4.0 +2020-04-02 15:33:38.256724 ask:6.0 +2020-04-02 15:33:38.256724 spread:2.0 +``` diff --git a/docs/wiki/get-started/Installation.md b/docs/wiki/get-started/Installation.md new file mode 100644 index 00000000..f7104347 --- /dev/null +++ b/docs/wiki/get-started/Installation.md @@ -0,0 +1,20 @@ +## `pip` + +We ship binary wheels to install CSP on MacOS and Linux via `pip`: + +```bash +pip install csp +``` + +## `conda` + +CSP is available on `conda` for Linux and Mac: + +```bash +conda install csp -c conda-forge +``` + +## Source installation + +For other platforms, follow the instructions to [build CSP from +source](Build-CSP-from-Source). diff --git a/docs/wiki/how-tos/Add-Cycles-in-Graphs.md b/docs/wiki/how-tos/Add-Cycles-in-Graphs.md new file mode 100644 index 00000000..d8fba431 --- /dev/null +++ b/docs/wiki/how-tos/Add-Cycles-in-Graphs.md @@ -0,0 +1,52 @@ +By definition of the graph building code, CSP graphs can only produce acyclical graphs. +However, there are many occasions where a cycle may be required. +For example, lets say you want part of your graph to simulate an exchange. +That part of the graph would need to accept new orders and return acks and executions. +However, the acks / executions would likely need to *feedback* into the same part of the graph that generated the orders. +For this reason, the `csp.feedback` construct exists. +Using `csp.feedback` one can wire a feedback as an input to a node, and effectively bind the actual edge that feeds it later in the graph. +Note that internally the graph is still acyclical. +Internally `csp.feedback` creates a pair of output and input adapters that are bound together. +When a timeseries that is bound to a feedback ticks, it is fed to the feedback which then schedules the tick on its bound input to be executed on the **next engine cycle**. +The next engine cycle will execute with the same engine time as the cycle that generated it, but it will be evaluated in a subsequent cycle. + +- **`csp.feedback(ts_type)`**: `ts_type` is the type of the timeseries (ie int, str). + This returns an instance of a feedback object + - **`out()`**: this method returns the timeseries edge which can be passed as an input to your node + - **`bind(ts)`**: this method is called to bind an edge as the source of the feedback after the fact + +A simple example should help demonstrate a possible usage. +Lets say we want to simulate acking orders that are generated from a node called `my_algo`. +In addition to generating the orders, `my_algo` also wants needs to receive the execution reports (this is demonstrated in example `e_13_feedback.py`) + +The graph code would look something like this: + +```python +# Simulate acking an order +@csp.node +def my_exchange(order:ts[Order]) -> ts[ExecReport]: + # ... impl details ... + +@csp.node +def my_algo(exec_report:ts[ExecReport]) -> ts[Order]: + # .. impl details ... + +@csp.graph +def my_graph(): + # create the feedback first so that we can refer to it later + exec_report_fb = csp.feedback(ExecReport) + + # generate orders, passing feedback out() which isn't bound yet + orders = my_algo(exec_report_fb.out()) + + # get exec_reports from "simulator" + exec_report = my_exchange(orders) + + # now bind the exec reports to the feedback, finishing the "loop" + exec_report_fb.bind(exec_report) +``` + +The graph would end up looking like this. +It remains acyclical, but the `FeedbackOutputDef` is bound to the `FeedbackInputDef` here, any tick to out will push the tick to in on the next cycle: + +![366521848](https://github.com/Point72/csp/assets/3105306/c4f920ff-49f9-4a52-8404-7c1989768da7) diff --git a/docs/wiki/how-tos/Create-Dynamic-Baskets.md b/docs/wiki/how-tos/Create-Dynamic-Baskets.md new file mode 100644 index 00000000..9457d68b --- /dev/null +++ b/docs/wiki/how-tos/Create-Dynamic-Baskets.md @@ -0,0 +1,58 @@ +CSP graphs are somewhat limiting in that they cannot change shape once the process starts up. +CSP dynamic graphs addresses this issue by introducing a construct to allow applications to dynamically add / remove sub-graphs from a running graph. + +`csp.DynamicBasket`s are a pre-requisite construct needed for dynamic graphs. +`csp.DynamicBasket`s work just like regular static CSP baskets, however dynamic baskets can change their shape over time. +`csp.DynamicBasket`s can only be created from either CSP nodes or from `csp.dynamic` calls, as described below. +A node can take a `csp.DynamicBasket` as an input or generate a dynamic basket as an output. +Dynamic baskets are always dictionary-style baskets, where time series can be added by key. +Note that timeseries can also be removed from dynamic baskets. + +## Syntax + +Dynamic baskets are denoted by the type `csp.DynamicBasket[key_type, ts_type]`, so for example `csp.DynamicBasket[str,int]` would be a dynamic basket that will have keys of type str, and timeseries of type int. +One can also use the non-python shorthand `{ ts[str] : ts[int] }` to signify the same. + +## Generating dynamic basket output + +For nodes that generate dynamic basket output, they would use the same interface as regular basket outputs. +The difference being that if you output a key that hasn't been seen before, it will automatically be added to the dynamic basket. +In order to remove a key from a dynamic basket output, you would use the `csp.remove_dynamic_key` method. +**NOTE** that it is illegal to add and remove a key in the same cycle: + +```python +@csp.node +def dynamic_demultiplex_example(data : ts[ 'T' ], key : ts['K']) -> csp.DynamicBasket['T', 'K']: + if csp.ticked(data) and csp.valid(key): + csp.output({ key : data }) + + + ## To remove a key, which wouldn't be done in this example node: + ## csp.remove_dynamic_key(key) +``` + +To remove a key one would use `csp.remove_dynamic_key`. +For a single unnamed output, the method expects the key. +For named outputs, the arguments would be `csp.remove_dynamic_key(output_name, key)` + +## Consuming dynamic basket input + +Taking dynamic baskets as input is exactly the same as static baskets. +There is one additional bit of information available on dynamic basket inputs though, which is the .shape property. +As keys are added or removed, the `basket.shape` property will tick the the change events. +The `.shape` property behaves effectively as a `ts[csp.DynamicBasketEvents]`: + +```python +@csp.node +def consume_dynamic_basket(data : csp.DynamicBasket[str,int]): + if csp.ticked(data.shape): + for key in data.shape.added: + print(f'key {key} was added') + for key in data.shape.removed: + print(f'key {key} was removed') + + + if csp.ticked(data): + for key,value in data.tickeditems(): + #...regular basket access here +``` diff --git a/docs/wiki/how-tos/Profile-CSP-Code.md b/docs/wiki/how-tos/Profile-CSP-Code.md new file mode 100644 index 00000000..15e7a5c3 --- /dev/null +++ b/docs/wiki/how-tos/Profile-CSP-Code.md @@ -0,0 +1,77 @@ +The `csp.profiler` library allows users to time cycle/node executions during a graph run. There are two available utilities. + +One can use these metrics to identify bottlenecks/inefficiencies in their graphs. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Profiling a real-time csp.graph](#profiling-a-real-time-cspgraph) +- [Saving raw profiling data to a file](#saving-raw-profiling-data-to-a-file) +- [graph_info: build-time information](#graph_info-build-time-information) + +## Profiling a real-time `csp.graph` + +The `csp.profiler` library provides a GUI for profiling real-time CSP graphs. +One can access this GUI by adding a `http_port` argument to their profiler call. + +```python +with profiler.Profiler(http_port=8888) as p: + results = csp.run(graph, starttime=st, endtime=et) # run the graph normally +``` + +This will open up the GUI on `localhost:8888` (as http_port=8888) which will display real-time node timing, cycle timing and memory snapshots. +Profiling stats will be calculated whenever you refresh the page or call a GET request. +Additionally, you can add the `format=json`argument (`localhost:8888?format=json`) to your request to receive the ProfilerInfo as a `JSON` object rather than the `HTML` display. + +Users can add the `display_graphs=True` flag to include bar/pie charts of node execution times in the web UI. +The matplotlib package is required to use the flag. + +```python +with profiler.Profiler(http_port=8888, display_graphs=True) as p: + ... +``` + +new_profiler + +## Saving raw profiling data to a file + +Users can save individual node execution times and individual cycle execution times to a `.csv` file if they desire. +This is useful if you want to apply your own analysis e.g. calculate percentiles. +To do this, simply add the flags `node_file=` or `cycle_file=` + +```python +with profiler.Profiler(cycle_file="cycle_data.csv", node_file="node_data.csv") as p: + ... +``` + +After the graph is run, the file `node_data.csv` contains: + +``` +Node Type,Execution Time +count,1.9814e-05 +cast_int_to_float,1.2791e-05 +_time_window_updates,4.759e-06 +... +``` + +After the graph is run, the file `cycle_data.csv` contains: + +``` +Execution Time +9.4757e-05 +4.5205e-05 +2.2873e-05 +... +``` + +## graph_info: build-time information + +Users can also extract build-time information about the graph without running it by calling `profiler.graph_info`. + +The code snippet below shows how to call `graph_info`. + +```python +from csp import profiler + +info = profiler.graph_info(graph) +``` diff --git a/docs/wiki/how-tos/Use-Statistical-Nodes.md b/docs/wiki/how-tos/Use-Statistical-Nodes.md new file mode 100644 index 00000000..a5e0f008 --- /dev/null +++ b/docs/wiki/how-tos/Use-Statistical-Nodes.md @@ -0,0 +1,433 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Introduction](#introduction) +- [Working with a single-valued time series](#working-with-a-single-valued-time-series) +- [Working with a NumPy time series](#working-with-a-numpy-time-series) +- [Working with a basket of time series](#working-with-a-basket-of-time-series) +- [Cross-sectional statistics](#cross-sectional-statistics) +- [Expanding window statistics](#expanding-window-statistics) +- [Common user options](#common-user-options) + - [Intervals](#intervals) + - [Triggers, samplers and resets](#triggers-samplers-and-resets) + - [Data validity](#data-validity) + - [NaN handling](#nan-handling) + - [Weighted statistics](#weighted-statistics) +- [Numerical stability](#numerical-stability) + - [The `recalc` parameter](#the-recalc-parameter) + +## Introduction + +The `csp.stats` library provides rolling window calculations on time series data in CSP. +The goal of the library is to provide a uniform, robust interface for statistical calculations in CSP. +Each computation is a `csp.graph` which consists of one or more nodes that perform a given computation. +Users can treat these graphs as a "black box" with specified inputs and outputs as provided in the API reference. +Example statistics graphs for *mean* and *standard deviation* are provided below to give a rough idea of how the graphs work. + +**Mean using a tick-specified interval** +![437686747](https://github.com/Point72/csp/assets/3105306/5586a355-e405-45c3-aa6d-c64754fd6c26) + +**Standard deviation using a tick-specified interval** +![437686748](https://github.com/Point72/csp/assets/3105306/8ae2ab7a-413d-4175-89d5-5b252401a83e) + +Rolling windows can either be specified by the number of ticks in the window or the time duration of the window. +Users can specify minimum window sizes for results as well as the minimum number of data points for a valid computation. +Standard NaN handling is provided with two different options. +Weighting is available for relevant stats functions such as sums, mean, covariance, and skew. + +## Working with a single-valued time series + +Time series of float and int types can be used for all stats functions, except those listed as "NumPy Specific". +Internally, all values are cast to float-type. +`NaN` values in the series (if applicable) are allowed and will be handled as specified by the `ignore_na` flag. + +If you are performing the same calculation on many different series, **it is highly recommended that you use a NumPy array.** +NumPy array inputs result in a much smaller CSP graph which can drastically improve performance. +If different series tick asynchronously, then sometimes using single-input calculations cannot be avoided. +However, you can consider sampling your data at regularly specified intervals, and then using the sampled values to create a NumPy array which is provided to the calculation. + +## Working with a NumPy time series + +All statistics functions work on both single-input time series and time series of NumPy arrays. +NumPy arrays provide the ability to perform the same calculation on many different elements within the same `csp.node`, and therefore drastically reduce the overall size of the CSP graph. +The performance benefits of using NumPy arrays for large-scale computations (i.e. thousands of symbols) is order of magnitudes faster, per benchmarking. +To convert a list of individual series into a NumPy array, use the `csp.stats.list_to_numpy` conversion node. +To convert back to a basket of series, use the `csp.stats.numpy_to_list` converter. + +All calculations on NumPy arrays are performed element-wise, with exception of `cov_matrix` and `corr_matrix` which are defined in the statistical sense. +Arrays of arbitrary dimension are supported, as well as array views such as transposes and slices. +The data type of arrays must be of float-type, not an int. +If your data is integer valued, convert the array to a float-type using the `astype` function in the NumPy library. +Basic mathematical operations (such as addition, multiplication etc.) are defined on NumPy array time series using NumPy's built-in functions, which allow for proper broadcasting rules. + +## Working with a basket of time series + +There are two ways that users can run stats function on a listbasket of time series. +If the data in the time series ticks together (or *relatively* together) then users can convert their listbasket data into a NumPy array time series +using the `list_to_numpy` node, run the calculations they want, and then convert back to a listbasket using the `numpy_to_list` node. +Since NumPy arrays only require one node per computation, whereas a list of `N` time series will require `N` nodes, this method is highly efficient even for small graphs. +Below is a diagram of the workflow for a listbasket with 2 elements. + +**A sum over a listbasket with 2 elements** +![437687654](https://github.com/Point72/csp/assets/3105306/0e12b9ff-9461-497c-895d-3b1c33669235) + +If the data does not tick (or is sampled) at the same time or the computations are fundamentally different in nature (i.e. different intervals), then the NumPy method will not provide the desired functionality. +Instead, if users wish to store all their individual time series in a listbasket, then they must use single input stats with standard CSP listbasket syntax. +This method is significantly slower than using NumPy arrays, since the graphs must be much larger. +However, depending on your use case, this may be unavoidable. +If possible, it is highly recommended that you consider transformations to your data that allow it to be stored in NumPy arrays, such as sampling at given intervals. + +## Cross-sectional statistics + +The `stats` library also exposes an option to compute cross-sectional statistics. +Cross-sectional statistics are statistics which are computed using every value in the window at each iteration. +These computations are less efficient than rolling window functions that employ smart updating. +However, some computations may have to be applied cross-sectionally, and some users may want to apply cross-sectional statistics for small window calculations that require high numerical stability. + +To use cross-sectional statistics, use the `csp.stats.cross_sectional` utility to receive all data in the current window. +Then, use `csp.apply` to use your own function on the cross-sectional data. +The `cross_sectional` function allows for the same user options as standard stats functions (such as triggering and sampling). +An example of using `csp.stats.cross_sectional` is shown below: + +```python +# Starttime: 2020-01-01 00:00:00 +x = {'2020-01-01': 1, '2020-01-01': 2, '2020-01-01': 3, '2020-01-01': 4, '2020-01-01': 5} +cs = cross_sectional(x, interval=3, min_window=2) +cs +``` + +```python +{'2020-01-02': [1,2], '2020-01-03': [1,2,3], '2020-01-04': [2,3,4], '2020-01-05': [3,4,5]} +``` + +```python +# Calculate a cross-sectional mean +cs_mean = csp.apply(cs, lambda v: sum(v)/len(v), float) +cs_mean +``` + +```python +{'2020-01-02': 1.5, '2020-01-03': 2.0, '2020-01-04': 3.0, '2020-01-05': 4.0} +``` + +## Expanding window statistics + +An expanding window holds all ticks of its underlying time series - in other words, the window grows unbounded as you receive more data points. +To use an expanding window, either don't specify an interval or set `interval=None`. +An example of an expanding window sum is shown below: + +```python +# Starttime: 2020-01-01 00:00:00 +x = {'2020-01-01': 1, '2020-01-01': 2, '2020-01-01': 3, '2020-01-01': 4, '2020-01-01': 5} +sum(x) +``` + +```python +{'2020-01-01': 1, '2020-01-02': 3, '2020-01-03': 6, '2020-01-04': 10, '2020-01-05': 15} +``` + +## Common user options + +### Intervals + +Intervals can be specified as a tick window or a time window. +Tick windows are int arguments while time windows are timedelta arguments. +For example, + +- `csp.stats.mean(x, interval=4)` will calculate a rolling mean over the last 4 ticks of data. +- `csp.stats.mean(x, interval=timedelta(seconds=4))` will calculate a rolling mean over the last 4 seconds of data + +Time intervals are inclusive at the right endpoint but **exclusive** at the left endpoint. +For example, if `x` ticks every one second with a value of `1`, and I call `csp.stats.sum(x, timedelta(seconds=1))`then my output will be `1` at all times. +It will not be `2`, since the left endpoint value (which ticked *exactly* one second ago) is not included. + +Tick intervals include `NaN` values. +For example, a tick interval of size `10` with `9` `NaN` values in the interval will only use the single non-nan value for computations. +For more information on `NaN` handling, see the "NaN handling" section. + +If no interval is specified, then the calculation will be treated as an expanding window statistic and all data will be cumulative (see the above section on Expanding Window Statistics). + +### Triggers, samplers and resets + +**Triggers** are optional arguments which *trigger* a computation of the statistic. +If no trigger is provided as an argument, the statistic will be computed every time `x` ticks i.e. `x` becomes the trigger. + +```python +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3} +trigger = {'2020-01-02': True} + +sum(x, interval=2) +``` + +```python +{'2020-01-02': 3, '2020-01-03': 5} +``` + +```python +sum(x, interval=2, trigger=trigger) +``` + +```python +# No result at day 3 +{'2020-01-02': 3} +``` + +**Samplers** are optional arguments which *sample* the data. +Samplers are used to signify when the data, `x`, *should* tick. +If no sampler is provided, the data is sampled whenever `x` ticks i.e. `x` becomes the sampler. + +- If the sampler ticks and `x` does as well, then the tick is treated as valid data +- If the sampler ticks but `x` does not, then the tick is treated as `NaN` data +- If the sampler does not tick but `x` does, then the tick is ignored completely + +```python +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3} +sampler = {'2020-01-01': True, '2020-01-03': True} + +sum(x, interval=2) +``` + +```python +{'2020-01-02': 3, '2020-01-03': 5} +``` + +```python +sum(x, interval=2, sampler=sampler) +``` + +```python +# Tick on day 2 is ignored +{'2020-01-03': 4} +``` + +**Resets** are optional arguments which *reset* the interval, clearing all existing data. +Whenever reset ticks, the data is cleared. +If no reset is provided, then the data is never reset. + +```python +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3} +reset = {'2020-01-02 12:00:00': True} + +sum(x, interval=2) +``` + +```python +{'2020-01-02': 3, '2020-01-03': 5} +``` + +```python +sum(x, interval=2, reset=reset) +``` + +```python +# Data is reset after day 2 +{'2020-01-02': 3, '2020-01-03': 3} +``` + +**Important:** the order of operations between all three actions is as follows: reset, sample, trigger. +If all three series were to tick at the same time: the data is first reset, then sampled, and then a computation is triggered. + +```python +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3} +reset = {2020-01-03: True} + +# Trigger = sampler = x. Reset, trigger and sampler therefore all tick at 2020-01-03 + +sum(x, interval=2, reset=reset) +``` + +```python +# the data is first reset, then 3 is sampled, and then the sum is computed +{'2020-01-02': 3, '2020-01-03': 3} +``` + +### Data validity + +**Minimum window size** (`min_window`) is the smallest allowable window before returning a computation. +If a time window interval is used, then `min_window` must also be a `timedelta`. +If a tick interval is used, then `min_window` must also be an `int`. +Minimum window is a startup condition: once the minimum window size is reached, it will never go away. +For example, if you have a minimum window of 5 ticks with a 10 tick interval, once 5 ticks of data have occurred computations will always be returned when triggered. +By *default*, the minimum window size is equal to the interval itself. + +```python +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3} +sum(x, interval=2, min_window=1} +``` + +```python +{'2020-01-01': 1, '2020-01-02': 3, '2020-01-03': 5} +``` + +```python +sum(x, interval=timedelta(days=2), min_window=timedelta(days=1)) +``` + +```python +# Assuming graph start time is 2020-01-01 +{'2020-01-02': 3, '2020-01-03': 5} +``` + +**Minimum data points** (`min_data_points`) is the number of *valid* (non-nan) data points that must exist in the current window for a valid computation. +By default, min_data_points is 0. +However, in most applications, if you are dealing with frequently NaN data you may want to ensure that stats computations provide meaningful results. +Thus, if the interval has fewer than min_data_points values, the computation is too noisy and thus NaN is returned instead. + +```python +x = {'2020-01-01': 1, '2020-01-02': nan, '2020-01-03': 3} + +sum(x, interval=2) +``` + +```python +{'2020-01-02': 1, '2020-01-03': 3} + +sum(x, interval=2, min_data_points=2) +``` + +```python +# We only have 1 valid data point +{'2020-01-02': nan, '2020-01-03': nan} +``` + +### NaN handling + +The stats library provides a uniform interface for NaN handling. +Functions have an `ignore_na` parameter which is a bool argument (default value is `True`). + +- If `ignore_na=True`, then NaN values are "ignored" in the computation but still included in the interval +- If `ignore_na=False`, then NaN values make the whole computation NaN ("poison" the interval) as long as they are present in the interval + +```python +x = {'2020-01-01': 1, '2020-01-02': nan, '2020-01-03': 3, '2020-01-04': 4} + +sum(x, interval=2, ignore_na=True} +``` + +```python +{'2020-01-02': 1, '2020-01-03': 3, '2020-01-04': 7} +``` + +```python +sum(x, interval=2, ignore_na=False) +``` + +```python +# NaN at t=2 only out of interval by t=4 +{'2020-01-02': nan, '2020-01-03': nan, '2020-01-04': 7} +``` + +For exponential moving calculations, **EMA NaN handling** is slightly different. +If `ignore_na=True`, then NaN values are completely discarded. +If `ignore_na=False`, then NaN values do not poison the interval, but rather count as a tick with no data. +This affects the reweighting of past data points when the next tick with valid data is added. +For a detailed explanation, see the EMA section. + +### Weighted statistics + +**Weights** is an optional time-series which gives a relative weight to each data point. +Weighted statistics are available for: *sum(), mean(), var(), cov(), stddev(), sem(), corr(), skew(), kurt(), cov_matrix()* and *corr_matrix()*. +Since weights are relative, they do not need to be normalized by the user. +Weights also do not need to tick at the same time as the data, necessarily: the weights are *sampled* whenever the data sampler ticks. +For higher-order statistics such as variance, covariance, correlation, standard deviation, standard error, skewness and kurtosis, weights are interpreted as *frequency weights*. +This means that a weight of 1 corresponds to that observation occurring once and a weight of 2 signifies that observation occurring twice. + +If either the data *or* its corresponding weight is NaN, then the weighted data point is collectively treated as NaN. + +```python +# Single valued time series + +x = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-03': 3, '2020-01-04': 4} +weights = {'2020-01-01': 1, '2020-01-02': 2, '2020-01-04': 1} + +sum(x, interval=2, weights=weights) +``` + +```python +# Weight of 2 applied to x=3, as it is sampled +{'2020-01-02': 5, '2020-01-03': 10, '2020-01-04': 10} +``` + +```python +mean(x, interval=2, weights=weights) +``` + +```python +# Weighted mean +{'2020-01-02': 1.667, '2020-01-03': 2.5, '2020-01-04': 3.333} +``` + +If the time-series is of type `float`, then the weights series is also of type `float`. +If the time-series is of type `np.ndarray`, then the weights series is sometimes of type `np.ndarray` and sometimes of type `float`. +For element-wise statistics *sum(), mean(), var(), stddev(), sem(), skew(), kurt()* the weights are element-wise as well. +For *cov_matrix()* and *corr_matrix(),* the weights are of type float since they apply to the data vector collectively. +Consult the individual function references for more details. + +```python +# NumPy applied element-wise + +x = {'2020-01-01': [1,1], '2020-01-02': [2,2], '2020-01-03': [3,3]} +weights = {'2020-01-01': [1,2], '2020-01-02': [2,1], '2020-01-03': [1,1]} + +sum(x, interval=2, weights=weights) + +``` + +```python +{'2020-01-02': [5,4], '2020-01-03': [7,5]} +``` + +```python +mean(x, interval=2, weights=weights) +``` + +```python +# Weighted mean +{'2020-01-02': [1.667, 1.333], '2020-01-03': [2.333, 2.5]} +``` + +## Numerical stability + +Stats functions are not guaranteed to be numerically stable due to the nature of a rolling window calculation. +These functions implement online algorithms which have increased risk of floating point precision errors, especially when the data is ill-conditioned. +**Users are recommended to apply their own data cleaning** before calling these functions. +Data cleaning may include clipping large, erroneous values to be NaN or normalizing data based on historical ranges. +Cleaning can be implemented using the `csp.apply` node (see baselib documentation) with your cleaning pipeline expressed within a callable object (function). +If numerical stability is paramount, then cross-sectional calculations can be used at the cost of efficiency (see the section below on Cross-Sectional Statistics). + +Where possible, `csp.stats` algorithms are chosen to maximize stability while maintaining their online efficiency. +For example, rolling variance is calculated using Welford's online algorithm and rolling sums are calculated using Kahan's algorithm if `precise=True` is set. +Floating-point error can still accumulate when the functions are used on large data streams, especially if the interval used is small in comparison to the quantity of data. +Each stats method that is prone to floating-point error exposes a **recalc parameter** which is an optional time-series argument to trigger a clean recalculation of the statistic. +The recalculation clears any accumulated floating-point error up to that point. + +### The `recalc` parameter + +The `recalc` parameter is an optional time-series argument designed to stop unbounded floating-point error accumulation in rolling `csp.stats` functions. +When `recalc` ticks, the next calculation of the desired statistic will be computed with all data in the window. +This clears any accumulated error from prior intervals. +The parameter is meant to be used heuristically for use cases involving large data streams and small interval sizes, causing values to be continuously added and removed from the window. +Periodically triggering a recalculation will limit the floating-point error accumulation caused by these updates; for example, a user could set `recalc` to tick every 100 intervals of their data. +The cost of triggering a recalculation is efficiency: since all data in the window must be processed, it is not as fast as doing the calculation in the standard online fashion. + +A basic example using the `recalc` parameter is provided below. + +```python +x = {'2020-01-01': 0.1, '2020-01-02': 0.2, '2020-01-03': 0, '2020-01-04': 0} +sum(x, interval=2) +``` + +```python +# floating-point error has caused the sum to not perfectly go to zero +{'2020-01-02': 0.3, '2020-01-03': 0.19999999, '2020-01-03': -0.00000001} +``` + +```python +recalc = {'2020-01-04': True} +sum(x, interval=2, recalc=recalc) +``` + +```python +# at day 4, a clean recalculation clears the floating-point error from the previous data +{'2020-01-02': 0.3, '2020-01-03': 0.19999999, '2020-01-04': 0} +``` diff --git a/docs/wiki/how-tos/Write-Historical-Input-Adapters.md b/docs/wiki/how-tos/Write-Historical-Input-Adapters.md new file mode 100644 index 00000000..ac5435c5 --- /dev/null +++ b/docs/wiki/how-tos/Write-Historical-Input-Adapters.md @@ -0,0 +1,415 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Introduction](#introduction) +- [Types of Historical Adapters](#types-of-historical-adapters) +- [PullInputAdapter](#pullinputadapter) + - [PullInputAdapter - Python](#pullinputadapter---python) + - [PullInputAdapter - C++](#pullinputadapter---c) +- [AdapterManager and ManagedSimInputAdapter - Python](#adaptermanager-and-managedsiminputadapter---python) + - [AdapterManager - **--graph-- time**](#adaptermanager-----graph---time) + - [AdapterManager - **--impl-- runtime**](#adaptermanager-----impl---runtime) + - [ManagedSimInputAdapter - **--impl-- runtime**](#managedsiminputadapter-----impl---runtime) + - [ManagedSimInputAdapter - **--graph-- time**](#managedsiminputadapter-----graph---time) + - [Example - CSVReader](#example---csvreader) + +## Introduction + +There are two main categories of writing input adapters, historical and realtime. + +When writing historical adapters you will need to implement a "pull" adapter, which pulls data from a historical data source in time order, one event at a time. + +There are also ManagedSimAdapters for feeding multiple "managed" pull adapters from a single source (more on that below). + +When writing input adapters it is also very important to denote the difference between "graph building time" and "runtime" versions of your adapter. +For example, `csp.adapters.csv` has a `CSVReader` class that is used at graph building time. + +**Graph build time components** solely *describe* the adapter. +They are meant to do little else than keep track of the type of adapter and its parameters, which will then be used to construct the actual adapter implementation when the engine is constructed from the graph description. +It is the runtime implementation that actual runs during the engine execution phase to process data. + +For clarity of this distinction, in the descriptions below we will denote graph build time components with *--graph--* and runtime implementations with *--impl--*. + +## Types of Historical Adapters + +There are two flavors of historical input adapters that can be written. +The simplest one is a PullInputAdapter. +A PullInputAdapter can be used to convert a single source into a single timeseries. +The `csp.curve` implementation is a good example of this. +Single source to single timeseries adapters are of limited use however, and the more typical use case is for AdapterManager based input adapters to service multiple InputAdapters from a single source. +For this one would use an AdapterManager to coordinate processing of the data source, and ManagedSimInputAdapter as the individual timeseries providers. + +## PullInputAdapter + +### PullInputAdapter - Python + +To write a Python based `PullInputAdapter` one must write a class that derives from `csp.impl.pulladapter.PullInputAdapter`. +The derived type should the define two methods: + +- `def start(self, start_time, end_time)`: this will be called at the start of the engine with the start/end times of the engine. + `start_Time` and `end_time` will be tz-unaware datetime objects in UTC time. + At this point the adapter should open its resource and seek to the requested starttime. +- `def next(self)`: this method will be repeatedly called by the engine. + The adapter should return the next event as a time,value tuple. + If there are no more events, then the method should return `None`. + +The `PullInputAdapter` that you define will be used as the runtime *--impl–-*. +You also need to define a *--graph--* time representation of the time series edge. +In order to do this you should define a `csp.impl.wiring.py_pull_adapter_def`. +The `py_pull_adapter_def` creates a *--graph--* time representation of your adapter: + +```python +def py_pull_adapter_def(name, adapterimpl, out_type, **kwargs) +``` + +- **`name`**: string name for the adapter +- **`adapterimpl`**: a derived implementation of `csp.impl.pulladapter.PullInputAdapter` +- **`out_type`**: the type of the output, should be a `ts[]` type. Note this can use tvar types if a subsequent argument defines the tvar +- **`kwargs`**: \*\*kwargs here be passed through as arguments to the `PullInputAdapter` implementation + +Note that the \*\*kwargs passed to `py_pull_adapter_def` should be the names and types of the variables, like `arg1=type1, arg2=type2`. +These are the names of the kwargs that the returned input adapter will take and pass through to the `PullInputAdapter` implementation, and the types expected for the values of those args. + +`csp.curve` is a good simple example of this: + +```python +import copy +from csp.impl.pulladapter import PullInputAdapter +from csp.impl.wiring import py_pull_adapter_def +from csp import ts +from datetime import timedelta + + +class Curve(PullInputAdapter): + def __init__(self, typ, data): + ''' data should be a list of tuples of (datetime, value) or (timedelta, value)''' + self._data = data + self._index = 0 + super().__init__() + + def start(self, start_time, end_time): + if isinstance(self._data[0][0], timedelta): + self._data = copy.copy(self._data) + for idx, data in enumerate(self._data): + self._data[idx] = (start_time + data[0], data[1]) + + while self._index < len(self._data) and self._data[self._index][0] < start_time: + self._index += 1 + + super().start(start_time, end_time) + + def next(self): + + if self._index < len(self._data): + time, value = self._data[self._index] + if time <= self._end_time: + self._index += 1 + return time, value + return None + + +curve = py_pull_adapter_def('curve', Curve, ts['T'], typ='T', data=list) +``` + +Now curve can be called in graph code to create a curve input adapter: + +```python +x = csp.curve(int, [ (t1, v1), (t2, v2), .. ]) +csp.print('x', x) +``` + +See example [e_14_user_adapters_01_pullinput.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_01_pullinput.py) for more details. + +### PullInputAdapter - C++ + +**Step 1)** `PullInputAdapter` impl + +Similar to the Python `PullInputAdapter` API is the c++ API which one can leverage to improve performance of an adapter implementation. +The *--impl--* is very similar to python pull adapter. +One should derive from `PullInputAdapter`, a templatized base class (templatized on the type of the timeseries) and define these methods: + +- **`start(DateTime start, DateTime end)`**: similar to python API start, called when engine starts. + Open resource and seek to start time here +- **`stop()`**: called on engine shutdown, cleanup resource +- **`bool next(DateTime & t, T & value)`**: if there is data to provide, sets the next time and value for the adapter and returns true. + Otherwise, return false + +**Step 2)** Expose creator func to python + +Now that we have a c++ impl defined, we need to expose a python creator for it. +Define a method that conforms to the signature + +```cpp +csp::InputAdapter * create_my_adapter( + csp::AdapterManager * manager, + PyEngine * pyengine, + PyTypeObject * pyType, + PushMode pushMode, + PyObject * args) +``` + +- **`manager`**: will be nullptr for pull adapters +- **`pyengine `**: PyEngine engine wrapper object +- **`pyType`**: this is the type of the timeseries input adapter to be created as a `PyTypeObject`. + one can switch on this type using switchPyType to create the properly typed instance +- **`pushMode`**: the CSP PushMode for the adapter (pass through to base InputAdapter) +- **`args`**: arguments to pass to the adapter impl + +Then simply register the creator method: + +**`REGISTER_INPUT_ADAPTER(_my_adapter, create_my_adapter)`** + +This will register methodname onto your python module, to be accessed as your module.methodname. +Note this uses `csp/python/InitHelpers` which is used in the `_cspimpl` module. +To do this in a separate python module, you need to register `InitHelpers` in that module. + +**Step 3)** Define your *--graph–-* time adapter + +One liner now to wrap your impl in a graph time construct using `csp.impl.wiring.input_adapter_def`: + +```python +my_adapter = input_adapter_def('my_adapter', my_module._my_adapter, ts[int], arg1=int, arg2={str:'foo'}) +``` + +`my_adapter` can now be called with `arg1, arg2` to create adapters in your graph. +Note that the arguments are typed using `v=t` syntax. `v=(t,default)` is used to define arguments with defaults. + +Also note that all input adapters implicitly get a push_mode argument that is defaulted to `csp.PushMode.LAST_VALUE`. + +## AdapterManager and ManagedSimInputAdapter - Python + +In most cases you will likely want to expose a single source of data into multiple input adapters. +For this use case your adapter should define an AdapterManager *--graph--* time component, and AdapterManagerImpl *--impl--* runtime component. +The AdapterManager *--graph--* time component just represents the parameters needed to create the *--impl--* AdapterManager. +Its the *--impl--* that will have the actual implementation that will open the data source, parse the data and provide it to individual Adapters. + +Similarly you will need to define a derived ManagedSimInputAdapter *--impl--* component to handle events directed at an individual time series adapter. + +**NOTE** It is highly recommended not to open any resources in the *--graph--* time component. +graph time components can be pruned and/or memoized into a single instance, opening resources at graph time shouldn't be necessary. + +### AdapterManager - **--graph-- time** + +The graph time AdapterManager doesn't need to derive from any interface. +It should be initialized with any information the impl needs in order to open/process the data source (ie csv file, time column, db connection information, etc etc). +It should also have an API to create individual timeseries adapters. +These adapters will then get passed the adapter manager *--impl--* as an argument where they are created, so that they can register themselves for processing. +The AdapterManager also needs to define a **\_create** method. +The **\_create** is the bridge between the *--graph--* time AdapterManager representation and the runtime *--impl--* object. +**\_create** will be called on the *--graph--* time AdapterManager which will in turn create the *--impl--* instance. +\_create will get two arguments, engine (this represents the runtime engine object that will run the graph) and a memo dict which can optionally be used for any memoization that on might want. + +Lets take a look at [`CSVReader`](https://github.com/Point72/csp/blob/main/csp/adapters/csv.py) as an example: + +```python +# GRAPH TIME +class CSVReader: + def __init__(self, filename, time_converter, delimiter=',', symbol_column=None): + self._filename = filename + self._symbol_column = symbol_column + self._delimiter = delimiter + self._time_converter = time_converter + + def subscribe(self, symbol, typ, field_map=None): + return CSVReadAdapter(self, symbol, typ, field_map) + + def _create(self, engine, memo): + return CSVReaderImpl(engine, self) +``` + +- **`__init__`**: as you can see, all `__init__` does is keep the parameters that the impl will need. +- **`subscribe`**: API to create an individual timeseries / edge from this file for the given symbol. + typ denotes the type of the timeseries to create (ie `ts[int]`) and field_map is used for mapping columns onto `csp.Struct` types. + Note that subscribe returns a `CSVReadAdapter` instance. + `CSVReadAdapter` is the *--graph--* time representation of the edge (similar to how we defined `csp.curve` above). + We pass it `self` as its first argument, which will be used to create the AdapterManager *--impl--* +- **`\_create`**: the method to create the *--impl--* object from the given *--graph--* time representation of the manager + +The `CSVReader` would then be used in graph building code like so: + +```python +reader = CSVReader('my_data.csv', time_formatter, symbol_column='SYMBOL', delimiter='|') +# aapl will represent a ts[PriceQuantity] edge that will tick with rows from +# the csv file matching on SYMBOL column AAPL +aapl = reader.subscribe('AAPL', PriceQuantity) +``` + +### AdapterManager - **--impl-- runtime** + +The AdapterManager *--impl--* is responsible for opening the data source, parsing and processing through all the data and managing all the adapters it needs to feed. +The impl class should derive from `csp.impl.adaptermanager.AdapterManagerImpl` and implement the following methods: + +- **`start(self,starttime,endtime)`**: this is called when the engine starts up. + At this point the impl should open the resource providing the data and seek to starttime. + starttime/endtime will be tz-unaware datetime objects in UTC time +- **`stop(self)`**: this is called at the end of the run, resources should be cleaned up at this point +- **`process_next_sim_timeslice(self, now)`**: this method will be called multiple times through the run. + The initial call will provide now with starttime. + The impl's responsibility is to process all data at the given timestamp (more on how to do this below). + The method should return the next time in the data source, or None if there is no more data to process. + The method will be called again with the provided timestamp as "now" in the next iteration. + **NOTE** that process_next_sim_timeslice is required to move ahead in time. + In most cases the resource data can be supplied in time order, if not it would have to be sorted up front. + +`process_next_sim_timeslice` should parse data for a given time/row of data and then push it through to any registered `ManagedSimInputAdapter` that matches on the given row. + +### ManagedSimInputAdapter - **--impl-- runtime** + +Users will need to define `ManagedSimInputAdapter` derived types to represent the individual timeseries adapter *--impl--* objects. +Objects should derive from `csp.impl.adaptermanager.ManagedSimInputAdapter`. + +`ManagedSimInputAdapter.__init__` takes two arguments: + +- **`typ`**: this is the type of the timeseries, ie int for a `ts[int]` +- **`field_map`**: Optional, field_map is a dictionary used to map source column names → `csp.Struct` field names. + +`ManagedSimInputAdapter` defines a method `push_tick()` which takes the value to feed the input for given timeslice (as defined by "now" at the adapter manager level). +There is also a convenience method called `process_dict()` which will take a dictionary of `{column : value}` entries and convert it properly into the right value based on the given **field_map.** + +### ManagedSimInputAdapter - **--graph-- time** + +As with the `csp.curve` example, we need to define a graph-time construct that represents a `ManagedSimInputAdapter` edge. +In order to define this we use `py_managed_adapter_def`. +`py_managed_adapter_def` is AdapterManager-"aware" and will properly create the AdapterManager *--impl--* the first time its encountered. +It will then pass the manager impl as an argument to the `ManagedSimInputAdapter`. + +```python +def py_managed_adapter_def(name, adapterimpl, out_type, manager_type, **kwargs): +""" +Create a graph representation of a python managed sim input adapter. +:param name: string name for the adapter +:param adapterimpl: a derived implementation of csp.impl.adaptermanager.ManagedSimInputAdapter +:param out_type: the type of the output, should be a ts[] type. Note this can use tvar types if a subsequent argument defines the tvar +:param manager_type: the type of the graph time representation of the AdapterManager that will manage this adapter +:param kwargs: **kwargs will be passed through as arguments to the ManagedSimInputAdapter implementation +the first argument to the implementation will be the adapter manager impl instance +""" +``` + +### Example - CSVReader + +Putting this all together lets take a look at a `CSVReader` implementation +and step through what's going on: + +```python +import csv as pycsv +from datetime import datetime + +from csp import ts +from csp.impl.adaptermanager import AdapterManagerImpl, ManagedSimInputAdapter +from csp.impl.wiring import pymanagedadapterdef + +# GRAPH TIME +class CSVReader: + def __init__(self, filename, time_converter, delimiter=',', symbol_column=None): + self._filename = filename + self._symbol_column = symbol_column + self._delimiter = delimiter + self._time_converter = time_converter + + def subscribe(self, symbol, typ, field_map=None): + return CSVReadAdapter(self, symbol, typ, field_map) + + def _create(self, engine, memo): + return CSVReaderImpl(engine, self) +``` + +Here we define CSVReader, our AdapterManager *--graph--* time representation. +It holds the parameters that will be used for the impl, it implements a `subscribe()` call for users to create timeseries and defines a \_create method to create a runtime *--impl–-* instance from the graphtime representation. +Note how on line 17 we pass self to the CSVReadAdapter, this is what binds the input adapter to this AdapterManager + +```python +# RUN TIME +class CSVReaderImpl(AdapterManagerImpl): # 1 + def __init__(self, engine, adapterRep): # 2 + super().__init__(engine) # 3 + # 4 + self._rep = adapterRep # 5 + self._inputs = {} # 6 + self._csv_reader = None # 7 + self._next_row = None # 8 + # 9 + def start(self, starttime, endtime): # 10 + self._csv_reader = pycsv.DictReader( # 11 + open(self._rep._filename, 'r'), # 12 + delimiter=self._rep._delimiter # 13 + ) # 14 + self._next_row = None # 15 + # 16 + for row in self._csv_reader: # 17 + time = self._rep._time_converter(row) # 18 + self._next_row = row # 19 + if time >= starttime: # 20 + break # 21 + # 22 + def stop(self): # 23 + self._csv_reader = None # 24 + # 25 + def register_input_adapter(self, symbol, adapter): # 26 + if symbol not in self._inputs: # 27 + self._inputs[symbol] = [] # 28 + self._inputs[symbol].append(adapter) # 29 + # 30 + def process_next_sim_timeslice(self, now): # 31 + if not self._next_row: # 32 + return None # 33 + # 34 + while True: # 35 + time = self._rep._time_converter(self._next_row) # 36 + if time > now: # 37 + return time # 38 + self.process_row(self._next_row) # 39 + try: # 40 + self._next_row = next(self._csv_reader) # 41 + except StopIteration: # 42 + return None # 43 + # 44 + def process_row(self, row): # 45 + symbol = row[self._rep._symbol_column] # 46 + if symbol in self._inputs: # 47 + for input in self._inputs.get(symbol, []): # 48 + input.process_dict(row) # 49 +``` + +`CSVReaderImpl` is the runtime *--impl–-*. +It gets created when the engine is being built from the described graph. + +- **lines 10-21 - start()**: this is the start method that gets called with the time range the graph will be run against. + Here we open our resource (`pycsv.DictReader`) and scan t through the data until we reach the requested starttime. + +- **lines 23-24 - stop()**: this is the stop call that gets called when the engine is done running and is shutdown, we free our resource here + +- **lines 26-29**: the `CSVReader` allows one to subscribe to many symbols from one file. + symbols are keyed by a provided `SYMBOL` column. + The individual adapters will self-register with the `CSVReaderImpl` when they are created with the requested symbol. + `CSVReaderImpl` keeps track of what adapters have been registered for what symbol in its `self._inputs` map. + +- **lines 31-43**: this is main method that gets invoked repeatedly throughout the run. + For every distinct timestamp in the file, this method will get invoked once and the method is expected to go through the resource data for all points with time now, process the row and push the data to any matching adapters. + The method returns the next timestamp when its done processing all data for "now", or None if there is no more data. + **NOTE** that the csv impl expects the data to be in time order. + `process_next_sim_timeslice` must advance time forward. + +- **lines 45-49**: this method takes a row of data (provided as a dict from `DictReader`), extracts the symbol and pushes the row through to all input adapters that match + +```python +class CSVReadAdapterImpl(ManagedSimInputAdapter): # 1 + def __init__(self, managerImpl, symbol, typ, field_map): # 2 + managerImpl.register_input_adapter(symbol, self) # 3 + super().__init__(typ, field_map) # 4 + # 5 +CSVReadAdapter = py_managed_adapter_def( # 6 + 'csvadapter', + CSVReadAdapterImpl, + ts['T'], + CSVReader, + symbol=str, + typ='T', + fieldMap=(object, None) +) +``` + +- **line 3**: this is where the instance of an adapter *--impl--* registers itself with the `CSVReaderImpl`. +- **line 6+**: this is where we define `CSVReadAdapter`, the *--graph--* time representation of a CSV adapter, returned from `CSVReader.subscribe` + +See example [e_14_user_adapters_02_adaptermanager_siminput.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_02_adaptermanager_siminput.py) for another example of how to write a managed sim adapter manager. diff --git a/docs/wiki/how-tos/Write-Output-Adapters.md b/docs/wiki/how-tos/Write-Output-Adapters.md new file mode 100644 index 00000000..8fb92d0f --- /dev/null +++ b/docs/wiki/how-tos/Write-Output-Adapters.md @@ -0,0 +1,317 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Output Adapters](#output-adapters) + - [OutputAdapter - Python](#outputadapter---python) + - [OutputAdapter - C++](#outputadapter---c) + - [OutputAdapter with Manager](#outputadapter-with-manager) + - [InputOutputAdapter - Python](#inputoutputadapter---python) + +## Output Adapters + +Output adapters are used to define graph outputs, and they differ from input adapters in a number of important ways. +Output adapters also differ from terminal nodes, e.g. regular `csp.node` instances that do not define outputs, and instead consume and emit their inputs inside their `csp.ticked` blocks. + +For many use cases, it will be sufficient to omit writing an output adapter entirely. +Consider the following example of a terminal node that writes an input dictionary timeseries to a file. + +```python +@csp.node +def write_to_file(x: ts[Dict], filename: str): + if csp.ticked(x): + with open(filename, "a") as fp: + fp.write(json.dumps(x)) +``` + +This is a perfectly fine node, and serves its purpose. +Unlike input adapters, output adapters do not need to differentiate between *historical* and *realtime* mode. +Input adapters drive the execution of the graph, whereas output adapters are reactive to their input nodes and subject to the graph's execution. + +However, there are a number of reasons why you might want to define an output adapter instead of using a vanilla node. +The most important of these is when you want to share resources across a number of output adapters (e.g. with a Manager), or between an input and an output node, e.g. reading data from a websocket, routing it through your CSP graph, and publishing data *to the same websocket connection*. +For most use cases, a vanilla CSP node will suffice, but let's explore some anyway. + +### OutputAdapter - Python + +To write a Python based OutputAdapter one must write a class that derives from `csp.impl.outputadapter.OutputAdapter`. +The derived type should define the method: + +- `def on_tick(self, time: datetime, value: object)`: this will be called when the input to the output adapter ticks. + +The OutputAdapter that you define will be used as the runtime *--impl–-*. You also need to define a *--graph--* time representation of the time series edge. +In order to do this you should define a `csp.impl.wiring.py_output_adapter_def`. +The `py_output_adapter_def` creates a *--graph--* time representation of your adapter: + +**def py_output_adapter_def(name, adapterimpl, \*\*kwargs)** + +- **`name`**: string name for the adapter +- **`adapterclass`**: a derived implementation of `csp.impl.outputadapter.OutputAdapter` +- **`kwargs`**: \*\*kwargs here be passed through as arguments to the OutputAdapter implementation + +Note that the `**kwargs` passed to py_output_adapter_def should be the names and types of the variables, like `arg1=type1, arg2=type2`. +These are the names of the kwargs that the returned output adapter will take and pass through to the OutputAdapter implementation, and the types expected for the values of those args. + +Here is a simple example of the same filewriter from above: + +```python +from csp.impl.outputadapter import OutputAdapter +from csp.impl.wiring import py_output_adapter_def +from csp import ts +import csp +from json import dumps +from datetime import datetime, timedelta + + +class MyFileWriterAdapterImpl(OutputAdapter): + def __init__(self, filename: str): + super().__init__() + self._filename = filename + + def start(self): + self._fp = open(self._filename, "a") + + def stop(self): + self._fp.close() + + def on_tick(self, time, value): + self._fp.write(dumps(value) + "\n") + + +MyFileWriterAdapter = py_output_adapter_def( + name='MyFileWriterAdapter', + adapterimpl=MyFileWriterAdapterImpl, + input=ts['T'], + filename=str, +) +``` + +Now our adapter can be called in graph code: + +```python +@csp.graph +def my_graph(): + curve = csp.curve( + data=[ + (timedelta(seconds=0), {"a": 1, "b": 2, "c": 3}), + (timedelta(seconds=1), {"a": 1, "b": 2, "c": 3}), + (timedelta(seconds=1), {"a": 1, "b": 2, "c": 3}), + ], + typ=object, + ) + + MyFileWriterAdapter(curve, filename="testfile.jsonl") +``` + +As explained above, we could also do this via single node (this is probably the best version between the three): + +```python +@csp.node +def dump_json(data: ts['T'], filename: str): + with csp.state(): + s_file=None + with csp.start(): + s_file = open(filename, "w") + with csp.stop(): + s_file.close() + if csp.ticked(data): + s_file.write(json.dumps(data) + "\n") + s_file.flush() +``` + +### OutputAdapter - C++ + +TODO + +### OutputAdapter with Manager + +Adapter managers function the same way for output adapters as for input adapters, i.e. to manage a single shared resource from the manager across a variety of discrete output adapters. + +### InputOutputAdapter - Python + +As a as last example, lets tie everything together and implement a managed push input adapter combined with a managed output adapter. +This example is available in [e_14_user_adapters_06_adaptermanager_inputoutput.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_06_adaptermanager_inputoutput.py). + +First, we will define our adapter manager. +In this example, we're going to cheat a little bit and combine our adapter manager (graph time) and our adapter manager impl (run time). + +```python +class MyAdapterManager(AdapterManagerImpl): + ''' + This example adapter will generate random `MyData` structs every `interval`. This simulates an upstream + data feed, which we "connect" to only a single time. We then multiplex the results to an arbitrary + number of subscribers via the `subscribe` method. + + We can also receive messages via the `publish` method from an arbitrary number of publishers. These messages + are demultiplexex into a number of outputs, simulating sharing a connection to a downstream feed or responses + to the upstream feed. + ''' + def __init__(self, interval: timedelta): + self._interval = interval + self._counter = 0 + self._subscriptions = {} + self._publications = {} + self._running = False + self._thread = None + + def subscribe(self, symbol): + '''This method creates a new input adapter implementation via the manager.''' + return _my_input_adapter(self, symbol, push_mode=csp.PushMode.NON_COLLAPSING) + + def publish(self, data: ts['T'], symbol: str): + '''This method creates a new output adapter implementation via the manager.''' + return _my_output_adapter(self, data, symbol) + + def _create(self, engine, memo): + # We'll avoid having a second class and make our AdapterManager and AdapterManagerImpl the same + super().__init__(engine) + return self + + def start(self, starttime, endtime): + self._running = True + self._thread = threading.Thread(target=self._run) + self._thread.start() + + def stop(self): + if self._running: + self._running = False + self._thread.join() + + # print closing of the resources + for name in self._publications.values(): + print("closing asset {}".format(name)) + + def register_subscription(self, symbol, adapter): + if symbol not in self._subscriptions: + self._subscriptions[symbol] = [] + self._subscriptions[symbol].append(adapter) + + def register_publication(self, symbol): + if symbol not in self._publications: + self._publications[symbol] = "publication_{}".format(symbol) + + def _run(self): + '''This method runs in a background thread and generates random input events to push to the corresponding adapter''' + symbols = list(self._subscriptions.keys()) + while self._running: + # Lets pick a random symbol from the requested symbols + symbol = symbols[random.randint(0, len(symbols) - 1)] + + data = MyData(symbol=symbol, value=self._counter) + + self._counter += 1 + + for adapter in self._subscriptions[symbol]: + # push to all the subscribers + adapter.push_tick(data) + + time.sleep(self._interval.total_seconds()) + + def _on_tick(self, symbol, value): + '''This method just writes the data to the appropriate outbound "channel"''' + print("{}:{}".format(self._publications[symbol], value)) +``` + +This adapter manager is a bit of a silly example, but it demonstrates the core concepts. +The adapter manager will demultiplex a shared stream (in this case, the stream defined in `_run` is a random sequence of `MyData` structs) between all the input adapters it manages. +The input adapter itself will do nothing more than let the adapter manager know that it exists: + +```python +class MyInputAdapterImpl(PushInputAdapter): + '''Our input adapter is a very simple implementation, and just + defers its work back to the manager who is expected to deal with + sharing a single connection. + ''' + def __init__(self, manager, symbol): + manager.register_subscription(symbol, self) + super().__init__() +``` + +Similarly, the adapter manager will multiplex the output adapter streams, in this case combining them into streams of print statements. +And similar to the input adapter, the output adapter does relatively little more than letting the adapter manager know that it has work available, using its triggered `on_tick` method to call the adapter manager's `_on_tick` method. + +``` +class MyOutputAdapterImpl(OutputAdapter): + '''Similarly, our output adapter is simple as well, deferring + its functionality to the manager + ''' + def __init__(self, manager, symbol): + manager.register_publication(symbol) + self._manager = manager + self._symbol = symbol + super().__init__() + + def on_tick(self, time, value): + self._manager._on_tick(self._symbol, value) +``` + +As a last step, we need to ensure that the runtime adapter implementations are registered with our graph: + +```python +_my_input_adapter = py_push_adapter_def(name='MyInputAdapter', adapterimpl=MyInputAdapterImpl, out_type=ts[MyData], manager_type=MyAdapterManager, symbol=str) +_my_output_adapter = py_output_adapter_def(name='MyOutputAdapter', adapterimpl=MyOutputAdapterImpl, manager_type=MyAdapterManager, input=ts['T'], symbol=str) +``` + +To test this example, we will: + +- instantiate our manager +- subscribe to a certain number of input adapter "streams" (which the adapter manager will demultiplex out of a single random node) +- print the data +- sink each stream into a smaller number of output adapters (which the adapter manager will multiplex into print statements) + +```python +@csp.graph +def my_graph(): + adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) + + data_1 = adapter_manager.subscribe("data_1") + data_2 = adapter_manager.subscribe("data_2") + data_3 = adapter_manager.subscribe("data_3") + + csp.print("data_1", data_1) + csp.print("data_2", data_2) + csp.print("data_3", data_3) + + # pump two streams into 1 output and 1 stream into another + adapter_manager.publish(data_1, "data_1") + adapter_manager.publish(data_2, "data_1") + adapter_manager.publish(data_3, "data_3") +``` + +Here is the result of a single run: + +``` +2023-02-15 19:14:53.859951 data_1:MyData(symbol=data_1, value=0) +publication_data_1:MyData(symbol=data_1, value=0) +2023-02-15 19:14:54.610281 data_3:MyData(symbol=data_3, value=1) +publication_data_3:MyData(symbol=data_3, value=1) +2023-02-15 19:14:55.361157 data_3:MyData(symbol=data_3, value=2) +publication_data_3:MyData(symbol=data_3, value=2) +2023-02-15 19:14:56.112030 data_2:MyData(symbol=data_2, value=3) +publication_data_1:MyData(symbol=data_2, value=3) +2023-02-15 19:14:56.862881 data_2:MyData(symbol=data_2, value=4) +publication_data_1:MyData(symbol=data_2, value=4) +2023-02-15 19:14:57.613775 data_1:MyData(symbol=data_1, value=5) +publication_data_1:MyData(symbol=data_1, value=5) +2023-02-15 19:14:58.364408 data_3:MyData(symbol=data_3, value=6) +publication_data_3:MyData(symbol=data_3, value=6) +2023-02-15 19:14:59.115290 data_2:MyData(symbol=data_2, value=7) +publication_data_1:MyData(symbol=data_2, value=7) +2023-02-15 19:14:59.866160 data_2:MyData(symbol=data_2, value=8) +publication_data_1:MyData(symbol=data_2, value=8) +2023-02-15 19:15:00.617068 data_1:MyData(symbol=data_1, value=9) +publication_data_1:MyData(symbol=data_1, value=9) +2023-02-15 19:15:01.367955 data_2:MyData(symbol=data_2, value=10) +publication_data_1:MyData(symbol=data_2, value=10) +2023-02-15 19:15:02.118259 data_3:MyData(symbol=data_3, value=11) +publication_data_3:MyData(symbol=data_3, value=11) +2023-02-15 19:15:02.869170 data_2:MyData(symbol=data_2, value=12) +publication_data_1:MyData(symbol=data_2, value=12) +2023-02-15 19:15:03.620047 data_1:MyData(symbol=data_1, value=13) +publication_data_1:MyData(symbol=data_1, value=13) +closing asset publication_data_1 +closing asset publication_data_3 +``` + +Although simple, this examples demonstrates the utility of the adapters and adapter managers. +An input resource is managed by one entity, distributed across a variety of downstream subscribers. +Then a collection of streams is piped back into a single entity. diff --git a/docs/wiki/how-tos/Write-Realtime-Input-Adapters.md b/docs/wiki/how-tos/Write-Realtime-Input-Adapters.md new file mode 100644 index 00000000..10eedc4c --- /dev/null +++ b/docs/wiki/how-tos/Write-Realtime-Input-Adapters.md @@ -0,0 +1,407 @@ +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Introduction](#introduction) +- [PushInputAdapter - Python](#pushinputadapter---python) +- [GenericPushAdapter](#genericpushadapter) +- [Realtime AdapterManager](#realtime-adaptermanager) + - [AdapterManager - **graph-- time**](#adaptermanager---graph---time) + - [AdapterManager - **impl-- runtime**](#adaptermanager---impl---runtime) + - [PushInputAdapter - **--impl-- runtime**](#pushinputadapter-----impl---runtime) + - [PushInputAdapter - **--graph-- time**](#pushinputadapter----graph---time) + - [Example](#example) + +## Introduction + +There are two main categories of writing input adapters, historical and realtime. + +When writing realtime adapters, you will need to implement a "push" adapter, which will get data from a separate thread that drives external events and "pushes" them into the engine as they occur. + +When writing input adapters it is also very important to denote the difference between "graph building time" and "runtime" versions of your adapter. +For example, `csp.adapters.csv` has a `CSVReader` class that is used at graph building time. +**Graph build time components** solely *describe* the adapter. +They are meant to do little else than keep track of the type of adapter and its parameters, which will then be used to construct the actual adapter implementation when the engine is constructed from the graph description. +It is the runtime implementation that actual runs during the engine execution phase to process data. + +For clarity of this distinction, in the descriptions below we will denote graph build time components with *--graph--* and runtime implementations with *--impl--*. + +## PushInputAdapter - Python + +To write a Python based `PushInputAdapter` one must write a class that derives from `csp.impl.pushadapter.PushInputAdapter`. +The derived type should the define two methods: + +- `def start(self, start_time, end_time)`: this will be called at the start of the engine with the start/end times of the engine. + start_time and end_time will be tz-unaware datetime objects in UTC time (generally these aren't needed for realtime adapters). + At this point the adapter should open its resource / connect the data source / start any driver threads that are needed. +- `def stop(self)`: This method well be called when the engine is done running. + At this point any open threads should be stopped and resources cleaned up. + +The `PushInputAdapter` that you define will be used as the runtime *--impl–-*. +You also need to define a *--graph--* time representation of the time series edge. +In order to do this you should define a `csp.impl.wiring.py_push_adapter_def`. +The `py_push_adapter_def` creates a *--graph--* time representation of your adapter: + +**def py_push_adapter_def(name, adapterimpl, out_type, \*\*kwargs)** + +- **`name`**: string name for the adapter +- **`adapterimpl`**: a derived implementation of + `csp.impl.pushadapter.PushInputAdapter` +- **`out_type`**: the type of the output, should be a `ts[]` type. + Note this can use tvar types if a subsequent argument defines the + tvar. +- **`kwargs`**: \*\*kwargs here be passed through as arguments to the + PushInputAdapter implementation + +Note that the \*\*kwargs passed to `py_push_adapter_def` should be the names and types of the variables, like `arg1=type1, arg2=type2`. +These are the names of the kwargs that the returned input adapter will take and pass through to the `PushInputAdapter` implementation, and the types expected for the values of those args. + +Example [e_14_user_adapters_03_pushinput.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_03_pushinput.py) demonstrates a simple example of this. + +```python +from csp.impl.pushadapter import PushInputAdapter +from csp.impl.wiring import py_push_adapter_def +import csp +from csp import ts +from datetime import datetime, timedelta +import threading +import time + + +# The Impl object is created at runtime when the graph is converted into the runtime engine +# it does not exist at graph building time! +class MyPushAdapterImpl(PushInputAdapter): + def __init__(self, interval): + print("MyPushAdapterImpl::__init__") + self._interval = interval + self._thread = None + self._running = False + + def start(self, starttime, endtime): + """ start will get called at the start of the engine, at which point the push + input adapter should start its thread that will push the data onto the adapter. Note + that push adapters will ALWAYS have a separate thread driving ticks into the csp engine thread + """ + print("MyPushAdapterImpl::start") + self._running = True + self._thread = threading.Thread(target=self._run) + self._thread.start() + + def stop(self): + """ stop will get called at the end of the run, at which point resources should + be cleaned up + """ + print("MyPushAdapterImpl::stop") + if self._running: + self._running = False + self._thread.join() + + def _run(self): + counter = 0 + while self._running: + self.push_tick(counter) + counter += 1 + time.sleep(self._interval.total_seconds()) + + +# MyPushAdapter is the graph-building time construct. This is simply a representation of what the +# input adapter is and how to create it, including the Impl to create and arguments to pass into it +MyPushAdapter = py_push_adapter_def('MyPushAdapter', MyPushAdapterImpl, ts[int], interval=timedelta) +``` + +Note how line 41 calls **self.push_tick**. +This is the call to get data from the adapter thread ticking into the CSP engine. + +Now `MyPushAdapter` can be called in graph code to create a timeseries that is sourced by `MyPushAdapterImpl`: + +```python +@csp.graph +def my_graph(): + # At this point we create the graph-time representation of the input adapter. This will be converted + # into the impl once the graph is done constructing and the engine is created in order to run + data = MyPushAdapter(timedelta(seconds=1)) + csp.print('data', data) +``` + +## GenericPushAdapter + +If you dont need as much control as `PushInputAdapter` provides, or if you have some existing source of data on a thread you can't control, another option is to use the higher-level abstraction `csp.GenericPushAdapter`. +`csp.GenericPushAdapter` wraps a `csp.PushInputAdapter` implementation internally and provides a simplified interface. +The downside of `csp.GenericPushAdapter` is that you lose some control of when the input feed starts and stop. + +Lets take a look at the example found in [e_14_generic_push_adapter.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_generic_push_adapter.py): + +```python +# This is an example of some separate thread providing data +class Driver: + def __init__(self, adapter : csp.GenericPushAdapter): + self._adapter = adapter + self._active = False + self._thread = None + + def start(self): + self._active = True + self._thread = threading.Thread(target=self._run) + self._thread.start() + + def stop(self): + if self._active: + self._active = False + self._thread.join() + + def _run(self): + print("driver thread started") + counter = 0 + # Optionally, we can wait for the adapter to start before proceeding + # Alternatively we can start pushing data, but push_tick may fail and return False if + # the csp engine isn't ready yet + self._adapter.wait_for_start() + + while self._active and not self._adapter.stopped(): + self._adapter.push_tick(counter) + counter += 1 + time.sleep(1) + +@csp.graph +def my_graph(): + adapter = csp.GenericPushAdapter(int) + driver = Driver(adapter) + # Note that the driver thread starts *before* the engine is started here, which means some ticks may potentially get dropped if the + # data source doesn't wait for the adapter to start. This may be ok for some feeds, but not others + driver.start() + + # Lets be nice and shutdown the driver thread when the engine is done + csp.schedule_on_engine_stop(driver.stop) +``` + +In this example we have this dummy `Driver` class which simply represents some external source of data which arrives on a thread that's completely independent of the engine. +We pass along a `csp.GenericInputAdapter` instance to this thread, which can then call adapter.push_tick to get data into the engine (see line 27). + +On line 24 we can also see an optional feature which allows the unrelated thread to wait for the adapter to be ready to accept data before ticking data onto it. +If push_tick is called before the engine starts / the adapter is ready to receive data, it will simply drop the data. +Note that GenericPushAadapter.push_tick will return a bool to indicate whether the data was successfully pushed to the engine or not. + +## Realtime `AdapterManager` + +In most cases you will likely want to expose a single source of data into multiple input adapters. +For this use case your adapter should define an `AdapterManager` *--graph--* time component, and `AdapterManagerImpl` *--impl--* runtime component. +The `AdapterManager` *--graph--* time component just represents the parameters needed to create the *--impl--* `AdapterManager`. +Its the *--impl--* that will have the actual implementation that will open the data source, parse the data and provide it to individual Adapters. + +Similarly you will need to define a derived `PushInputAdapter` *--impl--* component to handle events directed at an individual time series adapter. + +**NOTE** It is highly recommended not to open any resources in the *--graph--* time component. +Graph time components can be pruned and/or memoized into a single instance, opening resources at graph time shouldn't be necessary. + +### AdapterManager - **graph-- time** + +The graph time `AdapterManager` doesn't need to derive from any interface. +It should be initialized with any information the impl needs in order to open/process the data source (ie activemq connection information, server host port, multicast channels, config files, etc etc). +It should also have an API to create individual timeseries adapters. +These adapters will then get passed the adapter manager *--impl--* as an argument when they are created, so that they can register themselves for processing. +The `AdapterManager` also needs to define a **\_create** method. +The **\_create** is the bridge between the *--graph--* time `AdapterManager` representation and the runtime *--impl--* object. +**\_create** will be called on the *--graph--* time `AdapterManager` which will in turn create the *--impl--* instance. +\_create will get two arguments, engine (this represents the runtime engine object that will run the graph) and memo dict which can optionally be used for any memoization that on might want. + +Lets take a look at the example found in [e_14_user_adapters_04_adaptermanager_pushinput.py](https://github.com/Point72/csp/blob/main/examples/4_writing_adapters/e_14_user_adapters_04_adaptermanager_pushinput.py): + +```python +# This object represents our AdapterManager at graph time. It describes the manager's properties +# and will be used to create the actual impl when its time to build the engine +class MyAdapterManager: + def __init__(self, interval: timedelta): + """ + Normally one would pass properties of the manager here, ie filename, + message bus, etc + """ + self._interval = interval + + def subscribe(self, symbol, push_mode=csp.PushMode.NON_COLLAPSING): + """ User facing API to subscribe to a timeseries stream from this adapter manager """ + # This will return a graph-time timeseries edge representing and edge from this + # adapter manager for the given symbol / arguments + return MyPushAdapter(self, symbol, push_mode=push_mode) + + def _create(self, engine, memo): + """ This method will get called at engine build time, at which point the graph time manager representation + will create the actual impl that will be used for runtime + """ + # Normally you would pass the arguments down into the impl here + return MyAdapterManagerImpl(engine, self._interval) +``` + +- **\_\_init\_\_** - as you can see, all \_\_init\_\_ does is keep the parameters that the impl will need. +- **subscribe** - API to create an individual timeseries / edge from this file for the given symbol. + The interface defined here is up to the adapter writer, but generally "subscribe" is recommended, and it should take any number of arguments needed to define a single stream of data. + *MyPushAdapter* is the *--graph--* time representation of the edge, which will be described below. + We pass it *self* as its first argument, which will be used to create the `AdapterManager` *--impl--* +- **\_create** - the method to create the *--impl--* object from the given *--graph--* time representation of the manager + +`MyAdapterManager` would then be used in graph building code like so: + +```python +adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) +data = adapter_manager.subscribe('AAPL', push_mode=csp.PushMode.LAST_VALUE) +csp.print(symbol + " last_value", data) +``` + +### AdapterManager - **impl-- runtime** + +The `AdapterManager` *--impl--* is responsible for opening the data source, parsing and processing all the data and managing all the adapters it needs to feed. +The impl class should derive from `csp.impl.adaptermanager.AdapterManagerImpl` and implement the following methods: + +- **start(self,starttime,endtime)**: this is called when the engine starts up. + At this point the impl should open the resource providing the data and start up any thread(s) needed to listen to and react to external data. + starttime/endtime will be tz-unaware datetime objects in UTC time, though typically these aren't needed for realtime adapters +- **`stop(self)`**: this is called at the end of the run, resources should be cleaned up at this point +- **`process_next_sim_timeslice(self, now)`**: this is used by sim adapters, for realtime adapter managers we simply return None + +In the example manager, we spawn a processing thread in the `start()` call. +This thread runs in a loop until it is shutdown, and will generate random data to tick out to the registered input adapters. +Data is passed to a given adapter by calling `push_tick()`. + +### PushInputAdapter - **--impl-- runtime** + +Users will need to define `PushInputAdapter` derived types to represent the individual timeseries adapter *--impl--* objects. +Objects should derive from `csp.impl.pushadapter.PushInputAdapter`. + +`PushInputAdapter` defines a method `push_tick()` which takes the value to feed the input timeseries. + +### PushInputAdapter - **--graph-- time** + +Similar to the stand alone `PushInputAdapter` described above, we need to define a graph-time construct that represents a `PushInputAdapter` edge. +In order to define this we use `py_push_adapter_def` again, but this time we pass the adapter manager *--graph--* time type so that it gets constructed properly. +When the `PushInputAdapter` instance is created it will also receive an instance of the adapter manager *--impl–-*, which it can then self-register on. + +```python +def py_push_adapter_def (name, adapterimpl, out_type, manager_type=None, memoize=True, force_memoize=False, **kwargs): +""" +Create a graph representation of a python push input adapter. +:param name: string name for the adapter +:param adapterimpl: a derived implementation of csp.impl.pushadapter.PushInputAdapter +:param out_type: the type of the output, should be a ts[] type. Note this can use tvar types if a subsequent argument defines the tvar +:param manager_type: the type of the graph time representation of the AdapterManager that will manage this adapter +:param kwargs: **kwargs will be passed through as arguments to the ManagedSimInputAdapter implementation +the first argument to the implementation will be the adapter manager impl instance +""" +``` + +### Example + +Continuing with the *--graph--* time `AdapterManager` described above, we +now define the impl: + +```python +# This is the actual manager impl that will be created and executed during runtime +class MyAdapterManagerImpl(AdapterManagerImpl): + def __init__(self, engine, interval): + super().__init__(engine) + + # These are just used to simulate a data source + self._interval = interval + self._counter = 0 + + # We will keep track of requested input adapters here + self._inputs = {} + + # Our driving thread, all realtime adapters will need a separate thread of execution that + # drives data into the engine thread + self._running = False + self._thread = None + + def start(self, starttime, endtime): + """ start will get called at the start of the engine run. At this point + one would start up the realtime data source / spawn the driving thread(s) and + subscribe to the needed data """ + self._running = True + self._thread = threading.Thread(target=self._run) + self._thread.start() + + def stop(self): + """ This will be called at the end of the engine run, at which point resources should be + closed and cleaned up """ + if self._running: + self._running = False + self._thread.join() + + def register_input_adapter(self, symbol, adapter): + """ Actual PushInputAdapters will self register when they are created as part of the engine + This is the place we gather all requested input adapters and their properties + """ + if symbol not in self._inputs: + self._inputs[symbol] = [] + # Keep a list of adapters by key in case we get duplicate adapters (should be memoized in reality) + self._inputs[symbol].append(adapter) + + def process_next_sim_timeslice(self, now): + """ This method is only used by simulated / historical adapters, for realtime we just return None """ + return None + + def _run(self): + """ Our driving thread, in reality this will be reacting to external events, parsing the data and + pushing it into the respective adapter + """ + symbols = list(self._inputs.keys()) + while self._running: + # Lets pick a random symbol from the requested symbols + symbol = symbols[random.randint(0, len(symbols) - 1)] + adapters = self._inputs[symbol] + data = MyData(symbol=symbol, value=self._counter) + self._counter += 1 + for adapter in adapters: + adapter.push_tick(data) + + time.sleep(self._interval.total_seconds()) +``` + +Then we define our `PushInputAdapter` *--impl--*, which basically just +self-registers with the adapter manager *--impl--* upon construction. We +also define our `PushInputAdapter` *--graph--* time construct using `py_push_adapter_def`. + +```python +# The Impl object is created at runtime when the graph is converted into the runtime engine +# it does not exist at graph building time. a managed sim adapter impl will get the +# adapter manager runtime impl as its first argument +class MyPushAdapterImpl(PushInputAdapter): + def __init__(self, manager_impl, symbol): + print(f"MyPushAdapterImpl::__init__ {symbol}") + manager_impl.register_input_adapter(symbol, self) + super().__init__() + + +MyPushAdapter = py_push_adapter_def('MyPushAdapter', MyPushAdapterImpl, ts[MyData], MyAdapterManager, symbol=str) +``` + +And then we can run our adapter in a CSP graph + +```python +@csp.graph +def my_graph(): + print("Start of graph building") + + adapter_manager = MyAdapterManager(timedelta(seconds=0.75)) + symbols = ['AAPL', 'IBM', 'TSLA', 'GS', 'JPM'] + for symbol in symbols: + # your data source might tick faster than the engine thread can consume it + # push_mode can be used to buffered up tick events will get processed + # LAST_VALUE will conflate and only tick the latest value since the last cycle + data = adapter_manager.subscribe(symbol, csp.PushMode.LAST_VALUE) + csp.print(symbol + " last_value", data) + + # BURST will change the timeseries type from ts[T] to ts[[T]] (list of ticks) + # that will tick with all values that have buffered since the last engine cycle + data = adapter_manager.subscribe(symbol, csp.PushMode.BURST) + csp.print(symbol + " burst", data) + + # NON_COLLAPSING will tick all events without collapsing, unrolling the events + # over multiple engine cycles + data = adapter_manager.subscribe(symbol, csp.PushMode.NON_COLLAPSING) + csp.print(symbol + " non_collapsing", data) + + print("End of graph building") + + +csp.run(my_graph, starttime=datetime.utcnow(), endtime=timedelta(seconds=10), realtime=True) +``` + +Do note that realtime adapters will only run in realtime engines (note the `realtime=True` argument to `csp.run`). diff --git a/docs/wiki/references/Examples.md b/docs/wiki/references/Examples.md new file mode 100644 index 00000000..41229319 --- /dev/null +++ b/docs/wiki/references/Examples.md @@ -0,0 +1,7 @@ +> \[!WARNING\] +> This page is a work in progress. + + diff --git a/docs/wiki/references/Glossary.md b/docs/wiki/references/Glossary.md new file mode 100644 index 00000000..d66dd472 --- /dev/null +++ b/docs/wiki/references/Glossary.md @@ -0,0 +1,142 @@ +> \[!WARNING\] +> This page is a work in progress. + +## Table of Contents + +- [Table of Contents](#table-of-contents) +- [Terms](#terms) + - [Engine time](#engine-time) + - [Event streaming](#event-streaming) + - [Time series](#time-series) + - [Tick](#tick) + - [Node](#node) + - [Graph](#graph) + - [Alarm](#alarm) + - [Adapter](#adapter) + - [Realtime](#realtime) + - [Wiring (or graph building time)](#wiring-or-graph-building-time) + - [Graph run time](#graph-run-time) + - [Ticked (as in csp.ticked)](#ticked-as-in-cspticked) + - [Valid (as in csp.valid)](#valid-as-in-cspvalid) + - [Push mode](#push-mode) + - [Edge](#edge) + - [Delayed edge](#delayed-edge) + - [Feedback](#feedback) + - [Struct](#struct) + - [List basket](#list-basket) + - [Dict basket](#dict-basket) + - [Dynamic graph](#dynamic-graph) + - [Push input adapter](#push-input-adapter) + - [Pull input adapter](#pull-input-adapter) + - [Output adapter](#output-adapter) + - [Managed sim adapter](#managed-sim-adapter) + - [Adapter manager](#adapter-manager) + +## Terms + + + +### Engine time + +The CSP engine always maintains its current view of time. +The current time of the engine can be accessed at any time within a `csp.node` by calling `csp.now()` + +### Event streaming + + + +### Time series + + + +### Tick + + + +### Node + + + +### Graph + + + +### Alarm + + + +### Adapter + + + +### Realtime + + + +### Wiring (or graph building time) + + + +### Graph run time + + + +### Ticked (as in csp.ticked) + + + +### Valid (as in csp.valid) + + + +### Push mode + + + +### Edge + + + +### Delayed edge + + + +### Feedback + + + +### Struct + + + +### List basket + + + +### Dict basket + + + +### Dynamic graph + + + +### Push input adapter + + + +### Pull input adapter + + + +### Output adapter + + + +### Managed sim adapter + + + +### Adapter manager + +