You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is about finding new use-cases for arc-script.
Problem
Applications are needed to motivate the design of arc-script. For arc-script we have multiple levels of abstractions:
Applications (e.g., anomaly detection, data cleaning). These are very specialized abstractions built using libraries.
Libraries (e.g., custom operators, composite data types). These are more generalized abstractions built using language constructs.
Language (e.g., tasks, native data types). These are abstractions built using Rust and runtime primitives. Runtime primitives are not exposed at the arc-script level (only inside Rust).
Runtime (e.g., channels, streams, components). These are abstractions built using Kompact and Rust.
In the ideal case an application is just about implementing an algorithm (such as PageRank). Generally however data must also be pre- and post-processed. Where things get hairy is when we need to also interact with other systems. For example, Alibaba's use case requires interaction with MySQL/HBase data sources, and Druid data sinks. Zalando requires interaction with Kafka/S3 data sources.
Application requirements
TODO: List of requirements which our runtime/language must offer
Application non-requirements
TODO: List of requirements which our runtime/language cannot offer at the moment (but likely never)
List of Applications
This is a continuously updated list of applications in data analytics which can help motivate the design of Arc-Script. I will fill in some details to explain the idea and requirements of the use cases. Some use cases here share similar patterns in their implementation. Use cases may also be building blocks for other use cases. Data analytics pipelines may be composed of many different algorithms. This makes the distinction about what should be provided by our language and what should be possible to implement using our language important.
Streaming Algorithms
Streaming algorithms have a well-established theory behind them (known bounds on time and space). However, their applications are very specific... too specific to cover the area of BigData analytics. BigData analytics can however sometimes involve the use of streaming algorithms.
Requirements
Requirements of such algorithms are that they:
Must process data in a single-pass
Must process data in the order it arrives
Must process data with limited memory
Must process data in a limited time per event (faster than the inbound throughput)
BigData streaming is about processing massive amounts of information in real-time. Analytics are both continuous and deep, and for this reason pose harder requirements than general streaming algorithms.
Requirements
Requirements of continuous deep data analytics are as follows:
Applications are generally represented as pipelines which involve multiple stages of processing and analysis. A stage may for example represent a streaming algorithm
Data is timestamped (algorithms often revolve around time)
Data items arrive out-of-order with respect to their timestamps
Data can come from multiple sources, each with a different throughput
Data often needs to be pre-processed and enriched
Data can be partitioned by-key
Data is continuously generated (algorithms never terminate)
Applications can contain iterative loops which feed-back data for further refinement
Results should ideally be reproducible
Examples
Following is a list of examples of BigData streaming:
Specific requirements: 1) Ability to parse strings into adjacency lists (vector of integers). 2) Ability to join two datasets (ranks with adjacency lists).
“Whenever the sum of the accumulated payment amount from the same payer to the same beneficiary within the duration of a week is greater than 1 000 000 $ - fire an alert.”
“Whenever the sum of payments from the same payer to the same beneficiary within a 24 hour period is greater than 200 000 $ - trigger an alert.”
Specific requirements: 1) Ability to broadcast events to all operator instances (which is used to dynamically update the rules for deciding whether a transaction is fraudulent without recompilation). The rules are in other words an input stream of control events to the system relayed from a Kafka topic. 2) Ability to partition events by key. The key is the payer and beneficiary. 3) Ability to forward events (reusing the same key). 4) Sliding windows. 5) Managed state.
Arterial Blood Pressure (ABP) pulse detection [O’Rourke 1971; Zong et al. 2003] is a complex streaming computation, and is difficult to express with existing languages for stream processing. The use of a streaming query language for medical monitoring applications has been considered in [Abbas et al. 2018, 2019].
The ABP signal is collected from the MIT-BIH Polysomnographic database [Ichimaru and Moody 1999]. The signal measurements are of type VT = {val: V, ts: T}, where val is the value of the signal and ts is the timestamp. The signal is uniformly sampled at a frequency of 250Hz. (...) The ABP waveform contains rich information about the cardiovascular system (e.g., heart rate, systolic, mean, and diastolic arterial pressures). Reliable ABP pulse detection is crucial for extracting this information.
First, the algorithm preprocesses the signal stream using a low-pass IIR filter and a slope sum function (SSF), and then it performs the detection of the pulse onset.
The low-pass filter IIR suppresses high frequency noise, and is defined by 𝑦(𝑛)=2𝑦(𝑛−1)−𝑦(𝑛−2)+𝑥(𝑛)−2𝑥(𝑛−5)+𝑥(𝑛−10). The SSF is defined by 𝑧(𝑛)=Σ0≤𝑖≤31𝑚𝑎𝑥(0,𝑑(𝑛−𝑖)), where 𝑑(𝑛)=𝑦(𝑛)−𝑦(𝑛−1). It enhances the up-slope of the ABP pulse and restrains the remainder of the pressure waveform. The query getVTP : Q(VT, VTP) annotates each item {val, ts} of the input stream with an additional component pval, which is the result of the preprocessing. The type VTP = {val: V, ts: T, pval: V} extends VT with this additional component. These preprocessed values have a phase shift of 20ms (5 samples), which is introduced by low-pass filtering.
The detection of ABP onset is described by the following rules:
R1. In intervals where the SSF value exceeds a threshold Thred (i.e. a tentative pulse), the algorithm selects the first and the maximum SSF values.
R2. The pulse detection is accepted only if the difference between the first and the maximum SSF values exceeds 100.
R3. When the pulse is accepted, the algorithm chooses the first sample that crosses the threshold as the onset point. The detected onset is adjusted by 20ms (5 samples) to compensate for the phase shift of low-pass filtering.
R4. After an onset is detected, to avoid double detection of the same pulse, the detection falls silent for 300ms.
"Assume that the input stream consists of signal measurements of type V (integer type) which are collected at a fixed frequency. We will consider a computation that is the composition of a smoothing filter and calculating the derivative. We use a low-pass filter to smooth the input into results f : F (floating point type), where f = (v1 + 2*v2 + 4*v3 + 2*v4 + v5)/10 for each five consecutive input items v1, v2, ..., v5. Then, we compute the derivative d : D (floating point type) where d = f2 − f1 for every two consecutive smoothed values."
"Now, let us consider an algorithm for detect peaks in a stream of numerical values (suppose they are of type V). The algorithm searches for the first value that exceeds the threshold THRESH. Then, it search for the maximum over the next #PEAK_CNT elements, which is considered a peak. After that, the algorithm silences detection for #SILENCE_CNT elements to avoid a duplicate detection. This process is repeated indefinitely in order to detect all peaks."
Complex Event Processing Algorithms
In the words of Esper, "Complex Event Processing" is about analysing events to find situations of interest. CEP detects and derives information, which can be reacted to by deciding and doing an action. This is known as the 4D model (Detect-Derive-Decide-Do).
An example situation to be detected is: A suspicious account is derived whenever there are at least three large cash deposits in the last 15 days.
The "Detect" is about the raw event, for example a cash deposit event.
The "Derive" is about the situation, i.e. "did something happen?", for example there is a suspicious account.
The "Decide" is about the decision of what to do, for example the decision to determine a risk score or determine another course of action
The "Do" is the action, for example an action that opens an investigation
This issue is about finding new use-cases for arc-script.
Problem
Applications are needed to motivate the design of arc-script. For arc-script we have multiple levels of abstractions:
In the ideal case an application is just about implementing an algorithm (such as PageRank). Generally however data must also be pre- and post-processed. Where things get hairy is when we need to also interact with other systems. For example, Alibaba's use case requires interaction with MySQL/HBase data sources, and Druid data sinks. Zalando requires interaction with Kafka/S3 data sources.
Application requirements
TODO: List of requirements which our runtime/language must offer
Application non-requirements
TODO: List of requirements which our runtime/language cannot offer at the moment (but likely never)
List of Applications
This is a continuously updated list of applications in data analytics which can help motivate the design of Arc-Script. I will fill in some details to explain the idea and requirements of the use cases. Some use cases here share similar patterns in their implementation. Use cases may also be building blocks for other use cases. Data analytics pipelines may be composed of many different algorithms. This makes the distinction about what should be provided by our language and what should be possible to implement using our language important.
Streaming Algorithms
Streaming algorithms have a well-established theory behind them (known bounds on time and space). However, their applications are very specific... too specific to cover the area of BigData analytics. BigData analytics can however sometimes involve the use of streaming algorithms.
Requirements
Requirements of such algorithms are that they:
Examples
Some examples from Wikipedia are Frequency Moments, Frequent Elements, Event Detection, Counting Distinct Elements, Feature Hashing, Stochastic Gradient Descent, Bloom filter, Count-min sketch, Locality-sensitive hashing, MinHash, SimHash, w-shingling. We should not delve into individual algorithms since these algorithms are rarely useful on their own in the broader picture.
BigData Streaming
BigData streaming is about processing massive amounts of information in real-time. Analytics are both continuous and deep, and for this reason pose harder requirements than general streaming algorithms.
Requirements
Requirements of continuous deep data analytics are as follows:
Examples
Following is a list of examples of BigData streaming:
Complex Event Processing Algorithms
In the words of Esper, "Complex Event Processing" is about analysing events to find situations of interest. CEP detects and derives information, which can be reacted to by deciding and doing an action. This is known as the 4D model (Detect-Derive-Decide-Do).
Resources
The text was updated successfully, but these errors were encountered: