Set conditions using the PAR model #1320

tosador · 2023-03-20T13:14:51Z

Problem Description

I'm using ther PAR model to generate synthetic timeseries data and I noticed that it seems not possible to set conditions when generating more than one timeseries together.
Given 3 timeseries, for example, when the first one must always be lower than the second one and the second one lower than the third one, I noticed that sometimes the boundaries are not satisfied and I am not sure if increasing the number of epochs is a feasible solution.

Expected behavior

When simulating timeseries as in the example described above, I expect that each simulated data for the 3 timeseries ts1, ts2 and ts3 satisfies the below condition:

ts1[i] < ts2[i] < ts3[i]

Additional context

npatki · 2023-03-21T14:18:47Z

Hi @tosador, thanks for the feature request.

It would be helpful if you could describe what you mean by 3 timeseries. Are ts1, ts2 and ts3 different columns -- or are they the same column but for different entities? Does your dataset contain an entity_column? Maybe a snippet of your code would be helpful, showing how you are creating the PAR model and any relevant metadata.

tosador · 2023-03-22T16:30:41Z

Hi @npatki, thanks for your reply and sorry for the lack of clarity.

ts1,ts2and ts3 are 3 different columns. The dataset does not contain any entity_column.

Please find below a snippet of the code used to build this example:

import numpy as np
import pandas as pd
from sdv.timeseries import PAR

# GBM parameters
mu, n, T, M, S0, sigma = 0.1, 100, 1, 1, 100, 0.3
dt = T/n
# GBM simulation
St = np.exp(
    (mu - sigma ** 2 / 2) * dt
    + sigma * np.random.normal(0, np.sqrt(dt), size=(M, n)).T
)
St = np.vstack([np.ones(M), St])
St = S0 * St.cumprod(axis=0)

# sample from uniform distribution [0, 1]
x1 = np.random.uniform(low=0, high=1, size=(n+1, 1))
x2 = 2 * x1

# creating 2 additional time series: St_low < St < St_high for each i
St_high = St + x1
St_low = St - x2

df = pd.DataFrame(data=np.hstack((St_low, St, St_high)), columns=['ts1', 'ts2', 'ts3'])

model = PAR(epochs=10000, verbose=True)
model.fit(df)

synthetic_data = model.sample(1)

When I run the code, the synthetic data do not satisfy the constraint used to build the dataset:

ts1[i] < ts2[i] < ts3[i] for each i

npatki · 2023-04-10T20:12:57Z

Hi @tosador, the PAR model is suited for data that has multiple sequences within a single table. I think if you have no entity_column then it means you only have a single sequence of data.

I wonder if you'll be better off applying a tabular model -- such as CTGAN or GaussianCopula? You can then apply constraints to hardcode logic that ts1 < ts2 < ts3.

BTW if you haven't already, I'd recommend upgrading to the new SDV 1.0 releases, as it fixes some bugs, offers a cleaner API and more functionality. Some relevant docs:

PARSynthesizer: More information about single vs multi-sequence data
Single table models
Demo Notebooks

tosador · 2023-04-19T08:11:11Z

Hi @npatki, thanks for your reply.

I think that if I apply a tabular model the time dependancy of each timeseries will be lost. I will be able to set the constraints but the correlation between, for example, ts1[i] and ts1[i-1] will be different.

What I would like to generate is synthetic financial ohlc bars.

So, the dataset is built by 4 time series where:

high >= open, low, close
low <= open, high, close

then, there is a correlation between each time step.

In my understanding the PAR model should be the right one, but sometimes the above constraints are not satisfied by the synthetic data.

Thanks, I will upgrade to the new SDV 1.0 since I have not done it yet!

tosador · 2023-04-19T10:04:54Z

Hi @npatki, I have upgraded SDV to 1.0 and I have really appreciated the cleaner API and the new functionalities. Using SDV 1.0 I wrote the code to generate what I need.

However I am getting the below UserWarning:
UserWarning: The PARSynthesizer does not yet support constraints. This model will ignore any constraints in the metadata. warnings.warn(

I think that when the PARSynthesizer will handle the constraints, it will be possible to generate synthetic financial bars.

Do you know when it will be possible to handle constraints using the PAR model?

Below, as reference, the code that creates the timeseries, constraints and produce the UserWarning:

import numpy as np
import pandas as pd
import datetime

from sdv.metadata import SingleTableMetadata
from sdv.sequential import PARSynthesizer

# GBM parameters
mu, n, T, M, S0, sigma = 0.1, 100, 1, 1, 100, 0.3
dt = T/n
# GBM simulation
St = np.exp(
    (mu - sigma ** 2 / 2) * dt
    + sigma * np.random.normal(0, np.sqrt(dt), size=(M, n)).T
)
St = np.vstack([np.ones(M), St])
St = S0 * St.cumprod(axis=0)

# sample from uniform distribution [0, 1]
x1 = np.random.uniform(low=0, high=1, size=(n+1, 1))
x2 = 2 * x1
x3 = np.random.uniform(low=-1, high=1, size=(n+1, 1))

# creating 2 additional time series: St_low < St < St_high for each i
St_high = St + x1
St_low = St - x2
St_open = St + x3


df = pd.DataFrame(data=np.hstack((St_open, St_low, St, St_high)), columns=['open', 'low', 'close', 'high'])
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(n+1)]

df['Dates'] = date_list
df.insert(0, 'Underlying ID', 'ID_000')

my_constraint_hl = {
    'constraint_class': 'Inequality',
    'table_name': 'guests',
    'constraint_parameters': {
        'low_column_name': 'low',
        'high_column_name': 'high',
        'strict_boundaries': True
    }
}

my_constraint_ho = {
    'constraint_class': 'Inequality',
    'table_name': 'guests',
    'constraint_parameters': {
        'low_column_name': 'open',
        'high_column_name': 'high',
        'strict_boundaries': False
    }
}

my_constraint_hc = {
    'constraint_class': 'Inequality',
    'table_name': 'guests',
    'constraint_parameters': {
        'low_column_name': 'close',
        'high_column_name': 'high',
        'strict_boundaries': False
    }
}

my_constraint_lo = {
    'constraint_class': 'Inequality',
    'table_name': 'guests',
    'constraint_parameters': {
        'low_column_name': 'low',
        'high_column_name': 'open',
        'strict_boundaries': False
    }
}

my_constraint_lc = {
    'constraint_class': 'Inequality',
    'table_name': 'guests',
    'constraint_parameters': {
        'low_column_name': 'low',
        'high_column_name': 'close',
        'strict_boundaries': False
    }
}

constraints = [my_constraint_hl, my_constraint_ho, my_constraint_hc, my_constraint_lo, my_constraint_lc]

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(column_name='Underlying ID', sdtype='id')
metadata.set_sequence_key(column_name='Underlying ID')
metadata.set_sequence_index(column_name='Dates')

s = PARSynthesizer(metadata=metadata,
                   enforce_rounding=False,
                   enforce_min_max_values=False)

s.add_constraints(constraints=constraints)
s.fit(df)
synthetic_data = s.sample(num_sequences=1)

npatki · 2023-05-03T19:30:42Z

Hi @tosador, I'm glad you're finding the new API more clear and useful. That was our main goal 😄

With SDV 1.0, certain "constraints" are automatically me such as enforcing that the min and max values in the synthetic data are within the appropriate ranges. But more complex constraints like Inequality are not yet supported.

We have an open issue #570 for tracking constraints on the PAR model. While it has not yet been prioritized, seeing more usages and demand for this will definitely help us add it to our roadmap. So if you want to add your use case to that issue (including how you want to ultimately use the synthetic data), that will be helpful.

npatki · 2023-05-12T21:13:45Z

Hi @tosador, I'm going to close this issue off as a duplicate of #570 (adding constraints to PAR). If there is still more to discuss, feel free to reply and we can reopen the issue to investigate.

tosador added feature request Request for a new feature new Automatic label applied to new issues labels Mar 20, 2023

npatki added data:sequential Related to timeseries datasets under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Mar 21, 2023

tosador mentioned this issue May 4, 2023

Support constraints for the PAR Synthesizer #570

Open

npatki closed this as completed May 12, 2023

npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set conditions using the PAR model #1320

Set conditions using the PAR model #1320

tosador commented Mar 20, 2023

npatki commented Mar 21, 2023

tosador commented Mar 22, 2023

npatki commented Apr 10, 2023

tosador commented Apr 19, 2023

tosador commented Apr 19, 2023

npatki commented May 3, 2023

npatki commented May 12, 2023

Set conditions using the PAR model #1320

Set conditions using the PAR model #1320

Comments

tosador commented Mar 20, 2023

Problem Description

Expected behavior

Additional context

npatki commented Mar 21, 2023

tosador commented Mar 22, 2023

npatki commented Apr 10, 2023

tosador commented Apr 19, 2023

tosador commented Apr 19, 2023

npatki commented May 3, 2023

npatki commented May 12, 2023