Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set conditions using the PAR model #1320

Closed
tosador opened this issue Mar 20, 2023 · 7 comments
Closed

Set conditions using the PAR model #1320

tosador opened this issue Mar 20, 2023 · 7 comments
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature resolution:duplicate This issue or pull request already exists

Comments

@tosador
Copy link

tosador commented Mar 20, 2023

Problem Description

I'm using ther PAR model to generate synthetic timeseries data and I noticed that it seems not possible to set conditions when generating more than one timeseries together.
Given 3 timeseries, for example, when the first one must always be lower than the second one and the second one lower than the third one, I noticed that sometimes the boundaries are not satisfied and I am not sure if increasing the number of epochs is a feasible solution.

Expected behavior

When simulating timeseries as in the example described above, I expect that each simulated data for the 3 timeseries ts1, ts2 and ts3 satisfies the below condition:

  • ts1[i] < ts2[i] < ts3[i]

Additional context

<Please provide any additional context that may be relevant to the issue here. If none, please remove this section.>

@tosador tosador added feature request Request for a new feature new Automatic label applied to new issues labels Mar 20, 2023
@npatki
Copy link
Contributor

npatki commented Mar 21, 2023

Hi @tosador, thanks for the feature request.

It would be helpful if you could describe what you mean by 3 timeseries. Are ts1, ts2 and ts3 different columns -- or are they the same column but for different entities? Does your dataset contain an entity_column? Maybe a snippet of your code would be helpful, showing how you are creating the PAR model and any relevant metadata.

@npatki npatki added data:sequential Related to timeseries datasets under discussion Issue is currently being discussed and removed new Automatic label applied to new issues labels Mar 21, 2023
@tosador
Copy link
Author

tosador commented Mar 22, 2023

Hi @npatki, thanks for your reply and sorry for the lack of clarity.

ts1,ts2and ts3 are 3 different columns. The dataset does not contain any entity_column.

Please find below a snippet of the code used to build this example:

import numpy as np
import pandas as pd
from sdv.timeseries import PAR

# GBM parameters
mu, n, T, M, S0, sigma = 0.1, 100, 1, 1, 100, 0.3
dt = T/n
# GBM simulation
St = np.exp(
    (mu - sigma ** 2 / 2) * dt
    + sigma * np.random.normal(0, np.sqrt(dt), size=(M, n)).T
)
St = np.vstack([np.ones(M), St])
St = S0 * St.cumprod(axis=0)

# sample from uniform distribution [0, 1]
x1 = np.random.uniform(low=0, high=1, size=(n+1, 1))
x2 = 2 * x1

# creating 2 additional time series: St_low < St < St_high for each i
St_high = St + x1
St_low = St - x2

df = pd.DataFrame(data=np.hstack((St_low, St, St_high)), columns=['ts1', 'ts2', 'ts3'])

model = PAR(epochs=10000, verbose=True)
model.fit(df)

synthetic_data = model.sample(1)


When I run the code, the synthetic data do not satisfy the constraint used to build the dataset:

  • ts1[i] < ts2[i] < ts3[i] for each i

@npatki
Copy link
Contributor

npatki commented Apr 10, 2023

Hi @tosador, the PAR model is suited for data that has multiple sequences within a single table. I think if you have no entity_column then it means you only have a single sequence of data.

I wonder if you'll be better off applying a tabular model -- such as CTGAN or GaussianCopula? You can then apply constraints to hardcode logic that ts1 < ts2 < ts3.

BTW if you haven't already, I'd recommend upgrading to the new SDV 1.0 releases, as it fixes some bugs, offers a cleaner API and more functionality. Some relevant docs:

@tosador
Copy link
Author

tosador commented Apr 19, 2023

Hi @npatki, thanks for your reply.

I think that if I apply a tabular model the time dependancy of each timeseries will be lost. I will be able to set the constraints but the correlation between, for example, ts1[i] and ts1[i-1] will be different.

What I would like to generate is synthetic financial ohlc bars.

So, the dataset is built by 4 time series where:

  • high >= open, low, close
  • low <= open, high, close

then, there is a correlation between each time step.

In my understanding the PAR model should be the right one, but sometimes the above constraints are not satisfied by the synthetic data.

Thanks, I will upgrade to the new SDV 1.0 since I have not done it yet!

@tosador
Copy link
Author

tosador commented Apr 19, 2023

Hi @npatki, I have upgraded SDV to 1.0 and I have really appreciated the cleaner API and the new functionalities. Using SDV 1.0 I wrote the code to generate what I need.

However I am getting the below UserWarning:
UserWarning: The PARSynthesizer does not yet support constraints. This model will ignore any constraints in the metadata. warnings.warn(

I think that when the PARSynthesizer will handle the constraints, it will be possible to generate synthetic financial bars.

Do you know when it will be possible to handle constraints using the PAR model?

Below, as reference, the code that creates the timeseries, constraints and produce the UserWarning:

import numpy as np
import pandas as pd
import datetime

from sdv.metadata import SingleTableMetadata
from sdv.sequential import PARSynthesizer

# GBM parameters
mu, n, T, M, S0, sigma = 0.1, 100, 1, 1, 100, 0.3
dt = T/n
# GBM simulation
St = np.exp(
    (mu - sigma ** 2 / 2) * dt
    + sigma * np.random.normal(0, np.sqrt(dt), size=(M, n)).T
)
St = np.vstack([np.ones(M), St])
St = S0 * St.cumprod(axis=0)

# sample from uniform distribution [0, 1]
x1 = np.random.uniform(low=0, high=1, size=(n+1, 1))
x2 = 2 * x1
x3 = np.random.uniform(low=-1, high=1, size=(n+1, 1))

# creating 2 additional time series: St_low < St < St_high for each i
St_high = St + x1
St_low = St - x2
St_open = St + x3


df = pd.DataFrame(data=np.hstack((St_open, St_low, St, St_high)), columns=['open', 'low', 'close', 'high'])
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(n+1)]

df['Dates'] = date_list
df.insert(0, 'Underlying ID', 'ID_000')

my_constraint_hl = {
    'constraint_class': 'Inequality',
    'table_name': 'guests',
    'constraint_parameters': {
        'low_column_name': 'low',
        'high_column_name': 'high',
        'strict_boundaries': True
    }
}

my_constraint_ho = {
    'constraint_class': 'Inequality',
    'table_name': 'guests',
    'constraint_parameters': {
        'low_column_name': 'open',
        'high_column_name': 'high',
        'strict_boundaries': False
    }
}

my_constraint_hc = {
    'constraint_class': 'Inequality',
    'table_name': 'guests',
    'constraint_parameters': {
        'low_column_name': 'close',
        'high_column_name': 'high',
        'strict_boundaries': False
    }
}

my_constraint_lo = {
    'constraint_class': 'Inequality',
    'table_name': 'guests',
    'constraint_parameters': {
        'low_column_name': 'low',
        'high_column_name': 'open',
        'strict_boundaries': False
    }
}

my_constraint_lc = {
    'constraint_class': 'Inequality',
    'table_name': 'guests',
    'constraint_parameters': {
        'low_column_name': 'low',
        'high_column_name': 'close',
        'strict_boundaries': False
    }
}

constraints = [my_constraint_hl, my_constraint_ho, my_constraint_hc, my_constraint_lo, my_constraint_lc]

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(column_name='Underlying ID', sdtype='id')
metadata.set_sequence_key(column_name='Underlying ID')
metadata.set_sequence_index(column_name='Dates')

s = PARSynthesizer(metadata=metadata,
                   enforce_rounding=False,
                   enforce_min_max_values=False)

s.add_constraints(constraints=constraints)
s.fit(df)
synthetic_data = s.sample(num_sequences=1)

@npatki
Copy link
Contributor

npatki commented May 3, 2023

Hi @tosador, I'm glad you're finding the new API more clear and useful. That was our main goal 😄

With SDV 1.0, certain "constraints" are automatically me such as enforcing that the min and max values in the synthetic data are within the appropriate ranges. But more complex constraints like Inequality are not yet supported.

We have an open issue #570 for tracking constraints on the PAR model. While it has not yet been prioritized, seeing more usages and demand for this will definitely help us add it to our roadmap. So if you want to add your use case to that issue (including how you want to ultimately use the synthetic data), that will be helpful.

@npatki
Copy link
Contributor

npatki commented May 12, 2023

Hi @tosador, I'm going to close this issue off as a duplicate of #570 (adding constraints to PAR). If there is still more to discuss, feel free to reply and we can reopen the issue to investigate.

@npatki npatki closed this as completed May 12, 2023
@npatki npatki added resolution:duplicate This issue or pull request already exists and removed under discussion Issue is currently being discussed labels May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature resolution:duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants