Resources for ICDE 2024 Submission
-
DeepMapping is wrapped up as a Python library, please run the following command to install it.
cd DeepMapping pip install -e ./
-
We wrapped up the feature extraction as a C-based function for better performance. Run the following command to compile it as a shared library.
cc -fPIC -shared -o shared_utils.so shared_utils.c
Our experiments covered synthetic datasets, low/high correlation datasets with different scales(100MB, 1GB, 10GB), and TPC-H, TPC-DS benchmark datasets with scale factors as 1 and 10. We removed all string/continuous columns and uploaded our pre-generated datasets to HERE.
After download it, please unzip it to the root folder of this GitHub repository. Then, you will see a dataset folder here.
List of datasets:
- TPC-H (S1/S10):
customer
,lineitem
,orders
,part
,supplier
. - TPC-DS (S1/S10):
catalog_page
,catalog_returns
,catalog_sales
,customer_address
,customer_demographics
,customer
,item
,store_returns
,web_returns
. - Synthetic Dataset (100MB, 1GB, 10GB):
single_value_column_low_correlation
,single_value_column_high_correlation
,multiple_value_column_low_correlation
,multiple_value_column_high_correlation
.
-
Please run
python run_search_model.py
to perform a NAS with given dataset. You can configure the NAS by editing the run_search_model.py correspondingly. The searched result will be printout. -
Modify the
SEARCH_MODEL_STRUCTURE
inrun_train_searched_model.py
with the output from step 1. And then runpython run_train_searched_model.py
to train a model.
We provided some demo models for the following 2 tasks. Please go HERE to download:
After download it, please unzip it to the root folder of this GitHub repository. Then, you will see a models folder here.
Note: to optimize the performance for each method, including baselines and DeepMapping. It is recommended to tune the hyperparameters in your local environment and use it. Run run_benchmark_tune.py
to run a grid-search.
These experiments measured overall storage overhead and end-end query latency for benchmark datasets, i.e. TPC-H and TPC-DS.
Run python run_benchmark_data_query.py
to benchmark. To benchmark with different dataset, you should modify the file correspondingly by following the instructions provided in the python file.
These experiments measured overall storage overhead and end-end query latency for synthetic dataset with data manipulation, i.e. INSERT/UPDATE/DELETE. Run python run_benchmark_data_manipulation.py
to benchmark it. To benchmark with different dataset, you should modify the file correspondingly by following the instructions provided in the python file.