Touchstone-plus is a supplementary version of Touchstone designed to address the insufficient support for matching operators.
Here is our technical report, which is a extention of our submitted paper.
- In Section 2, we give the proof for Proposation 1.
- In Section 4, we give the proof for Theorem 1.
Touchstone-plus's workflow is divided into two steps: computation and data generation, which can be executed directly using the given command line.
The main task of computation is to extract table column information related to the input queries (including table names, column names, and cardinality of columns), as well as the cardinality of each query. Then, based on this information, a Constraint Programming (CP) problem model is constructed, and the solver's results are output to a file.
The configuration file path is ./conf/tool.json. Specifically, the configuration file tool.json is formatted to contain information such as database connection information and directory information.
databaseConnectorConfig
: Database connection information. It is the connection configuration information for connecting to the target database.inputDirectory
: The directory where the query is located refers to the query ready for simulation.outputDirectory
: The directory for storing parsed results and solver computation results.newsqlDirectory
: The directory of simulated queries obtained during the generation phase.dataDirectory
: The directory of simulated data obtained during the generation phase.
An example is shown below.
{
"databaseConnectorConfig": {
"databaseIp": "127.0.0.1", //database IP
"databaseName": "tpch1", //database name
"databasePort": "5432", //database port
"databasePwd": "mima123", //database password
"databaseUser": "postgres" //database username
},
"inputDirectory": "conf/inputTest.txt", //directory where the query is located
"outputDirectory": "conf/output.txt", //execution result storage directory
"newsqlDirectory": "conf/newsql.txt", //directory where the simulated query is located
"dataDirectory": "conf/data.txt", //directory where the simulated query is located
}
The command for executing the computation phase task via the command line is
java -jar multiStringMatching-${version}.jar solve -c conf/tool.json -t ${thred number} -e ${comoutation error allowed} -s ${scale error}
The specific parameters are shown below:
-t, --The number of threads used by the solver.
-e, --The maximum allowable error $\Epsilon$ (corresponding to optimization method 2).
-s, --The parameter $\rho$ for scaling the value range (corresponding to optimization method 2).
The main task of the data generation phase is to generate simulated data and simulated queries based on the results obtained through solver computation.
The command for executing the data generation phase task via the command line is
java -jar multiStringMatching-${version}.jar generate -c ${outputDictionary} -d ${dataDictionary}$
The specific parameters are shown below:
-c, --execution result storage directory.
-d, --directory where the simulated data is located.
After the generation phase is completed, a script can be used to generate a simulated database.