Skip to content

inputChecker

Alex Bettinardi edited this page Jan 8, 2020 · 46 revisions

Input Checker Tool

Background

Activity-based travel models rely on data from a variety of sources (zonal data, highway networks, transit networks, synthetic population, etc). A problem in any of these inputs can affect the accuracy of model outputs and/or can result in run time error(s) during the model run. It is very important that the analyst carefully prepare and review all inputs prior to running the model. However, even with the best of efforts, sometimes errors in input data remain undetected. In order to aid the analyst in the input checking process, an automated Input Checker Tool was developed for use with the ABM. The following sections describe the setup and application of this tool.

inputChecker Implementation

The Input Checker Tool (inputChecker) was implemented in Python and makes heavy use of the pandas and numpy packages. The main inputs to inputChecker are a list of ABM input tables, a list of QA/QC checks to be performed on these input tables and the actual ABM inputs in CSV format. All CSV inputs are read as pandas DataFrames (2-dimensional data tables). The input checks are specified by the user as pandas expressions which are solved by the inputChecker on the input pandas DataFrames. The inputChecker generates a LOG file summarizing the results of all of the input checks.

The inputChecker setup is described in the table below:

Directory/File Description
config directory Contains list of inputs, list of checks and a settings file
inputs directory All the inputs specified in the inputs list are exported or copied to this directory
logs directory The Log and summary files from different runs are outputted to this directory
scripts directory Contains the main inputChecker Python script
RunInputChecker.bat The batch file to run inputChecker

The RunInputChecker.bat DOS batch file is called by the RunModel.bat DOS batch file to run the inputChecker at the beginning of each ABM run. The user can also launch the inputChecker independently by simply double-clicking the RunInputChecker.bat DOS batch file. However, the inputChecker working directory must be inside the ABM working directory to read inputs from the appropriate input sub-directories.

Process Overview

inputChecker executes the following steps:

1. Read Inputs:

First, inputChecker reads all the inputs specified in the list of inputs and copies them to the inputChecker/inputs directory. After assembling all inputs in the inputChecker/inputs directory, all the inputs are loaded as pandas DataFrames.

2. Run Checks

Next, the list of input checks is read. inputChecker loops through the list of input checks and evaluates the checks. The result of each check is sent to the logging module. The user must specify the severity level of each check as - Fatal, Logical or Warning.

3. Run Self Diagnostics

Besides the checks specified by the user, inputChecker also performs self-diagnostics to check for missing values in inputs. The severity level for the automated missing value checks is set via the config/settings.csv file.

4. Generate LOG File and Return Error Status

The final step is to generate the inputChecker log file. The inputChecker log includes results of all checks. The checks that failed are moved up in order of the severity-level specified for the test. A summary of inputChecker results is also generated to be read by the RunModel.bat DOS batch file to generate a reminder message for the user at the end of the SOABM run. An appropriate exit code is returned depending on the outcome of the inputChecker run. The table below describes the various outcomes and the associated exit codes:

inputChecker End State Exit Code
inputChecker ran successfully with no fatal checks fails 0
inputChecker did not run successfully due to errors 1
inputChecker ran successfully with at least one fatal check fails 2

With a return code of 0, the RunModel.bat DOS batch file resumes the SOABM run. A reminder message is generated at the end to check the inputChecker log file. In case the inputChecker errors out, the model run is aborted. If the inputChecker completes with at least one fatal check fails, the RunModel.bat DOS batch file aborts the SOABM run and user is directed to check the inputChecker log file.

Configuring inputChecker

Configuring inputChecker involves specifying both the inputs and and the checks to be performed on them. This section describes the configuration details of the two settings file - config/inputs_list.csv and config/inputs_checks.csv.

Specifying inputs

Inputs on which QA/QC checks are to be performed are specified in the config/inputs_list.csv file. Each row in inputs_list.csv represents an ABM input. The attributes that user must specify for each input are described in the table below:

Attribute Description
Table The name of the input table. The inputs are loaded into inputChecker memory as data-frames under this name. For CSV inputs, this must match the CSV file name.
Directory The location of the CSV input file - SOABM inputs directory or SOABM uec directory
Visum_Object The name of the Visum Object whose attributes must be exported. Must be specified as 'NA' for CSV inputs
Input_ID_Column The name of the unique ID column. inputChecker creates an ID column by specified name if the column is missing from the input table
Fields The list of attributes to be exported from the Visum network object. All the fields are read for CSV inputs
Column_Map A column name can be specified if some columns must be renamed for easy reference
Input_Description The description of the input file.

All the inputs must be in CSV format. Some ABM inputs may not be available in CSV format. Specifically, network related inputs are usually embedded in a transportation modeling software database. For the Visum-based SOABM, the Visum version file, SOABM.ver, contains the zone system geography, all zonal attributes, the highway network, and the transit network. The export_csv module of inputChecker loads the model version file and exports attributes of the specified Visum network objects to the inputChecker/inputs directory in CSV format. The inputChecker assumes that the model version file exists within the input sub-directory of the SOABM working directory. The name of the version file is specified in the inputChecker/config/settings.csv file next to the input_version_file token. The user must specify each input either as a Visum object (e.g., Visum.Net.Links) or a csv file in the inputs or uec sub-directories. The CSV inputs are copied from the specified sub-directory to the inputChecker/inputs directory. Columns are renamed as per user specification and an ID column is generated if not specified.

The user has an option to comment out inputs that should not to be loaded. To comment out a line in inputs_list.csv, add a "#" in front of the table name. All inputs whose table name starts with a "#" are ignored by inputChecker

Specifying checks

The QA/QC checks to be performed on the ABM inputs are specified in the config/checks_list.csv file. Each row in checks_list.csv represents a specific operation to be performed on a specific input listed in inputs_list.csv. The operations are evaluated in the same order as they are listed in inputs_list.csv. Each operation can be classified as a Test or Calculation. For Test operations, the pandas expression is evaluated and the result is sent to the logging module of inputChecker for logging. For Calculation operations, the pandas expression is evaluated and the result is stored as a Python object to be referenced by subsequent operations. The table below describes the various tokens that user must specify for each Test or Calculation operation:

Attribute Description
Test The name of the QA/QC check. The check results are referenced using this name in the log file. For calculation operations, this becomes the name of the resulting object
Input_Table The name of the input table on which the check is to be performed. This name must match the name specified under the Table token in inputs_list
ID_Column The unique ID column name. This must match the name specified under the Input_ID_Column token in inputs_list
Severity The severity level of the test - Fatal, Logical or Warning
Type The type of operation - Test or Calculation
Expression The pandas expression to be evaluated
Test_Vals A list of values on which the test needs to be repeated. List must be comma separated. Test for each value is logged separately
Report-Statistic Any additional statistics from the test that must be reported to the log file
Test_Description The description of the check that is being performed
Severity levels

An important step in specifying checks is assigning a severity level to each check. inputChecker allows the user to specify three severity levels for each QA/QC check - Fatal, Logical, Warning. Careful thought must be given while assigning severity level to each check. Some general principles to help decide the severity level of a check are described below:

Fatal

If inputChecker fails a fatal check, it returns an exit code of 2 to the main ABM procedure, causing the ABM run to halt. Therefore, the analyst should only set the severity level of Fatal for checks that must pass in order to proceed with a model run.

Logical

The failure of these checks indicates logical inconsistencies in the inputs. With logical errors in inputs, the ABM outputs may not be very meaningful.

Warnings

The failure of warning checks would indicate an issue in the input data which are not significant enough to cause a run-time error or affect model outputs. However, these checks might reveal other problems related to data processing or data quality.

Expressions

At the heart of an input data check is the pandas expression that is evaluated on an input data table. Each Test expression must evaluate to a single logical value (TRUE or FALSE) or a vector of logical values. Therefore, the Test expression must be a logical test. For most applications, this involves creating logical relationships such as equalities, inequalities and ranges using standard logical operators (AND, OR, EQUAL, GREATER THAN, LESS THAN, IN, etc.). The length of the result vector must be equal to the length of the input on which the check was performed. The result of a Calculation expression can be any Python data type to be used by a subsequent expression.

The success or failure of a check is decided based on the test result. In case of a single value result, the check fails if the result is FALSE. In case of a vector result, the test is declared as failed if any value in the vector is FALSE. Therefore, the expression must be designed to evaluate to TRUE if there are no problems in the input data.

Conventions for writing expressions

Rules and conventions for writing inputChecker expressions are summarized below:

  • Expressions must be a valid Python/pandas expression
  • Expressions must be designed to evaluate to FALSE to indicate any errors in data
  • Each expression must evaluate to logical value(s)
  • Each expression must be applied to valid input table specified in inputs_list.csv or make use of intermediate tables created by preceding Calculation expressions
  • Expressions must use the same table names as specified in inputs_list.csv or the Test name of the Calculation object
  • Expressions must use the same field names as specified in inputs_list.csv. If a column map was specified, then the new names must be used
  • Expressions can be looped over a list of Test_Vals to reduce number of expressions
  • The Report_Statistic must also be a valid Python/pandas expression and must evaluate to a single numeric value
  • Expressions can be commented by adding a "#" in front of the Test name. All checks whose test name starts with a "#" are ignored by inputChecker
Example expressions

Below are some example expressions for different types of checks

Data completeness checks

Check if household income field exists in the input synthetic population

For performing this check for multiple fields, write the expression as follows and specify the list of field names under Test_Vals token (separated by comma):

Boundary checks

Check if household size ('np') is greater than zero for each household

households.np>0
Predefined value checks

Check if each person's occupation code ('occp') matches the pre-defined occupation codes

persons.occp.apply(lambda x: True if x in [1,2,3,4,5,6,999] else False)

It is possible that all person records pass the above test but one of the occupation code may not have a single person record. To check for such cases, following expression can be used:

set(persons.occp)=={1,2,3,4,5,6,999}
Consistency checks

Check if total employment across occupation categories sum to total employment for each MAZ. Since this may result in a complex expression, this can be done in two steps. First, employment across all occupation types are summed using a Calculation expression:

maz_data[[col for col in maz_data if (col.startswith('EMP')) and not (col.endswith('TOTAL'))]].sum(axis=1)

The result of the above expression is a MAZ level vector - maz_total_employment Next, the total employment field can be compared against maz_total_employment

maz_data.EMP_TOTAL==maz_total_employment
Order checks

Check if household IDs start from 1 and are sequential

(min(households.hhid)==1) & (max(households.hhid)==len(set(households.hhid)))
Logical checks

To ensure that ABM outputs are meaningful, it is important to perform logical checks on input data. One such check is to compare the number of workers against available jobs in each industry. While they may not match exactly, the difference must not exceed 10%. For this check, first the number of workers and jobs by industry type must be calculated. This can be achieved by a series of Calculate operations.

Next, the check can be performed for each industry type separately It can be noted in the above example that the indexing between the two arrays is off by one. This is because the maz_occ_jobs array is indexed on array position (starting from 0) whereas the person_occ_workers array is indexed on occupation type code which goes from 1 to 6. A consistent indexing must be used wherever possible to avoid coding errors.

In addition to the result of this test, an analyst might be interested in knowing the actual ratio of jobs to workers. Therefore, a Report_Statistic can be specified for this test as maz_occp_jobs[0]/person_occ_workers[1]

Network checks

While most of the above checks apply to link and node level attributes, some checks might be unique to some other network objects such as transit routes. In Visum, the transit line route names must be unique. This requires performing a check on transit line route data as follows:

len(set(lineroute_data.NAME)) == len(lineroute_data.NAME)

The design of network level checks will depend on the transportation modeling software being used.

Running inputChecker

inputChecker is launched by the RunModel.bat DOS batch file. The user also has an option to run inputChecker independent of the ABM run. In order to run inputChecker by itself, run the inputChecker/RunInputChecker.bat file.

Analyzing inputChecker Log

The final output from inputChecker is a log file which is outputted to the inputChecker/logs directory. The log file is named as inputCheckerLog[RUN_DATE].LOG. The log file can be opened using any text editor. The results of all checks are summarized in this log file. The following sections describe the organization and details of the log file.

Organization

The log file summarizes results from all checks. However, the order in which they are presented depends upon the severity level and the output of the check. inputChecker organizes the check results under the following headings:

  • IMMEDIATE ACTION REQUIRED: All failed FATAL checks are logged under this heading
  • ACTION REQUIRED: All failed LOGICAL checks are logged under this heading
  • WARNINGS: All failed WARNING checks are logged under this heading
  • LOG OF ALL PASSED CHECKS: A complete LOG of all passed checks
  • MISSING VALUE DIAGNOSTICS ON ALL INPUTS: All failed missing value self-diagnostics tests are logged under this section

Check LOG

A standard check log is generated for each check. The table below shows the elements of a check LOG:

Attribute Description
Input File Name The name of the input file on which the check was evaluated
Input File Location Path to the location of the input file
Visum Object The name of the Visum object, if applicable
Input Description The decription of the input as specified in inputs_list.csv
Test Name The name of the test as specified in checks_list.csv
Test Description The description of the test
Test Severity The severity level of the test
TEST RESULT The result of the test - PASSED or FAILED
TEST results for Test_Vals Test result for each Test val on which the test was repeated
Test Statistics The value of the expression specified under the Report_Statistic token of checks_list.csv. First 25 values are printed in case of vector result
ID Column The name of the unique ID column of the input data table
List of failed IDs The first 25 IDs for which the test failed. This is generated in case of vector result
Number of failures Total number of failures in case of vector result

Summary file

In addition to the log file, inputChecker also produces a text file (inputCheckerSummary.txt) containing a summary of number of inputChecker fails by their severity levels. This file is read by the main ABM batch script to present the summary at the end of the model run.

Clone this wiki locally