Skip to content

ricci-colasanti/SyntheticPopulationPapers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SyntheticPopulationsPapers

Synthetic Population Papers

This is a list of interesting papers on generating synthetic populations. For a more detailed exploration of some of the methods, please visit our webpage: Synthetic Population Generation.

Alfonso_Comparison of Iterative Proportional Fitting and Simulated Annealing as synthetic population generation techniques: Importance of the rounding method

The text discusses the use of synthetic population generation techniques to model individual-level decision-making and interactions, as actual population data is often unavailable due to confidentiality and cost constraints. Two common techniques are Iterative Proportional Fitting (IPF) coupled with integerization and Simulated Annealing (SA).

A methodology is presented to compare these techniques using statistical analysis, evaluating performance through the percentage classification error against reference populations. Two case studies are analyzed: the 2001 census microdata in Andalusia (Spain) and the 2000 Swiss Public Use Sample, each including socio-demographic attributes and geographic location.

Results show that IPF with marginal distributions-controlled rounding outperforms SA in all scenarios, highlighting the importance of the integerization process in IPF. However, SA generally outperforms IPF with commonly used Monte Carlo rounding.

Chapuis_22_Chapuis_22_Generation of Synthetic Populations - review

Generation of Synthetic Populations in Social Simulations: A Review of Methods and Practices

The paper reviews advanced methodologies and theories for creating realistic synthetic populations for agent-based simulation models, focusing on social simulations. It emphasizes the importance of data on social agents' attributes, which are crucial for model outcomes but often challenging to collect and integrate into simulations. The review identifies discrepancies between theoretical approaches and practical applications in the field. By analyzing work published in the Journal of Artificial Societies and Social Simulation (JASSS) from 2011 to 2021, the paper highlights the challenges in bridging the gap between theory and practice. It concludes with several recommendations to help modelers adopt best practices for generating synthetic populations

Choupani_16_Pop synth us IPF - A rev and future res

Population synthesis using iterative proportional fitting (IPF): A review and future research

Review concludes that integerisation and treatment of zero-value cells are pressing extant issues in IPF.

Figueira_22_Survey on Synthetic Data Generation, Evaluation Methods and GANs

The abstract presents a survey that combines the topics of synthetic data generation and Generative Adversarial Networks (GANs), aiming to serve as a comprehensive starting point for new researchers in the field. The survey reviews state-of-the-art methods by analyzing data from four major databases: Web of Sciences (WoS), Scopus, IEEE Xplore, and ACM Digital Library. Key insights include identifying relevant authors, journals, cited papers, research areas, institutions, and GAN architectures. The review covers common training problems, significant breakthroughs, and GAN architectures for tabular data. It also discusses algorithms for generating synthetic data, their applications, and methods for evaluating the quality of synthetic data, with a focus on tabular data. The survey provides a schematic overview of the presented information to offer a clear and structured understanding of the field.

Harland_12_Creating Realistic Synthetic Populations

Creating Realistic Synthetic Populations at Varying Spatial Scales: A Comparative Critique of Population Synthesis Techniques

The text discusses various methodologies for generating synthetic populations, including deterministic reweighting, conditional probability (Monte Carlo simulation), and simulated annealing. Each method has its limitations, such as the geographical scale it can be applied to or the number of real population characteristics it can replicate. The research evaluates the performance of these methods across different spatial scales and finds that the simulated annealing algorithm produces the most consistent and accurate synthetic populations across all scales. The relative strengths and weaknesses of each method are further discussed.

Kim_A Reproducibility Analysis of Synthetic Population Generation

This paper focuses on developing a synthetic population generation method for agent-based traffic simulation using the Simulated Annealing (SA) algorithm. Accurate population synthesis is essential for reliable simulation outcomes, but existing parameter estimation rules for SA are lacking, and finding optimal solutions can be time-consuming. The study incorporates hill climbing and a cooling schedule into the SA approach, using the Metropolis-Hastings algorithm to manage follow-up distributions and calculating total absolute error to avoid hill climbing issues. Scenario analysis was conducted to assess the stability of the algorithm based on iteration and temperature settings. The results were compared with the Iterative Proportional Fitting (IPF) method using household travel diary and census data to evaluate the effectiveness of the proposed algorithm for creating synthetic populations.

Lomax_16_Est pop attr val in table - Get Me Started in IPF

Estimating Population Attribute Values in a Table: “Get Me Started in” Iterative Proportional Fitting

A practical description of IPF, with lots of case studies, including a worked example of the iteration process. Also uses the R package MIPFP in an example - Case Study 3, which uses three variables (age, ethnicity and health status).

Lomax_22_Microsimulation - chapter in Rey and Franklin

This chapter offers an overview of microsimulation techniques used to estimate the distribution and evolution of phenomena over time and across different population groups. The discussion is divided into three main research areas:

  1. Generation of Synthetic Population Data: This involves creating detailed, attribute-rich individual-level population data. These data are used to assess distributional differences or as inputs for other models.

  2. Static Models: These models evaluate short-term changes in a system, such as the immediate effects of policy changes on the distribution of outcomes.

  3. Dynamic Models: These models incorporate time to assess the long-term impacts of changes on individuals as they transition through a system of interest.

When individuals in these models have geographical identifiers, the methods are often referred to as spatial microsimulation, allowing for comparisons across different areas.

Lovelace_13_Truncate replicate sample - A meth for creat int weights for spat MS

‘Truncate, replicate, sample’: A method for creating integer weights for spatial microsimulation

Description of a method for ensuring the population produced by synthesis via IPF has integer values - i.e. corresponds to individuals - and improves on other, existing methods. Mentioned in Lovelace_18 and implemented in R there; also implemented in Python at our repository here.

Lovelace_15_Eval perf IPF for spat MS

Evaluating the Performance of Iterative Proportional Fitting for Spatial Microsimulation: New Tests for an Established Technique

Intended as practical guide to users, to avoid "starting from scratch". Multiple examples given; experiments investigate issues such as modification of initial weights, treatment of empty cells and integerisation. They find that initial weights have little influence on results, whereas removing empty cells has greatest effect of all, especially for a large number of constraints and small datasets, which "could potentially cause problems" for small-area applications.

Lovelace_18_Spatial Microsimulation with R

Available online here. Lots of examples of population synthesis, including SimpleWorld dataset, plus implementation of IPF via multiple R packages (IPFP and MIPFP). Contains very accessible description of the motivation and application of population synthesis methods and is not limited to IPF.

Minh La_25_Pop syn - a prob-based review

Population synthesis: a problem-based review

A mainly narrative review (i.e. no numerical examples, etc.) with several interesting sections on the issue of reconciling multiple kinds of unit, e.g. individuals and households (see 4.3.2 "Population hierarchical connection" and 4.3.3 "Relationships amongst household members").

Norman_99_Putting IPF on researchers desk

Putting Iterative Proportional Fitting on the researcher’s desk

An introductory paper intended for users, with multiple worked UK-based examples, e.g. 1996 electoral ward population estimates and working with the electoral register.

Pritchard_12_Adv in pop synth - fitt many att per agent fitt to hh pers margins simul

Advances in population synthesis: fitting many attributes per agent and fitting to household and person margins simultaneously

The topic of particular interest in this paper is the issue of constraining at multiple levels, e.g. individual and household. The authors present detailed examples using Canadian census data and assess the results using goodness of fit.

Some points of particular relevance:

  • Multiple methods reviewed for combining different agent types (see "Ssynthesizing agent relationships", p. 693). Bottom-up, top-down and incidental methods are available. Any method "must produce credible groupings of persons (into families and/or households)", e.g. a mother must be older than her child; age differences between spouses must not exceed (say) 50 years.
  • N.b. goodness of fit measure SRMSE (standardised RMS error) used, see Eq. 6, might be good all-purpose metric, although see Zaloznik (2011), 8.4 "Measuring goodness-of-fit" for alternatives.
  • Method overview in Figure 2; uses Monte Carlo.

Templ_17_Sim synth complex data - r Pack simPop

Simulation of Synthetic Complex Data: The R Package simPop

Publication linked to R package SimPop, which contains the calipPop simulated annealing function, the aim of which is "to find, given a population, a combination of different households which optimally satisfy, in the sense of an acceptable error, a given table of specific known marginals"; more details here.

Zaloznik_11_IPF - theor synth pract limit

Iterative Proportional Fitting - Theoretical Synthesis and Practical Limitations.

Not uploaded to Github as file too large (100MB+) but available from ResearchGate here. In this thesis, Zaloznik goes into great detail on the history and mathematical/statistical basis of IPF, and brings together literature from multiple fields into a single synthesis.

Some important topics:

  • Very nice diagrammatic representation of contingency tables, e.g. Figure 3.2, Figure 6.2.
  • Concept of odds ratio (OR) as represented in IPF input data. This is the practical meaning of seed tables, e.g. created by cross-tabulation.
  • Relationship of IPF to log-linear models, including tau and lambda notation for elements of contingency data (see 4.1 and e.g. Equation 4.2).
  • "Uniform prior", i.e. setting initial estimates to constant value ("usually one"), which corresponds to no association between categories. Counter-intuitively, using sample data may result in poorer results that just using a uniform prior (see 6.5).
  • Review of multiple bivariate association metrics given in 8.3 - chi-squared, PRE; also goodness of fit; distance-based measures.
  • Review of various goodness of fit metrics given in 8.4: distance-based scores, Z-scores, Pearson's chi-squared and power divergence, exact permutation distributions.
  • Graph-theoretical representation of possible models in terms of combinations of constraints, see e.g. Figure 10.30.
  • Issue of empty cells, i.e. zeroes in input data. Very important to note there are two kinds of zero: (1) "structural zeroes", i.e. values that should be zero, like "children under 10 who can drive"; and (2) "sampling zeroes", i.e. those due to sampling. Several points: first, IPF cannot differentiate between types (1) and (2), and they may prevent convergence. An experiment (see Figure 11.1) in which empty cells are filled with values between 0.00005 and 1000 shows that adding a value of 1 gives the best results, whereas adding 0.00005 or 1000 gives equally bad results.
  • Another important thing to note, from the above: in another example (see Figure 11.2 and accompanying text), the author notes that the results are correct in that the constraints are matched when empty cells are filled with large values (1,000), but the results are poor in terms of classification, since "the errors only stem from the interaction between the variables being wrong" (p. 210).
  • Analysis of goodness of fit vs. sample size used for input data is given for multiple models in (e.g.) Figure 11.13.
  • Three geographical problems discussed: (1) the modifiable areal unit problem (MAUP, e.g. see 9.2.2); (2) the ecological fallacy (e.g. see 9.3.1); and (3) Simpson's paradox (e.g. see 9.3.2). How are these relevant to our problem?
  • When working with UK local authority-level data, using larger areas (regional or "supergroup") for samples often yielded better results (see e.g. and p. 234).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •