Skip to content

Data Structure

Quan Le edited this page Sep 5, 2017 · 1 revision

Every dataset we store must have a valid (non-missing, unique) key.

Keep datasets normalized (contains only variables at the same logical level as the key) as far down the data pipeline is feasible. E.g., keep state-level variables in a state file and county-level variables in a county file. Of course at some point in the pipeline it will often be necessary to de-normalize the data (e.g., to combine both state and county-level variables in the same file for analysis.

One of the main features of modern programming languages (object-oriented or not) is that they allow users to define a rich array of data structures.

Data structures affect efficiency. Choosing between arrays, stacks, hash tables, binary trees, and so forth is a key part of algorithm design. These considerations are second order in much of what we do, however, because most of the statistical analysis we do naturally operates on arrays.

Data structures also affect clarity. For example, this

param = estimate_model(y1, z, x1, x2, x3, clist, plist, alg, ///
  x4, x5, verbosity)

is much harder to make sense of than this

param = estimate_model(lhs_var, rhs_vars, options)

Likewise, this

[div_est,div_all_est,multi_est] = compute_diversity_statistics...
    (Xsim,Ysim_est,input.maxnum,eq_est,simind);
[div_rand,div_all_rand,multi_rand] = compute_diversity_statistics...
    (Xsim,Ysim_rand,input.maxnum,eq_rand,simind);
[div_repshare50,div_all_repshare50,multi_repshare50] =...
     commpute_diversity_statistics...
    (Xsim,Ysim_repshare50,input.maxnum,eq_repshare50,simind);

is harder to read than this

dstat.est = compute_diversity_statistics...
    (Xsim,Ysim_est,input.maxnum,eq_est,simind);
dstat.rand = compute_diversity_statistics...
    (Xsim,Ysim_rand,input.maxnum,eq_rand,simind);
dstat.repshare50 = compute_diversity_statistics...
    (Xsim,Ysim_repshare50,input.maxnum,eq_repshare50,simind);

especially as the number of parallel calls gets large.

Another example of how using the right data structure greatly improves clarity is when creating tables in Matlab. This

tabletext = ...
[div_est, div_all_ex_est, div_all_est;...
 div_coll_p, div_all_ex_coll_p, div_all_coll_p;...
 div_rand, div_all_ex_rand, div_all_rand;...
 div_coll_ad, div_all_ex_coll_ad, div_all_coll_ad;...
 div_joint_op, div_all_ex_joint_op, div_all_joint_op;...
 div_repshare50, div_all_ex_repshare50, div_all_repshare50;...
 div_coll_rd, div_all_ex_coll_rd, div_all_coll_rd;...
 ];

is not as clear as, and much harder to maintain than, this

rows = {’est’,’coll_p’,’rand’,’coll_ad’,’joint_op’,’repshare50’,’coll_rd’};
cols = {’div’,’div_all_ex’,’div_all’};
tabletext = zeros(length(rows),length(cols));
for i = 1:length(rows)
    for j = 1:length(cols)
        tabletext(i,j) = colstat{j}.(rows{i}).(cols{j});
    end
end
Clone this wiki locally