-
Notifications
You must be signed in to change notification settings - Fork 14
Data Structure
Every dataset we store must have a valid (non-missing, unique) key.
Keep datasets normalized (contains only variables at the same logical level as the key) as far down the data pipeline is feasible. E.g., keep state-level variables in a state file and county-level variables in a county file. Of course at some point in the pipeline it will often be necessary to de-normalize the data (e.g., to combine both state and county-level variables in the same file for analysis.
One of the main features of modern programming languages (object-oriented or not) is that they allow users to define a rich array of data structures.
Data structures affect efficiency. Choosing between arrays, stacks, hash tables, binary trees, and so forth is a key part of algorithm design. These considerations are second order in much of what we do, however, because most of the statistical analysis we do naturally operates on arrays.
Data structures also affect clarity. For example, this
param = estimate_model(y1, z, x1, x2, x3, clist, plist, alg, ///
x4, x5, verbosity)
is much harder to make sense of than this
param = estimate_model(lhs_var, rhs_vars, options)
Likewise, this
[div_est,div_all_est,multi_est] = compute_diversity_statistics...
(Xsim,Ysim_est,input.maxnum,eq_est,simind);
[div_rand,div_all_rand,multi_rand] = compute_diversity_statistics...
(Xsim,Ysim_rand,input.maxnum,eq_rand,simind);
[div_repshare50,div_all_repshare50,multi_repshare50] =...
commpute_diversity_statistics...
(Xsim,Ysim_repshare50,input.maxnum,eq_repshare50,simind);
is harder to read than this
dstat.est = compute_diversity_statistics...
(Xsim,Ysim_est,input.maxnum,eq_est,simind);
dstat.rand = compute_diversity_statistics...
(Xsim,Ysim_rand,input.maxnum,eq_rand,simind);
dstat.repshare50 = compute_diversity_statistics...
(Xsim,Ysim_repshare50,input.maxnum,eq_repshare50,simind);
especially as the number of parallel calls gets large.
Another example of how using the right data structure greatly improves clarity is when creating tables in Matlab. This
tabletext = ...
[div_est, div_all_ex_est, div_all_est;...
div_coll_p, div_all_ex_coll_p, div_all_coll_p;...
div_rand, div_all_ex_rand, div_all_rand;...
div_coll_ad, div_all_ex_coll_ad, div_all_coll_ad;...
div_joint_op, div_all_ex_joint_op, div_all_joint_op;...
div_repshare50, div_all_ex_repshare50, div_all_repshare50;...
div_coll_rd, div_all_ex_coll_rd, div_all_coll_rd;...
];
is not as clear as, and much harder to maintain than, this
rows = {’est’,’coll_p’,’rand’,’coll_ad’,’joint_op’,’repshare50’,’coll_rd’};
cols = {’div’,’div_all_ex’,’div_all’};
tabletext = zeros(length(rows),length(cols));
for i = 1:length(rows)
for j = 1:length(cols)
tabletext(i,j) = colstat{j}.(rows{i}).(cols{j});
end
end
- Getting Started
- Computing Environment
- Project Management
- Version Control
- Other Collaboration Tools
- Coding Principles
- Paper Production
- PhD Application
- Appendix A: Style Guides
- Appendix B: Legacy Tools