-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsearch.json
163 lines (163 loc) · 62.3 KB
/
search.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
[
{
"objectID": "about.html",
"href": "about.html",
"title": "About",
"section": "",
"text": "This GitHub Pages site is for technical notes for the Team Public Health blog."
},
{
"objectID": "about.html#about-drtomasaragon.github.io",
"href": "about.html#about-drtomasaragon.github.io",
"title": "About",
"section": "",
"text": "This GitHub Pages site is for technical notes for the Team Public Health blog."
},
{
"objectID": "about.html#about-tomás",
"href": "about.html#about-tomás",
"title": "About",
"section": "About Tomás",
"text": "About Tomás\nDr. Tomás Aragón serves as the director of the California Department of Public Health and the State Public Health Officer, since January 4, 2021. Prior to coming to CDPH, he was the health officer for the City and County of San Francisco and director of the public health division.\nDr. Aragón has served in public health leadership roles for more than 25 years (communicable disease controller, deputy health officer, health officer, community health and chronic disease epidemiologist), including directing a public health emergency preparedness and response research and training center at the University of California, Berkeley School of Public Health.\nAs CDPH director, striving to embody and promote the universal values of dignity, equity, compassion and belonging, he works through collaborative partnerships to mobilize communities and institutions to transform policies and systems towards a culture of equity, antiracism, healing and health for all people and our planet. As State Public Health Officer, he exercises leadership and legal authority to protect health and prevent disease and injury.\nDr. Aragón graduated from University of California, Berkeley (B.A., Molecular Biology; Dr.P.H., Epidemiology), Harvard Medical School (M.D.), and Harvard School of Public Health (M.P.H.). He received certification in Strategic Decision and Risk Management in Healthcare from Stanford University. He completed his clinical and research training at University of California, San Francisco (San Francisco General Hospital Primary Care Internal Medicine; Clinical Infectious Diseases; and Traineeship in AIDS Prevention Studies)."
},
{
"objectID": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html",
"href": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html",
"title": "Simple method for calculating decision criteria weights",
"section": "",
"text": "Note\n\n\n\nThe first part of this article is posted on Team Public Health at Substack."
},
{
"objectID": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html#introduction",
"href": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html#introduction",
"title": "Simple method for calculating decision criteria weights",
"section": "1 Introduction",
"text": "1 Introduction\nHuman decision intelligence (HDI)1 is applying ethics, science, and technology to support team and individual decisions to solve problems, achieve objectives, and improve and innovate in the face of time constraints, uncertainty, and trade-offs. Think of human decision intelligence in the same way you might think about human emotional intelligence. A foundational component of HDI is decision quality — the six requirements of good decision making. At a minimum, a decision quality checklist (DQ) (Table 1) improves the quality of decisions at any stage of problem solving. A good decision is only as strong as its weakest link.\n\n\n\nTable 1: Key questions to answer for decision quality (DQ)\n\n\n\n\n\n\n\n\n\n\nNo.\nRequirement\nKey question to ask\n\n\n\n\n1\nFrame\nAre we clear on the problem we are solving?\n\n\n2\nValues\nHave we identified what we truly want?\n\n\n3\nAlternatives\nDo we have a good set of alternatives?\n\n\n4\nInformation\nHave we gathered the relevant information?\n\n\n5\nReasoning\nHow will we evaluate alternatives to find the one that gets us the most of what what we truly want?\n\n\n6\nCommitment\nAre we committed to follow through on our choice?\n\n\n\n\n\n\n\n\n\n\n\n\nTip\n\n\n\nWhat should I read to improve my decision making? Visit TeamPublicHealth at Substack.com."
},
{
"objectID": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html#decision-making",
"href": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html#decision-making",
"title": "Simple method for calculating decision criteria weights",
"section": "2 Decision making",
"text": "2 Decision making\nHiring an employee, selecting a contractor, or ranking a set of proposals uses a common team approach. Team members usually rate the alternatives using pre-determined criteria that have been weigthed based on importance. Alternatives with high scores on the most important criteria (ie, higher weight) will be ranked at or near the top.\nTo summarize:\n\ndevelop criteria\nweight criteria\nrate alternatives using the weighted criteria.\nrank the alternatives\n\nIdeally, the criteria should be weighted without any knowledge of the alternatives. This is to prevent evaluators from biasing the criteria weights in favor of their favorite alternative.\nIn this blog posting I show how to weight criteria using a simple ranking method. In a future blog post, I will show how to apply weighted criteria to rank and select alternatives. This first step, weighting criteria, is very powerful and practical. We will use a trivial example to nail down the concepts.\nNow, suppose we wish to buy a car and our choices are a Honda Civic, and Subaru Impreza, or Toyota Corolla. We have data on the following attributes: safety (S), mileage (M), reliability (R), color (C), price (P), and resale value (V). Table 2 summarizes the DQ requirements for buying our car.\n\n\n\nTable 2: Decision quality requirements for buying a car\n\n\n\n\n\n\n\n\n\n\n\nNo.\nRequirement\nKey question to ask\nAnswer\n\n\n\n\n1\nFrame\nAre we clear on the problem we are solving?\nNeed personal transportion.\n\n\n2\nValues\nHave we identified what we truly want?\nCar within my budget.\n\n\n3\nAlternatives\nDo we have a good set of alternatives?\nCivic, Corolla, or Impreza\n\n\n4\nInformation\nHave we gathered the relevant information?\nColor, mileage, price, reliability, safety, resale value\n\n\n5\nReasoning\nHow will we evaluate alternatives to find the one that gets us the most of what what we truly want?\nRanking algorithm using weight calculations\n\n\n6\nCommitment\nAre we committed to follow through on our choice.\nYes, my spouse approves."
},
{
"objectID": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html#calculating-criteria-weights-the-easy-way",
"href": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html#calculating-criteria-weights-the-easy-way",
"title": "Simple method for calculating decision criteria weights",
"section": "3 Calculating criteria weights — the easy way",
"text": "3 Calculating criteria weights — the easy way\nGroup deliberative decision-making is cognitively exhausting. So, anything you can do to make the process easier will keep team members engaged. Do not let “perfection become the enemy of the good.” The easiest way to generate criteria weights from a team of evaluators is to use a rank ordinal method [1].\nGive evaluators small pieces of paper with one criterion printed on on each. If we have five criteria, they get five small pieces of paper. Have them rank them from top to bottom. Once they have ranked them, tape their ranking onto an 8.5in x 11in paper and hand to the facilitator. This ranking is entered into the computer for analysis (see below).\n\n3.1 Ratio ordinal method in Julia\nI will demonstrate this method using the Julia language. This method can also be implemented using R, Python, or Microsoft Excel.\nFor rating the cars we have six criteria (attributes) for which we need to calculate weights:\n\nColor (C)\nMileage (M)\nPrice (P)\nReliability (R)\nSafety (S)\nValue, resale (V)\n\nWe have five evaluators that will rank the criteria based on their knowledge, experience, expertise, and wisdom. It is much better for them to rank the criteria independently and without thinking about specific cars, otherwise they may game (bias) the weighting.\nHere are the steps:\n\nSelect a ranking method to calculate weights for a specific number of criteria, in this case we have six criteria. We will write and use a Julia function that implements the SR method from [1].\nHave each evaluator independently rank the criteria.\nUse Julia to calculate the final criteria weights. We will use the split-apply-combine workflow that I introduced in a previous blog post and book review [2].\n\n\n3.1.1 Step 1: The SR method for generating criteria weights\nHere is the formula2 where \\(N\\) is the number of criteria, and \\(w_i^{SR}\\) is the weight for the \\(i^{th}\\) criterion [1].\n\\[\nw_i^{SR} = \\frac{1/i + \\frac{N+1-i}{N}} {\\sum_{j=1}^N\n \\left(\\frac{N+1-i}{N}\\right)}\n\\]\nFor this calculation I use the Julia Language. Julia is as simple to program as Python but with the speed of C++. These calculations can also be completed in R.\n\nusing DataFrames\nusing StatsBase\nfunction calculate_rank_weights(n::Int64)\n num = zeros(n)\n for i in 1:n\n num[i] = (1/i) + ( (n + 1 - i) / n )\n end\n wi = num / sum(num)\n return wi\nend\nwi = calculate_rank_weights(6)\n\n6-element Vector{Float64}:\n 0.3361344537815126\n 0.22408963585434175\n 0.1680672268907563\n 0.12605042016806722\n 0.0896358543417367\n 0.05602240896358543\n\n\nThe weights sum to 1, as expected.\n\nround(sum(wi))\n\n1.0\n\n\n\n\n3.1.2 Step 2: Evaluators rank the criteria\nFive evaluators rank the criteria based on their expertise.\n\neval1 = [\"Mileage\", \"Color\", \"Price\", \"Safety\", \"Reliability\", \"Value\"]\neval2 = [\"Mileage\", \"Color\", \"Safety\", \"Reliability\", \"Value\", \"Price\"]\neval3 = [\"Color\", \"Value\", \"Price\", \"Mileage\", \"Reliability\", \"Safety\"]\neval4 = [\"Mileage\", \"Value\", \"Color\", \"Safety\", \"Reliability\", \"Price\"]\neval5 = [\"Safety\", \"Price\", \"Color\", \"Reliability\", \"Mileage\", \"Value\"]\n\n6-element Vector{String}:\n \"Safety\"\n \"Price\"\n \"Color\"\n \"Reliability\"\n \"Mileage\"\n \"Value\"\n\n\nNext, we organize the evaluator criteria rankings and the SR method criteria weights into a data frame with three columns:\n\nevaluators (optional)\nevaluator_rankings (required)\nweights (required)\n\n\nne = 5; # number of evaluators\nnc = 6; # number for criteria\nevaluators = repeat([\"eval\" .* string.(1:ne)...], inner=repeat([nc]))\nevaluator_rankings = vcat( eval1, eval2, eval3, eval4, eval5 )\nweights = repeat( wi, ne )\ndf = DataFrame(\n hcat(evaluators, evaluator_rankings, weights), \n [\"evaluator\", \"criteria\", \"weight\"]\n )\n\n30×3 DataFrame5 rows omitted\n\n\n\nRow\nevaluator\ncriteria\nweight\n\n\n\nAny\nAny\nAny\n\n\n\n\n1\neval1\nMileage\n0.336134\n\n\n2\neval1\nColor\n0.22409\n\n\n3\neval1\nPrice\n0.168067\n\n\n4\neval1\nSafety\n0.12605\n\n\n5\neval1\nReliability\n0.0896359\n\n\n6\neval1\nValue\n0.0560224\n\n\n7\neval2\nMileage\n0.336134\n\n\n8\neval2\nColor\n0.22409\n\n\n9\neval2\nSafety\n0.168067\n\n\n10\neval2\nReliability\n0.12605\n\n\n11\neval2\nValue\n0.0896359\n\n\n12\neval2\nPrice\n0.0560224\n\n\n13\neval3\nColor\n0.336134\n\n\n⋮\n⋮\n⋮\n⋮\n\n\n19\neval4\nMileage\n0.336134\n\n\n20\neval4\nValue\n0.22409\n\n\n21\neval4\nColor\n0.168067\n\n\n22\neval4\nSafety\n0.12605\n\n\n23\neval4\nReliability\n0.0896359\n\n\n24\neval4\nPrice\n0.0560224\n\n\n25\neval5\nSafety\n0.336134\n\n\n26\neval5\nPrice\n0.22409\n\n\n27\neval5\nColor\n0.168067\n\n\n28\neval5\nReliability\n0.12605\n\n\n29\neval5\nMileage\n0.0896359\n\n\n30\neval5\nValue\n0.0560224\n\n\n\n\n\n\n\n\n3.1.3 Step 3: Calculate mean criteria weights\nCalculate mean criteria weights using split-apply-combine workflow (Figure 1). In other words, stratifying by one or more criteria, what is the mean weight for each strata (attribute)?\n\n\n\n\n\n\nFigure 1: The split-apply-combine is a common workflow in data science.\n\n\n\n\n## split\ngdf = groupby(df, :criteria)\n\nGroupedDataFrame with 6 groups based on key: criteriaFirst Group (5 rows): criteria = \"Mileage\"\n\n\n\nRow\nevaluator\ncriteria\nweight\n\n\n\nAny\nAny\nAny\n\n\n\n\n1\neval1\nMileage\n0.336134\n\n\n2\neval2\nMileage\n0.336134\n\n\n3\neval3\nMileage\n0.12605\n\n\n4\neval4\nMileage\n0.336134\n\n\n5\neval5\nMileage\n0.0896359\n\n\n\n⋮Last Group (5 rows): criteria = \"Value\"\n\n\n\nRow\nevaluator\ncriteria\nweight\n\n\n\nAny\nAny\nAny\n\n\n\n\n1\neval1\nValue\n0.0560224\n\n\n2\neval2\nValue\n0.0896359\n\n\n3\neval3\nValue\n0.22409\n\n\n4\neval4\nValue\n0.22409\n\n\n5\neval5\nValue\n0.0560224\n\n\n\n\n\n\n\n## combine and apply \ncrit_weights = combine(gdf, :weight => mean)\nsort!(crit_weights, :weight_mean, rev = true)\n\n6×2 DataFrame\n\n\n\nRow\ncriteria\nweight_mean\n\n\n\nAny\nFloat64\n\n\n\n\n1\nMileage\n0.244818\n\n\n2\nColor\n0.22409\n\n\n3\nSafety\n0.162465\n\n\n4\nPrice\n0.134454\n\n\n5\nValue\n0.129972\n\n\n6\nReliability\n0.104202\n\n\n\n\n\n\nThese are the final criteria weights and, as expected, they sum to 1.\n\nround(sum(crit_weights.weight_mean))\n\n1.0"
},
{
"objectID": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html#appendix",
"href": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html#appendix",
"title": "Simple method for calculating decision criteria weights",
"section": "4 Appendix",
"text": "4 Appendix\n\n4.1 Function to automate calculating mean weights\nWe created the calculate_rank_weights function to calculate the weights for \\(n\\) criteria using the SR method. Then we used Julia to calculate the mean weights for five evaluators.\nNow we create a final function to automate calculating the mean weights. This function will handle an arbitrary number of vectors with rankings, or a matrix created from those vectors beforehand. Notice that this new function will call our previous calculate_rank_weights function.\n\nfunction calculate_mean_weights(x...)\n x = hcat(x...)\n ncrit, neval = size(x, 1), size(x, 2)\n wts_i = calculate_rank_weights(ncrit)\n weights = repeat( wts_i, neval )\n evaluators = repeat( \n [\"eval\" .* string.(1:neval)...], \n inner = repeat([ncrit])\n )\n evalvator_rankings = reshape(x, ncrit * neval)\n df = DataFrame( \n hcat(evaluators, evaluator_rankings, weights), \n [\"evaluator\", \"criteria\", \"weight\"] \n )\n gdf = groupby(df, :criteria)\n mean_wts = sort!(\n combine(gdf, :weight => mean), \n :weight_mean, rev = true\n ) \n return (\n weights = mean_wts,\n data = df\n )\nend\n\ncalculate_mean_weights (generic function with 1 method)\n\n\nHere we test passing vectors as arguments to the calculate_mean_weights function.\n\nr1 = calculate_mean_weights(eval1, eval2, eval3, eval4, eval5);\n\nThe results are saved in r1 and the semi-colon (;) suppresses the output. We use typeof function to evaluate the type of r1.\n\ntypeof(r1)\n\n@NamedTuple{weights::DataFrame, data::DataFrame}\n\n\nWe see that r1 is a NamedTuple and it contains two data frames named weights and data. We can index each separately.\n\nr1.weights\n\n6×2 DataFrame\n\n\n\nRow\ncriteria\nweight_mean\n\n\n\nAny\nFloat64\n\n\n\n\n1\nMileage\n0.244818\n\n\n2\nColor\n0.22409\n\n\n3\nSafety\n0.162465\n\n\n4\nPrice\n0.134454\n\n\n5\nValue\n0.129972\n\n\n6\nReliability\n0.104202\n\n\n\n\n\n\n\nr1.data\n\n30×3 DataFrame5 rows omitted\n\n\n\nRow\nevaluator\ncriteria\nweight\n\n\n\nAny\nAny\nAny\n\n\n\n\n1\neval1\nMileage\n0.336134\n\n\n2\neval1\nColor\n0.22409\n\n\n3\neval1\nPrice\n0.168067\n\n\n4\neval1\nSafety\n0.12605\n\n\n5\neval1\nReliability\n0.0896359\n\n\n6\neval1\nValue\n0.0560224\n\n\n7\neval2\nMileage\n0.336134\n\n\n8\neval2\nColor\n0.22409\n\n\n9\neval2\nSafety\n0.168067\n\n\n10\neval2\nReliability\n0.12605\n\n\n11\neval2\nValue\n0.0896359\n\n\n12\neval2\nPrice\n0.0560224\n\n\n13\neval3\nColor\n0.336134\n\n\n⋮\n⋮\n⋮\n⋮\n\n\n19\neval4\nMileage\n0.336134\n\n\n20\neval4\nValue\n0.22409\n\n\n21\neval4\nColor\n0.168067\n\n\n22\neval4\nSafety\n0.12605\n\n\n23\neval4\nReliability\n0.0896359\n\n\n24\neval4\nPrice\n0.0560224\n\n\n25\neval5\nSafety\n0.336134\n\n\n26\neval5\nPrice\n0.22409\n\n\n27\neval5\nColor\n0.168067\n\n\n28\neval5\nReliability\n0.12605\n\n\n29\neval5\nMileage\n0.0896359\n\n\n30\neval5\nValue\n0.0560224\n\n\n\n\n\n\nNext, we create the matrix of evaluator data and then pass it to the calculate_mean_weights function.\n\neval_rankings_tab = hcat(eval1, eval2, eval3, eval4, eval5)\nr2 = calculate_mean_weights(eval_rankings_tab);\n\nThe results are saved in r and the semi-colon (;) suppresses the output. We use typeof function to evaluate the type of r2.\n\ntypeof(r2)\n\n@NamedTuple{weights::DataFrame, data::DataFrame}\n\n\nWe see that r2 is a NamedTuple and it contains two data frames named weights and data. We can index each separately.\n\nr2.weights\n\n6×2 DataFrame\n\n\n\nRow\ncriteria\nweight_mean\n\n\n\nAny\nFloat64\n\n\n\n\n1\nMileage\n0.244818\n\n\n2\nColor\n0.22409\n\n\n3\nSafety\n0.162465\n\n\n4\nPrice\n0.134454\n\n\n5\nValue\n0.129972\n\n\n6\nReliability\n0.104202\n\n\n\n\n\n\n\nr2.data\n\n30×3 DataFrame5 rows omitted\n\n\n\nRow\nevaluator\ncriteria\nweight\n\n\n\nAny\nAny\nAny\n\n\n\n\n1\neval1\nMileage\n0.336134\n\n\n2\neval1\nColor\n0.22409\n\n\n3\neval1\nPrice\n0.168067\n\n\n4\neval1\nSafety\n0.12605\n\n\n5\neval1\nReliability\n0.0896359\n\n\n6\neval1\nValue\n0.0560224\n\n\n7\neval2\nMileage\n0.336134\n\n\n8\neval2\nColor\n0.22409\n\n\n9\neval2\nSafety\n0.168067\n\n\n10\neval2\nReliability\n0.12605\n\n\n11\neval2\nValue\n0.0896359\n\n\n12\neval2\nPrice\n0.0560224\n\n\n13\neval3\nColor\n0.336134\n\n\n⋮\n⋮\n⋮\n⋮\n\n\n19\neval4\nMileage\n0.336134\n\n\n20\neval4\nValue\n0.22409\n\n\n21\neval4\nColor\n0.168067\n\n\n22\neval4\nSafety\n0.12605\n\n\n23\neval4\nReliability\n0.0896359\n\n\n24\neval4\nPrice\n0.0560224\n\n\n25\neval5\nSafety\n0.336134\n\n\n26\neval5\nPrice\n0.22409\n\n\n27\neval5\nColor\n0.168067\n\n\n28\neval5\nReliability\n0.12605\n\n\n29\neval5\nMileage\n0.0896359\n\n\n30\neval5\nValue\n0.0560224\n\n\n\n\n\n\nThe calculate_mean_weights function worked for both vectors or a matrix. This was possible because of the splat operator (...) as in the two lines below.\n\nfunction calculate_mean_weights(x...)\n x = hcat(x...)"
},
{
"objectID": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html#footnotes",
"href": "posts/2024-07-14-simple-method-for-calculating-decision-criteria-weights/index.html#footnotes",
"title": "Simple method for calculating decision criteria weights",
"section": "Footnotes",
"text": "Footnotes\n\n\nI focus on human decision intelligence (HDI), in contrast to “decision intelligence” (DI). “Decision intelligence is an engineering discipline that augments data science with theory from social science, decision theory, and managerial science. Its application provides a framework for best practices in organizational decision-making and processes for applying machine learning at scale. The basic idea is that decisions are based on our understanding of how actions lead to outcomes. Decision intelligence is a discipline for analyzing this chain of cause and effect, and decision modeling is a visual language for representing these chains.”↩︎\nThe SR method was selected because it was the best performing.↩︎"
},
{
"objectID": "posts/2023-02-20-julia-for-data-analysis/index.html",
"href": "posts/2023-02-20-julia-for-data-analysis/index.html",
"title": "Julia for Data Analysis",
"section": "",
"text": "Julia is an open source, general purpose programming language for scientific computing and is well suited for data science. I taught R programming for several years at the UC Berkeley School of Public Health. I recently posted a blog entry on why I switched from R to Julia (“My Journey from R to Julia”). In this blog entry I review the book “Julia for Data Analysis” by Bogumit Kaminski [1]. In short, this is an outstanding book that I highly recommend without any reservations (5/5 stars). This review is a working blog that I will be updating with highlights from the book.\nFor data science, Julia is exploding in popularity and there are numerous outstanding resources for learning Julia, including books, videos, and blog postings. I purchased the print and eBook option which includes an online “liveBook” which is easy to read. However, we cannot beat having a book in hand to read and markup. The author provides a GitHub repository with code and data files.\nThe author, Bogumit Kaminski, is a core developer of the DataFrames.jl package. He is an associate professor and head of the Decision Support and Analysis Unit at the SGH Warsaw School of Economics, as well as adjunct professor at the data science laboratory, Ryerson University, Toronto.\nThis book is perfect for population health data scientists already familiar with R or Python, or already have basic proficiency with Julia but need an in depth and systematic introduction to Julia for data science.\nHere is the table of contents.\n\nIntroduction\nGetting started with Julia\nJulia’s support for scaling projects\nWorking with collections in Julia\nAdvanced topics on handling collections\nWorking with strings\nHandling time-series data and missing values\nFirst steps with data frames\nGetting data from a data frame\nCreating data frame objects\nConverting and grouping data frames\nMutating and transforming data frames\nAdvanced transformations of data frames\nCreating web services for sharing data analysis results"
},
{
"objectID": "posts/2023-02-20-julia-for-data-analysis/index.html#overview",
"href": "posts/2023-02-20-julia-for-data-analysis/index.html#overview",
"title": "Julia for Data Analysis",
"section": "",
"text": "Julia is an open source, general purpose programming language for scientific computing and is well suited for data science. I taught R programming for several years at the UC Berkeley School of Public Health. I recently posted a blog entry on why I switched from R to Julia (“My Journey from R to Julia”). In this blog entry I review the book “Julia for Data Analysis” by Bogumit Kaminski [1]. In short, this is an outstanding book that I highly recommend without any reservations (5/5 stars). This review is a working blog that I will be updating with highlights from the book.\nFor data science, Julia is exploding in popularity and there are numerous outstanding resources for learning Julia, including books, videos, and blog postings. I purchased the print and eBook option which includes an online “liveBook” which is easy to read. However, we cannot beat having a book in hand to read and markup. The author provides a GitHub repository with code and data files.\nThe author, Bogumit Kaminski, is a core developer of the DataFrames.jl package. He is an associate professor and head of the Decision Support and Analysis Unit at the SGH Warsaw School of Economics, as well as adjunct professor at the data science laboratory, Ryerson University, Toronto.\nThis book is perfect for population health data scientists already familiar with R or Python, or already have basic proficiency with Julia but need an in depth and systematic introduction to Julia for data science.\nHere is the table of contents.\n\nIntroduction\nGetting started with Julia\nJulia’s support for scaling projects\nWorking with collections in Julia\nAdvanced topics on handling collections\nWorking with strings\nHandling time-series data and missing values\nFirst steps with data frames\nGetting data from a data frame\nCreating data frame objects\nConverting and grouping data frames\nMutating and transforming data frames\nAdvanced transformations of data frames\nCreating web services for sharing data analysis results"
},
{
"objectID": "posts/2023-02-20-julia-for-data-analysis/index.html#book-highlights",
"href": "posts/2023-02-20-julia-for-data-analysis/index.html#book-highlights",
"title": "Julia for Data Analysis",
"section": "2 Book highlights",
"text": "2 Book highlights\n\n2.1 Chapter 1 Introduction\nMy blog posting “My Journey from R to Julia” is a good summary of what is covered in the Introduction. I will cover just one item—execution speed, and compare it to R.\nWe will construct a for loop summation of a random sequence of integers from 1 to 1,000,000,000 (1 billion) that are sampled without replacment.1 Here is the correct answer as a reference:\n\n## In Julia\njulia> sum(1:1_000_000_000)\n500000000500000000\n\n## In R\n> options(digits=20)\n> sum(1:1000000000)\n[1] 5.000000005e+17\n\nBy default, R would give the wrong answer (not shown) because it uses 64-bit floating point numbers.2 To get the correct answer we need 64-bit integers. For this I used the bit64 R package to get the correct answer (below).\n\n\n> require(\"bit64\")\n> n = 1000000000\n> samp = sample(1:n, n, replace=FALSE)\n> sum_n = function(x){\n+ s = as.integer64(0)\n+ for (i in x){\n+ s = s + i\n+ }\n+ s\n+ }\n> system.time(x <- sum_n(samp))\n user system elapsed \n11094.792 101.378 11201.145 \n> x\ninteger64\n[1] 500000000500000000\n\nTo get the correct answer in R, the execution time was about 11,095 seconds. Okay, let’s try Julia.\n\n\njulia> using StatsBase, BenchmarkTools\njulia> n = 1_000_000_000;\njulia> samp = sample(1:n, n, replace=false);\njulia> function sum_n(x)\n s = 0\n for i in x\n s = s + i\n end\n return s\n end\nsum_n (generic function with 1 method)\n\njulia> @btime sum_n(samp)\n 158.118 ms (1 allocation: 16 bytes)\n500000000500000000\n\nIn Julia, it took about 158 milliseconds. R is about 70,168 times slower than Julia!3\nConclusion: Compared to R, Julia can handle large for loops for summation and give an accurate answer fast. To add integers correctly using a for loop, R requires using the bit64 package but is 70,168 time slower than Julia. With R, we are taught to avoid for loops — now you know why. However, a for loop is a workhorse tool that we want available to us; hence, this a huge advantage of Julia.\nTo see more benchmarks visit Which programming language is fastest?.\n\n\n2.2 2 Getting started with Julia\n\n2.2.1 Basic data types\nIf you are familar with R or Python, you will feel comfortable with Julia. Here is a character:\n\n## character; notice single quotation marks\n'a'\n\n'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)\n\n\nA string is a collection of characters:\n\n## string; notice double quotation marks\n\"Hello World\"\n\n\"Hello World\"\n\n\nAn array is a collection in brackets. A vector is a 1-dimensional array. Commas are used to create a vector. By default, the vector is displayed vertically, but it is not a “column vector”.\n\n## vector\n[1, 2, 3]\n\n3-element Vector{Int64}:\n 1\n 2\n 3\n\n\nTo display horizontally use the show function.4\n\n## vector\nshow([1, 2, 3])\n\n[1, 2, 3]\n\n\nA matrix is a two-dimensional array. Numbers separated by spaces create one row of a matrix. This is not a vector.\n\n## matrix\n[1 2 3 4]\n\n1×4 Matrix{Int64}:\n 1 2 3 4\n\n\nMatrix rows separated by a semi-colon creates a multi-row matrix:\n\n## matrix\n[1 2 3 4; 6 7 8 9]\n\n2×4 Matrix{Int64}:\n 1 2 3 4\n 6 7 8 9\n\n\nThe size function returns the dimensions of an array. Previously, we learned that a vector is a 1-dimensional array.\n\nsize([1, 2, 3])\n\n(3,)\n\n\nWe see this vector is 1-dimensional ((3,)); in contrast,\n\nsize([1 2 3])\n\n(1, 3)\n\n\nthis is a (1, 3) matrix or 2-dimensional.\nA tuple is a collection in parentheses, and commas separate the elements. Above, the size function returned a tuple. Tuples are immutable; their elements cannot be changed. Immutable objects increase computational speed.\n\n## tuple\n(1, 2, 3)\n\n(1, 2, 3)\n\n\nTuple elements can be named.\n\n## named tuple\n(four = 4, five = 5, six = 6)\n\n(four = 4, five = 5, six = 6)\n\n\nThe assigment operation (=) is used to bind data values to a variable. Binding in Julia is similar to binding in Python, but not in R.\n\nx = [1, 2]\ny = x\ny\n\n2-element Vector{Int64}:\n 1\n 2\n\n\nIf we change an element in x, the same change occurs in y because same array was binded to the variables x and y.\n\nx[2] = 99\n## notice that y changes also\ny\n\n2-element Vector{Int64}:\n 1\n 99\n\n\nThis is not the case in R:\n\n> x = c(1, 2)\n> y = x\n> x[2] = 99\n> x\n[1] 1 99\n> y\n[1] 1 2\n\nTo replicate the R experience in Julia we assign a copy of x to y.\n\nx = [1, 2]\ny = copy(x)\nx[2] = 99\nx\n\n2-element Vector{Int64}:\n 1\n 99\n\n\n\ny\n\n2-element Vector{Int64}:\n 1\n 2\n\n\n\n\n2.2.2 Basic functions\nIn my blog posting “My Journey from R to Julia” I demonstrated how to create a simple function to calculate the odds ratio using 3 methods with multiple dispatch.5 I will use the function that calculates the odds ratio using the cross-product of 4 integers, and build a more useful function to illustrate some features of Julia.\nFor an appropriately structured table, for example,\n\n\n\nExposure\nDisease\nNo disease\n\n\n\n\nYes\na\nb\n\n\nNo\nc\nd\n\n\n\nthe odds ratio is the cross-product:\n\\[\nOR = \\frac{a d}{b c}\n\\]\nHere is a simple Julia function to calculate the odds ratio:\n\nfunction oddsratio(a, b, c, d)\n or = (a * d) / (b * c)\n return or\nend\n\noddsratio (generic function with 1 method)\n\n\nBecause this is a simple function, it can also be created in an abbreviated form:\n\noddsratio(a, b, c, d) = (a * d) / (b * c)\n\noddsratio (generic function with 1 method)\n\n\nHere is data from a case-control study [2]:\n\n\n\nExposure\nCase\nControl\n\n\n\n\nHighest\n12\n6\n\n\nLowest\n2\n29\n\n\n\nLet’s test the oddsratio function by passing four integers from our 2x2 table.\n\noddsratio(12, 6, 2, 29)\n\n29.0\n\n\nThe function arguments, a, b, c, and d are called positional arguments and are always required in the correct order.\nWe now add a keyword argument which is optional. Keyword arguments are separated from the positional arguments by a semicolon (;). Any argument, positional or keyword, can be assigned a default value. We’ll create oddsratio2function to calculate a confidence interval using the Normal approximation. The keyword argument will have a default confidence level of 0.95.\n\nusing Distributions # to access standard normal distribution \nfunction oddsratio2(a, b, c, d; level = 0.95)\n zv = quantile(Normal(), 0.5*(1 + level))\n est = (a * d) / (b * c)\n log_or = log(est)\n se_log_or = sqrt((1/a) + (1/b) + (1/c) + (1/d))\n lcl = exp(log_or - zv * se_log_or)\n ucl = exp(log_or + zv * se_log_or)\n return (\n or = est, \n confint = (lcl, ucl), \n level = level\n )\nend\n\noddsratio2 (generic function with 1 method)\n\n\nBy default, oddsratio2 will calculate the 95% confidence interval:\n\noddsratio2(12, 6, 2, 29)\n\n(or = 29.0, confint = (5.110695577009899, 164.55685675804662), level = 0.95)\n\n\nI can also calculate 99% confidence intervals:\n\nresults = oddsratio2(12, 6, 2, 29; level = 0.99)\n\n(or = 29.0, confint = (2.9619778898301936, 283.93189661797715), level = 0.99)\n\n\nAnd we can index elements of the named tuple.\n\nresults.confint\n\n(2.9619778898301936, 283.93189661797715)\n\n\n\n\n2.2.3 Anonymous functions\nFunctions can be arguments to functions. For example, we will create a times_two function and past it to other functions.\n\ntimes_two(x) = 2 * x\nmap(times_two, [1, 2, 8])\n\n3-element Vector{Int64}:\n 2\n 4\n 16\n\n\n\nsum(times_two, [1, 2, 8])\n\n22\n\n\n\nusing StatsBase\nmean(times_two, [1, 2, 8])\n\n7.333333333333333\n\n\nAlternatively, we can pass the times_two function as an anonymous function; that is, a function without a name.\n\nmap(x -> 2 * x, [1, 2, 8])\n\n3-element Vector{Int64}:\n 2\n 4\n 16\n\n\n\nsum(x -> 2 * x, [1, 2, 8])\n\n22\n\n\n\nmean(x -> 2 * x, [1, 2, 8])\n\n7.333333333333333\n\n\nWe can even calculate the odds ratio as an anonymous function.\n\nmap((a, b, c, d) -> (a * d)/(b * c), (12, 6, 2, 29)...)\n\n29.0\n\n\nNotice I used a trick. By default, the map function maps and executes a function with elements of a collection (eg, [1 ,2, 8]). However, for the odds ratio calcuation, I need to map the arguments (a, b, c, d) to their values (12, 6, 2, 29), then calculate the odds ratio. Therefore, I used the splat operator (...) to break up the collection of values so that they could be mapped to their arguments first. Although I used a tuple (12, 6, 2, 29), an array also works [12, 6, 2, 29].\n\nmap((a, b, c, d) -> (a * d)/(b * c), [12, 6, 2, 29]...)\n\n29.0\n\n\nThe splat operator (...) converts [12, 6, 2, 29] to 12, 6, 2, 29 which is very convenient if the vector is very large. In this case, with only four integers, I could have passed the integers without the splat operator.\n\nmap((a, b, c, d) -> (a * d)/(b * c), 12, 6, 2, 29)\n\n29.0\n\n\nChapter 1 also covers other topics including loops, conditional expressions, and scoping. Chapters 3 to 7 cover practical tools for processing and managing data in Julia. I will focus the remainder of this book review on data frames.\n\n\n\n2.3 Chapters 8–13: Working with data frames\nThe author is a lead developer of the DataFrames.jl Julia package. So in these chapters he covers this package thoroughly and you will not be disappointed. I will over cover the following:\n\nDownloading a CSV data file from a website\nReading a CSV file into a data frame\nConducting a common analytic workflows\n\nThe data set we will use is from National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS). The version of the NHEFS data set we will use in from Professor Migual Hernan’s textbook “Causal Inference — What if.”\n\nimport Downloads\nusing CSV, DataFrames\n\nDownloads.download(\"https://www.hsph.harvard.edu/miguel-hernan/wp-content\" *\n \"/uploads/sites/1268/2019/03/nhefs.csv\", \"nhefs.csv\")\nreadlines(\"nhefs.csv\")[1:6]\n\n6-element Vector{String}:\n \"seqn,qsmk,death,yrdth,modth,dad\" ⋯ 465 bytes ⋯ \"ax71,tax82,price71_82,tax71_82\"\n \"233,0,0,,,,175,96,0,42,1,19,2,7\" ⋯ 172 bytes ⋯ \"0977,0.4437866211,0.6403808594\"\n \"235,0,0,,,,123,80,0,36,0,18,2,9\" ⋯ 169 bytes ⋯ \"994141,0.5493164063,0.79296875\"\n \"244,0,0,,,,115,75,1,56,1,15,3,1\" ⋯ 170 bytes ⋯ \"5488,0.0561981201,0.3202514648\"\n \"245,0,1,85,2,14,148,78,0,68,1,1\" ⋯ 174 bytes ⋯ \"7031,0.0547943115,0.3049926758\"\n \"252,0,0,,,,118,77,0,40,0,18,2,1\" ⋯ 168 bytes ⋯ \"994141,0.5493164063,0.79296875\"\n\n\nWe are using the Download, CSV, and Dataframes modules.6\nThe data file has a header (variable names). If there is no header see p. 191 of book.\n\nnhefs = CSV.read(\"nhefs.csv\", DataFrame)\nnhefs[1:6,:]\n\n6×64 DataFrame\n\n\n\nRow\nseqn\nqsmk\ndeath\nyrdth\nmodth\ndadth\nsbp\ndbp\nsex\nage\nrace\nincome\nmarital\nschool\neducation\nht\nwt71\nwt82\nwt82_71\nbirthplace\nsmokeintensity\nsmkintensity82_71\nsmokeyrs\nasthma\nbronch\ntb\nhf\nhbp\npepticulcer\ncolitis\nhepatitis\nchroniccough\nhayfever\ndiabetes\npolio\ntumor\nnervousbreak\nalcoholpy\nalcoholfreq\nalcoholtype\nalcoholhowmuch\npica\nheadache\notherpain\nweakheart\nallergies\nnerves\nlackpep\nhbpmed\nboweltrouble\nwtloss\ninfection\nactive\nexercise\nbirthcontrol\npregnancies\ncholesterol\nhightax82\nprice71\nprice82\ntax71\ntax82\nprice71_82\ntax71_82\n\n\n\nInt64\nInt64\nInt64\nInt64?\nInt64?\nInt64?\nInt64?\nInt64?\nInt64\nInt64\nInt64\nInt64?\nInt64\nInt64\nInt64\nFloat64\nFloat64\nFloat64?\nFloat64?\nInt64?\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64?\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64\nInt64?\nInt64?\nInt64?\nFloat64?\nFloat64?\nFloat64?\nFloat64?\nFloat64?\nFloat64?\n\n\n\n\n1\n233\n0\n0\nmissing\nmissing\nmissing\n175\n96\n0\n42\n1\n19\n2\n7\n1\n174.188\n79.04\n68.946\n-10.094\n47\n30\n-10\n29\n0\n0\n0\n0\n1\n1\n0\n0\n0\n0\n1\n0\n0\n0\n1\n1\n3\n7\n0\n1\n0\n0\n0\n0\n0\n1\n0\n0\n0\n0\n2\n2\nmissing\n197\n0\n2.18359\n1.73999\n1.10229\n0.461975\n0.443787\n0.640381\n\n\n2\n235\n0\n0\nmissing\nmissing\nmissing\n123\n80\n0\n36\n0\n18\n2\n9\n2\n159.375\n58.63\n61.235\n2.60497\n42\n20\n-10\n24\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n1\n0\n1\n4\n0\n1\n0\n0\n0\n0\n0\n0\n0\n0\n1\n0\n0\n2\nmissing\n301\n0\n2.34668\n1.79736\n1.36499\n0.571899\n0.549316\n0.792969\n\n\n3\n244\n0\n0\nmissing\nmissing\nmissing\n115\n75\n1\n56\n1\n15\n3\n11\n2\n168.5\n56.81\n66.2245\n9.41449\n51\n20\n-14\n26\n0\n0\n0\n0\n0\n0\n0\n0\n0\n1\n0\n0\n1\n0\n1\n3\n4\nmissing\n0\n1\n1\n0\n0\n1\n0\n0\n0\n0\n0\n0\n2\n0\n2\n157\n0\n1.56958\n1.51343\n0.55127\n0.230988\n0.0561981\n0.320251\n\n\n4\n245\n0\n1\n85\n2\n14\n148\n78\n0\n68\n1\n15\n3\n5\n1\n170.188\n59.42\n64.4101\n4.99012\n37\n3\n4\n53\n0\n0\n0\n0\n1\n0\n0\n0\n0\n0\n0\n0\n0\n0\n1\n2\n3\n4\n0\n0\n1\n1\n0\n0\n0\n0\n0\n0\n0\n1\n2\n2\nmissing\n174\n0\n1.50659\n1.4519\n0.524902\n0.219971\n0.0547943\n0.304993\n\n\n5\n252\n0\n0\nmissing\nmissing\nmissing\n118\n77\n0\n40\n0\n18\n2\n11\n2\n181.875\n87.09\n92.0793\n4.98925\n42\n20\n0\n19\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n1\n2\n1\n2\n0\n1\n0\n0\n0\n0\n0\n0\n1\n0\n0\n1\n1\n2\nmissing\n216\n0\n2.34668\n1.79736\n1.36499\n0.571899\n0.549316\n0.792969\n\n\n6\n257\n0\n0\nmissing\nmissing\nmissing\n141\n83\n1\n43\n1\n11\n4\n9\n2\n162.188\n99.0\n103.419\n4.41906\n34\n10\n10\n21\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n1\n3\n2\n1\n0\n1\n0\n0\n0\n0\n0\n0\n0\n0\n0\n1\n1\n0\n1\n212\n1\n2.20996\n2.02588\n1.15479\n0.747925\n0.184082\n0.406982\n\n\n\n\n\n\nTo see all the variable names we will take the vector of 64 names and reshape them into a 16 by 4 matrix for display purposes.\n\nvn = names(nhefs)\nreshape(vn, :, 4)\n\n16×4 Matrix{String}:\n \"seqn\" \"wt71\" \"hayfever\" \"hbpmed\"\n \"qsmk\" \"wt82\" \"diabetes\" \"boweltrouble\"\n \"death\" \"wt82_71\" \"polio\" \"wtloss\"\n \"yrdth\" \"birthplace\" \"tumor\" \"infection\"\n \"modth\" \"smokeintensity\" \"nervousbreak\" \"active\"\n \"dadth\" \"smkintensity82_71\" \"alcoholpy\" \"exercise\"\n \"sbp\" \"smokeyrs\" \"alcoholfreq\" \"birthcontrol\"\n \"dbp\" \"asthma\" \"alcoholtype\" \"pregnancies\"\n \"sex\" \"bronch\" \"alcoholhowmuch\" \"cholesterol\"\n \"age\" \"tb\" \"pica\" \"hightax82\"\n \"race\" \"hf\" \"headache\" \"price71\"\n \"income\" \"hbp\" \"otherpain\" \"price82\"\n \"marital\" \"pepticulcer\" \"weakheart\" \"tax71\"\n \"school\" \"colitis\" \"allergies\" \"tax82\"\n \"education\" \"hepatitis\" \"nerves\" \"price71_82\"\n \"ht\" \"chroniccough\" \"lackpep\" \"tax71_82\"\n\n\n\n\n2.4 The split-apply-combine workflow\nWe will apply a common workflow (Figure 1):\n\nsplit (stratify) the data by one or more variables\napply a function to each strata\ncombine the results into a table\n\n\n\n\n\n\n\nFigure 1: The split-apply-combine is a common workflow in data science.\n\n\n\nFor split-apply-combine we will conduct the following analysis:\n\nStratified by sex and race, what is the proportion of deaths?\nStratified by sex and race, what is the mean age?\n\nWe start by creating smaller data set with these four variables, and describing them.\n\nnhefs4 = nhefs[:,[:death,:sex,:age,:race]]\ndescribe(nhefs4)\n\n4×7 DataFrame\n\n\n\nRow\nvariable\nmean\nmin\nmedian\nmax\nnmissing\neltype\n\n\n\nSymbol\nFloat64\nInt64\nFloat64\nInt64\nInt64\nDataType\n\n\n\n\n1\ndeath\n0.195212\n0\n0.0\n1\n0\nInt64\n\n\n2\nsex\n0.509515\n0\n1.0\n1\n0\nInt64\n\n\n3\nage\n43.9153\n25\n44.0\n74\n0\nInt64\n\n\n4\nrace\n0.131983\n0\n0.0\n1\n0\nInt64\n\n\n\n\n\n\nFor “split” we use the groupby function, and for “apply” and “combine” we use the combine function. I also insert a column with description of strate (eg, “white female”).\n\ngdf = groupby(nhefs4, [:sex, :race])\nresults = combine(gdf,\n :death => mean,\n :age => mean)\ninsertcols!(results, \n :strata => [\"white male\", \"black male\", \"white female\", \"black female\"])\n\n4×5 DataFrame\n\n\n\nRow\nsex\nrace\ndeath_mean\nage_mean\nstrata\n\n\n\nInt64\nInt64\nFloat64\nFloat64\nString\n\n\n\n\n1\n0\n0\n0.239716\n44.5262\nwhite male\n\n\n2\n0\n1\n0.276596\n45.0638\nblack male\n\n\n3\n1\n0\n0.141044\n43.3808\nwhite female\n\n\n4\n1\n1\n0.190083\n42.595\nblack female\n\n\n\n\n\n\nConclusions:\n\nIn 1971, the mean age of males was slightly higher than the mean age of females.\nBy 1992, a higher proportion of males died.\nBy 1992, within sex strata, a higher proportion of Blacks died compared to whites."
},
{
"objectID": "posts/2023-02-20-julia-for-data-analysis/index.html#appendix-data-dictionary-for-nhefs-data-file",
"href": "posts/2023-02-20-julia-for-data-analysis/index.html#appendix-data-dictionary-for-nhefs-data-file",
"title": "Julia for Data Analysis",
"section": "3 Appendix: Data dictionary for NHEFS data file",
"text": "3 Appendix: Data dictionary for NHEFS data file\nDownload data dictionary which is an Microsoft Excel XLSX file.\n\nDownloads.download(\"https://www.hsph.harvard.edu/miguel-hernan/wp-content\" *\n \"/uploads/sites/1268/2012/10/NHEFS_Codebook.xls\",\n \"NHEFS_Codebook.xls\")\n\n\"NHEFS_Codebook.xls\"\n\n\nThen, open up in you MS Excel and manually save as a CSV file. Although the Julia XLSX.jl package can read .xlsx files, it cannot read old .xls file (yet). So this is a work around.\n\ncb = CSV.read(\"NHEFS_Codebook.csv\", DataFrame)\ninsertcols!(cb, 1, :row => 1:64)\ncb[1:20,:]\n\n20×3 DataFrame\n\n\n\nRow\nrow\nVariable name\nDescription\n\n\n\nInt64\nString31\nString\n\n\n\n\n1\n1\nactive\nIN YOUR USUAL DAY, HOW ACTIVE ARE YOU? IN 1971, 0:very active, 1:moderately active, 2:inactive\n\n\n2\n2\nage\nAGE IN 1971\n\n\n3\n3\nalcoholfreq\nHOW OFTEN DO YOU DRINK? IN 1971 0: Almost every day, 1: 2-3 times/week, 2: 1-4 times/month, 3: < 12 times/year, 4: No alcohol last year, 5: Unknown\n\n\n4\n4\nalcoholhowmuch\nWHEN YOU DRINK, HOW MUCH DO YOU DRINK? IN 1971\n\n\n5\n5\nalcoholpy\nHAVE YOU HAD 1 DRINK PAST YEAR? IN 1971, 1:EVER, 0:NEVER; 2:MISSING\n\n\n6\n6\nalcoholtype\nWHICH DO YOU MOST FREQUENTLY DRINK? IN 1971 1: BEER, 2: WINE, 3: LIQUOR, 4: OTHER/UNKNOWN\n\n\n7\n7\nallergies\nUSE ALLERGIES MEDICATION IN 1971, 1:EVER, 0:NEVER\n\n\n8\n8\nasthma\nDX ASTHMA IN 1971, 1:EVER, 0:NEVER\n\n\n9\n9\nbithcontrol\nBIRTH CONTROL PILLS PAST 6 MONTHS? IN 1971 1:YES, 0:NO, 2:MISSING\n\n\n10\n10\nbirthplace\nCHECK STATE CODE - SECOND PAGE\n\n\n11\n11\nboweltrouble\nUSE BOWEL TROUBLE MEDICATION IN 1971, 1:EVER, 0:NEVER, ; 2:MISSING\n\n\n12\n12\nbronch\nDX CHRONIC BRONCHITIS/EMPHYSEMA IN 1971, 1:EVER, 0:NEVER\n\n\n13\n13\ncholesterol\nSERUM CHOLESTEROL (MG/100ML) IN 1971\n\n\n14\n14\nchroniccough\nDX CHRONIC COUGH IN 1971, 1:EVER, 0:NEVER\n\n\n15\n15\ncolitis\nDX COLITIS IN 1971, 1:EVER, 0:NEVER\n\n\n16\n16\ndadth\nDAY OF DEATH\n\n\n17\n17\ndbp\nDIASTOLIC BLOOD PRESSURE IN 1982\n\n\n18\n18\ndeath\nDEATH BY 1992, 1:YES, 0:NO\n\n\n19\n19\ndiabetes\nDX DIABETES IN 1971, 1:EVER, 0:NEVER, 2:MISSING\n\n\n20\n20\neducation\nAMOUNT OF EDUCATION BY 1971: 1: 8TH GRADE OR LESS, 2: HS DROPOUT, 3: HS, 4:COLLEGE DROPOUT, 5: COLLEGE OR MORE\n\n\n\n\n\n\n\ncb[21:40,:]\n\n20×3 DataFrame\n\n\n\nRow\nrow\nVariable name\nDescription\n\n\n\nInt64\nString31\nString\n\n\n\n\n1\n21\nexercise\nIN RECREATION, HOW MUCH EXERCISE? IN 1971, 0:much exercise,1:moderate exercise,2:little or no exercise\n\n\n2\n22\nhayfever\nDX HAY FEVER IN 1971, 1:EVER, 0:NEVER\n\n\n3\n23\nhbp\nDX HIGH BLOOD PRESSURE IN 1971, 1:EVER, 0:NEVER, 2:MISSING\n\n\n4\n24\nhbpmed\nUSE HIGH BLOOD PRESSURE MEDICATION IN 1971, 1:EVER, 0:NEVER, ; 2:MISSING\n\n\n5\n25\nheadache\nUSE HEADACHE MEDICATION IN 1971, 1:EVER, 0:NEVER\n\n\n6\n26\nhepatitis\nDX HEPATITIS IN 1971, 1:EVER, 0:NEVER\n\n\n7\n27\nhf\nDX HEART FAILURE IN 1971, 1:EVER, 0:NEVER\n\n\n8\n28\nhightax82\nLIVING IN A HIGHLY TAXED STATE IN 1982, High taxed state of residence=1, 0 otherwise\n\n\n9\n29\nht\nHEIGHT IN CENTIMETERS IN 1971\n\n\n10\n30\nincome\nTOTAL FAMILY INCOME IN 1971 11:<$1000, 12: 1000-1999, 13: 2000-2999, 14: 3000-3999, 15: 4000-4999, 16: 5000-5999, 17: 6000-6999, 18: 7000-9999, 19: 10000-14999, 20: 15000-19999, 21: 20000-24999, 22: 25000+\n\n\n11\n31\ninfection\nUSE INFECTION MEDICATION IN 1971, 1:EVER, 0:NEVER\n\n\n12\n32\nlackpep\nUSELACK OF PEP MEDICATION IN 1971, 1:EVER, 0:NEVER\n\n\n13\n33\nmarital\nMARITAL STATUS IN 1971 1: Under 17, 2: Married, 3: Widowed, 4: Never married, 5: Divorced, 6: Separated, 8: Unknown\n\n\n14\n34\nmodth\nMONTH OF DEATH\n\n\n15\n35\nnerves\nUSE NERVES MEDICATION IN 1971, 1:EVER, 0:NEVER\n\n\n16\n36\nnervousbreak\nDX NERVOUS BREAKDOWN IN 1971, 1:EVER, 0:NEVER\n\n\n17\n37\notherpain\nUSE OTHER PAINS MEDICATION IN 1971, 1:EVER, 0:NEVER\n\n\n18\n38\npepticulcer\nDX PEPTIC ULCER IN 1971, 1:EVER, 0:NEVER\n\n\n19\n39\npica\nDO YOU EAT DIRT OR CLAY, STARCH OR OTHER NON STANDARD FOOD? IN 1971 1:EVER, 0:NEVER; 2:MISSING\n\n\n20\n40\npolio\nDX POLIO IN 1971, 1:EVER, 0:NEVER\n\n\n\n\n\n\n\ncb[41:64,:]\n\n24×3 DataFrame\n\n\n\nRow\nrow\nVariable name\nDescription\n\n\n\nInt64\nString31\nString\n\n\n\n\n1\n41\npregnancies\nTOTAL NUMBER OF PREGNANCIES? IN 1971\n\n\n2\n42\nprice71\nAVG TOBACCO PRICE IN STATE OF RESIDENCE 1971 (US$2008)\n\n\n3\n43\nprice71_82\nDIFFERENCE IN AVG TOBACCO PRICE IN STATE OF RESIDENCE 1971-1982 (US$2008)\n\n\n4\n44\nprice82\nAVG TOBACCO PRICE IN STATE OF RESIDENCE 1982 (US$2008)\n\n\n5\n45\nqsmk\nQUIT SMOKING BETWEEN 1ST QUESTIONNAIRE AND 1982, 1:YES, 0:NO\n\n\n6\n46\nrace\n0: WHITE 1: BLACK OR OTHER IN 1971\n\n\n7\n47\nsbp\nSYSTOLIC BLOOD PRESSURE IN 1982\n\n\n8\n48\nschool\nHIGHEST GRADE OF REGULAR SCHOOL EVER IN 1971\n\n\n9\n49\nseqn\nUNIQUE PERSONAL IDENTIFIER\n\n\n10\n50\nsex\n0: MALE 1: FEMALE\n\n\n11\n51\nsmokeintensity\nNUMBER OF CIGARETTES SMOKED PER DAY IN 1971\n\n\n12\n52\nsmkintensity 82_71\nINCREASE IN NUMBER OF CIGARETTES/DAY BETWEEN 1971 and 1982\n\n\n13\n53\nsmokeyrs\nYEARS OF SMOKING\n\n\n14\n54\ntax71\nTOBACCO TAX IN STATE OF RESIDENCE 1971 (US$2008)\n\n\n15\n55\ntax71_82\nDIFFERENCE IN TOBACCO TAX IN STATE OF RESIDENCE 1971-1982 (US$2008)\n\n\n16\n56\ntax82\nTOBACCO TAX IN STATE OF RESIDENCE 1971 (US$2008)\n\n\n17\n57\ntb\nDX TUBERCULOSIS IN 1971, 1:EVER, 0:NEVER\n\n\n18\n58\ntumor\nDX MALIGNANT TUMOR/GROWTH IN 1971, 1:EVER, 0:NEVER\n\n\n19\n59\nweakheart\nUSE WEAK HEART MEDICATION IN 1971, 1:EVER, 0:NEVER\n\n\n20\n60\nwt71\nWEIGHT IN KILOGRAMS IN 1971\n\n\n21\n61\nwt82\nWEIGHT IN KILOGRAMS IN 1982\n\n\n22\n62\nwt82_71\nWEIGHT CHANGE IN KILOGRAMS\n\n\n23\n63\nwtloss\nUSE WEIGHT LOSS MEDICATION IN 1971, 1:EVER, 0:NEVER\n\n\n24\n64\nyrdth\nYEAR OF DEATH"
},
{
"objectID": "posts/2023-02-20-julia-for-data-analysis/index.html#updates-from-book-author",
"href": "posts/2023-02-20-julia-for-data-analysis/index.html#updates-from-book-author",
"title": "Julia for Data Analysis",
"section": "4 Updates from book author",
"text": "4 Updates from book author\n\nWhat is new in DataFrames.jl 1.5. Mar 24, 2023. Available from: https://bkamins.github.io/julialang/2023/03/24/df15.html\nHunting for bugs in Julia for Data Analysis. Medium. March 3, 2023. Available from: https://medium.com/bkamins/hunting-for-bugs-in-julia-for-data-analysis-ed6f4d1ce6bd\nErrata and source code for book is available from https://github.com/bkamins/JuliaForDataAnalysis"
},
{
"objectID": "posts/2023-02-20-julia-for-data-analysis/index.html#footnotes",
"href": "posts/2023-02-20-julia-for-data-analysis/index.html#footnotes",
"title": "Julia for Data Analysis",
"section": "Footnotes",
"text": "Footnotes\n\n\nUsed Rstudio on 2021 MacBook Pro M1 Max with 32 GB RAM under Ventura 13.2.1↩︎\n“The reason for the difference is that Julia uses 64-bit integers and R uses 64-bit floats by default to do these computations.” Source: https://twitter.com/BogumilKaminski/status/1629968902456311818↩︎\nNotice that in Julia I can use 1000000000 or 1_000_000_000 for the number 1 billion.↩︎\nIn the Pluto.jl package, vectors are displayed horizontally.↩︎\nTo learn more visit https://freecontent.manning.com/using-multiple-dispatch-in-julia/.↩︎\nhttps://stackoverflow.com/questions/27086159/what-is-the-difference-between-using-and-import-in-julia-when-building-a-mod↩︎"
},
{
"objectID": "posts/2023-01-14-my-journey-from-r-to-julia/index.html",
"href": "posts/2023-01-14-my-journey-from-r-to-julia/index.html",
"title": "My journey from R to Julia",
"section": "",
"text": "For 15 years, as volunteer adjunct faculty, I taught “Applied Epidemiology using R” at the UC Berkeley School of Public Health. I started teaching this course in the early 2000s when most people were not interested in R. Eventually R’s popularity grew, and so did my course enrollment. I taught basic R programming applied to epidemiologic concepts. Students worked on a project of their choosing. Because I had students from multiple disciplines, their projects were often very innovative and I learned a lot from them.\nOver the 15 years I witnessed the emergence of “data science” and students’ ingenuity and creativeness in their projects. Naturally, the course evolved too. I introduced Bayesian networks as a unifying framework to introduce probablistic dependence, causal graphs, and decision networks (for decision analysis) [1].\nI started dabbling in Python. Unfortunately, the COVID-19 pandemic interrupted my teaching because I was the health officer of San Francisco and director of the Population Health Division. Working on the pandemic response left me little time to conduct analyses or to learn Python.1\nI cannot remember when, but eventually I discovered Julia—a programming language designed for scientific computing with the intuition of Python or R, but with the speed of C++. I fell in love with Julia and I gave up on learning Python. I did not, and do not, have the time to maintain core competency in more than 1.5 programming languages.2 As I learned more Julia, I became convinced, that for me, learning Julia was a better long term investment than sticking with R.\nI use programming to explore or test my epidemiologic intuition, to learn new methods, and to visualize and analyze data. I have a personal interest in Bayesian networks, decision networks, causal inference, Markov decision processes, and agent-based modeling.3\nHere are some key Julia features that I believe epidemiologists will value:\n\nmultiple dispatch (see below)\ncomposite types\njust-in-time compiling\nspeed (very very fast)\ncommunity (mostly computer science, mathematics, engineering)\n1-based indexing like R and MATLAB (Python is 0-based indexing)\nCapabilities similar to dpylr in R 4, 5\nPipe operator similar to %>% in R\ngeneral programming language (like Python)\n\nJulia enables more intuitive programming. For example, in R, we try to avoid loops because they are very inefficient. In Julia, loops are efficient because they compile before execution. This promotes programming that is more natural.\nBelow I demonstrate multiple dispatch with a trivial example. R is single dispatch."
},
{
"objectID": "posts/2023-01-14-my-journey-from-r-to-julia/index.html#my-history-with-the-r-language",
"href": "posts/2023-01-14-my-journey-from-r-to-julia/index.html#my-history-with-the-r-language",
"title": "My journey from R to Julia",
"section": "",
"text": "For 15 years, as volunteer adjunct faculty, I taught “Applied Epidemiology using R” at the UC Berkeley School of Public Health. I started teaching this course in the early 2000s when most people were not interested in R. Eventually R’s popularity grew, and so did my course enrollment. I taught basic R programming applied to epidemiologic concepts. Students worked on a project of their choosing. Because I had students from multiple disciplines, their projects were often very innovative and I learned a lot from them.\nOver the 15 years I witnessed the emergence of “data science” and students’ ingenuity and creativeness in their projects. Naturally, the course evolved too. I introduced Bayesian networks as a unifying framework to introduce probablistic dependence, causal graphs, and decision networks (for decision analysis) [1].\nI started dabbling in Python. Unfortunately, the COVID-19 pandemic interrupted my teaching because I was the health officer of San Francisco and director of the Population Health Division. Working on the pandemic response left me little time to conduct analyses or to learn Python.1\nI cannot remember when, but eventually I discovered Julia—a programming language designed for scientific computing with the intuition of Python or R, but with the speed of C++. I fell in love with Julia and I gave up on learning Python. I did not, and do not, have the time to maintain core competency in more than 1.5 programming languages.2 As I learned more Julia, I became convinced, that for me, learning Julia was a better long term investment than sticking with R.\nI use programming to explore or test my epidemiologic intuition, to learn new methods, and to visualize and analyze data. I have a personal interest in Bayesian networks, decision networks, causal inference, Markov decision processes, and agent-based modeling.3\nHere are some key Julia features that I believe epidemiologists will value:\n\nmultiple dispatch (see below)\ncomposite types\njust-in-time compiling\nspeed (very very fast)\ncommunity (mostly computer science, mathematics, engineering)\n1-based indexing like R and MATLAB (Python is 0-based indexing)\nCapabilities similar to dpylr in R 4, 5\nPipe operator similar to %>% in R\ngeneral programming language (like Python)\n\nJulia enables more intuitive programming. For example, in R, we try to avoid loops because they are very inefficient. In Julia, loops are efficient because they compile before execution. This promotes programming that is more natural.\nBelow I demonstrate multiple dispatch with a trivial example. R is single dispatch."
},
{
"objectID": "posts/2023-01-14-my-journey-from-r-to-julia/index.html#multiple-dispatch-example-in-julia",
"href": "posts/2023-01-14-my-journey-from-r-to-julia/index.html#multiple-dispatch-example-in-julia",
"title": "My journey from R to Julia",
"section": "Multiple dispatch example in Julia",
"text": "Multiple dispatch example in Julia\nI was the founding developer of the R ‘epitools’ package.6 I developed functions for basic epidemiologic analyses (eg, 2x2 tables), several from Rothman’s “Modern Epidemiology” textbook. For example, if I wanted to create a function to calculate an odds ratio for a 2x2 contingency table, the data could be provided in several ways:\n\nfour integer counts,\ntwo proportions, or\na 2x2 table (matrix)\ntwo vectors with categorical data\n\nIf I wanted to write one function to handle these possible data types as arguments, I would have to do a lot of processing and checking of data types in order to call the next function (nested or external). This requires much more work than is necessary. Let’s do this is Julia using multiple dispatch.\nIn 2003, we published a study that provided evidence that drinking unfiltered municipal tap water was associated with developing cryptosporidiosis among patients with advanced HIV disease [2]. Here is a contingency table from this paper.\n\n\n\nExposure\nCase\nControl\n\n\n\n\nHighest\n12\n6\n\n\nIntermediate\n35\n64\n\n\nLowest\n2\n29\n\n\n\nLet’s calculate the unadjusted odds ratio comparing the highest exposure category to the lowest exposure category. Here is the 2x2 table for this calculation.\n\n\n\n\n\n\n\n\nExposure\nCase\nControl\n\n\n\n\nHighest\n12\n6\n\n\nLowest\n2\n29\n\n\n\nFor an appropriately structured table, for example,\n\n\n\n\n\n\n\n\nExposure\nCase\nControl\n\n\n\n\nHighest\na\nb\n\n\nLowest\nc\nd\n\n\n\nthe odds ratio is the cross-product:\n\\[\nOR = \\frac{a d}{b c}\n\\]\nFor a case-control design, the odds ratio is the ratio of the exposure odds.\n\\[\nOR = \\frac{p_1/(1 - p_1)}{p_0/(1 - p_0)}\n\\]\nWe will create three functions that can receive three different types of arguments to calculate an odds ratio.\n\nfour integer counts,\ntwo proportions, or\na 2x2 table (matrix)\n\nThe catch is that the three functions will have the same name: oddsratio. This is possible in Julia because of multiple dispatch. In contrast, R is single dispatch.\n\n## Function 1\nfunction oddsratio(a::Int, b::Int, c::Int, d::Int)\n or = (a * d) / (b * c)\n return or\nend\n\noddsratio (generic function with 1 method)\n\n\nLet’s test the oddsratio function by passing four integers from our 2x2 table.\n\noddsratio(12, 6, 2, 29)\n\n29.0\n\n\nHere is the second function to handle arguments that are proportions; for example, the exposure odds comparing cases to controls.\n\n## Function 2\nfunction oddsratio(p1::Float64, p0::Float64)\n or = ((p1)/(1 - p1)) / ((p0)/(1 - p0))\n return or\nend\n\noddsratio (generic function with 2 methods)\n\n\nLet’s test the oddsratio function by passing two proportions from our 2x2 table.\n\nprop1 = 12 / (12 + 2) # probability of exposure among cases\nprop0 = 6 / (6 + 29) # probability of exposure among controls\noddsratio(prop1, prop0)\n\n28.999999999999982\n\n\nFinally, here is the third function to handle an argument that is a 2x2 table (matrix).\n\n## Function 3\nfunction oddsratio(x::Matrix{Int})\n or = (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1])\n return or\nend\n\noddsratio (generic function with 3 methods)\n\n\nLet’s test the oddsratio function by passing our 2x2 table.\n\ntab = [12 6; 2 29]\n\n2×2 Matrix{Int64}:\n 12 6\n 2 29\n\n\n\noddsratio(tab)\n\n29.0\n\n\nThis is called multiple dispatch. The oddsratio function has three methods and can handle multiple data types. Notice how easy that was.\n\nmethods(oddsratio)\n\n# 3 methods for generic function oddsratio from Main.Notebook:\n\n- oddsratio(x::Matrix{Int64}) in Main.Notebook at /Users/tja/myblog/posts/2023-01-14-my-journey-from-r-to-julia/index.qmd:236\n- oddsratio(p1::Float64, p0::Float64) in Main.Notebook at /Users/tja/myblog/posts/2023-01-14-my-journey-from-r-to-julia/index.qmd:207\n- oddsratio(a::Int64, b::Int64, c::Int64, d::Int64) in Main.Notebook at /Users/tja/myblog/posts/2023-01-14-my-journey-from-r-to-julia/index.qmd:180\nWow! Now anyone, including me, can add new methods to the oddsratio function without disrupting previous methods. This is very powerful. Other languages accomplish this using object-oriented programming. Julia is not an object-oriented programming lanuage."
},
{
"objectID": "posts/2023-01-14-my-journey-from-r-to-julia/index.html#summary",
"href": "posts/2023-01-14-my-journey-from-r-to-julia/index.html#summary",
"title": "My journey from R to Julia",
"section": "Summary",
"text": "Summary\nIn this blog entry I summarized why I switched from R to Julia. I illustrated how multiple dispatch works with functions. Notice how easy it was for me to create an oddsratio function with three methods to handle different argument data types (integer counts, proportions, and a matrix).\nI enjoy Julia and you will too. You can also run R or Python from Julia without skipping a beat.\nI will be posting simple examples that highlight Julia features applied to basic epidemiology or epidemiologic programming."
},
{
"objectID": "posts/2023-01-14-my-journey-from-r-to-julia/index.html#appendix",
"href": "posts/2023-01-14-my-journey-from-r-to-julia/index.html#appendix",
"title": "My journey from R to Julia",
"section": "Appendix",
"text": "Appendix\nJulia is very flexible in the creation of functions. Here are the same three functions in an abbreviated form.\n\n## Function 1\noddsratio(a::Int, b::Int, c::Int, d::Int) = (a * d) / (b * c)\n\noddsratio (generic function with 3 methods)\n\n\n\n## Function 2\noddsratio(p1::Float64, p0::Float64) = ((p1)/(1 - p1)) / ((p0)/(1 - p0))\n\noddsratio (generic function with 3 methods)\n\n\n\n## Function 3\noddsratio(x::Matrix{Int}) = (x[1, 1] * x[2, 2]) / (x[1, 2] * x[2, 1])\n\noddsratio (generic function with 3 methods)\n\n\nFrom my biased perspective, Julia has a simplicity and elegance that is lacking in R."
},
{
"objectID": "posts/2023-01-14-my-journey-from-r-to-julia/index.html#footnotes",
"href": "posts/2023-01-14-my-journey-from-r-to-julia/index.html#footnotes",
"title": "My journey from R to Julia",
"section": "Footnotes",
"text": "Footnotes\n\n\nI did not enjoy Python.↩︎\nI still use R occassionally; that is the “.5”.↩︎\nThe more I study the more I am humbled by the enormous talent out there. I am only stratching the surface.↩︎\nSee https://bkamins.github.io/julialang/2020/07/03/dplyr-vs-df.html↩︎\nSee https://juliadata.github.io/DataFramesMeta.jl/stable/dplyr/↩︎\nSee https://cran.r-project.org/web/packages/epitools/index.html↩︎"
},
{
"objectID": "posts/2024-07-07-moved-to-teampublichealth-substack/index.html",
"href": "posts/2024-07-07-moved-to-teampublichealth-substack/index.html",
"title": "Moved to Team Public Health at Substack!",
"section": "",
"text": "This GitHub Pages site is for technical notes for the Team Public Health blog."
},
{
"objectID": "index.html",
"href": "index.html",
"title": "Team Public Health Technical notes",
"section": "",
"text": "Moved to Team Public Health at Substack!\n\n\n\n\n\n\nPublic health\n\n\nLeadership\n\n\nJulia language\n\n\n\nThis web site is for technical notes. \n\n\n\n\n\nJul 14, 2024\n\n\nTomás Aragón\n\n\n\n\n\n\n\n\n\n\n\n\nSimple method for calculating decision criteria weights\n\n\n\n\n\n\nJulia\n\n\nData science\n\n\nProgramming\n\n\nEpidemiology\n\n\nDecision intelligence\n\n\n\n“Improving human decision intelligence with Julia code” \n\n\n\n\n\nJul 13, 2024\n\n\nTomás Aragón\n\n\n\n\n\n\n\n\n\n\n\n\nJulia for Data Analysis\n\n\n\n\n\n\nBook review\n\n\nJulia\n\n\nR\n\n\nPython\n\n\nData science\n\n\nProgramming\n\n\nEpidemiology\n\n\nScientific computing\n\n\n\n“A book review for population health data scientists” \n\n\n\n\n\nFeb 20, 2023\n\n\nTomás Aragón\n\n\n\n\n\n\n\n\n\n\n\n\nMy journey from R to Julia\n\n\n\n\n\n\nJulia language\n\n\nR language\n\n\nPython\n\n\nData science\n\n\nProgramming\n\n\nEpidemiology\n\n\nScientific computing\n\n\n\nA very brief introduction to Julia for epidemiologists \n\n\n\n\n\nJan 14, 2023\n\n\nTomás Aragón\n\n\n\n\n\n\nNo matching items"
}
]