forked from kkranker/dtalink
-
Notifications
You must be signed in to change notification settings - Fork 0
/
dtalink.ado
1914 lines (1686 loc) · 75.4 KB
/
dtalink.ado
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
*! dtalink.ado
*! Probabilistic record linkage or deduplication
*
* dtalink implements probabilistic record linkage (a.k.a. probabilistic matching) for two cases:
* - deduplicating records in one data file
* - linking records in two data files (requires using or source())
*
* For each matching variable, you can use two different methods to compare two observations in a potential match pair:
* - "Exact" matching awards positive weights if (X for observation 1)==(X for observation 2), and awards negative points otherwise.
* - "Caliper matching" awards positive weights if | (X for observation 1)-(X for observation 2)|<=threshold, and awards negative points otherwise.
*
* The project's main goal was speed: to quickly implement "standard" linkage techniques with large
* datasets. The computationally-heavy parts of the program are implemented in Mata subroutines.
*
*! By Keith Kranker
*
* dtalink.sthlp includes a full description of the command, the command's sytnax, and description of each outcome.
*
* See dtalink_example.do for examples.
*
* Copyright (C) Mathematica Policy Research, Inc. This code cannot be copied, calipributed or used without the express written permission of Mathematica Policy Research, Inc.
program define dtalink, rclass
version 15.1
return clear
return local cmd "dtalink"
return local cmdline `"dtalink `0'"'
syntax anything(id="matching criteria") /// list of exact-matching variables or (mvar1 #1 #2 [#3]) [(mvar2 #1 #2 [#3]) [(mvar3 #1 #2 [#3]) [...]]]
[if] [in] /// standard data restrictions
[using] [, /// appends new file onto using dataset and creates dummy for source()
///
/// Matching Options
Source(varname) /// idenfities the source file for record linkage. That is, cases where you are going to link records in file A to file B. varname must be a dummy = 0/1.
id(varname) /// Variable to identify unique observations. there may be more than one record (row) per observation (e.g., more than one record [row] per person [observation])
/// The default is to treat each row as a unique observation, by creating an id variable with this command: . generate _id = _n
/// If the variable specified is missing, the record will not be included.
CUToff(real 0) /// drops potential matched pairs if the score is below the cutoff. The default is cutoff(0).
Block(string) /// declares blocking variables. If multiple variables listed, each unique combinations of the variables is considered a block. (No variable name abbreviations allowed.)
/// To specify multiple sets of blocks, separate blocking variables with "|", such as block(bvar1 | bvar2 bvar3 | bvar4)
CALcweights /// calculate weights
BESTmatch /// drop 2nd-best matches. See notes below.
SRCBESTmatch(integer -1) /// drop 2nd-best matches for source=0 observations or source=1 observations (but not both). See notes below.
TIEs /// modifies the bestmatch and scrbestmatch() options to keep ties (when it would otherwise break ties arbitrarily)
COMBINEsets /// creates extra large groups that may contain more than one id(). See notes below.
ALLScores /// keeps all scores for a pair, not just the maximum score (By default, the program only keeps the max score for a matched pair. This option keeps all scores for a pair. (This only has an effect on the results if id() is not unique.) (implies nomerge)
///
/// OPTIONS TO FORMAT OUTPUT
WIde /// do not reshape file from "wide" into "long" format (implies nomerge).
noMErge /// do not merge (the "long" file) back onto original data
MISSing /// treats missing strings ("") in match variables that are strings {ul:and} block variables as their own group. Use with caution. See remarks below.
FILLunmatched /// fills the _matchID variable with a unique identifier for unmatched observations (this option is ignored when nomerge is specified.)
///
/// Using options
noLabel nonotes /// options for appending the using dataset; options are ignored if using file not provided
///
/// Display options
noWEIghttable /// supresses the table with the matching weights
DESCribe /// show a list of variables in the new dataset
examples(numlist integer max=1 >0) /// print # examples; the default is examples(0) (no examples)
///
/// undocumented
noMISSCheck /// checks for missing data once (instead of checking for each block)
debug ///
]
// Missing data (that is, "" for string variables or . for numeric variables) do not count as a match or a non-match.
// Likewise, missing data in the block variables keeps observations from being compared.
// If one or two observations in a potential match pair have missing data, neither a positive or negative weight is applied.
// However, for numeric variables, special missing codes do count as a match (e.g., two observations with .a are considered a match, but an observation with 5 does not match another observation with .a).
// The missing option overrides this behavior for string matching variables and blocking variables.
// Two observations are compared if they are both missing data for (one or more) block variable, and/or are considered to be a match if a variable is missing for both observations. USE THIS OPTION WITH CAUTION.{p_end}{p_end}
// using option
if (`"`using'"'!="") {
if (`"`source'"'!="") {
di as error "source() option not allowed with " in smcl "{help using}"
error 184
}
confirm new variable _file
local source _file
append `using', generate(`source') `label' `notes'
cap label define `source' 0 "master" 1 "using", add
label val `source' `source'
label var `source' "0=master; 1=using"
return local using `"`using'"'
}
// identify sample using `if' `in' from standard syntax
marksample touse, novarlist
confirm new variable _id _id0 _id1 _score _matchID _source
// no ID provided
if ("`id'"=="") {
gen _id = _n
label var _id "Row number in original data file"
local id _id
}
// handle string IDs
else {
return local idvar = "`id'"
cap confirm numeric var `id'
if _rc {
if ("`wide'"=="wide" | "`merge'"=="nomerge") {
nois di as error "String id() variable not allowed with -wide- and -nomerge- options. Try converting `id' to a numeric variable."
error 108
}
qui egen _id = group( `id' ) if `touse' & !mi(`id')
if !mi(`"`: var label `id''"') label var _id `"`: var label `id''"'
else label var _id "`id'"
local idvar : copy local id
local id _id
}
}
local id_label : var label `id'
// with 2 files, check/prepare the `source' variable
sort `source' `id', stable
if ("`source'"!="") {
// check `source' is a dummy
cap assert inlist(`source',0,1) if `touse'
if _rc {
di as error "`source' must equal 0 or 1 for all records."
error 459
}
return scalar numfiles = 2
// see which file has fewer unique IDs
tempvar idtag
qui egen byte `idtag'= tag(`source' `id') if `touse'
qui count if `idtag' & !`source'
local n_ids_0 = r(N)
qui count if `idtag' & `source'
local n_ids_1 = r(N)
drop `idtag'
// throw error if `source' is always 0 or always 1
if (inlist(`n_ids_0',.,0) | inlist(`n_ids_1',.,0)) {
di as error "source(`source') must equal 0 for at least one record and equal 1 for at least one record."
error 459
}
// if source=1 is smaller than source=0, swap the two values so that file "0" has fewer IDs
// in the typical case, this should cause more computations to happen in parallel
else if (`n_ids_1'<`n_ids_0') {
// di as txt " (`source'=0 has more unique IDs than `source'=1. Temporarily creating new source variable = (!`source')."
qui gen byte _source = !`source' if `touse'
local sourcevar : copy local source
local source _source
}
}
else {
return scalar numfiles = 1
}
// for "merge" option, save a temporary copy of the file
if ("`wide'"=="wide" | "`allscores'"=="allscores") local merge nomerge
if ("`merge'"!="nomerge" ) {
tempfile sourcefile
sort `source' `id', stable
qui save `sourcefile'
}
// As `anything' gets parsed, I will create the table I want to display (with tabdisp) in these temporary variables
if ("`weighttable'"!="noweighttable") {
tempvar mv1 mv2 mv3 mv4 mv5
qui gen `mv1' = ""
qui gen `mv2' = .
qui gen `mv3' = .
qui gen `mv4' = .
qui gen `mv5' = ""
label var `mv1' "variable name"
label var `mv2' "Match weight"
label var `mv3' "No match weight"
label var `mv4' "Caliper"
label var `mv5' "Usage type"
local tablerow = 0
}
// Parse `anthing'
tokenize `anything'
while (`"`1'"' != "") {
confirm variable `1'
confirm number `2'
confirm number `3'
cap assert inrange(`2',0,.) & (`3'<=0)
if _rc {
di as error `"`1' `2' `3' invalid"'
error 198
}
if ("`weighttable'"!="noweighttable") {
if c(N)<=`++tablerow' qui set obs `tablerow'
qui replace `mv1' = trim(`"`1'"') in `tablerow'
qui replace `mv2' = `2' in `tablerow'
qui replace `mv3' = `3' in `tablerow'
}
cap confirm number `4'
if !_rc {
confirm numeric variable `1'
cap assert (`1'<=.) if `touse'
if _rc {
misstable summarize `1'
di as error `"Special missing codes (.a, .b, ... , .z) not allowed with caliper matching variables. Consider changing these observations to ."'
error 416
}
cap assert inrange(`4',0,.)
if _rc {
di as error `"`1' `2' `3' `4' invalid"'
error 198
}
local calipvars `calipvars' `1'
local calipposwgt `calipposwgt' `2'
local calipnegwgt `calipnegwgt' `3'
local calipers `calipers' `4'
if ("`weighttable'"!="noweighttable") {
qui replace `mv4' = `4' in `tablerow'
qui replace `mv5' = "Caliper matching variables" in `tablerow'
}
mac shift 4
}
else {
local varlist `varlist' `1'
local posweights `posweights' `2'
local negweights `negweights' `3'
if ("`weighttable'"!="noweighttable") {
qui replace `mv4' = 0 in `tablerow'
qui replace `mv5' = "Exact matching variables" in `tablerow'
}
mac shift 3
}
}
// return key inputs in r()
return scalar cutoff = `cutoff'
return scalar misscheck = ("`misscheck'"!="nomisscheck")
if ("`varlist'"!="") {
return local mtcvars = "`varlist'"
return local mtcposwgt = "`posweights'"
return local mtcnegwgt = "`negweights'"
}
if ("`calipvars'"!="") {
return local dstvars = "`calipvars'"
return local dstradii = "`calipers'"
return local dstposwgt = "`calipposwgt'"
return local dstnegwgt = "`calipnegwgt'"
}
if ("`block'"!="") {
return local blockvars = "`block'"
}
// add blocking variables to output table, but don't do anything else (yet)
if ("`weighttable'"!="noweighttable") {
if (`"`block'"'!="") {
tokenize `"`block'"', parse("|")
while (`"`1'"' != "") {
if (trim(`"`1'"') != "|") {
if c(N)<=`++tablerow' qui set obs `tablerow'
qui replace `mv1' = trim(`"`1'"') in `tablerow'
qui replace `mv4' = 0 in `tablerow'
qui replace `mv5' = "Blocking variables" in `tablerow'
}
mac shift
}
}
else {
if c(N)<=`++tablerow' qui set obs `tablerow'
qui replace `mv1' = `"(None)"' in `tablerow'
qui replace `mv4' = 0 in `tablerow'
qui replace `mv5' = "Blocking variables" in `tablerow'
}
}
// check for invalid options or combinations of options
if ("`ties'"=="ties") {
if (1 != ("`bestmatch'"=="bestmatch") + inlist(`srcbestmatch',0,1)) {
di as error "With the `ties' option, you are required to select one of the following options: bestmatch, srcbestmatch()."
error 184
}
else if ("`allscores'"=="allscores") {
di as error "With the `ties' option, allscores is not allowed."
error 184
}
}
else if (1<(("`combinesets'"=="combinesets") + ("`bestmatch'"=="bestmatch") + inlist(`srcbestmatch',0,1) + ("`allscores'"=="allscores"))) {
di as error "Only one of the following options is allowed at a time: combinesets, bestmatch, srcbestmatch(), and allscores."
error 184
}
else return local options = trim(`"`combinesets' `bestmatch' `ties' `=cond(inlist(`srcbestmatch',0,1),"srcbestmatch(`srcbestmatch')","")' `allscores'"')
if !inlist(`srcbestmatch',-1) {
if !inlist(`srcbestmatch',0,1) {
di as error "srcbestmatch() option must be 0 or 1"
error 198
}
if ("`source'"=="") {
di as error "srcbestmatch() option only allowed for record linkage (two files)"
error 184
}
}
local allvars : list uniq varlist
if (!`: list allvars === varlist') {
di as error `"`: list dups varlist' listed in varlist more than once"'
error 198
}
if (`"`block'"'!="") {
local blockvars : subinstr local block "|" " ", all
cap confirm var `blockvars', exact
if _rc {
di as error "One or more variables listed in block() were not found in the data." _n "Note that variable abbreviations are not allowed in the block() option.)"
confirm var `blockvars', exact
}
}
local calcweights = ("`calcweights'"=="calcweights") // switch to dummy
// print matching variables to screen
if ("`weighttable'"!="noweighttable") {
sort `mv1' `mv5' `mv4' `mv3' `mv2', stable
qui by `mv1' (`mv5' `mv4' `mv3' `mv2'): replace `mv1' = `mv1' + " (" + strofreal(_n) + ")" if _N > 1 & !mi(`mv1')
tabdisp `mv1' if !mi(`mv1'), cellvar(`mv2' `mv3' `mv4') by(`mv5') concise
}
// drop duplicates
unab allvars : `id' `varlist' `calipvars' `blockvars' `source' `touse'
qui keep `allvars'
qui keep if `touse'
qui duplicates drop
// convert any strings to (temporary) numeric variables
// `mtcvarlist' is the same as `varlist' but we replace the old variable name with the temporary variable's name
// `numblock' is the same as `block' but we replace the old variable name with the temporary variable's name
// `calipvars' doesn't need to be checked since we already know the variable is numeric (see above)
local mtcvarlist: copy local varlist
local numblock: copy local block
local allvars : list uniq allvars
cap confirm numeric var `allvars', exact
if _rc {
foreach v of local allvars {
cap confirm numeric var `v', exact
if !_rc continue
tempvar _`v'
qui egen `_`v'' = group( `v' ) , `missing'
local mtcvarlist : subinstr local mtcvarlist "`v'" "`_`v''" , word all
local numblock : subinstr local numblock "`v'" "`_`v''" , word
}
}
// do not compare records where the id variable is missing
qui count if `touse' & mi(`id')
if (r(N)) {
di as res =r(N) as txt " records dropped because missing value(s) in `id'"
qui replace `touse' = 0 if mi(`id') & `touse'
}
// one copy of `touse' for each file
if ("`source'"!="") {
tempvar touse_f0 touse_f1
qui gen byte `touse_f0' = `touse' & !`source'
qui gen byte `touse_f1' = `touse' & `source'
qui count if `touse_f0'
if (!r(N)) error 2000
return scalar N0 = r(N)
qui count if `touse_f1'
if (!r(N)) error 2000
return scalar N1 = r(N)
}
else {
local touse_f0 : copy local touse
qui count if `touse'
if (!r(N)) error 2000
return scalar N = r(N)
}
// create three empty variables to hold the results
// variables have the same type as ID; scores are double
qui gen double _matchID = .
label var _matchID "Matched set identifier"
qui compress `id'
qui clonevar _id0 = `id' if 0
qui clonevar _id1 = `id' if 0
qui gen double _score = .
format _score %9.2f
label var _score "Probabilistic matching score"
// when de-duping, sort larger groups to the top
sort `source' `id', stable
if ("`source'"=="") {
tempvar id_n rownum
gen `rownum' = _n
qui by `id': gen byte `id_n'= _N if `touse'
gsort `source' -`id_n' `id' `rownum'
drop `id_n' `rownum'
}
// -- -block- option --
// setup the blocking commands by creating a series of numeric variables with the egen() command
// (if no blocking variables, then `block_id_list' is empty)
local b=0
if (`"`block'"' != "") {
while (`"`block'"'!="") {
// get the blocking set (until we run out)
local ++b
gettoken blockvars block : block, parse("|")
gettoken waste block : block, parse("|")
confirm var `blockvars'
gettoken numblockvars numblock : numblock, parse("|")
gettoken waste numblock : numblock, parse("|")
local blockvars = trim(`"`blockvars'"')
local numblockvars = trim(`"`numblockvars'"')
confirm var `blockvars', exact
confirm var `numblockvars', exact
// setup variable to block on; find number of blocks
if (`: list sizeof numblockvars'==1) {
local block_id_`b': copy local numblockvars
}
else {
tempvar block_id_`b'
qui egen `block_id_`b'' = group( `numblockvars' ) if `touse' , `missing'
}
local `block_id_`b''_label = `"`"`blockvars'"'"'
local block_id_list : list block_id_list | block_id_`b'
local block_lab_list : list block_lab_list | `block_id_`b''_label
mac drop _`block_id_`b''_label
} // end of loop through block variables
} // end of -block- option setup
// MAIN ALGORITHM IS IMPLEMENTED IN MATA
// get a class instance
tempname D
// copy data and other matching parameters to mata
if ("`source'"=="") {
mata: `D' = dtalink()
}
else {
mata: `D' = dtalink2()
}
// move local macros and data from Stata into Mata
mata: `D'.load()
keep _id0 _id1 _score _matchID
qui keep in 1
// run the linkage
mata: `D'.probabilisticlink(`calcweights')
// remove duplicates, sort, assign IDs
mata: `D'.dedup("`allscores'"=="")
mata: `D'.assign()
// Re-calucate weights using the pairs that were found,
// and then re-run matching and recalculate weights.
// Repeat until no matches are found or the maximum number of loops is reached.
if (`calcweights') {
tempname wtab
local lastN = r(pairsnum)
di as txt _n "Suggested matching weights:"
mata: `D'.newweights()
matrix `wtab' = r(new_weights)
return add
_matrix_table `wtab', format(%9.3f %9.3f `=cond(trim("`calipvars'")!="","%9.3f","")')
}
// deal with case of no matches found
if (r(pairsnum)==0) {
if ("`merge'"!="nomerge" ) {
nois di as txt "(Restoring the data.)"
qui use `sourcefile',clear
cap drop _id
if ("`sourcevar'"!="") {
drop _source _id
local source : copy local sourcevar
}
}
exit
}
// The bestmatch option deals with case where an `id' is assigned to multiple _matchIDs.
// After running this subroutine, each `id' will be assigned to exactly one _matchID.
if ("`bestmatch'"=="bestmatch" & "`ties'"=="ties") {
mata: `D'.dropinferior()
}
else if ("`bestmatch'"=="bestmatch" & "`ties'"=="") {
mata: `D'.bestmatch()
}
// The srcbestmatch() option deals with case where an `id' in one file (0 or 1) is assigned to multiple _matchIDs.
// For each `id' in file=`srcbestmatch', we keep each _matchIDs with the highest score.
else if (inlist(`srcbestmatch',0,1) & "`ties'"=="ties") {
local adj_srcbestmatch = cond("`sourcevar'"!="",1-`srcbestmatch',`srcbestmatch') // we might have switched left/right above
mata: `D'.dropinferior(`adj_srcbestmatch')
}
else if (inlist(`srcbestmatch',0,1) & "`ties'"=="") {
local adj_srcbestmatch = cond("`sourcevar'"!="",1-`srcbestmatch',`srcbestmatch') // we might have switched left/right above
mata: `D'.bestmatch(`adj_srcbestmatch')
}
// combinesets is a subroutine to deal with case where an ID is assigned to multiple _matchIDs.
// After running this subroutine, _matchID be updated to include all IDs that were ever matched together
if ("`combinesets'"=="combinesets") {
mata: `D'.combinesets()
}
// move results back into Stata
mata: `D'.extract("_id0 _id1 _score _matchID")
mata: st_local("pairsnum",strofreal(`D'.pairsnum))
return scalar pairsnum = `pairsnum'
qui compress _score _matchID
// drop the class instance
mata: mata drop `D'
// summary stats on scores
di as txt _n "Distribution of matched pair scores, among pairs with score >=`cutoff':"
qui inspect _score
if (r(N_unique)<30) {
tab _score, plot
qui sum _score
}
else {
summ _score, det
}
return scalar scores_mean = r(mean)
return scalar scores_sd = r(sd)
return scalar scores_min = r(min)
return scalar scores_max = r(max)
if ("`combinesets'"=="combinesets") {
qui {
gen row = _n
if ("`source'"=="") {
reshape long _id, i(row)
drop row _j
}
else {
reshape long _id, i(row) j(`source')
drop row
}
collapse (max) _score, by(_matchID _id `source')
rename _id `id'
label var `id' `"`id_label'"'
sort _matchID `source' `id', stable
order _matchID `source' `id', first
}
di as txt "The current configuration of the data is: one row per record."
}
// if `wide' option, just leave in wide format
else if ("`wide'"=="wide") {
if ("`sourcevar'"!="") {
rename (_id0 _id1) (_id1 _id0)
order _id0, before(_id1)
}
di as txt "The current configuration of the data is: one row per matched pair."
}
// reshape into long format
else {
qui {
if ("`source'"=="") {
reshape long _id, i(_matchID)
drop _j
}
else {
reshape long _id, i(_matchID) j(`source')
}
rename _id `id'
label var `id' `"`id_label'"'
sort _matchID `source' `id', stable
order _matchID `source' `id', first
}
di as txt "The current configuration of the data is: one row per record."
}
// merge option -- 1:n merge with the original file
if ("`merge'"!="nomerge") {
sort `source' `id', stable
gen byte `touse'=1
qui joinby `source' `id' `touse' using `sourcefile', _merge(_matchflagtemp) unmatched(both)
gen byte _matchflag = _matchflagtemp==3
drop _matchflagtemp
if ("`sourcevar'"!="") {
drop _source
local source : copy local sourcevar
}
if ("`idvar'"!="") {
drop _id
local id : copy local idvar
}
sort _matchID `source' `id' `calipvars' `varlist', stable
order _matchID `source' `id' _score _matchflag, first
label var _matchID "Matched set identifier"
if (`"`using'"'!="") label var _file "0=master dataset; 1=using dataset"
cap assert (_matchflag==!missing(_matchID)) // at this point, only matched observations have _matchID filled in
if _rc {
di as error "Programming error 1"
list if (_matchflag!=!missing(_matchID))
assert (_matchflag==!missing(_matchID))
}
// `fillunmatched' fills the _matchID variable with a unique identifier for unmatched observations (this option is ignored when nomerge is specified.)
if ("`fillunmatched'"=="fillunmatched") {
tempvar groupid
egen `groupid' = group(`source' `id') if !missing(`id') & `touse'
summ _matchID, mean
qui replace _matchID = r(max) + `groupid' if _matchflag!=1 & !missing(`id') & `touse'
drop `groupid'
sort _matchID `source' `id' `calipvars' `varlist', stable
qui count if missing(_matchID)
if r(N) di as error r(N) " observations have missing _matchID." as txt " This is typically due to missing values in the ID variable (`id')."
}
drop `touse'
}
else {
gen byte _matchflag=1
}
cap label define _mtchflg 1 "Matched" 0 "Not matched"
label val _matchflag _mtchflg
label var _matchflag "Match indicator"
// describe output file (a little)
if ("`describe'"=="describe") {
di _n(2) as text "Description of the new dataset:"
desc
}
// print examples
if ("`examples'"!="") {
di as txt _n "Examples (up to `examples' rows):"
if "`wide'"=="wide" list if _matchID<=_matchID[`=min(`examples',c(N))'] & `=cond("`merge'"=="nomerge","!missing(_matchID)","_matchflag==1")', sepby(_id0)
else list if _matchID<=_matchID[`=min(`examples',c(N))'] & `=cond("`merge'"=="nomerge","!missing(_matchID)","_matchflag==1")', sepby(_matchID)
}
end // end of dtalink program definition
*! Mata Source Code
*! Defines classes which are called by dtalink.ado
*! dtalink is the parent class and includes shared functions
*! dtalink is used for deduplicating 1 file
*! dtalink2 extends dtalink for the case of linking 2 files
*! some functions in dtalink are replaced in the derived class, dtatlink2
version 15.1
mata:
mata set matastrict on
mata set matafavor speed
class dtalink
{
// moving between Stata and matastrict
public:
void load()
void updateweights()
void extract()
// linking criteria setup
protected:
string scalar id_var
string vector mtc_vars
string vector mtc_vars_num
string vector dst_vars
real scalar mtc_num
real scalar dst_num
real colvector mtc_poswgt
real colvector mtc_negwgt
real rowvector dst_radii
real colvector dst_poswgt
real colvector dst_negwgt
real scalar cutoff
// file0
protected:
string scalar selectrows_0
real scalar num_0
real colvector ids_0
real matrix mtc_0
real matrix dst_0
real matrix block_0
// blocking setup
protected:
string vector block_vars
string vector block_labs
real scalar num_block_vars
// functions and variables used in matching
public:
real matrix pairs
real scalar pairsnum
void clearpairs()
void probabilisticlink()
void dedup()
void assign()
void bestmatch()
void dropinferior()
void combinesets()
real matrix tall()
protected:
void new()
virtual void link_one_block_var()
virtual void link_one_block()
transmorphic colvector intersect()
void store_pairs()
real colvector mtc_score()
real colvector dts_score()
real scalar check_miss, mtc_miss, dst_miss, pairmatnum, initrows
real matrix mtc_match, mtc_ijnonmiss, dst_match, dst_ijnonmiss
// variables to perform em calculations
public:
void clearsums()
void newweights()
real colvector new_mtc_poswgt, new_mtc_negwgt, new_dst_poswgt, new_dst_negwgt
protected:
real scalar runsum_1_N, runsum_0_N
real rowvector mtc_runsum_1, mtc_runsum_0, dst_runsum_1, dst_runsum_0
//DEBUG// // dummy to get extra output when debugging
//DEBUG// protected:
//DEBUG// real scalar debug
}
class dtalink2 extends dtalink
{
public:
void load()
void bestmatch()
void dropinferior()
void combinesets()
protected:
virtual void link_one_block_var()
virtual void link_one_block()
// file 1
protected:
string scalar selectrows_1
real scalar num_1
real colvector ids_1
real matrix mtc_1
real matrix dst_1
real matrix block_1
}
// dta::new() initializes the matrices and scalars that hold the results
void dtalink::new()
{
clearpairs()
}
// dta::clearpairs() clears the matrices and scalars that hold the results, without removing any of the inputs
void dtalink::clearpairs() {
pairs = J(0,3,.)
pairsnum = pairmatnum = 0
}
// resets matrices that are used to calculate running sums for computing EM weights
void dtalink::clearsums() {
new_mtc_poswgt = new_mtc_negwgt = J(mtc_num,1,.)
new_dst_poswgt = new_dst_negwgt = J(dst_num,1,.)
mtc_runsum_1 = mtc_runsum_0 = J(1,mtc_num,0)
dst_runsum_1 = dst_runsum_0 = J(1,dst_num,0)
runsum_1_N = runsum_0_N = 0
}
// dtalink::load() will
// 1) copy relevant Stata local macros into class variables and
// 2) copy the data for file 0 into the class instance.
// It assumes locals are set up the same was as in the .ado file
void dtalink::load()
{
//DEBUG// debug = (st_local("debug")=="debug")
//DEBUG// if (debug) "+++ debug is [on]"
//DEBUG// if (debug) "+++ beginning of dtalink::load()"
// basic setup
id_var = tokens(st_local("id"))
cutoff = strtoreal(st_local("cutoff"))
check_miss = st_local("misscheck")!="nomisscheck"
// exact matching setup
mtc_vars = tokens(st_local("varlist")) // this has string variables
mtc_vars_num = tokens(st_local("mtcvarlist")) // string variables converted to numeric
mtc_num = length(mtc_vars)
if (mtc_num) {
mtc_poswgt = strtoreal(tokens(st_local("posweights")))'
mtc_negwgt = strtoreal(tokens(st_local("negweights")))'
}
else mtc_vars = J(1,0,"")
// caliper matching setup
dst_vars = tokens(st_local("calipvars"))
dst_num = length(dst_vars)
if (dst_num) {
dst_radii = strtoreal(tokens(st_local("calipers" )))
dst_poswgt = strtoreal(tokens(st_local("calipposwgt")))'
dst_negwgt = strtoreal(tokens(st_local("calipnegwgt")))'
}
else dst_vars = J(1,0,"")
// blocking setup
block_vars = tokens(st_local("block_id_list"))
block_labs = tokens(st_local("block_lab_list"))
num_block_vars = (length(block_vars))
//DEBUG// if (debug) {
//DEBUG// "id_var="; id_var
//DEBUG// "cutoff="; cutoff
//DEBUG// "mtc_num="; mtc_num
//DEBUG// if (mtc_num) {
//DEBUG// "mtc_vars="; mtc_vars
//DEBUG// "mtc_vars_num="; mtc_vars_num
//DEBUG// "mtc_poswgt="; mtc_poswgt
//DEBUG// "mtc_negwgt="; mtc_negwgt
//DEBUG// }
//DEBUG// "dst_num="; dst_num
//DEBUG// if (dst_num) {
//DEBUG// "dst_vars="; dst_vars
//DEBUG// "dst_radii="; dst_radii
//DEBUG// "dst_poswgt="; dst_poswgt
//DEBUG// "dst_negwgt="; dst_negwgt
//DEBUG// }
//DEBUG// "num_block_vars="; num_block_vars
//DEBUG// if (num_block_vars) {
//DEBUG// "block_vars="; block_vars
//DEBUG// }
//DEBUG// }
// load file 0
selectrows_0 = tokens(st_local("touse_f0"))
ids_0 = st_data(., id_var, selectrows_0)
num_0 = rows(ids_0)
if (mtc_num) mtc_0 = st_data(., mtc_vars_num, selectrows_0)
else mtc_0 = J(num_0,0,.)
if (dst_num) dst_0 = st_data(., dst_vars, selectrows_0)
else dst_0 = J(num_0,0,.)
if (num_block_vars) block_0 = st_data(., block_vars, selectrows_0)
else block_0 = J(num_0,0,.)
initrows = length(ids_0)*5
// dummy for having missing data (calculations are quicker if data is never missing)
if (mtc_num) mtc_miss = hasmissing(mtc_0)
if (dst_num) dst_miss = hasmissing(dst_0)
// setup running sum matrices know that mtc_num and dst_num are set
clearpairs()
clearsums()
//DEBUG// if (debug) {
//DEBUG// "selectrows_0="; selectrows_0
//DEBUG// "mtc_0 is " + strofreal(rows(mtc_0 )) + " by " + strofreal(cols(mtc_0 ))
//DEBUG// "dst_0 is " + strofreal(rows(dst_0 )) + " by " + strofreal(cols(dst_0 ))
//DEBUG// "block_0 is " + strofreal(rows(block_0)) + " by " + strofreal(cols(block_0))
//DEBUG// "pairs is " + strofreal(rows(pairs )) + " by " + strofreal(cols(pairs))
//DEBUG// "pairsnum="; pairsnum
//DEBUG// "pairmatnum="; pairmatnum
//DEBUG// "+++ end of dtalink::load()"
//DEBUG// }
}
// dtalink::updateweights() allows you to upadte class variables
// mtc_poswgt, mtc_negwgt, dst_poswgt, and dst_negwgt
// Inputs must be same shape as original weight vectors.
// Use . or J(0,1,.) if only using exact-or caliper matching.
void dtalink::updateweights(real colvector new_mtc_poswgt,
real colvector new_mtc_negwgt,
real colvector new_dst_poswgt,
real colvector new_dst_negwgt)
{
if (rows(new_mtc_poswgt) & new_mtc_poswgt!=.) {
if (rows(new_mtc_poswgt)!=mtc_num) _error(3200)
this.mtc_poswgt = new_mtc_poswgt
}
if (rows(new_mtc_negwgt) & new_mtc_negwgt!=.) {
if (rows(new_mtc_negwgt)!=mtc_num) _error(3200)
this.mtc_negwgt = new_mtc_negwgt
}
if (rows(new_dst_poswgt) & new_dst_poswgt!=.) {
if (rows(new_dst_poswgt)!=dst_num) _error(3200)
this.dst_poswgt = new_dst_poswgt
}
if (rows(new_dst_negwgt) & new_dst_negwgt!=.) {
if (rows(new_dst_negwgt)!=dst_num) _error(3200)
this.dst_negwgt = new_dst_negwgt
}
}
// dtalink2::load() will
// 1) copy relevant Stata local macros into class variables.
// 2) copy the data for file 0 into the class instance (by calling dtalink::load()) and
// 3) copy the data for file 1 into the class instance.
// It assumes locals are set up the same was as in the .ado file
void dtalink2::load()
{
// load file 0 and most of the macros
super.load()
//DEBUG// if (debug) "+++ beginning of dtalink2::load() (after calling super.load)"
// file 1 touse variable
selectrows_1 = st_local("touse_f1")
// load file 1
ids_1 = st_data(., id_var, selectrows_1)
num_1 = rows(ids_1)
if (mtc_num) mtc_1 = st_data(., mtc_vars_num, selectrows_1)
else mtc_1 = J(num_1,0,.)
if (dst_num) dst_1 = st_data(., dst_vars, selectrows_1)
else dst_1 = J(num_1,0,.)
if (num_block_vars) block_1 = st_data(., block_vars, selectrows_1)
else block_1 = J(num_1,0,.)
initrows = min((initrows,length(ids_1)*5))
// dummy for having missing data (calculations are quicker if data is never missing)
if (mtc_num) mtc_miss = ( mtc_miss | hasmissing(mtc_1) )
if (dst_num) dst_miss = ( dst_miss | hasmissing(dst_1) )
//DEBUG// if (debug) {
//DEBUG// "selectrows_1="; selectrows_1
//DEBUG// "mtc_1 is " + strofreal(rows(mtc_1 )) + " by " + strofreal(cols(mtc_1 ))
//DEBUG// "dst_1 is " + strofreal(rows(dst_1 )) + " by " + strofreal(cols(dst_1 ))
//DEBUG// "block_1 is " + strofreal(rows(block_1)) + " by " + strofreal(cols(block_1))
//DEBUG// "+++ end of dtalink2::load()"
//DEBUG// }
}
// dtalink::probabilisticlink() performs probabilistic linkage
// however all the action is in the subroutines that it calls
// this function is just a high level wrapper to loop over any blocking variables
// optionally (if em), running sums are computed for calculating newweights
void dtalink::probabilisticlink(| real scalar em)
{
//DEBUG// if (debug) "+++ beginning of dtalink::probabilisticlink()"
if (args()<1) em = 0
if (num_block_vars) {
real scalar B
for (B=1; B<=num_block_vars; B++) {
(void) link_one_block_var(B,em)
}
}
else {
(void) link_one_block(em)
}
if (em) (void) newweights()
//DEBUG// if (debug) {
//DEBUG// "pairs is " + strofreal(rows(pairs)) + " by " + strofreal(cols(pairs))
//DEBUG// "pairsnum="; pairsnum
//DEBUG// "pairmatnum="; pairmatnum
//DEBUG// "first 40 rows of pairs:"; if (pairmatnum) pairs[|1,1\min((40,rows(pairs))),.|];
//DEBUG// "+++ end of dtalink::probabilisticlink()"
//DEBUG// }
}
// intersect() gives the intersection of two column vectors
// That is, if you type C=intersect(A,B), then the column vector C contains the elements common to A and B, sorted, with no repetitions.
// If you have two rowvectors, intersect(A',B')' will give the corresponding result, with a little loss of speed
transmorphic colvector dtalink::intersect(transmorphic colvector v_0, transmorphic colvector v_1)
{
if (eltype(v_0)!=eltype(v_1)) _error("the 2 vectors have different eltypes")
if (!rows(v_0) | !rows(v_1)) return(J(0,0,missingof(v_0)))
transmorphic colvector u_0, u_1; real colvector idx; real scalar r, R
u_0=uniqrows(v_0)
u_1=uniqrows(v_1)
idx=J(rows(u_0), 1, .)
R = rows(u_0)
for (r=1; r<=R; r++) {
idx[r] = anyof(u_1, u_0[r])
}
return(select(u_0, idx))
}
// the two link_one_block_var() programs perform the probabilistic linkage for the blocking variable in column B of the blocking matrix
// again, most of the action is in the subroutines
// optionally (if em), running sums are computed for calculating newweights
// the dtalink::link_one_block_var() version handels the case of de-duping one files, with or without missing data
void dtalink::link_one_block_var(real scalar B, | real scalar em)
{
real colvector sortindex_0
real matrix info_0, blockrange_0
real scalar i0, I0, dotsize
if (args()<2) em = 0
// re-sort the data
//DEBUG// if (debug) "sorting on "+block_vars[B]
sortindex_0 = order((block_0[.,B],ids_0), (1,2))
_collate(block_0, sortindex_0)
_collate(ids_0, sortindex_0)
if (mtc_num) _collate(mtc_0, sortindex_0)
if (dst_num) _collate(dst_0, sortindex_0)
// use panelsetup() to identify rows in each block
// this only keeps blocks with >2 rows; if there is only 1 row in the block, there cannot be a match
info_0 = panelsetup(block_0, B, 2)