-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAcctgAnomaly_forPublication.Rmd
1894 lines (1457 loc) · 81.4 KB
/
AcctgAnomaly_forPublication.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Accounting Anomaly Detector"
output:
pdf_document: default
html_notebook:
toc: true
toc_depth: 2
self_contained: true
---
# Executive Summary #
The company is authorised by the Financial Conduct Authority, which requires it to have regular processes to identify potential fraud. There are millions of records in the accounting system, too many to manually review. An autoencoder is developed which very accurately sifts unusual transactions from those which are commonplace. It is sufficiently accurate to flag 100 intentionally unusual transactions from 250,000 test transactions, with almost no false positives.
The autoencoder also indicates which component of a record causes it to be flagged as abnormal, ie it provides some degree of explanation.
This work is inspired by a paper by PwC (https://arxiv.org/pdf/1709.05254.pdf) and adapted for the company's data.
# Introduction #
The company has an accounting system with approx 250,000 debits and credits made per year. Detecting fraud or error within this mass of data is not straightforward. Traditional solutions involve querying the data for 'red flags', transactinos at odd times of day for example. A more recent approach is the autoencoder, a deep learning tool which attempts to compress the information within a record, then reconstitute it into a copy of the original record. Since the vast majority of record sin a system will be 'business as usual', then the autoencoder can learn to re-constitute them well. It will be less well trained at unusual records, i.e. fraud or error, so these records are less well reconstituted. The delta between the original copy and its recreation is considered a measure of its un-usualness.
The company's accounting system has been in place for 10years, meaning there are now over 2,000,000 records in the database. This should easily be enough for a robust deep learning model to be built.
```{r, eval=FALSE, echo=FALSE, message=FALSE, cache=FALSE}
library(RODBC)
# Create a connection to the database using ODBC. This password does not show in the pdf
connection <- odbcConnect(dsn = "RODBC",
uid = "sa",
pwd = "CauliG00gle")
```
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE, fig.width=8}
library(dplyr)
library(tidyr)
library(RODBC)
library(reticulate)
library(knitr)
# Lots of packages, ensure dplyr gets priority◘
select <- dplyr::select
filter <- dplyr::filter
summarise <- dplyr::summarise
# Extract the required cash flow data.
# The SQL used in the stored procedure is in an appendix at the end of the notebook
srcData <- sqlQuery(connection, "EXEC [sp_GetAllNominalTransactions]", stringsAsFactors=FALSE)
# Close connection
odbcClose(connection)
#Let's see how big this data is
paste0(format(object.size(srcData), units = "auto"))
```
Lets take a look at the data, which is randomly sorted already...
```{r, eval=TRUE, echo=TRUE, message=FALSE, cache=FALSE, fig.width=8}
str(srcData)
```
So each row is a posting to a nominal, either a debit (+ve) or a credit (-ve). The system groups a transaction into the component debits and credits which normally balance to zero. So there may be multiple rows (postings) per a single transaction. The key by which we group the postings into a transaction is the 'DOCUMENT_REF'. This data has only the first digit of the DOCUMENT_REF, which holds information, for example 'P' is the first digit of a purchase, sales are prefixed with the number associated with a depot (1 to 5). Subsequent digits are simply a sequential number, so its no loss to have them omitted from the data.
The nominal 'account' number is strictly a categorical field, although it appears numerical. The first digit is the type of the account, 1= Asset, 2= Liability , 3 = Sale etc. The last digit denotes the depot of the transaction; 0 = Not depot specific, 1= Telford, 2=Hereford, etc. These are separated into their own columns.
There are thousands of nominal accounts, which would lead to a one-hot vector thousands wide, SInce each digit has some meaning, we'll create a one-hot vecot for each of the 7 digits in a nominal account number. Meaning only 10x7=70 columns for the one-hot of an entire nominal account number.
This data has been randomly sorted so the postings of each transaction are not listed together.
## Data Cleaning & Prep ##
Much of the data is comprised of categories, these need to be facorised approporiatey and a record kept of the factor levels.
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
# Column specification will be used frequently during this analysis, so best presented as a table
Col_Spec <- cbind.data.frame(Name = colnames(srcData),
Type_init = as.vector(sapply(srcData, class)),
Type_final= as.character(NA),
Min = as.numeric(NA),
Max = as.numeric(NA),
MinMaxed = FALSE,
Logged = FALSE,
stringsAsFactors = FALSE)
# get a copy of the source data, we will adjust only this copy, not the original
srcData_adj <- srcData
# This is the kind of data where it helps to set NA to zero
srcData_adj[is.na(srcData_adj)] <- 0
#Create factors where data is text
srcData_adj$ACCOUNT_NO_Digit1<- as.factor(srcData_adj$ACCOUNT_NO_Digit1)
srcData_adj$ACCOUNT_NO_Digit2<- as.factor(srcData_adj$ACCOUNT_NO_Digit2)
srcData_adj$ACCOUNT_NO_Digit3<- as.factor(srcData_adj$ACCOUNT_NO_Digit3)
srcData_adj$ACCOUNT_NO_Digit4<- as.factor(srcData_adj$ACCOUNT_NO_Digit4)
srcData_adj$ACCOUNT_NO_Digit5<- as.factor(srcData_adj$ACCOUNT_NO_Digit5)
srcData_adj$ACCOUNT_NO_Digit6<- as.factor(srcData_adj$ACCOUNT_NO_Digit6)
srcData_adj$ACCOUNT_NO_Digit7<- as.factor(srcData_adj$ACCOUNT_NO_Digit7)
srcData_adj$TRANS_TYPE <- as.factor(srcData_adj$TRANS_TYPE)
srcData_adj$OPERATOR <- as.factor(srcData_adj$OPERATOR)
srcData_adj$DOCUMENT_REF_L1 <- as.factor(srcData_adj$DOCUMENT_REF_L1)
srcData_adj$SUPP_REF <- as.factor(srcData_adj$SUPP_REF)
srcData_adj$CUST_REF <- as.factor(srcData_adj$CUST_REF)
## get the levels for future reference
ACCOUNT_NO_Digit1_Levels <- levels(srcData_adj$ACCOUNT_NO_Digit1)
ACCOUNT_NO_Digit2_Levels <- levels(srcData_adj$ACCOUNT_NO_Digit2)
ACCOUNT_NO_Digit3_Levels <- levels(srcData_adj$ACCOUNT_NO_Digit3)
ACCOUNT_NO_Digit4_Levels <- levels(srcData_adj$ACCOUNT_NO_Digit4)
ACCOUNT_NO_Digit5_Levels <- levels(srcData_adj$ACCOUNT_NO_Digit5)
ACCOUNT_NO_Digit6_Levels <- levels(srcData_adj$ACCOUNT_NO_Digit6)
ACCOUNT_NO_Digit7_Levels <- levels(srcData_adj$ACCOUNT_NO_Digit7)
TRANS_TYPE_Levels <- levels(srcData_adj$TRANS_TYPE)
OPERATOR_Levels <- levels(srcData_adj$OPERATOR)
DOCUMENT_REF_L1_Levels <- levels(srcData_adj$DOCUMENT_REF_L1)
SUPP_REF_Levels <- levels(srcData_adj$SUPP_REF)
CUST_REF_Levels <- levels(srcData_adj$CUST_REF)
# identify the columns which are factors
ColIsFactor <- seq(1:ncol(srcData_adj))[sapply(srcData_adj, function(x) is.factor(x))]
# add this knowledge to our column specification table
Col_Spec[ColIsFactor,]$Type_final <- "factor"
Col_Spec[-ColIsFactor,]$Type_final <- "numerical"
# collate levels into one object, for future use
factorlevels <- list(as.numeric(ACCOUNT_NO_Digit1_Levels), as.numeric(ACCOUNT_NO_Digit2_Levels),
as.numeric(ACCOUNT_NO_Digit3_Levels), as.numeric(ACCOUNT_NO_Digit4_Levels),
as.numeric(ACCOUNT_NO_Digit5_Levels), as.numeric(ACCOUNT_NO_Digit6_Levels),
as.numeric(ACCOUNT_NO_Digit7_Levels), DOCUMENT_REF_L1_Levels,
TRANS_TYPE_Levels, OPERATOR_Levels, SUPP_REF_Levels, CUST_REF_Levels)
# apply names to the object storing our levels
names(factorlevels) <- colnames(srcData_adj)[ColIsFactor]
## convert each categorical field to the numeric value of the levels, as required by the deep learning model
srcData_adj$ACCOUNT_NO_Digit1<- as.numeric(srcData_adj$ACCOUNT_NO_Digit1)
srcData_adj$ACCOUNT_NO_Digit2<- as.numeric(srcData_adj$ACCOUNT_NO_Digit2)
srcData_adj$ACCOUNT_NO_Digit3<- as.numeric(srcData_adj$ACCOUNT_NO_Digit3)
srcData_adj$ACCOUNT_NO_Digit4<- as.numeric(srcData_adj$ACCOUNT_NO_Digit4)
srcData_adj$ACCOUNT_NO_Digit5<- as.numeric(srcData_adj$ACCOUNT_NO_Digit5)
srcData_adj$ACCOUNT_NO_Digit6<- as.numeric(srcData_adj$ACCOUNT_NO_Digit6)
srcData_adj$ACCOUNT_NO_Digit7<- as.numeric(srcData_adj$ACCOUNT_NO_Digit7)
srcData_adj$TRANS_TYPE <- as.numeric(srcData_adj$TRANS_TYPE)
srcData_adj$OPERATOR <- as.numeric(srcData_adj$OPERATOR)
srcData_adj$DOCUMENT_REF_L1 <- as.numeric(srcData_adj$DOCUMENT_REF_L1)
# EXCEPT CUST_REF and SUPP_REF, dealt with in next codechunk
#Set dates to numeric
srcData_adj$TRANS_DATE <- as.numeric(srcData_adj$TRANS_DATE)
srcData_adj$SYSTEM_DATE <- as.numeric(srcData_adj$SYSTEM_DATE)
# NOTE, do nothing with DOCUMENT_REF. It will NOT be fed to the model, is unique to each transaction
# It will be used in testing, to link transactions back to documents in the accts system
# Thus enabling us to investigate candidate records for error and even fraud
# For now, we will simply record its function in our Col_Spec table.
# Note Type_final = "exclude", ie it will be excluded from the model.
#Remove the automatically created entry
Col_Spec <- Col_Spec %>% filter(Name != "DOCUMENT_REF")
#Replace with this entry
Col_Spec[nrow(Col_Spec)+1,] <- list("DOCUMENT_REF", "character", "exclude",
as.numeric(NA), as.numeric(NA), FALSE, FALSE)
```
# CUST_REF and SUPP_REF #
These two fields are factors with thousands of levels, this would lead to enormous onehot vectors later in the analysis. So, we will use only the top 100 factors and lump the rest into 'other'. The tidyverse package, 'forcats', has tools for this job.
The top 100 will capture a large proportion of the data, let's see the count per factor, after the 'lumping' together...
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
library(forcats)
#100 levels, 1 'other' of lumped together rarest factors, and 1 N/A (-1)
srcData_adj$CUST_REF <- fct_lump(srcData_adj$CUST_REF, n = 100) #101 levels = 100 factors + 'other'
srcData_adj$SUPP_REF <- fct_lump(srcData_adj$CUST_REF, n = 100) #101 levels = 100 factors + 'other'
#record the new levels
CUST_REF_Levels <- levels(srcData_adj$CUST_REF)
SUPP_REF_Levels <- levels(srcData_adj$SUPP_REF)
#Let's see the countþ
fct_count(srcData_adj$CUST_REF) %>% arrange(desc(n))
## convert each categorical field to the numeric value of the levels, as required by the deep learning model
srcData_adj$SUPP_REF <- as.numeric(srcData_adj$SUPP_REF)
srcData_adj$CUST_REF <- as.numeric(srcData_adj$CUST_REF)
```
## New Field for Debits vs Credits ##
Currently the BC_TRANS_VALUE is +ve for a debit and -ve for a credit. It will later prove useful to separate the sign of the transaction from the value. Before we do, we should see the distribution of debits vs credits:
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
debit_count <- sum(srcData_adj$BC_TRANS_VALUE >= 0)
credit_count <- sum(srcData_adj$BC_TRANS_VALUE < 0)
```
```{r, eval=TRUE, echo=TRUE, message=FALSE, cache=FALSE, fig.width=8}
print(paste0("For each debit there are ",round(credit_count/debit_count,2), " credits"))
```
A slight skew to credits.
Now we adjust the data for BC_TRANS_VALUE into two columns, a BC_TRANS_VALUE which is always +ve and a separate column for the debit or credit sign. Debit (+ve) = 1 , Credit (-ve) = 0
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
srcData_adj <- srcData_adj %>% mutate(isDebit = ifelse(BC_TRANS_VALUE >= 0,1,0))
srcData_adj <- srcData_adj %>% mutate(BC_TRANS_VALUE = abs(BC_TRANS_VALUE))
#We treat the isDebit field as binary, we won't need to scale it later.
Col_Spec[nrow(Col_Spec)+1,] <- list("isDebit", "binary", "binary",
as.numeric(NA),as.numeric(NA), FALSE, FALSE)
```
## Create Dummy Data ##
We currently don't know whether our data contains any true frauds, we're not aware of any!
So how can we test whether the model is properly identifying unusual records?
We can create 1000 records which are simply randomly created. Since they will make no no sense we'd hope the autoencoder can find them.
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
library(lubridate)
dummyqty <- 1000
get_random <- function(range, isFactor, isTransValue = FALSE, isPostingDelay = FALSE,
dummy_qty = dummyqty, seed = seed){
#convert levels to values
if(isFactor){
range <- seq_along(range)
}
#get max and min
max_in_range <- max(range)
min_in_range <- min(range)
#if we're dealing with the transaction value, try not to take more than £50k, bit too obvious!
if(isTransValue){
max_in_range <- 50000
min_in_range <- 0
}
# if we're dealing with the posting delay then it should be within -2 to +30 days
if(isPostingDelay){
max_in_range <- 30
min_in_range <- -2
}
#use uniform probability
set.seed(seed)
randoms <- runif(dummy_qty, min_in_range, max_in_range)
#round all values to integer, EXCEPT TransValue which is rounded to 2 places
if(isTransValue){
randoms <- round(randoms,2)
}else{
randoms <- round(randoms)
}
return(randoms)
}
dummy_rows <- cbind.data.frame(
ACCOUNT_NO_Digit1 = get_random(range=ACCOUNT_NO_Digit1_Levels,
isFactor=TRUE, seed=20181208),
ACCOUNT_NO_Digit2 = get_random(ACCOUNT_NO_Digit2_Levels,
isFactor=TRUE, seed=20181209),
ACCOUNT_NO_Digit3 = get_random(ACCOUNT_NO_Digit3_Levels,
isFactor=TRUE, seed=20181210),
ACCOUNT_NO_Digit4 = get_random(ACCOUNT_NO_Digit4_Levels,
isFactor=TRUE, seed=20181211),
ACCOUNT_NO_Digit5 = get_random(ACCOUNT_NO_Digit5_Levels,
isFactor=TRUE, seed=20181212),
ACCOUNT_NO_Digit6 = get_random(ACCOUNT_NO_Digit6_Levels,
isFactor=TRUE, seed=20181213),
ACCOUNT_NO_Digit7 = get_random(ACCOUNT_NO_Digit7_Levels,
isFactor=TRUE, seed=20181214),
TRANS_DATE = get_random(srcData_adj$TRANS_DATE,
isFactor=FALSE,seed=20181215),
# These TRANS_DATE fields are dictated by the TRANS_DATE, so calculated later
TRANS_DATE_Mth = 0,
TRANS_DATE_DoM = 0,
TRANS_DATE_DoW = 0,
# These SYSTEM_DATE fields are dictated by the PostingDelayDays, so calculated later
SYSTEM_DATE = 0,
SYSTEM_DATE_Mth = 0,
SYSTEM_DATE_DoM = 0,
SYSTEM_DATE_DoW = 0,
PostingDelayDays = get_random(srcData_adj$PostingDelayDays,isFactor=FALSE,
isPostingDelay=TRUE, seed=20181216),
DOCUMENT_REF_L1 = get_random(DOCUMENT_REF_L1_Levels, isFactor=TRUE,
seed=20181217),
BC_TRANS_VALUE = get_random(srcData_adj$BC_TRANS_VALUE, isFactor=FALSE,
isTransValue=TRUE, seed=20181218),
TRANS_TYPE = get_random(TRANS_TYPE_Levels, isFactor=TRUE,
seed=20181219),
OPERATOR = get_random(OPERATOR_Levels, isFactor=TRUE,
seed=20181220),
CUST_REF = get_random(CUST_REF_Levels, isFactor=TRUE,
seed=20181221),
SUPP_REF = get_random(SUPP_REF_Levels, isFactor=TRUE,
seed=20181222),
# NOTE: DOCUMENT_REF will not be fed to the model, so leave as NA
DOCUMENT_REF = NA,
# Flag that the record is randomly created for testing, not a real record.
isRandom = TRUE,
# Flag for debit or Credit. Debit = 1, Credit = 0. Debit half as likely as credit.
isDebit = rbinom(n=dummyqty, size=1, prob=0.3333333333)
)
#get the calculated date fields...
TRANS_DATE_InDateFormat <- as.Date("1970-01-01") + dummy_rows$TRANS_DATE
dummy_rows$TRANS_DATE_Mth <- month(TRANS_DATE_InDateFormat)
dummy_rows$TRANS_DATE_DoM <- day(TRANS_DATE_InDateFormat)
dummy_rows$TRANS_DATE_DoW <- wday(TRANS_DATE_InDateFormat)
dummy_rows$SYSTEM_DATE <- dummy_rows$PostingDelayDays + dummy_rows$TRANS_DATE
SYSTEM_DATE_InDateFormat <- as.Date("1970-01-01") + dummy_rows$SYSTEM_DATE
dummy_rows$TRANS_DATE_Mth <- month(SYSTEM_DATE_InDateFormat)
dummy_rows$TRANS_DATE_DoM <- day(SYSTEM_DATE_InDateFormat)
dummy_rows$TRANS_DATE_DoW <- wday(SYSTEM_DATE_InDateFormat)
```
```{r, eval=TRUE, echo=TRUE, message=FALSE, cache=FALSE}
#clear unused data
rm(TRANS_DATE_InDateFormat, SYSTEM_DATE_InDateFormat)
#create the flag to identify which records are random and which real
srcData_adj <- srcData_adj %>% mutate(isRandom = FALSE)
#We treat the IsRandom field as a logical field, we exclude it from the final model
Col_Spec[nrow(Col_Spec)+1,] <- list("isRandom", "logical", "exclude",
as.numeric(NA), as.numeric(NA), FALSE, FALSE)
#append our dummy records to the real data
srcData_adj <- rbind(srcData_adj, dummy_rows)
#randomly sort
srcData_adj <- srcData_adj[sample(nrow(srcData_adj)),]
# take a copy for testing later
srcData_b4_ohot <- srcData_adj
```
## Log of Transaction Value ##
The transaction value (BC_TRANS_VALUE) has a long tail, meaning most transactions are small and only a few are very large. Whereas deep learning works better with normally distributed data, so we apply the log(x+1) transform to transaction values (BC_TRANS_VALUE). This does not result in normally distributed data, but its a lot better than the raw data.
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
b4_log <- srcData_adj$BC_TRANS_VALUE
af_log <- log(srcData_adj$BC_TRANS_VALUE+1)
```
```{r, eval=TRUE, echo=TRUE, message=FALSE, cache=FALSE,fig.height=8}
par(mfrow = c(1,2))
hist(b4_log, breaks = 1000, main="Hist: Trans Value")
hist(af_log, breaks = 1000, main="Hist: log(Trans Value + 1)")
```
Let's apply the log
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
# log the data, don't forget to add 1 because log(0)=inf.
srcData_adj$BC_TRANS_VALUE <- log(srcData_adj$BC_TRANS_VALUE + 1)
# record the fact that it has been logged
Col_Spec$Logged[which(Col_Spec$Name == "BC_TRANS_VALUE")] <- TRUE
```
## Scale other Numerical Fields ##
Furthermore, deep learning works best when all the numerical (as opposed to categorical) data is at the same scale and with low values, preferably less than 1. At the same scale means subtracting the average and dividing by the standard deviation for each field. This also beings most of the data to be less than one.
However, the paper by PwC (Schreyer et al, 2018) reports good results with a different approach, presenting each figure as a proportion of the maximum for its field. This results in a value between 0 and 1 for all fields, this prioritises a 0 to 1 value over true scaling.
Note, after this process the Trans Value (BC_TRANS_VALUE) will have been both logged then 'minmax scaled'.
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
# Scale the non-factor columns
# We save the min and max for each numerical column, so that we can 'unscale' later
for (i in 1:nrow(Col_Spec)) {
if(Col_Spec$Type_final[i] == "numerical"){
# get the relevant column's data
col_num <- which(colnames(srcData_adj) == Col_Spec$Name[i])
col_dat <- srcData_adj[,col_num]
# record the min and max
Col_Spec$Max[i] <- max(col_dat)
Col_Spec$Min[i] <- min(col_dat)
# minmax scale the data
srcData_adj[,col_num] <- (col_dat - Col_Spec$Min[i]) / (Col_Spec$Max[i] - Col_Spec$Min[i])
# record the fact that the data has been minmaxed
Col_Spec$MinMaxed[i] <- TRUE
} else {
Col_Spec$Max[i] <- NA
Col_Spec$Min[i] <- NA
Col_Spec$MinMaxed[i] <- FALSE
}
}
```
## Convert Categorical Fields to One Hot ##
For feeding into a neural network, categorical fields need to be one-hot vectors. We will do this in R, using dplyr.
BUT
It would have been possible to have left this until entry to the model, in python for keras, using:
keras.utils.to_categorical(y, num_classes=None, dtype='float32')
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
# convert each factor column to one-hot vectors, results in multipe fields per factor column
onehot <- function(dataframe, column_to_one_hot){
#requires one column of unique values, column RowNum in this instance
#also requires a column of 1's, column 'i' in this instance
output <- dataframe %>%
select(eval(column_to_one_hot)) %>%
mutate(RowNum=row_number(), i=1) %>%
spread(key=eval(column_to_one_hot), value=i, fill=0) %>%
select(-RowNum)
# append the col name to all new col names
names(output) <- paste0(column_to_one_hot, "_", names(output))
return(output)
}
TRANS_TYPE_ohot <- onehot(srcData_adj, quote("TRANS_TYPE"))
OPERATOR_ohot <- onehot(srcData_adj, quote("OPERATOR"))
DOCUMENT_REF_L1_ohot <- onehot(srcData_adj, quote("DOCUMENT_REF_L1"))
CUST_REF_ohot <- onehot(srcData_adj, quote("CUST_REF"))
SUPP_REF_ohot <- onehot(srcData_adj, quote("SUPP_REF"))
ACCOUNT_NO_Digit1_ohot <- onehot(srcData_adj, quote("ACCOUNT_NO_Digit1"))
ACCOUNT_NO_Digit2_ohot <- onehot(srcData_adj, quote("ACCOUNT_NO_Digit2"))
ACCOUNT_NO_Digit3_ohot <- onehot(srcData_adj, quote("ACCOUNT_NO_Digit3"))
ACCOUNT_NO_Digit4_ohot <- onehot(srcData_adj, quote("ACCOUNT_NO_Digit4"))
ACCOUNT_NO_Digit5_ohot <- onehot(srcData_adj, quote("ACCOUNT_NO_Digit5"))
ACCOUNT_NO_Digit6_ohot <- onehot(srcData_adj, quote("ACCOUNT_NO_Digit6"))
ACCOUNT_NO_Digit7_ohot <- onehot(srcData_adj, quote("ACCOUNT_NO_Digit7"))
# Bring all columns together
# get list of non factor columns
col_refs <- which(!Col_Spec$Type_final %in% c("factor", "exclude"))
srcData_prepped <- cbind.data.frame(# not factor and not excluded
srcData_adj[,col_refs],
# factors
TRANS_TYPE_ohot,
OPERATOR_ohot,
DOCUMENT_REF_L1_ohot,
ACCOUNT_NO_Digit1_ohot,
ACCOUNT_NO_Digit2_ohot,
ACCOUNT_NO_Digit3_ohot,
ACCOUNT_NO_Digit4_ohot,
ACCOUNT_NO_Digit5_ohot,
ACCOUNT_NO_Digit6_ohot,
ACCOUNT_NO_Digit7_ohot,
CUST_REF_ohot,
SUPP_REF_ohot,
# excluded from model
DOCUMENT_REF = srcData_adj$DOCUMENT_REF,
isRandom = srcData_adj$isRandom,
stringsAsFactors = FALSE
)
# record the order in which these columns appear
# we need this order for matrix operations later
Col_Spec_Sequence <- c( # not factor and not excluded
colnames(srcData_adj[,col_refs]),
# factors
"TRANS_TYPE",
"OPERATOR",
"DOCUMENT_REF_L1",
"ACCOUNT_NO_Digit1",
"ACCOUNT_NO_Digit2",
"ACCOUNT_NO_Digit3",
"ACCOUNT_NO_Digit4",
"ACCOUNT_NO_Digit5",
"ACCOUNT_NO_Digit6",
"ACCOUNT_NO_Digit7",
"CUST_REF",
"SUPP_REF",
# excluded from model
"DOCUMENT_REF",
"isRandom")
Col_Spec_Sequence <- data.frame(Name = Col_Spec_Sequence,
Sequence = seq(from=1, to=length(Col_Spec_Sequence)),
stringsAsFactors = FALSE)
Col_Spec <- Col_Spec %>%
left_join(Col_Spec_Sequence, by="Name") %>%
arrange(Sequence)
```
```{r, eval=TRUE, echo=TRUE, message=FALSE, cache=FALSE, fig.height=8}
#Let's see how big this data is after we converted columns to one-hot...
paste0(format(object.size(srcData_prepped), units = "auto"))
```
A lot of data, mostly due to one hot encoding. Let's see the dimensions of the table, these will indicate the dims for the input layer of the autoencoder.
```{r, eval=TRUE, echo=TRUE, message=FALSE, cache=FALSE}
dim(srcData_prepped)
```
Each record is 393 columns wide, mostly due to one hot conversion and those fields will mostly be zeros. A very sparse matrix.
The final two columns will not be submitted to the model, so input_dim = 391
# Train, Validation and Test Sets #
Autoencoders don't use x and y, aka input and labels. There is no 'y', 'label' or 'target'. The correct output for each record is simply the record itself. However, it is still useful to have a validation and test set. The train and test sets should be certain to include dummy data. The validation set will help us understand how model training is progressing, identifying overtraining. The Test set will help build confidence in how effective the model is.
```{r, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
getdatasplit <- function(dataset, ProportionInSetA){
## Training set = proportion of the total data
sample_size <- floor(ProportionInSetA * nrow(dataset))
## set the seed to make partition reproducible
set.seed(20181212)
SetA_indices <- sample(seq_len(nrow(dataset)), size = sample_size, replace = F)
SetA <- dataset[ SetA_indices, ]
SetB <- dataset[-SetA_indices, ]
return(list(SetA, SetB))
}
#first separate out the training set, exclude the isRandom field.
srcData_split <- getdatasplit(srcData_prepped, 0.67)
srcData_Train <- srcData_split[[1]]
srcData_TestAndValid <- srcData_split[[2]]
#now separate out the validation and test sets
srcData_split <- getdatasplit(srcData_TestAndValid, 0.67)
srcData_Valid <- srcData_split[[1]]
srcData_Test <- srcData_split[[2]]
rm(srcData_split, srcData_TestAndValid)
# report results
print(paste0("Rows in Training Set : ",nrow(srcData_Train),
'. As Propn:', round(nrow(srcData_Train)/nrow(srcData_prepped),2)))
print(paste0("Rows in Validation Set: ",nrow(srcData_Valid),
'. As Propn:', round(nrow(srcData_Valid)/nrow(srcData_prepped),2)))
print(paste0("Rows in Testing Set : ",nrow(srcData_Test),
'. As Propn:', round(nrow(srcData_Test)/nrow(srcData_prepped),2)))
print(paste0("Dummy data in Training Set : ", nrow(srcData_Train %>% filter(isRandom==TRUE))))
print(paste0("Dummy data in Validation Set: ", nrow(srcData_Valid %>% filter(isRandom==TRUE))))
print(paste0("Dummy data in Testing Set : ", nrow(srcData_Test %>% filter(isRandom==TRUE))))
# finally, remove the isRandom field and convert to matrix format for use in Keras.
srcData_Train_py <- as.matrix(srcData_Train %>% select (-c(isRandom, DOCUMENT_REF)))
srcData_Valid_py <- as.matrix(srcData_Valid %>% select (-c(isRandom, DOCUMENT_REF)))
srcData_Test_py <- as.matrix(srcData_Test %>% select (-c(isRandom, DOCUMENT_REF)))
# For the test set we will want to know the row references for the dummy records
srcData_Test_ID <- srcData_Test %>%
mutate(ID = row_number()) %>%
select(ID, isRandom, DOCUMENT_REF)
#input_dim for model
input_dim = ncol(srcData_Train_py)
```
# Transfer Data to Python #
All fields except 'isRandom' and 'Document_Ref', are fed into the model
```{python, transfer, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
import pandas as pd
import numpy as np
import os as os
#dplyr for pandas in python
from dfply import *
#convert to pandas
Train_py = r.srcData_Train_py
Testi_py = r.srcData_Test_py
Valid_py = r.srcData_Valid_py
input_dim= r.input_dim
```
# Helper Function to Create Models #
The model will be very similar to the PwC model version AE7:
1. 7 encoding layers (exc the input)
2. 1 bottleneck layer of 3 neurons (which means easy to plot data in 3D)
3. 7 decoding layers
4. Leaky ReLU activation, alpha =0.4
5. drop out at each layer, except the bottleneck
Here are the layer sizes:
391(Input)-256-128-64-32-16-8-4-3-4-8-16-32-64-128-188-256-391(Output)
A helper function has been written to create the model, with options to vary key components so we can grid search for the optimal model.
The code includes options for setting pretrained weights on the encoder. This will be used to view embeddings after we have trained the model. See F.Chollet on the approach to doing this: https://github.com/keras-team/keras/issues/41
If we had a single record and wanted to get activations (embeddings) for all layers in the model, then we could use an api like https://github.com/philipperemy/keract, but that is not our situation here. For embeddings we will want the activiatons at the bottleneck layer for all records.
```{python, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
from keras.layers import Input, Dense, LeakyReLU, Dropout, BatchNormalization, concatenate
from keras.models import Model
import numpy as np
#from keras import backend as K
def create_model_basic(
input_dim = input_dim,
apply_dropout = True,
apply_batchnorm = False,
dropout_rate = 0.2,
leaky_alpha = 0.4,
# we may choose to see embeddings from pretrained models, rather than train from scratch...
# for applying the pretrained weights of an existing model, we need encoding layer numbers
# note, these may skip a layer between layers (ie 1,3,5 not 2,3),
# this is because there may be a dropout layer between each dense layer
get_embeddings = False,
pretrained_model = None,
# we have a dense layer and dropout layer for each encoding step.
encode_layer_nodes = [256, 128, 64, 32, 16, 8, 4],
# the bottleneck, or throat
throat_layer_nodes = 3,
# we need decode layers too
decode_layer_nodes = [ 4, 8, 16, 32, 64, 128, 256]):
#Kernel initialiser
kinit = 'glorot_normal'
#INPUT
the_input = Input(shape=(input_dim,))
####################################################################
# Encode Layers
####################################################################
encoded = the_input
for nodes in encode_layer_nodes:
if(apply_dropout):
encoded = Dropout(dropout_rate)(encoded)
if(get_embeddings):
# in the pretrained model layer[0] is the input layer
# if we apply dropout then that's layer[1]
# subsequently, this dense layer is layer[2] (or layer[1] without dropout)
# and then the LeakyRelu is layer[3] (or layer[2] without dropout)
lyr_idx = np.where(np.array(encode_layer_nodes) == nodes) # starts with 1
lyr_idx = lyr_idx[0]*(2+apply_dropout+apply_batchnorm)+(1+apply_dropout)
encoded = Dense(nodes, weights = pretrained_model.layers[lyr_idx[0]].get_weights())(encoded)
else:
encoded = Dense(nodes, kernel_initializer=kinit)(encoded)
if(apply_batchnorm):
encoded = BatchNormalization()(encoded)
encoded = LeakyReLU(alpha=leaky_alpha)(encoded)
####################################################################
# Bottleneck (aka Throat)
####################################################################
# Typically 3 nodes for easy plotting of embeddings in 3D
# Note no droput into or out of bottleneck
encoded = Dense(throat_layer_nodes, kernel_initializer=kinit)(encoded)
encoded = LeakyReLU(alpha=leaky_alpha)(encoded)
####################################################################
# Decode Layers
####################################################################
decoded = encoded
for nodes in decode_layer_nodes:
if(apply_dropout):
decoded = Dropout(dropout_rate)(decoded)
decoded = Dense(nodes, kernel_initializer=kinit)(decoded)
if(apply_batchnorm):
decoded = BatchNormalization()(decoded)
decoded = LeakyReLU(alpha=leaky_alpha)(decoded)
#########################################################
# Reconstruct
#########################################################
#Split into
# the first 10 columns are all scaled values (input was mean=0, sd=1), no activation required
decode_TRANS_DATE = Dense(1)(decoded) #intentionally no activation
decode_TRANS_DATE_Mth = Dense(1)(decoded) #intentionally no activation
decode_TRANS_DATE_DoM = Dense(1)(decoded) #intentionally no activation
decode_TRANS_DATE_DoW = Dense(1)(decoded) #intentionally no activation
decode_SYSTEM_DATE = Dense(1)(decoded) #intentionally no activation
decode_SYSTEM_DATE_Mth = Dense(1)(decoded) #intentionally no activation
decode_SYSTEM_DATE_DoM = Dense(1)(decoded) #intentionally no activation
decode_SYSTEM_DATE_DoW = Dense(1)(decoded) #intentionally no activation
decode_PostingDelayDays = Dense(1)(decoded) #intentionally no activation
decode_BC_TRANS_VALUE = Dense(1)(decoded) #intentionally no activation
# this is a binary field, requires sigmoid activation
decode_isDebit = Dense(1, activation='sigmoid')(decoded)
# the subsequent columns are all categories, requiring softmax activation
decode_TransType = Dense( 18, kernel_initializer=kinit, activation='softmax')(decoded)
decode_OPERATOR = Dense( 49, kernel_initializer=kinit, activation='softmax')(decoded)
decode_DOCUMENT_REF_L1 = Dense( 46, kernel_initializer=kinit, activation='softmax')(decoded)
decode_ACCOUNT_NO_Digit1 = Dense( 7, kernel_initializer=kinit, activation='softmax')(decoded)
decode_ACCOUNT_NO_Digit2 = Dense( 10, kernel_initializer=kinit, activation='softmax')(decoded)
decode_ACCOUNT_NO_Digit3 = Dense( 10, kernel_initializer=kinit, activation='softmax')(decoded)
decode_ACCOUNT_NO_Digit4 = Dense( 10, kernel_initializer=kinit, activation='softmax')(decoded)
decode_ACCOUNT_NO_Digit5 = Dense( 10, kernel_initializer=kinit, activation='softmax')(decoded)
decode_ACCOUNT_NO_Digit6 = Dense( 9, kernel_initializer=kinit, activation='softmax')(decoded)
decode_ACCOUNT_NO_Digit7 = Dense( 8, kernel_initializer=kinit, activation='softmax')(decoded)
decode_CUST_REF = Dense(101, kernel_initializer=kinit, activation='softmax')(decoded)
decode_SUPP_REF = Dense(100, kernel_initializer=kinit, activation='softmax')(decoded)
#########################################################
# Concatenate into one output
#########################################################
the_output = concatenate([decode_TRANS_DATE,
decode_TRANS_DATE_Mth,
decode_TRANS_DATE_DoM,
decode_TRANS_DATE_DoW,
decode_SYSTEM_DATE,
decode_SYSTEM_DATE_Mth,
decode_SYSTEM_DATE_DoM,
decode_SYSTEM_DATE_DoW,
decode_PostingDelayDays,
decode_BC_TRANS_VALUE,
decode_isDebit,
decode_TransType,
decode_OPERATOR,
decode_DOCUMENT_REF_L1,
decode_ACCOUNT_NO_Digit1,
decode_ACCOUNT_NO_Digit2,
decode_ACCOUNT_NO_Digit3,
decode_ACCOUNT_NO_Digit4,
decode_ACCOUNT_NO_Digit5,
decode_ACCOUNT_NO_Digit6,
decode_ACCOUNT_NO_Digit7,
decode_CUST_REF,
decode_SUPP_REF])
if get_embeddings:
#AUTOENCODER = ENCODE only
model = Model(the_input, encoded)
else:
#AUTOENCODER = ENCODE + DECODE
model = Model(the_input, the_output)
return(model)
```
# Instantiate the Model #
The model creation method, above, allows many permutations, but for now we need to see an example, jus tto confirm it is reasonable. This example will be with dropout, and 7 layers
We expect 391 inputs, which is the 393 in the data LESS the excluded fields; 'isRandom' , which would give the game away, and DOCUMENT_REF which adds no information to the model (unique record per row).
```{python, example, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
# create a baseline model
autoencoder_example = create_model_basic(input_dim = input_dim,
dropout_rate = 0.0,
apply_batchnorm= False)
```
```{python, eval=TRUE, echo=TRUE, message=FALSE, cache=FALSE, fig.width=8}
autoencoder_example.summary()
```
290k parameters to train, should be feasible with 1.3million records.
# Helper Function to Compile and Fit Models #
We will do a grid scan of the model options such as:
batch size [64, 256, 2048]
dropout rate [0.0, 0.1, 0.2]
We could also scan options such as layer quantity or node quantity within each layer, but time is limited.
At the heart of the grid scan will be a method to create the model, compile it, then fit the data.
This function may choose to use tensorflow callbacks (tensorflow is the backend to Keras on this rig)
```{python, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
from keras import callbacks
# hook up to Tensorboard for keeping records of progress
# to use the Tensorboard, go to command line, set cd "current project directory", then:
# tensorboard --logdir GraphRMS
# then
# http://localhost:6006/
## Tensorboard callback
tbCallback_ae_basic = callbacks.TensorBoard(log_dir = './Tensorboard_ae_basic',
histogram_freq = 0,
write_graph = True)
## Checkpoint callback
cpCallback_ae_basic = callbacks.ModelCheckpoint('./Checkpoints/ae_basic_{epoch:02d}.hdf5',
monitor='val_acc', verbose=1,
save_best_only=True, mode='max')
##Earlystopping, stop training when validation accuracy ceases to improve
esCallback = callbacks.EarlyStopping(monitor = 'val_loss', min_delta = 0.00005,
patience = 200, verbose = 0,
mode = 'auto', baseline = None)
```
```{python, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
os.chdir('c:\\Users\\User\\OneDrive\\Oliver\\0_OM\\Training\\DeepLearning_Udacity\\LSTM\\AcctgAnomalyDetection\\SaveModels')
def create_compile_fit(input_dim = 391,
dropout_rate = 0.1,
apply_batchnorm = True,
batch_size = 2048,
epochs = 500,
Train_data = Train_py,
Valid_data = Valid_py,
callbacks = None):
## Create model
if(dropout_rate == 0):
apply_dropout = False
else:
apply_dropout = True
model = create_model_basic(input_dim = input_dim,
apply_dropout = apply_dropout,
dropout_rate = dropout_rate,
apply_batchnorm= apply_batchnorm)
## Compile Model
model.compile(optimizer = 'adadelta',
loss = 'binary_crossentropy'
# metrics = ['accuracy']
)
## Fit model
model.fit(x = Train_data,
# note for autoencoders the x is the same as the y, output=input
y = Train_data,
epochs = epochs,
batch_size = batch_size,
shuffle = True,
validation_data = (Valid_data, Valid_data),
callbacks = callbacks)
return model
```
## Helper Function to Save Results ##
As we're creating a number of models we will need to save them to file.
```{python, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
def save_model(keras_model, filename):
import numpy as np
#set working directory
os.chdir('c:\\Users\\User\\OneDrive\\Oliver\\0_OM\\Training\\DeepLearning_Udacity\\LSTM\\AcctgAnomalyDetection\\SaveModels')
# save model
keras_model.save(filename+'.h5')
# save training history
#import pickle
#with open('./SaveModels/autoencoder_basic_history.pickle', 'wb') as file_pi:
# pickle.dump(autoencoder_basic.history.history, file_pi)
# save training history
l_loss = np.array(keras_model.history.history['loss'])
l_loss = np.reshape(l_loss, (l_loss.shape[0],1))
l_vloss = np.array(keras_model.history.history['val_loss'])
l_vloss = np.reshape(l_vloss, (l_vloss.shape[0],1))
np.save(file = filename+'_history',
arr = np.concatenate((l_loss, l_vloss), axis=1))
```
## Grid Scan the Model Parameters ##
All data is now either binary (precisely 1 or 0) or numerical (between 1 and 0). So, we can use 'binary_crossentropy' for our loss calculation, as per the PwC paper. If the numerical fields had simply been scaled, with some values greater than 1, then we would have been tempted to use 'mean_squared_error', but this would not have suited the categorical fields. This huighlights an advantage of the minmax scaling, it enables us to use one method of loss calculation for both categorical and numerical fields.
The following hyper parameters remain to be identified:
Optimal batch size
Whether regularisation via drop out is helpful
Whether regularisation via batch normalisation is helpful
A grid scan of the various hyper param permutations can be carried out to find the optimal arrangement. We simply train the various instances of the model and see which performs best
```{python, eval=FALSE, echo=TRUE, message=FALSE, cache=FALSE}
import pandas as pd
from keras import callbacks
#dplyr for pandas in python
from dfply import *
# set FP16, else training is far too slow for grid search (24hrs per model, even with RTX2070)
#from keras.backend.common import set_floatx
#set_floatx('float16')
os.chdir('c:\\Users\\User\\OneDrive\\Oliver\\0_OM\\Training\\DeepLearning_Udacity\\LSTM\\AcctgAnomalyDetection\\SaveModels')
params = {'dropout_rate' : [ 0.0, 0.1],
'batch_size' : [ 256, 1024, 2048],
'batch_norm' : [False, True]
}
#prep results table
results = None
results = pd.DataFrame(columns=['dropout_rate', 'batch_size', 'batch_norm',
'val_loss', 'loss', 'epoch'])
loop_count = -1
for dropout_rate in params.get('dropout_rate'):
for batch_size in params.get('batch_size'):
for batch_norm in params.get('batch_norm'):
loop_count += 1
params_current = {'dropout_rate' : dropout_rate,
'batch_size' : batch_size}
# filename for model instance
filename = ('DropoutRate_'+ str(dropout_rate) +
'_BatchSize_' + str(batch_size) +
'_BatchNorm_' + str(batch_norm))
# create csv logger callback
csvCallback = callbacks.CSVLogger(filename+'_Log.log')
# create model
model_instance = create_compile_fit(input_dim = input_dim,
dropout_rate = dropout_rate,
apply_batchnorm = batch_norm,
batch_size = batch_size,
epochs = 400,
Train_data = Train_py,
Valid_data = Valid_py,
callbacks = [csvCallback])
# save the model
save_model(model_instance, filename)
# get result, i.e. min validation error and train error for same epoch
best_valid = min(model_instance.history.history['val_loss'])
best_valid_epc = np.where(model_instance.history.history['val_loss'] == best_valid)
best_valid_epc = best_valid_epc[0].item() #tie breaker, take first. need item else rtn array
matching_train = model_instance.history.history['loss'][best_valid_epc]
# save results to table for comparison
results.loc[len(results)] = [dropout_rate, batch_size, batch_norm,
best_valid, matching_train, best_valid_epc]
print("loop: ", loop_count)
print('\n')
print(results.loc[len(results)-1])
print('\n')
# save to file because this took a long time (approx 10hrs) to complete!
results.to_csv('GridScan_Results.csv')