Refactor getDbCohortMethodData() #134

schuemie · 2023-03-17T12:03:01Z

The getDbCohortMethodData() is quite long and hard to read. It could use some refactoring. Some ideas:

Separate code for fetching all the data, and constructing the attrition table (e.g. using a private function call).
This code looks like it could be written to be shorter and clearer. All it is trying to do is make sure counts are 0 when a cohort is empty.
In case the study calendar time is restricted, make sure there's a top row in the attrition object that just counts the cohorts prior to any restriction (in the current code the top row already restricts to calendar time)

The text was updated successfully, but these errors were encountered:

mvankessel-EMC · 2023-04-03T14:08:26Z

I've added the following functions in the DataLoadingSaving.R:

checkParameters L288:L330
downSample L332:L379
removeDupes (working title) L381:445
getPresampleCount L447:L457

The way getDbCohortMethodData is currently setup makes it difficult to offload bits of functionality to separate functions. This is due to certain key objects (counts, metaData, preSampleCounts) being altered along the way. Every new function would then take 5-10 parameters, which are copied in into that functions own environment. Each subsequent function would also need to then return an updated copy of all the outputs that are used later in getDbCohortMethodData.

Instead of all that, I've opted to setup an environment at the start of getDbCohortMethodData L140. And reference all required objects through that. This prevents copying of data, and all functions defined in scope can access this environment.

schuemie · 2023-04-04T12:16:40Z

I must admit I'm not sure this is really an improvement. Functions are normally a good way to reduce complexity because the only way a function can interact with your environment is through its arguments and return value.

For example, I think this still works really well with your checkParameters() function, which simply checks all the inputs and its only possible effect is to throw an error. (Although ideally I would explicitly limit its input to the getDbCohortMethodData() argument, for example using match.call()).

But it breaks down for example in the downSample() function. This function has all sorts of side-effects that are not obvious by just looking at the function call. For example, the function sets the sampled variable, leaving a developer scratching his head why the value set here isn't necessarily the value going in here. To understand the code, you'd have to read all sub-functions as well.

Instead, the downSample() function could have these arguments:

connection (you can call DatabaseConnector::dbms() to get the dbms)
tempEmulationSchema
targetId
comparatorId
maxCohortSize

and return a list with two items:

sampled
preSampleCounts

mvankessel-EMC · 2023-04-04T13:01:01Z

I must admit I'm not sure this is really an improvement. Functions are normally a good way to reduce complexity because the only way a function can interact with your environment is through its arguments and return value.

I agree, an object like metadata is altered up to 8 times across 125 lines of code. Which makes it difficult to refactor, without significant alterations. I'll revert the implementations that use an environment for now.

For example, I think this still works really well with your checkParameters() function, which simply checks all the inputs and its only possible effect is to throw an error. (Although ideally I would explicitly limit its input to the getDbCohortMethodData() argument, for example using match.call()).

Cool! I will update the functionality to use match.call instead of an environment

But it breaks down for example in the downSample() function. This function has all sorts of side-effects that are not obvious by just looking at the function call. For example, the function sets the sampled variable, leaving a developer scratching his head why the value set here isn't necessarily the value going in here. To understand the code, you'd have to read all sub-functions as well.

Instead, the downSample() function could have these arguments:

connection (you can call DatabaseConnector::dbms() to get the dbms)

tempEmulationSchema

targetId

comparatorId

maxCohortSize

and return a list with two items:

sampled

preSampleCounts

I have actually done exactly that, but never pushed that variant. Mostly for consistency's sake as to how the sub-functions are implemented. I'll change downSample to that varient.

mvankessel-EMC · 2023-04-06T14:08:24Z

(Although ideally I would explicitly limit its input to the getDbCohortMethodData() argument, for example using match.call()).

Cool! I will update the functionality to use match.call instead of an environment

I ran in to the issue where match.call is only passing parameters, that are specifically used during the function call.

If I specify a dummy version of getDbCohortMethodData with its default parameters:

getDbCohortMethodData <- function(
    connectionDetails,
    cdmDatabaseSchema,
    tempEmulationSchema = getOption("sqlRenderTempEmulationSchema"),
    targetId,
    comparatorId,
    outcomeIds,
    studyStartDate = "",
    studyEndDate = "",
    exposureDatabaseSchema = cdmDatabaseSchema,
    exposureTable = "drug_era",
    outcomeDatabaseSchema = cdmDatabaseSchema,
    outcomeTable = "condition_occurrence",
    cdmVersion = "5",
    firstExposureOnly = FALSE,
    removeDuplicateSubjects = FALSE,
    restrictToCommonPeriod = FALSE,
    washoutPeriod = 0,
    maxCohortSize = 0,
    covariateSettings) {
  match.call()
  # Do getDbCohortMethodData things
  # ...
}

getDbCohortMethodData(
  connectionDetails = Eunomia::getEunomiaConnectionDetails(),
  cdmDatabaseSchema = "main",
  targetId = 1,
  comparatorId = 2,
  outcomeIds = c(3,4)
)

match.call() will return the following:

getDbCohortMethodData(connectionDetails = Eunomia::getEunomiaConnectionDetails(), 
    cdmDatabaseSchema = "main", targetId = 1, comparatorId = 2, 
    outcomeIds = c(3, 4))

ignoring the default parameters, which is similar to this issue. In this issue the following is the accepted answer:

mget(names(formals()),sys.frame(sys.nframe()))

Which gets all the variables in the current environment (nth frame of sys.frame) through formals.

Which will result into a named list of all parameters with either the passed, or default values, which can then be passed to checkParameters. I'm not particularly fond of this implementation, as it's not clear at all as to what it does, but it works.

So my question is, what do you prefer:

Pass all parameters manually to checkParameters.
Create an environmentand pass it. Clean up the environment after the checks are passed.
Pass the list of parameters using mget, formals and sys.frame.

schuemie · 2023-04-07T14:54:15Z

I think I actually prefer option 4:

Keep the parameter checks in the main function.

Let me try to explain why:

This code is actually not adding complexity to the function, since it is not changing any variables. All it does it throw an error if something doesn't meet expectations, so as a developer I can easily ignore this whole block if I want.
I like having the checks on the arguments right below the function header, and therefore also close to the argument documentation. Together these define the interface of this function.
This is an extreme case, where the function has lots of parameters. In general we're talking about a few lines of code at the top of a function (and I would like to follow the same pattern in every function in this package, and ideally in HADES).

The only thing I don't like is that we have code above the parameter check block. This should ideally not happen. I think we can achieve that by no longer allowing removeDuplicateSubjects to be a boolean, and by moving the NULL checks on studyStartDate and studyEndDate to below the check block.

mvankessel-EMC · 2023-04-11T11:34:41Z

I think I actually prefer option 4:

Keep the parameter checks in the main function.

Let me try to explain why:

This code is actually not adding complexity to the function, since it is not changing any variables. All it does it throw an error if something doesn't meet expectations, so as a developer I can easily ignore this whole block if I want.

I like having the checks on the arguments right below the function header, and therefore also close to the argument documentation. Together these define the interface of this function.

This is an extreme case, where the function has lots of parameters. In general we're talking about a few lines of code at the top of a function (and I would like to follow the same pattern in every function in this package, and ideally in HADES).

The only thing I don't like is that we have code above the parameter check block. This should ideally not happen. I think we can achieve that by no longer allowing removeDuplicateSubjects to be a boolean, and by moving the NULL checks on studyStartDate and studyEndDate to below the check block.

I've pushed the following updates:

Reverted all changes except downSample.
Deprecated the boolean support for removeDuplicateSubjects, which now defaults to "keep all", and updated test-simulation.R to accommodate this change.
Moved the studyStartDate and studyEndDate NULL updates after the assertions.
I've changed the following code to fit in a function, to be reused:

getDbCohortMethodData <- function(...) {
  ...
  idx <- which(counts$treatment == 1)
  if (length(idx) == 0) {
        preSampleCounts$targetPersons <- 0
        preSampleCounts$targetExposures <- 0
      } else {
        preSampleCounts$targetPersons <- counts$personCount[idx]
        preSampleCounts$targetExposures <- counts$rowCount[idx]
      }
  idx <- which(counts$treatment == 0)
  if (length(idx) == 0) {
        preSampleCounts$comparatorPersons <- 0
        preSampleCounts$comparatorExposures <- 0
      } else {
        preSampleCounts$comparatorPersons <- counts$personCount[idx]
        preSampleCounts$comparatorExposures <- counts$rowCount[idx]
      }
  ...
}

to:

getDbCohortMethodData <- function(...) {
  ...
  idx <- which(counts$treatment == 1)
  preSampleCounts <- preSample(idx, "target", counts, preSampleCounts)
  idx <- which(counts$treatment == 0)
  preSampleCounts <- preSample(idx, "comparator", counts, preSampleCounts)
  ...
}

preSample <- function(idx, colType, counts, preSampleCounts) {
  personCol <- paste0(colType, "Persons")
  exposuresCol <- paste0(colType, "Exposures")
  if (length(idx) == 0) {
      preSampleCounts[personCol] <- 0
      preSampleCounts[exposuresCol] <- 0
    } else {
      preSampleCounts[personCol] <- counts$personCount[idx]
      preSampleCounts[exposuresCol] <- counts$rowCount[idx]
    }
  return(preSampleCounts)
}

schuemie · 2023-04-11T12:54:12Z

Sounds good! Could you create a PR? That makes it easier to view all the changes.

mvankessel-EMC · 2023-04-17T14:00:47Z

Hi @schuemie as per this comment:

Looks great! As discussed, further refactoring would probably require turning the meta-data into some nice object that can be passed around by reference. But I'll merge what you've done so far, and leave it to you if you want to work on that.

I implemented a version of getDbCohortMethodData using R6. For now both implementations are available within the branch DataLoadingSaving-R6.

I split the functionality in two classes:

CohortMethodData: Class that does the computation within R.
CohortDbInterface: Class that exclusively interfaces with the database.

Both classes are called within the new getDbCohortMethodData function. The function interface is exactly the same, i.e. the input is the same, as well as the output. The old version of the function is still availible as getDbCohortMethodData_DEP.

I also benchmarked the new implementation and compared it with the old one to see if the speed and memory usage are at least similar with Microbenchmark and Profvis.

Benchmark

Dummy Setup

connectionDetails <- Eunomia::getEunomiaConnectionDetails()
Eunomia::createCohorts(connectionDetails)

packageRoot <- normalizePath(system.file("..", package = "CohortMethod"))
R.utils::createLink(link = file.path(packageRoot, "sql"), system.file("sql", package = "CohortMethod"))
options("use.devtools.sql_shim" = TRUE)

covarSettings <- FeatureExtraction::createCovariateSettings(
  useDemographicsGender = TRUE,
  useDemographicsAge = TRUE,
  useDemographicsAgeGroup = TRUE,
  useDemographicsIndexMonth = TRUE,
  useDemographicsPriorObservationTime = TRUE,
  useDemographicsPostObservationTime = TRUE,
  useDemographicsTimeInCohort = TRUE,
  useDemographicsIndexYearMonth = TRUE,
  useMeasurementValueLongTerm = TRUE,
  useMeasurementValueMediumTerm = TRUE,
  useMeasurementValueShortTerm = TRUE,
  useMeasurementRangeGroupAnyTimePrior = TRUE,
  useMeasurementRangeGroupLongTerm = TRUE,
  useMeasurementRangeGroupMediumTerm = TRUE,
  useMeasurementRangeGroupShortTerm = TRUE,
  useObservationAnyTimePrior = TRUE,
  useObservationLongTerm = TRUE,
  useObservationMediumTerm = TRUE,
  useObservationShortTerm = TRUE,
  endDays = 180
)

options(sqlRenderTempEmulationSchema = "main")

MicroBenchmark

microbenchmark::microbenchmark(
  Original = {
    cohortMethodData_DEP <- CohortMethod::getDbCohortMethodData_DEP(
      connectionDetails = connectionDetails,
      cdmDatabaseSchema = "main",
      exposureTable = "cohort",
      outcomeTable = "cohort",
      targetId = 1,
      comparatorId = 2,
      outcomeIds = c(3, 4),
      cdmVersion = "5",
      washoutPeriod = 183,
      firstExposureOnly = TRUE,
      removeDuplicateSubjects = "keep all",
      restrictToCommonPeriod = TRUE,
      maxCohortSize = 100000,
      covariateSettings = covarSettings
    )
  },
  R6 = {
    cohortMethodData <- CohortMethod::getDbCohortMethodData(
      connectionDetails = connectionDetails,
      cdmDatabaseSchema = "main",
      exposureTable = "cohort",
      outcomeTable = "cohort",
      targetId = 1,
      comparatorId = 2,
      outcomeIds = c(3, 4),
      cdmVersion = "5",
      washoutPeriod = 183,
      firstExposureOnly = TRUE,
      removeDuplicateSubjects = "keep all",
      restrictToCommonPeriod = TRUE,
      maxCohortSize = 100000,
      covariateSettings = covarSettings
    )
  },
  times = 10
)

There seems to be no significant speed difference in either implementation. The maximum and minimum time it took to execute are however slightly shorter.

Unit: seconds
     expr      min       lq     mean   median       uq      max neval cld
 Original 2.226537 2.352759 2.596895 2.466261 2.724559 3.488108    10   a
       R6 2.204499 2.481361 2.734244 2.801655 2.861608 3.315819    10   a

Profvis

# R6 ----
profvis::profvis({
  cohortMethodData <- CohortMethod::getDbCohortMethodData(
    connectionDetails = connectionDetails,
    cdmDatabaseSchema = "main",
    exposureTable = "cohort",
    outcomeTable = "cohort",
    targetId = 1,
    comparatorId = 2,
    outcomeIds = c(3, 4),
    cdmVersion = "5",
    washoutPeriod = 183,
    firstExposureOnly = TRUE,
    removeDuplicateSubjects = "keep all",
    restrictToCommonPeriod = TRUE,
    maxCohortSize = 100000,
    covariateSettings = covarSettings
  )
})

# Original ----
profvis::profvis({
  cohortMethodData <- CohortMethod::getDbCohortMethodData_DEP(
    connectionDetails = connectionDetails,
    cdmDatabaseSchema = "main",
    exposureTable = "cohort",
    outcomeTable = "cohort",
    targetId = 1,
    comparatorId = 2,
    outcomeIds = c(3, 4),
    cdmVersion = "5",
    washoutPeriod = 183,
    firstExposureOnly = TRUE,
    removeDuplicateSubjects = "keep all",
    restrictToCommonPeriod = TRUE,
    maxCohortSize = 100000,
    covariateSettings = covarSettings
  )
})

Memory usage seems to be nearly identical as well.

Implementation	Memory freed (-)	Memory used (+)	Net usage
Original	14.4	16.3	1.9
R6	14.2	16.4	2.2

schuemie · 2023-04-19T13:09:31Z

Thanks! Could you help me understand your overall design here? You have

a getDbCohortMethodData() function
a CohortDbInterface object (R6)
a CohortMethodDataR6 object (R6)
the old CohortMethodData object (S4)

The getDbCohortMethodData() creates a CohortDbInterface object, and then uses that as an argument when creating a CohortMethodDataR6 object. The getDbCohortMethodData() then continues to call the CohortMethodDataR6, which under the hood calls CohortDbInterface, ultimately constructing an old-fashioned CohortMethodData object.

Any reason for having the two R6 objects? Code like this makes me wonder if that is adding anything. (The new solution also seems to have more than doubled the lines of code, making it hard to argue it has simplified things ;-) )

mvankessel-EMC · 2023-04-19T13:57:45Z

Thanks! Could you help me understand your overall design here? You have

a getDbCohortMethodData() function

a CohortDbInterface object (R6)

a CohortMethodDataR6 object (R6)

the old CohortMethodData object (S4)

The getDbCohortMethodData() creates a CohortDbInterface object, and then uses that as an argument when creating a CohortMethodDataR6 object. The getDbCohortMethodData() then continues to call the CohortMethodDataR6, which under the hood calls CohortDbInterface, ultimately constructing an old-fashioned CohortMethodData object.

Yes, so my goal for this implementation was to remain compatible with the current implementation of CohortMethod. This means that the input and output of the implementation should remain the same, so functionality that depends on it doesn't fail.

Any reason for having the two R6 objects? Code like this makes me wonder if that is adding anything. (The new solution also seems to have more than doubled the lines of code, making it hard to argue it has simplified things ;-) )

Initially I created one massive R6 class. While doing that I then realized that half the parameters:

connectionDetails
tempEmulationSchema
cdmDatabaseSchema
exposureDatabaseSchema
exposureTable
cdmVersion
outcomeTable
outcomeDatabaseSchema

Are exclusively used for interfacing with the database, thus splitting the class in two. I do agree that this implementation is redundant. I initially only wanted to call methods from CohortMethodDataR6, but that plan fell through when I needed to disconnect from the database. So I will change that.

As to simplifying things; I guess it depends how you define simplified. I agree that the code is scattered around more across the two classes, however the trade off is that the cyclomatic complexity of each function drastically decreases.

Current implementation

Function	Cyclomatic Complexity
getDbCohortMethodData	28
downSample	4
countPreSample	2
handleCohortCovariateBuilders	9

R6 implementation

CohortMethodData

Function	Cyclomatic Complexity
initialize	1
getDbCohortMethodData	2
createCohorts	1
downSample	4
getCohorts	1
buildCovariateData	2
validate	6
countPreSample	2
checkCohorts	4
constructLabel	6
filterCohorts	10
handleCohortCovariateBuilders	9

CohortDbInterface

Function	Cyclomatic Complexity
initialize	1
disconnect	1
createCohorts	1
getCohorts	1
countCohorts	1
sampleCohorts	1
countRaw	1
getOutcomes	1
rmTempTables	1
extractCovarDat	1
validate	1

schuemie added the enhancement New functionality that could be added label Mar 17, 2023

schuemie assigned mvankessel-EMC Mar 24, 2023

mvankessel-EMC mentioned this issue Apr 11, 2023

Is134 get db cohort method data #136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor getDbCohortMethodData() #134

Refactor getDbCohortMethodData() #134

schuemie commented Mar 17, 2023

mvankessel-EMC commented Apr 3, 2023

schuemie commented Apr 4, 2023

mvankessel-EMC commented Apr 4, 2023

mvankessel-EMC commented Apr 6, 2023

schuemie commented Apr 7, 2023

mvankessel-EMC commented Apr 11, 2023 •

edited

Loading

schuemie commented Apr 11, 2023

mvankessel-EMC commented Apr 17, 2023 •

edited

Loading

schuemie commented Apr 19, 2023

mvankessel-EMC commented Apr 19, 2023

Refactor getDbCohortMethodData() #134

Refactor getDbCohortMethodData() #134

Comments

schuemie commented Mar 17, 2023

mvankessel-EMC commented Apr 3, 2023

schuemie commented Apr 4, 2023

mvankessel-EMC commented Apr 4, 2023

mvankessel-EMC commented Apr 6, 2023

schuemie commented Apr 7, 2023

mvankessel-EMC commented Apr 11, 2023 • edited Loading

schuemie commented Apr 11, 2023

mvankessel-EMC commented Apr 17, 2023 • edited Loading

Benchmark

Dummy Setup

MicroBenchmark

Profvis

schuemie commented Apr 19, 2023

mvankessel-EMC commented Apr 19, 2023

Current implementation

R6 implementation

CohortMethodData

CohortDbInterface

mvankessel-EMC commented Apr 11, 2023 •

edited

Loading

mvankessel-EMC commented Apr 17, 2023 •

edited

Loading