Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor getDbCohortMethodData() #134

Open
schuemie opened this issue Mar 17, 2023 · 10 comments
Open

Refactor getDbCohortMethodData() #134

schuemie opened this issue Mar 17, 2023 · 10 comments
Assignees
Labels
enhancement New functionality that could be added

Comments

@schuemie
Copy link
Member

The getDbCohortMethodData() is quite long and hard to read. It could use some refactoring. Some ideas:

  1. Separate code for fetching all the data, and constructing the attrition table (e.g. using a private function call).

  2. This code looks like it could be written to be shorter and clearer. All it is trying to do is make sure counts are 0 when a cohort is empty.

  3. In case the study calendar time is restricted, make sure there's a top row in the attrition object that just counts the cohorts prior to any restriction (in the current code the top row already restricts to calendar time)

@schuemie schuemie added the enhancement New functionality that could be added label Mar 17, 2023
@mvankessel-EMC
Copy link
Collaborator

I've added the following functions in the DataLoadingSaving.R:

  1. checkParameters L288:L330
  2. downSample L332:L379
  3. removeDupes (working title) L381:445
  4. getPresampleCount L447:L457

The way getDbCohortMethodData is currently setup makes it difficult to offload bits of functionality to separate functions. This is due to certain key objects (counts, metaData, preSampleCounts) being altered along the way. Every new function would then take 5-10 parameters, which are copied in into that functions own environment. Each subsequent function would also need to then return an updated copy of all the outputs that are used later in getDbCohortMethodData.

Instead of all that, I've opted to setup an environment at the start of getDbCohortMethodData L140. And reference all required objects through that. This prevents copying of data, and all functions defined in scope can access this environment.

@schuemie
Copy link
Member Author

schuemie commented Apr 4, 2023

I must admit I'm not sure this is really an improvement. Functions are normally a good way to reduce complexity because the only way a function can interact with your environment is through its arguments and return value.

For example, I think this still works really well with your checkParameters() function, which simply checks all the inputs and its only possible effect is to throw an error. (Although ideally I would explicitly limit its input to the getDbCohortMethodData() argument, for example using match.call()).

But it breaks down for example in the downSample() function. This function has all sorts of side-effects that are not obvious by just looking at the function call. For example, the function sets the sampled variable, leaving a developer scratching his head why the value set here isn't necessarily the value going in here. To understand the code, you'd have to read all sub-functions as well.

Instead, the downSample() function could have these arguments:

  • connection (you can call DatabaseConnector::dbms() to get the dbms)
  • tempEmulationSchema
  • targetId
  • comparatorId
  • maxCohortSize

and return a list with two items:

  • sampled
  • preSampleCounts

@mvankessel-EMC
Copy link
Collaborator

I must admit I'm not sure this is really an improvement. Functions are normally a good way to reduce complexity because the only way a function can interact with your environment is through its arguments and return value.

I agree, an object like metadata is altered up to 8 times across 125 lines of code. Which makes it difficult to refactor, without significant alterations. I'll revert the implementations that use an environment for now.

For example, I think this still works really well with your checkParameters() function, which simply checks all the inputs and its only possible effect is to throw an error. (Although ideally I would explicitly limit its input to the getDbCohortMethodData() argument, for example using match.call()).

Cool! I will update the functionality to use match.call instead of an environment

But it breaks down for example in the downSample() function. This function has all sorts of side-effects that are not obvious by just looking at the function call. For example, the function sets the sampled variable, leaving a developer scratching his head why the value set here isn't necessarily the value going in here. To understand the code, you'd have to read all sub-functions as well.

Instead, the downSample() function could have these arguments:

  • connection (you can call DatabaseConnector::dbms() to get the dbms)
  • tempEmulationSchema
  • targetId
  • comparatorId
  • maxCohortSize

and return a list with two items:

  • sampled
  • preSampleCounts

I have actually done exactly that, but never pushed that variant. Mostly for consistency's sake as to how the sub-functions are implemented. I'll change downSample to that varient.

@mvankessel-EMC
Copy link
Collaborator

(Although ideally I would explicitly limit its input to the getDbCohortMethodData() argument, for example using match.call()).

Cool! I will update the functionality to use match.call instead of an environment

I ran in to the issue where match.call is only passing parameters, that are specifically used during the function call.

If I specify a dummy version of getDbCohortMethodData with its default parameters:

getDbCohortMethodData <- function(
    connectionDetails,
    cdmDatabaseSchema,
    tempEmulationSchema = getOption("sqlRenderTempEmulationSchema"),
    targetId,
    comparatorId,
    outcomeIds,
    studyStartDate = "",
    studyEndDate = "",
    exposureDatabaseSchema = cdmDatabaseSchema,
    exposureTable = "drug_era",
    outcomeDatabaseSchema = cdmDatabaseSchema,
    outcomeTable = "condition_occurrence",
    cdmVersion = "5",
    firstExposureOnly = FALSE,
    removeDuplicateSubjects = FALSE,
    restrictToCommonPeriod = FALSE,
    washoutPeriod = 0,
    maxCohortSize = 0,
    covariateSettings) {
  match.call()
  # Do getDbCohortMethodData things
  # ...
}

getDbCohortMethodData(
  connectionDetails = Eunomia::getEunomiaConnectionDetails(),
  cdmDatabaseSchema = "main",
  targetId = 1,
  comparatorId = 2,
  outcomeIds = c(3,4)
)

match.call() will return the following:

getDbCohortMethodData(connectionDetails = Eunomia::getEunomiaConnectionDetails(), 
    cdmDatabaseSchema = "main", targetId = 1, comparatorId = 2, 
    outcomeIds = c(3, 4))

ignoring the default parameters, which is similar to this issue. In this issue the following is the accepted answer:

mget(names(formals()),sys.frame(sys.nframe()))

Which gets all the variables in the current environment (nth frame of sys.frame) through formals.

Which will result into a named list of all parameters with either the passed, or default values, which can then be passed to checkParameters. I'm not particularly fond of this implementation, as it's not clear at all as to what it does, but it works.

So my question is, what do you prefer:

  1. Pass all parameters manually to checkParameters.
  2. Create an environmentand pass it. Clean up the environment after the checks are passed.
  3. Pass the list of parameters using mget, formals and sys.frame.

@schuemie
Copy link
Member Author

schuemie commented Apr 7, 2023

I think I actually prefer option 4:

  1. Keep the parameter checks in the main function.

Let me try to explain why:

  • This code is actually not adding complexity to the function, since it is not changing any variables. All it does it throw an error if something doesn't meet expectations, so as a developer I can easily ignore this whole block if I want.
  • I like having the checks on the arguments right below the function header, and therefore also close to the argument documentation. Together these define the interface of this function.
  • This is an extreme case, where the function has lots of parameters. In general we're talking about a few lines of code at the top of a function (and I would like to follow the same pattern in every function in this package, and ideally in HADES).

The only thing I don't like is that we have code above the parameter check block. This should ideally not happen. I think we can achieve that by no longer allowing removeDuplicateSubjects to be a boolean, and by moving the NULL checks on studyStartDate and studyEndDate to below the check block.

@mvankessel-EMC
Copy link
Collaborator

mvankessel-EMC commented Apr 11, 2023

I think I actually prefer option 4:

  1. Keep the parameter checks in the main function.

Let me try to explain why:

  • This code is actually not adding complexity to the function, since it is not changing any variables. All it does it throw an error if something doesn't meet expectations, so as a developer I can easily ignore this whole block if I want.
  • I like having the checks on the arguments right below the function header, and therefore also close to the argument documentation. Together these define the interface of this function.
  • This is an extreme case, where the function has lots of parameters. In general we're talking about a few lines of code at the top of a function (and I would like to follow the same pattern in every function in this package, and ideally in HADES).

The only thing I don't like is that we have code above the parameter check block. This should ideally not happen. I think we can achieve that by no longer allowing removeDuplicateSubjects to be a boolean, and by moving the NULL checks on studyStartDate and studyEndDate to below the check block.

I've pushed the following updates:

  1. Reverted all changes except downSample.
  2. Deprecated the boolean support for removeDuplicateSubjects, which now defaults to "keep all", and updated test-simulation.R to accommodate this change.
  3. Moved the studyStartDate and studyEndDate NULL updates after the assertions.
  4. I've changed the following code to fit in a function, to be reused:
getDbCohortMethodData <- function(...) {
  ...
  idx <- which(counts$treatment == 1)
  if (length(idx) == 0) {
        preSampleCounts$targetPersons <- 0
        preSampleCounts$targetExposures <- 0
      } else {
        preSampleCounts$targetPersons <- counts$personCount[idx]
        preSampleCounts$targetExposures <- counts$rowCount[idx]
      }
  idx <- which(counts$treatment == 0)
  if (length(idx) == 0) {
        preSampleCounts$comparatorPersons <- 0
        preSampleCounts$comparatorExposures <- 0
      } else {
        preSampleCounts$comparatorPersons <- counts$personCount[idx]
        preSampleCounts$comparatorExposures <- counts$rowCount[idx]
      }
  ...
}

to:

getDbCohortMethodData <- function(...) {
  ...
  idx <- which(counts$treatment == 1)
  preSampleCounts <- preSample(idx, "target", counts, preSampleCounts)
  idx <- which(counts$treatment == 0)
  preSampleCounts <- preSample(idx, "comparator", counts, preSampleCounts)
  ...
}

preSample <- function(idx, colType, counts, preSampleCounts) {
  personCol <- paste0(colType, "Persons")
  exposuresCol <- paste0(colType, "Exposures")
  if (length(idx) == 0) {
      preSampleCounts[personCol] <- 0
      preSampleCounts[exposuresCol] <- 0
    } else {
      preSampleCounts[personCol] <- counts$personCount[idx]
      preSampleCounts[exposuresCol] <- counts$rowCount[idx]
    }
  return(preSampleCounts)
}

@schuemie
Copy link
Member Author

Sounds good! Could you create a PR? That makes it easier to view all the changes.

@mvankessel-EMC
Copy link
Collaborator

mvankessel-EMC commented Apr 17, 2023

Hi @schuemie as per this comment:

Looks great! As discussed, further refactoring would probably require turning the meta-data into some nice object that can be passed around by reference. But I'll merge what you've done so far, and leave it to you if you want to work on that.

I implemented a version of getDbCohortMethodData using R6. For now both implementations are available within the branch DataLoadingSaving-R6.

I split the functionality in two classes:

  1. CohortMethodData: Class that does the computation within R.
  2. CohortDbInterface: Class that exclusively interfaces with the database.

Both classes are called within the new getDbCohortMethodData function. The function interface is exactly the same, i.e. the input is the same, as well as the output. The old version of the function is still availible as getDbCohortMethodData_DEP.

I also benchmarked the new implementation and compared it with the old one to see if the speed and memory usage are at least similar with Microbenchmark and Profvis.

Benchmark

Dummy Setup

connectionDetails <- Eunomia::getEunomiaConnectionDetails()
Eunomia::createCohorts(connectionDetails)

packageRoot <- normalizePath(system.file("..", package = "CohortMethod"))
R.utils::createLink(link = file.path(packageRoot, "sql"), system.file("sql", package = "CohortMethod"))
options("use.devtools.sql_shim" = TRUE)

covarSettings <- FeatureExtraction::createCovariateSettings(
  useDemographicsGender = TRUE,
  useDemographicsAge = TRUE,
  useDemographicsAgeGroup = TRUE,
  useDemographicsIndexMonth = TRUE,
  useDemographicsPriorObservationTime = TRUE,
  useDemographicsPostObservationTime = TRUE,
  useDemographicsTimeInCohort = TRUE,
  useDemographicsIndexYearMonth = TRUE,
  useMeasurementValueLongTerm = TRUE,
  useMeasurementValueMediumTerm = TRUE,
  useMeasurementValueShortTerm = TRUE,
  useMeasurementRangeGroupAnyTimePrior = TRUE,
  useMeasurementRangeGroupLongTerm = TRUE,
  useMeasurementRangeGroupMediumTerm = TRUE,
  useMeasurementRangeGroupShortTerm = TRUE,
  useObservationAnyTimePrior = TRUE,
  useObservationLongTerm = TRUE,
  useObservationMediumTerm = TRUE,
  useObservationShortTerm = TRUE,
  endDays = 180
)

options(sqlRenderTempEmulationSchema = "main")

MicroBenchmark

microbenchmark::microbenchmark(
  Original = {
    cohortMethodData_DEP <- CohortMethod::getDbCohortMethodData_DEP(
      connectionDetails = connectionDetails,
      cdmDatabaseSchema = "main",
      exposureTable = "cohort",
      outcomeTable = "cohort",
      targetId = 1,
      comparatorId = 2,
      outcomeIds = c(3, 4),
      cdmVersion = "5",
      washoutPeriod = 183,
      firstExposureOnly = TRUE,
      removeDuplicateSubjects = "keep all",
      restrictToCommonPeriod = TRUE,
      maxCohortSize = 100000,
      covariateSettings = covarSettings
    )
  },
  R6 = {
    cohortMethodData <- CohortMethod::getDbCohortMethodData(
      connectionDetails = connectionDetails,
      cdmDatabaseSchema = "main",
      exposureTable = "cohort",
      outcomeTable = "cohort",
      targetId = 1,
      comparatorId = 2,
      outcomeIds = c(3, 4),
      cdmVersion = "5",
      washoutPeriod = 183,
      firstExposureOnly = TRUE,
      removeDuplicateSubjects = "keep all",
      restrictToCommonPeriod = TRUE,
      maxCohortSize = 100000,
      covariateSettings = covarSettings
    )
  },
  times = 10
)

There seems to be no significant speed difference in either implementation. The maximum and minimum time it took to execute are however slightly shorter.

Unit: seconds
     expr      min       lq     mean   median       uq      max neval cld
 Original 2.226537 2.352759 2.596895 2.466261 2.724559 3.488108    10   a
       R6 2.204499 2.481361 2.734244 2.801655 2.861608 3.315819    10   a

Profvis

# R6 ----
profvis::profvis({
  cohortMethodData <- CohortMethod::getDbCohortMethodData(
    connectionDetails = connectionDetails,
    cdmDatabaseSchema = "main",
    exposureTable = "cohort",
    outcomeTable = "cohort",
    targetId = 1,
    comparatorId = 2,
    outcomeIds = c(3, 4),
    cdmVersion = "5",
    washoutPeriod = 183,
    firstExposureOnly = TRUE,
    removeDuplicateSubjects = "keep all",
    restrictToCommonPeriod = TRUE,
    maxCohortSize = 100000,
    covariateSettings = covarSettings
  )
})

# Original ----
profvis::profvis({
  cohortMethodData <- CohortMethod::getDbCohortMethodData_DEP(
    connectionDetails = connectionDetails,
    cdmDatabaseSchema = "main",
    exposureTable = "cohort",
    outcomeTable = "cohort",
    targetId = 1,
    comparatorId = 2,
    outcomeIds = c(3, 4),
    cdmVersion = "5",
    washoutPeriod = 183,
    firstExposureOnly = TRUE,
    removeDuplicateSubjects = "keep all",
    restrictToCommonPeriod = TRUE,
    maxCohortSize = 100000,
    covariateSettings = covarSettings
  )
})

Memory usage seems to be nearly identical as well.

Implementation Memory freed (-) Memory used (+) Net usage
Original 14.4 16.3 1.9
R6 14.2 16.4 2.2

@schuemie
Copy link
Member Author

Thanks! Could you help me understand your overall design here? You have

  • a getDbCohortMethodData() function
  • a CohortDbInterface object (R6)
  • a CohortMethodDataR6 object (R6)
  • the old CohortMethodData object (S4)

The getDbCohortMethodData() creates a CohortDbInterface object, and then uses that as an argument when creating a CohortMethodDataR6 object. The getDbCohortMethodData() then continues to call the CohortMethodDataR6, which under the hood calls CohortDbInterface, ultimately constructing an old-fashioned CohortMethodData object.

Any reason for having the two R6 objects? Code like this makes me wonder if that is adding anything. (The new solution also seems to have more than doubled the lines of code, making it hard to argue it has simplified things ;-) )

@mvankessel-EMC
Copy link
Collaborator

Thanks! Could you help me understand your overall design here? You have

  • a getDbCohortMethodData() function
  • a CohortDbInterface object (R6)
  • a CohortMethodDataR6 object (R6)
  • the old CohortMethodData object (S4)

The getDbCohortMethodData() creates a CohortDbInterface object, and then uses that as an argument when creating a CohortMethodDataR6 object. The getDbCohortMethodData() then continues to call the CohortMethodDataR6, which under the hood calls CohortDbInterface, ultimately constructing an old-fashioned CohortMethodData object.

Yes, so my goal for this implementation was to remain compatible with the current implementation of CohortMethod. This means that the input and output of the implementation should remain the same, so functionality that depends on it doesn't fail.

Any reason for having the two R6 objects? Code like this makes me wonder if that is adding anything. (The new solution also seems to have more than doubled the lines of code, making it hard to argue it has simplified things ;-) )

Initially I created one massive R6 class. While doing that I then realized that half the parameters:

  1. connectionDetails
  2. tempEmulationSchema
  3. cdmDatabaseSchema
  4. exposureDatabaseSchema
  5. exposureTable
  6. cdmVersion
  7. outcomeTable
  8. outcomeDatabaseSchema

Are exclusively used for interfacing with the database, thus splitting the class in two. I do agree that this implementation is redundant. I initially only wanted to call methods from CohortMethodDataR6, but that plan fell through when I needed to disconnect from the database. So I will change that.

As to simplifying things; I guess it depends how you define simplified. I agree that the code is scattered around more across the two classes, however the trade off is that the cyclomatic complexity of each function drastically decreases.

Current implementation

Function Cyclomatic Complexity
getDbCohortMethodData 28
downSample 4
countPreSample 2
handleCohortCovariateBuilders 9

R6 implementation

CohortMethodData

Function Cyclomatic Complexity
initialize 1
getDbCohortMethodData 2
createCohorts 1
downSample 4
getCohorts 1
buildCovariateData 2
validate 6
countPreSample 2
checkCohorts 4
constructLabel 6
filterCohorts 10
handleCohortCovariateBuilders 9

CohortDbInterface

Function Cyclomatic Complexity
initialize 1
disconnect 1
createCohorts 1
getCohorts 1
countCohorts 1
sampleCohorts 1
countRaw 1
getOutcomes 1
rmTempTables 1
extractCovarDat 1
validate 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New functionality that could be added
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants