Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R6 v S3/S4 in HADES #82

Closed
mdlavallee92 opened this issue Sep 6, 2023 · 4 comments
Closed

R6 v S3/S4 in HADES #82

mdlavallee92 opened this issue Sep 6, 2023 · 4 comments

Comments

@mdlavallee92
Copy link
Collaborator

From #77 @chrisknoll has posed an interesting way of viewing Capr potentially through R6 via a pure OOP system. The purpose of this post is to

a) understand the benefits of switching to R6 for Capr and
b) consider the impact of R6 within HADES...when/where to use it, why is it beneficial and ultimately does it even matter

I am hoping to get some feedback or thoughts from others @chrisknoll, @azimov, @ablack3, @anthonysena, @schuemie. I know a post was made a while back referencing this same topic.

Thoughts on OOP in context of Capr

Currently Capr is written in S4 this was done for two (at this point, flimsy) reasons:

  1. R6 was not available at the time. R still used ReferenceClass as its pure OOP system. I explored this in early Capr development but opted to go the S4 route
  2. S4 maintains the "feel" of R. Where as R6 is more amenable to programmers coming from the java and python worlds. S4 is a stricter version of S3 which does a better job of working in a functional programming pipeline. When Capr was originally created it was intended to heavily leverage the pipe operator %>% however this proved to be rather awkward.

Resources to give context to S4 and R6 can be found in chapters 14, 15, and 16 of Advanced R. While the strengths of S4 can be found here.

Starting at Capr v2, there was an intentional effort to transition the feel of the code away from piping and towards nested functions. The construction of cohorts would hence feel like building the ui of a shinyDashboard. A dashboard requires a header, sidebar and body. Within each section the user provides context on the look by adding text, output, boxes etc. Similar to a cohort definition where the user is constructing sections of the definition...the entry, attrition, exit and era. An example of what Capr code should look like now is show below:

library(Capr)


cd <- cohort(
  
  #entry event (i.e. primary criteria)
  entry = entry(
    drugExposure(metformin, male()), # index query of metformin users who are male
    observationWindow = continuousObservation(priorDays = 365L), #365 min prior obs
    primaryCriteriaLimit = "All" # use all index events
  ),
  
  # attrition to index event (i.e. inclusion rules)
  attrition = attrition(
    # no t1d any time prior
    'no t1d' = withAll( # start group
      exactly( # start criteria (i.e. count)
        x = 0,
        query = conditionOccurrence(t1d),
        aperture = duringInterval(
          startWindow = eventStarts(a = -Inf, b = 0, index = "startDate")
        )
      )
    ),
    expressionLimit = "All" # include all events for attrition
  ),
  
  # exit when the person leaves the cohort
  exit = exit(
    endStrategy = drugExit(
      conceptSet = metformin,
      surveillanceWindow = 30L
    ) # create metformin era to determine exit
  ),
  
  # era logic on how to collapse multiple events
  era = era(eraDays = 30L) # 30 days of metformin use builds an era (bit redundant)
  
)

If Capr were to switch to R6 the syntax would look more like this:

cd <- cohort$new(
  
  #entry event (i.e. primary criteria)
  entry = entry$new(
    #list of queries or single query
    list(
      drugExposure$new(
        conceptSet = metformin,
        attributes = 
          list(
            male$new()
          )
      ),
      observationWindow = continuousObservation$new(prior = 365L),
      primaryCriteriaLimit = limit$new(type = "All"),
      additionalCriteria = NULL # placeholder
      qualifyingLimit = limit$new(type = "All")
    )
  ),
  
  # attrition to index event (i.e. inclusion rules)
  attrition = attrition$new(
    list(
      group$new(
        name = 'no t1d occurrence',
        type = "all",
        int = NULL, #placeholder
        criteria = criteria$new(
          type = "exactly",
          int = 1,
          query = conditionOccurrence$new(
            conceptSet = t1d
          ),
          aperture = aperture$new(
            startWindow = eventStarts$new(a = -Inf, b = 0, index = "startDate"),
            ignoreObservationPeriod = FALSE # placeholder
          )
        )
      )
    )
  ),
  
  # exit when the person leaves the cohort
  exit = exit$new(
    endStrategy = drugExit$new(
      conceptSet = metformin,
      surveillanceWindow = 30L
    ),
    censoringCriteria = NULL #placeholder
  ),
  
  # era logic on how to collapse multiple events
  era = era$new(
    eraDays = 30L
  )
  
)

Each class has a new object method where we describe its details. Classes can have further methods such as json coercion, sql builder, print statement, and plot functions. This would be quite nice. I am conscious of not overlapping too much with CirceR.

Thoughts on R6

My main hesitation with R6 is that it removes the "feel" of R. R works best in S3 when you take advantage of its "pipe-ability" and functional programming attributes. Forcing R code into a pure OOP system may tune out "tidy-verse" programmers trying to enter the OHDSI software space. Think there is legitimate fear here given the design of the DARWIN software which are quite "tidy-verse" heavy. Not that its any sort of competition.

Having said this, I am beginning to realize the benefits of using R6 particularly if we begin to think about complex objects (circe definition) and pipelines (strategus modules). Having a strictly encapsulated objects makes it easier to force a complex routine across a network.

This post has gone way too long (of which I have accidentally deleted it twice) but maybe it starts a conversation to think about the HADES codebase as it becomes more and more complex :)

@azimov
Copy link

azimov commented Sep 6, 2023

A point on the syntax:

I don't think the use of R6, in API form, has to use the $new syntactic style for initializing an instance of an object. In CohortGenerator for the recent implementation of cohort subsets the piping approach is used by adding constructors as first-class R functions. This isn't especially distinct from how an S3/S4 would work as an API. The decision to use R6 was entirely based around encapsulation - I don't expect any users of our packages to know much about OOP.

So, for your above example you could have an API that is identical to the current version 2 of Capr - your functions conditionOccurrence could just be a wrapper to create an instance of class ConditionOccurrence e.g:

conditionOccurrence <- function(...) {
    ConditionOccurrence$new(...)
}

There would then be no difference with the syntactic sugar of R in this case. For the cases of attrition, exactly, withAll etc there would be some need to handle collections of objects but I don't think the API would need to change there either.

People who prefer a java like declarative style could also just $new syntax (though I think having multiple functionally identical supported APIs for working with the package would be annoying to support).

And honestly, I agree that the current API is good and shouldn't be changed. So to me this is just a point about the internal design of the package. Would R6 make it significantly better? The main reason I use R6 over S4 is because (as you say) I greatly prefer the model for encapsulation (which is the entire point of OOP design to me) . There are also some minor benefits with inheritance but most OOP developers agree that more than a single level of inheritance is often too confusing and prefer just encapsulating an internal object.

Neither approach really enforces a "Safe Interface" approach like you would get in other languages (or even abstract base classes, like python has good support for). So in my opinion I find the differences between the two approaches are largely just programmer preference - in R6 you're encapsulating an object but its hardly type safe at runtime (Honestly, I still think of an R6 object as a jazzed up hash table pretending to be something more, while S4 is just a hash table with useful attributes).

In conclusion, @chrisknoll and I come from a more traditional OOP background so R6 seems more natural to us. However, we're not maintaining this package...

@schuemie
Copy link
Member

schuemie commented Sep 7, 2023

This sounds like a good topic for the next HADES meeting. My current two cents with all of HADES in mind: R code is hard to maintain (no strong typing, no finding errors at compile time, and lets not get into the ever-shifting dependencies). R6 might help in some areas, but also has its own limitations (e.g. can't use breakpoints in R6 methods). If I were to create a new package from scratch I might use R6, but I certainly wouldn't go back and change old code to use R6 just for the sake of slightly better OOP.

@chrisknoll
Copy link

Hi, @mdlavallee92 ,

I didn't mean to have #77 present a critique of the CaprAPI. At the core, the circe object model is a set of Java classes, (which is OO) and so I was just presenting the OO-view of creating instances and assigning values in an OO-way (which in R syntax would be via R6).

CirceR (the R package) does not have any object model defined in it, it works only off the raw JSON string (or the list-of-lists that comes from jsonlite::parseJson). Makes it hard to programmatically build circe expressions in R (which is why we have Capr!). But CapR is solving a secondary problem of 'reusable fragments' such that when you merge a fragment (that can contain criteria and concept set), CapR 'figures it out' to merge those reference IDs into the larger expression. So, in my mind it makes sense that CapR has CapR-specific objects to support the functionality it uses.

Back to the question of R6 use: the best example where I have made use of it is in the CohortIncidence package. It started off by a series of functions in the 2.x version, but 3.x introduced R6 classes (in case you are wondering what it was like to migrate).

What I appreciate about R6 is not just the encapsulation and 'by reference' semantics of the object fields, but documentation is pretty nice too. Here's the documentation output of an IncidenceDesign, and here is the source code annotation that produces it. Note how the documentation annotations are very close to the implementation of the fields.

The other reason I leveraged it is that jsonlite has some bad behavior when it comes to rendering single-element vectors: you can tell it to universally treat single element vectors as non-arrays, but it is such a thing (in JSON) to have a single element value that is actually an array:

> x0 <- jsonlite::fromJSON("{\"x\":1, \"y\":[1]}");
> jsonlite::toJSON(x0);
{"x":[1],"y":[1]} 
> jsonlite::toJSON(x0,auto_unbox = T);
{"x":1,"y":1} 

The problem is the raw json clearly shows x is not an array, and Y is, but jsonlite will serialize both of those to either non-array or both array based on if you auto_unbox or not. I had a 'wonderful' discussion with the maintainer of json line in this issue. Needless to say, it didn't go the way I wanted.

So, enter R6 classes, and implementing a toList() function on the class. This method will handle making the value unboxes (ie: won't get turned into an array when you serialize to JSON) when you call asJson() on the class

Not having breakpoints in the R6 implementations hasn't been too much of a problem for me (I'm using these R6 classes as mostly data holders not function holders, so maybe i'm just dodging those issues). And I do have helper 'factory functions' (example: createincidenceDesign()) that can give you auto-complete prompts on the elements of the object, but this strategy does make it a bit of 'double work' such that if your object introduces a new field, you have to go back to your factory function and provide an argument for it.

Anyways, happy to discuss my experiences with it on a call. I agree with the sentiment of Martijn that unless you see a big gain of using it, I wouldn't go back and switch. My primary need was controlling JSON serialization (and I liked the way it handled data validation via checkmate), so I went with it.

@mdlavallee92
Copy link
Collaborator Author

Thanks guys!

I did not take it as a critique @chrisknoll. I thought it was an interesting perspective to explore. With R6 popping up more and more in HADES packages I wanted to weigh up its potential within Capr. I strongly considered R6 in the version 2 update but opted not to, due to development time and balance with other projects. It makes a lot of sense to match the underlying nature of circe since Capr is an R interface to it as you stated @chrisknoll.

I agree @schuemie, maybe we can talk about this on the next HADES call or after the symposium as I know things get a bit hectic in the lead up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants