Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider to replace "A" (active infections) and "B" (secondary infections) w. one large fixed-size matrix, and track indices? #5

Open
ashlinrichardson opened this issue Aug 12, 2020 · 2 comments

Comments

@ashlinrichardson
Copy link

Hi Henry, for discussion..

..Hypothetically, how about:
The data you're currently storing in data frames "A" and "B", all goes in a big fixed-size matrix M (m x n), which is fair since you have a population ceiling anyways (either human pop or computer memory)

  • matrix not dataframe, as you said
  • likely can set to 32bit float not 64bit, to increase speed! This can speedup 4x ish or more..
  • modify in-place as needed but never copy, move or delete any rows:

Henry just like you said:
do have to keep track of where the active and secondary infections (or free matrix rows) are in M..

What if: Use vector types to store the three categories of indexes to rows of M:
A: active
B: secondary
C: free space

(to merge A and B, in C++/Java you'd use a linked list and attach the tail pointer of A to the head of B, and vice versa if it's doubly linked so the merge operation is O(1)..
.. hopefully R also allows you to "stick" lists together (fingers crossed) but if not, if the high-level language insists doing something under the hood like appending B to A element wise you still get ||B|| * O(1)..
.... which is much better than ||B|| * n * O(1) in the case of moving all the data around (so just moving indices around is an effective pyramid scheme)

The motivation for the above:

  • no explicit counters
  • no filling / copying whole vectors (likely time savings of (n-1)/n)
  • explicitly tracking locations to remove any possible ambiguity re: what the high-level language is doing with the data

although there could be many other (better) approaches..

..Thoughts? Thanks for posting a super cool and fun code! Enjoying learning from the great style and learning lots about R from it too
cheers
A

@ashlinrichardson
Copy link
Author

ashlinrichardson commented Aug 12, 2020

P.S. I do like the idea of not using data frame and wonder if it would look like this (it may be crazy, I always enjoy removing "features" of programming languages if I can, syntactic "sugar" is overrated, everybody knows it seems that simple sugars promote uncontrolled cell division haha)

create_state_df <- function(n_cases, sim_params, sim_status,
                            initialize=FALSE, import=FALSE,
                            primary_state_df=NULL, primary_case_ids=NULL){
  # List of columns in state df
  col_names <- c("case_id", "status", "is_traced", "is_trace_app_user", "is_trace_app_comply",
                 "is_traced_by_app", "is_symptomatic", "days_infected", "incubation_length",
                 "isolation_delay",  "infection_length", "contact_rate", "n_sec_infects",
                 "generation_intervals") #
  n_cols <- length(col_names)

  ix <- list()
  for(i in 1:n_cols){
    ix[[col_names[i]]] <- i
  }

  # Create data frame
  state_df <- matrix(0., nrow=n_cases, ncol=n_cols)

  # If no cases being added, then return empty data frame
  # This should only happen when initializing
  if(n_cases == 0){
    return(state_df)
  }

  # Fill in start values
  if(initialize){
    state_df[,ix[['case_id']]] <- 1:n_cases # special case for starting out
  }
  else{
    state_df[,ix[['case_id']]] <- sim_status$last_case_id + 1:n_cases
  }
.....

@henry-ngo
Copy link
Collaborator

Thanks for the insights! Dataframes were helpful in developing the code to get it completed quickly when things were changing a lot (and didn't want to fiddle with things tricky indexing issues). Now that the simulation base is more stable, it's a good time to think about the matrix approach and how R handles things to get the best speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants