The `SepsisTools` R package

This R package is a set of code chunks/functions that will be shared among analysts for global/pairwise permutation tests(and other needs).

Obectives

We aim to make sure that analysts are all using the same tested code, rather than simply trading code files back and forth. The advantages of packaging are

Uniformity of analysis - less gotchas with small differences between analysts' individual work, and robustness to different coding styles,
Versioning - one of the many benefits of using GitHub,
Easier code review - a central source of code only needs to be verified once before general lab use.

Collaborators

Brendon Phillips (SickKids); brendon.phillips@sickkids.ca,
Eleanor Pullenayegum; ,
Cole Heasley (SickKids); ,
Miranda Loutet (SickKids); ,
Celine Funk (SickKids); ,
Daniel Roth (SickKids); ,
Lisa Pell (SickKids); .

Package Installation

The package is available via GitHub via the commands:

# install the package
library(devtools)
devtools::install_github("brendonphillips/SepsisTools", force = TRUE)

# load the library
library(SepsisTools)

Why a new package?

The test statistic that we have settled on is not covered by other packages; the closest is the 'General Independence Test' offered in the R coin package, which offers a quadratic form of test statistic. Being unclear about the statistic implemented, the team of analysts decided to maintain our own methods.

Functionality

Permutation Tests

Motivation

For SEPSIS data sets, we realised that the Wald test fails when including groups with all-zeros (excluding these groups brings a host of problems, including questionable interpretation of the results), so we've gone with permutation tests, which we do both globally, and for pairwise comparisons between IP groups and placebo.

Theory

...

Implementation

These are the functions used in the package:

standard_table - selects the group, event and id features from the table and applies standardised column names "group_", "event_" and "id_" in preparation for other functions. A data frame is returned.
permute_group - randomly shuffles the groups to which events have been assigned in preparation for the "global_permutation_test" function. A data table giving both the original and permuted groups is output.
perm_test_statistic - calculates the sum of squares difference between the group means and the overall mean. A numerical value is output.
single_permutation - permutes the groups from the incoming data once (via "permute_groups") and calculates the test statistic (via "perm_test_statistic"). A numeric value is output.
global_permutation_test - takes a set of $N_{\text{obs}}$ observations and for a specified number of iterations $N_\text{iter}$ (where $N_{\text{iter}}\le N_{\text{obs}}!$), permutes the groups of the data set, calculates $\max\left(N_{\text{iter}}, N_{\text{obs}}!\right)$ test statistics and returns the p-value. These permutation tests can be run either serially or in parallel. A list is output with the features:
- $p - the p-value,
- error - the Monte-Carlo error term,
- N_trials - the number of permutation tests performed,
- N_obs - the number of observations in the tested data set.
pairwise_permutation_test - takes a data set with various groups, carries out a global permutation test (among all groups, if chosen by the user) and then calculates pairwise permutation tests between all combinations of some fixed groups and all of the other groups present in the data set. If chosen by the user, various family corrections can be carried out on the resulting p-values. All results are returned in a data frame.

Staffer Uniquification/Disambiguation

Motivation

Aim 11 of the study involves finding potential sources of contamination of maternal and sibling stool samples. A part of our method involves analysing the behaviour of a staffer before the collection of an LP202195-positive sample; for this, we need each research staff in SEPSIS to be uniquely identifiable throughout all encounters in the data set. QA showed that in > 20 cases, the same staffer was represented by a range of ID numbers (up to 5, among all formatted encounter data sets available).

Technique

Since ID numbers were paired with names, we used mangling and the Jaro-Winkler string distance metric to create a string-matching algorithm that identified potential matches (that is, one staffer represented by multiple IDs). The output of this algorithm (with other data analysis) was evaluated with help from study collaborators in Bangladesh, allowing for the identification of duplicated IDs. For each set of duplicates (ie., for each staffer) a new "RLxxxxx" ID was created (Roth Lab)

Implementation

The string matching process and algorithm are documented at length here. A data frame was created, where each staffer was matched to all their ID numbers through the study; a new ID was created with encode_staffer_ids algorithm, and a lookup table was assembled ("staffer_dictionary").

`encode_staffer_ids`

A new ID is created from an input string via the following process:

The name of the staffer is mangled (using the algorithm outlined here),
An XXH128 hash of the mangled name is created with R's rlang::hash function,
The index and value of each digit in the has is recorded,
The two vectors are multiplied, and the result vector is randomly shuffled,
Every other element of the vector is selected,
The prefix "RL" is appended,
The string is truncated at seven characters.

This gives an ID with the standard format "RLxxxxx" (where x is a digit 0-9). T

`uniquify_staffer_ids`

This function takes either a list or a data frame and looks up the new RL ID of every staffer in the data set.

Data Sets

There are two data sets provided with the package.

`class_performance`

This data set is meant to mimic the data sets seen in the SEPSIS study, and used to trial the functions relating to permutation testing.

data(class_performance) # load with

The data sets record the quizzes attempted by a number of students, their teachers, the quiz number, and whether they passed/failed. A sample follows:

student_id	class_teacher	quiz_number	passed_quiz	passed_quiz_numeric	passed_quiz_string
RL13020	Grey	5	TRUE	1	Yes
RL13020	Grey	9	TRUE	1	Yes
RL13020	Grey	11	TRUE	1	Yes
RL13020	Grey	12	TRUE	1	Yes
RL20429	Simpson	7	TRUE	1	Yes
RL20429	Simpson	8	TRUE	1	Yes
RL20429	Simpson	5	TRUE	1	Yes

There are 3618 records in the set, giving the TRUE/FALSE test results of 508 students split among 5 subjects taught by teachers with last names "Grey", "Simpson", "Cumberbatch", "Roth" and NA (for demonstration purposes). Examples for the get_p_value, global_permutation_test, pairwise_permutation_tests, perm_test_statistics, permute_groups and permgp_fn (deprecated).

`staffer_dictionary`

This is a lookup table mapping SEPSIS ID numbers to unique project "RL" numbers. It can be loaded with the command:

# load with
data(class_performance)

Explained above.

Package Details

Design

We aim to write and code this package in the tidyverse style, though R files may not always be completely linted before committing. All changes to the main branch must be done through a pull request from a dedicated branch.

Maintenance

For issues, contact me on Teams and/or email me.

Package Changes (to present)

Date	Action	Contributor	Note
25 Sep 2023	Initial inquiry re: accounting for 0 values	Lisa Pell
5 Oct 2023	Contribution of template R code	Eleanor Pullenayegum
10 Oct 2023	Adaptation of template R code	Cole Heasley
11 Oct 2023	Code completion	Brendon Phillips
17 Oct 2023	R package proposed	Brendon Phillips
23 Oct 2023	First draft of package finished	Brendon Phillips
25 Oct 2023	Package documented	Brendon Phillips
26 Oct 2023	problems with `ranseed` arguments fixed	Brendon Phillips	change necessitated by change to parameter names, `get_p_value_function` renamed to `single_permutation`
26 Oct 2023	staffer_dictionary amended after review	Brendon Phillips	one staffer name similarity dismissed as coincidence, dictionary regenerated
27 Oct 2023	p-value calculation corrected in the `global_permutation_test` function	Brendon Phillips	An issue was flagged initially by analyst Cole Heasley (subsequently confirmed by analyst Celine Funk) where the global permutation test applied to some data sets gave a p-value of 1 and an MC error of 0. The p-value calculation was changed from $\text{mean}(\text{teststatnull} \ge \text{teststat})$ to $\text{mean}(\text{teststatnull} > \text{teststat})$ (that is, the equal case was removed), and p-values decreased to believable values. This was the fault of the maintainer. Warnings ware now given when a seed was not set for any function in the package using random values.
ongoing	testing	Cole Heasley, Celine Funk

Dependencies

dplyr - .data, all_of, as_tibble, bind_rows, case_when, group_by, join_by, left_join. mutate, n, pull, rename, rename_with, right_join, rowwise, select, summarise, tibble, ungroup,
foreach - %do%, %dopar%, foreach,
data.table - fread, rbindlist,
doSNOW - registerDoDNOW,
haven - read_dta,
lazyeval - lasy_dots,
parallel - detectCores, makeCluster, stopCluster,
plyr - rbind.fill,
purrr - map, reduce,
readxl - read_excel,
rlang - hash,
stats - p.adjust

Decisions (Analysts)

Packages can be added over time; as of right now, no other code is of general use to the lab, other than the permutation tests and the staffer ID uniquification.
We will maintain the package in a public repository, given that none of the routines expose sensitive information, or could possibly lead to deanonymisation.

Sources

SEPSIS data (Microsoft 365), personal communication amongst the Analysis team.

To-do

access to permute_groups and standard_tble functions during parallel execution
na_fill argument for lists in encode_staffer_ids
histogram of test statistics during permutation tests

Licence

We use the MIT licence.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
R		R
data		data
examples		examples
images		images
man		man
.Rbuildignore		.Rbuildignore
.gitattributes		.gitattributes
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
Rbuildignore		Rbuildignore
SepsisTools.Rproj		SepsisTools.Rproj
haha.rda		haha.rda
rothlab_permtest.code-workspace		rothlab_permtest.code-workspace
temp.R		temp.R
testbench.R		testbench.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The `SepsisTools` R package

Obectives

Collaborators

Package Installation

Why a new package?

Functionality

Permutation Tests

Motivation

Theory

Implementation

Staffer Uniquification/Disambiguation

Motivation

Technique

Implementation

`encode_staffer_ids`

`uniquify_staffer_ids`

Data Sets

`class_performance`

`staffer_dictionary`

Package Details

Design

Maintenance

Package Changes (to present)

Dependencies

Decisions (Analysts)

Sources

To-do

Licence

About

Releases

Packages

Contributors 2

Languages

License

DGBassaniLab/SepsisTools

Folders and files

Latest commit

History

Repository files navigation

The SepsisTools R package

Obectives

Collaborators

Package Installation

Why a new package?

Functionality

Permutation Tests

Motivation

Theory

Implementation

Staffer Uniquification/Disambiguation

Motivation

Technique

Implementation

encode_staffer_ids

uniquify_staffer_ids

Data Sets

class_performance

staffer_dictionary

Package Details

Design

Maintenance

Package Changes (to present)

Dependencies

Decisions (Analysts)

Sources

To-do

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

The `SepsisTools` R package

`encode_staffer_ids`

`uniquify_staffer_ids`

`class_performance`

`staffer_dictionary`

Packages