Benchmark execution speed and test coverage #71

ablack3 · 2024-10-06T19:40:41Z

We would like to show some metrics after the sprint.

Execution speed before (current release) and after the sprint
Test coverage before and after (also consider test quality)
Test speed before and after

mvankessel-EMC · 2024-10-08T07:42:59Z

I parsed logs from an old log-file, so this is mostly an example:

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

path <- "D:/Users/mvankessel/Documents/CohortDiagnosticsSprint/log.txt"

readLog <- function(path) {
  df <- read.csv(path, sep = "\t", header = FALSE)
  names(df) <- c("time", "thread", "level", "package", "task", "message")

  df %>%
    dplyr::tibble() %>%
    dplyr::mutate(
      time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
      time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
      time_normalized = .data$time_normalized - min(.data$time_normalized),
      duration = -as.numeric(difftime(.data$time, dplyr::lead(.data$time), units = "secs"))
    )
}

log <- readLog(path)

g1 <- ggplot(data = log, mapping = aes(x = time_normalized, y = duration)) +
  geom_line()

g1 +
  geom_text(data = log[log$duration > 30, ], mapping = aes(x = time_normalized, y = duration, label = task))
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_text()`).

^{Created on 2024-10-08 with reprex v2.1.1}

ablack3 · 2024-10-09T06:30:15Z

How about using the executionTimes.csv output file instead of parsing the log file?
One thing I did notice was that the unit of executionTime could be second, minute, or hour. I fixed that in the the darwin_sprint branch and opened an issue here: OHDSI#1144

Also I think the nesting/parent idea will need to stay (unless you have a proposal for changing it) so you have a hierachy of times in this file.

For plotting what about something like this from profivis?

It's also important to compare two executionTime.csv files from different databases, executions, and possibly versions of the CohortDiagnostics package.

I was thinking of changing the output column from executionTime to executionTimeMinutes to make the time unit clear but then that would break the output format so I'm unsure about that change.

mvankessel-EMC · 2024-10-09T08:38:36Z

I like the idea of using the executionTimes.csv file. But like you said, I think its pretty much unusable without knowing what unit the recorded duration time is in. Looking at the screenshot, I think you can use the startTime column sometimes, as the time stamps align with the duration.

As for the hierarchy, I don't really mind either way. I think it would be neat to have a plot like profvis.

mvankessel-EMC · 2024-10-09T12:51:47Z

The plotly plot didn't render. But it makes it so you can hover over the start of a task, and it will show information, like the task name, and the message, etc.

I tried to emulate the plot you showed, so I did boxes. I tried just horizontal lines aswell, but tasks that last 0 seconds won't show up then. So I think at least keeping one vertical, and one horizontal line would be nice.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

library(ggplot2)

path <- "D:/Users/mvankessel/Documents/CohortDiagnosticsSprint/log.txt"

readLog <- function(path) {
  df <- read.csv(path, sep = "\t", header = FALSE)
  names(df) <- c("time", "thread", "level", "package", "task", "message")
  
  df %>%
    dplyr::tibble() %>%
    dplyr::mutate(
      time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
      time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
      time_normalized = .data$time_normalized - min(.data$time_normalized),
      duration = -as.numeric(difftime(.data$time, dplyr::lead(.data$time), units = "secs"))
    )
}

splitLogOnParent <- function(log) {
  level0Rows <- log$message %>% 
    stringr::str_detect(pattern = "^(?!([ ]{1,})?-).+$")
  level0Rows <- seq(1, nrow(log))[level0Rows]
  
  l <- sapply(seq_len(length(level0Rows)), function(i) {
    if (!is.na(level0Rows[i + 1])) {
      block <- seq(level0Rows[i], level0Rows[i + 1] - 1)
      d <- dplyr::tibble(log[block, ]) %>%
        dplyr::mutate(time_end = max(time_normalized) - time_normalized)
      return(d)
    }
  })
  
  l[!sapply(l, is.null)]
}

log <- readLog(path)
splitLog <- splitLogOnParent(log)

f <- function(splitLog) {
  splitLog <- lapply(splitLog, function(log) {
    log %>%
      mutate(level = dplyr::case_when(
        startsWith(.data$message, " -") ~ 1,
        startsWith(.data$message, "  -") ~ 2,
        startsWith(.data$message, "   -") ~ 3,
        startsWith(.data$message, "    -") ~ 4,
        startsWith(.data$message, "     -") ~ 5,
        startsWith(.data$message, "      -") ~ 6,
        .default = 0
      ))
  }) %>%
    dplyr::bind_rows()
  ggplot(
    data = splitLog,
    mapping = aes(
      text = sprintf("task: %s\nmessage: %s", task, message)
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized,
      xend = time_normalized + duration,
      y = level,
      yend = level
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized,
      xend = time_normalized + duration,
      y = level + 0.3,
      yend = level + 0.3
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized,
      xend = time_normalized,
      y = level,
      yend = level + 0.3
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized + duration,
      xend = time_normalized + duration,
      y = level,
      yend = level + 0.3
    )
  ) +
  theme_minimal()
}

g <- f(splitLog)
g
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.

plotly::ggplotly(g)
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Error in `with_random_port()`:
#> ! Cannot find an available port. Please try again.
#> Caused by error in `startup()`:
#> ! Failed to start chrome. Error:

^{Created on 2024-10-09 with reprex v2.1.1}

mvankessel-EMC · 2024-10-11T14:09:42Z

Another log parse, but this time I took the times directly from the messages. With some regex, I was able to get the time units, and converted them all to seconds.

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

path <- "D:/CohortDiagnosticsBench/results/log.txt"

readLog <- function(path) {
  df <- read.csv(path, sep = "\t", header = FALSE)
  names(df) <- c("time", "thread", "level", "package", "task", "message")
  
  df %>%
    dplyr::tibble() %>%
    dplyr::mutate(
      time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
      time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
      time_normalized = .data$time_normalized - min(.data$time_normalized)
    )
}

log <- readLog(path)

df <- bind_rows(
  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? secs$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?"))
    ),
  
  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? mins$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 60
    ),
  
  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? hours$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 3600
    ),
  
  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? days$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 3600 * 24
    )
)

ggplot(data = df, mapping = aes(x = .data$task, y = duration, fill = package)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

^{Created on 2024-10-11 with reprex v2.1.1}

ablack3 added this to CohortDiagnostics Sprint 1 - The Fast and Furious Cohort Diagnostics Oct 6, 2024

ablack3 converted this from a draft issue Oct 6, 2024

ablack3 changed the title ~~Benchmark execution speed~~ Benchmark execution speed and test coverage Oct 6, 2024

ablack3 assigned mvankessel-EMC Oct 7, 2024

This was referenced Oct 11, 2024

Added plotLogFile function #92

Closed

Added plotLogFile func #93

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark execution speed and test coverage #71

Benchmark execution speed and test coverage #71

ablack3 commented Oct 6, 2024 •

edited

Loading

mvankessel-EMC commented Oct 8, 2024

ablack3 commented Oct 9, 2024 •

edited

Loading

mvankessel-EMC commented Oct 9, 2024

mvankessel-EMC commented Oct 9, 2024 •

edited

Loading

mvankessel-EMC commented Oct 11, 2024 •

edited

Loading

Benchmark execution speed and test coverage #71

Benchmark execution speed and test coverage #71

Comments

ablack3 commented Oct 6, 2024 • edited Loading

mvankessel-EMC commented Oct 8, 2024

ablack3 commented Oct 9, 2024 • edited Loading

mvankessel-EMC commented Oct 9, 2024

mvankessel-EMC commented Oct 9, 2024 • edited Loading

mvankessel-EMC commented Oct 11, 2024 • edited Loading

ablack3 commented Oct 6, 2024 •

edited

Loading

ablack3 commented Oct 9, 2024 •

edited

Loading

mvankessel-EMC commented Oct 9, 2024 •

edited

Loading

mvankessel-EMC commented Oct 11, 2024 •

edited

Loading