Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark execution speed and test coverage #71

Open
ablack3 opened this issue Oct 6, 2024 · 5 comments
Open

Benchmark execution speed and test coverage #71

ablack3 opened this issue Oct 6, 2024 · 5 comments
Assignees

Comments

@ablack3
Copy link
Collaborator

ablack3 commented Oct 6, 2024

We would like to show some metrics after the sprint.

  • Execution speed before (current release) and after the sprint
  • Test coverage before and after (also consider test quality)
  • Test speed before and after
@ablack3 ablack3 converted this from a draft issue Oct 6, 2024
@ablack3 ablack3 changed the title Benchmark execution speed Benchmark execution speed and test coverage Oct 6, 2024
@mvankessel-EMC
Copy link

I parsed logs from an old log-file, so this is mostly an example:

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
path <- "D:/Users/mvankessel/Documents/CohortDiagnosticsSprint/log.txt"

readLog <- function(path) {
  df <- read.csv(path, sep = "\t", header = FALSE)
  names(df) <- c("time", "thread", "level", "package", "task", "message")

  df %>%
    dplyr::tibble() %>%
    dplyr::mutate(
      time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
      time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
      time_normalized = .data$time_normalized - min(.data$time_normalized),
      duration = -as.numeric(difftime(.data$time, dplyr::lead(.data$time), units = "secs"))
    )
}

log <- readLog(path)

g1 <- ggplot(data = log, mapping = aes(x = time_normalized, y = duration)) +
  geom_line()

g1 +
  geom_text(data = log[log$duration > 30, ], mapping = aes(x = time_normalized, y = duration, label = task))
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_text()`).

Created on 2024-10-08 with reprex v2.1.1

@ablack3
Copy link
Collaborator Author

ablack3 commented Oct 9, 2024

How about using the executionTimes.csv output file instead of parsing the log file?
One thing I did notice was that the unit of executionTime could be second, minute, or hour. I fixed that in the the darwin_sprint branch and opened an issue here: OHDSI#1144

Image

Also I think the nesting/parent idea will need to stay (unless you have a proposal for changing it) so you have a hierachy of times in this file.

For plotting what about something like this from profivis?

Image

It's also important to compare two executionTime.csv files from different databases, executions, and possibly versions of the CohortDiagnostics package.

I was thinking of changing the output column from executionTime to executionTimeMinutes to make the time unit clear but then that would break the output format so I'm unsure about that change.

@mvankessel-EMC
Copy link

I like the idea of using the executionTimes.csv file. But like you said, I think its pretty much unusable without knowing what unit the recorded duration time is in. Looking at the screenshot, I think you can use the startTime column sometimes, as the time stamps align with the duration.

As for the hierarchy, I don't really mind either way. I think it would be neat to have a plot like profvis.

@mvankessel-EMC
Copy link

mvankessel-EMC commented Oct 9, 2024

The plotly plot didn't render. But it makes it so you can hover over the start of a task, and it will show information, like the task name, and the message, etc.

I tried to emulate the plot you showed, so I did boxes. I tried just horizontal lines aswell, but tasks that last 0 seconds won't show up then. So I think at least keeping one vertical, and one horizontal line would be nice.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

path <- "D:/Users/mvankessel/Documents/CohortDiagnosticsSprint/log.txt"

readLog <- function(path) {
  df <- read.csv(path, sep = "\t", header = FALSE)
  names(df) <- c("time", "thread", "level", "package", "task", "message")
  
  df %>%
    dplyr::tibble() %>%
    dplyr::mutate(
      time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
      time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
      time_normalized = .data$time_normalized - min(.data$time_normalized),
      duration = -as.numeric(difftime(.data$time, dplyr::lead(.data$time), units = "secs"))
    )
}

splitLogOnParent <- function(log) {
  level0Rows <- log$message %>% 
    stringr::str_detect(pattern = "^(?!([ ]{1,})?-).+$")
  level0Rows <- seq(1, nrow(log))[level0Rows]
  
  l <- sapply(seq_len(length(level0Rows)), function(i) {
    if (!is.na(level0Rows[i + 1])) {
      block <- seq(level0Rows[i], level0Rows[i + 1] - 1)
      d <- dplyr::tibble(log[block, ]) %>%
        dplyr::mutate(time_end = max(time_normalized) - time_normalized)
      return(d)
    }
  })
  
  l[!sapply(l, is.null)]
}

log <- readLog(path)
splitLog <- splitLogOnParent(log)

f <- function(splitLog) {
  splitLog <- lapply(splitLog, function(log) {
    log %>%
      mutate(level = dplyr::case_when(
        startsWith(.data$message, " -") ~ 1,
        startsWith(.data$message, "  -") ~ 2,
        startsWith(.data$message, "   -") ~ 3,
        startsWith(.data$message, "    -") ~ 4,
        startsWith(.data$message, "     -") ~ 5,
        startsWith(.data$message, "      -") ~ 6,
        .default = 0
      ))
  }) %>%
    dplyr::bind_rows()
  ggplot(
    data = splitLog,
    mapping = aes(
      text = sprintf("task: %s\nmessage: %s", task, message)
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized,
      xend = time_normalized + duration,
      y = level,
      yend = level
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized,
      xend = time_normalized + duration,
      y = level + 0.3,
      yend = level + 0.3
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized,
      xend = time_normalized,
      y = level,
      yend = level + 0.3
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized + duration,
      xend = time_normalized + duration,
      y = level,
      yend = level + 0.3
    )
  ) +
  theme_minimal()
}

g <- f(splitLog)
g
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.

plotly::ggplotly(g)
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Error in `with_random_port()`:
#> ! Cannot find an available port. Please try again.
#> Caused by error in `startup()`:
#> ! Failed to start chrome. Error:

Created on 2024-10-09 with reprex v2.1.1

@mvankessel-EMC
Copy link

mvankessel-EMC commented Oct 11, 2024

Another log parse, but this time I took the times directly from the messages. With some regex, I was able to get the time units, and converted them all to seconds.

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
path <- "D:/CohortDiagnosticsBench/results/log.txt"

readLog <- function(path) {
  df <- read.csv(path, sep = "\t", header = FALSE)
  names(df) <- c("time", "thread", "level", "package", "task", "message")
  
  df %>%
    dplyr::tibble() %>%
    dplyr::mutate(
      time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
      time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
      time_normalized = .data$time_normalized - min(.data$time_normalized)
    )
}

log <- readLog(path)

df <- bind_rows(
  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? secs$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?"))
    ),
  
  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? mins$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 60
    ),
  
  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? hours$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 3600
    ),
  
  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? days$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 3600 * 24
    )
)

ggplot(data = df, mapping = aes(x = .data$task, y = duration, fill = package)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

Created on 2024-10-11 with reprex v2.1.1

This was referenced Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants