-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark execution speed and test coverage #71
Comments
I parsed logs from an old log-file, so this is mostly an example: library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union path <- "D:/Users/mvankessel/Documents/CohortDiagnosticsSprint/log.txt"
readLog <- function(path) {
df <- read.csv(path, sep = "\t", header = FALSE)
names(df) <- c("time", "thread", "level", "package", "task", "message")
df %>%
dplyr::tibble() %>%
dplyr::mutate(
time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
time_normalized = .data$time_normalized - min(.data$time_normalized),
duration = -as.numeric(difftime(.data$time, dplyr::lead(.data$time), units = "secs"))
)
}
log <- readLog(path)
g1 <- ggplot(data = log, mapping = aes(x = time_normalized, y = duration)) +
geom_line()
g1 +
geom_text(data = log[log$duration > 30, ], mapping = aes(x = time_normalized, y = duration, label = task))
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_text()`). Created on 2024-10-08 with reprex v2.1.1 |
How about using the executionTimes.csv output file instead of parsing the log file? Also I think the nesting/parent idea will need to stay (unless you have a proposal for changing it) so you have a hierachy of times in this file. For plotting what about something like this from profivis? It's also important to compare two executionTime.csv files from different databases, executions, and possibly versions of the CohortDiagnostics package. I was thinking of changing the output column from executionTime to executionTimeMinutes to make the time unit clear but then that would break the output format so I'm unsure about that change. |
I like the idea of using the executionTimes.csv file. But like you said, I think its pretty much unusable without knowing what unit the recorded duration time is in. Looking at the screenshot, I think you can use the As for the hierarchy, I don't really mind either way. I think it would be neat to have a plot like profvis. |
The plotly plot didn't render. But it makes it so you can hover over the start of a task, and it will show information, like the task name, and the message, etc. I tried to emulate the plot you showed, so I did boxes. I tried just horizontal lines aswell, but tasks that last 0 seconds won't show up then. So I think at least keeping one vertical, and one horizontal line would be nice. library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union library(ggplot2)
path <- "D:/Users/mvankessel/Documents/CohortDiagnosticsSprint/log.txt"
readLog <- function(path) {
df <- read.csv(path, sep = "\t", header = FALSE)
names(df) <- c("time", "thread", "level", "package", "task", "message")
df %>%
dplyr::tibble() %>%
dplyr::mutate(
time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
time_normalized = .data$time_normalized - min(.data$time_normalized),
duration = -as.numeric(difftime(.data$time, dplyr::lead(.data$time), units = "secs"))
)
}
splitLogOnParent <- function(log) {
level0Rows <- log$message %>%
stringr::str_detect(pattern = "^(?!([ ]{1,})?-).+$")
level0Rows <- seq(1, nrow(log))[level0Rows]
l <- sapply(seq_len(length(level0Rows)), function(i) {
if (!is.na(level0Rows[i + 1])) {
block <- seq(level0Rows[i], level0Rows[i + 1] - 1)
d <- dplyr::tibble(log[block, ]) %>%
dplyr::mutate(time_end = max(time_normalized) - time_normalized)
return(d)
}
})
l[!sapply(l, is.null)]
}
log <- readLog(path)
splitLog <- splitLogOnParent(log)
f <- function(splitLog) {
splitLog <- lapply(splitLog, function(log) {
log %>%
mutate(level = dplyr::case_when(
startsWith(.data$message, " -") ~ 1,
startsWith(.data$message, " -") ~ 2,
startsWith(.data$message, " -") ~ 3,
startsWith(.data$message, " -") ~ 4,
startsWith(.data$message, " -") ~ 5,
startsWith(.data$message, " -") ~ 6,
.default = 0
))
}) %>%
dplyr::bind_rows()
ggplot(
data = splitLog,
mapping = aes(
text = sprintf("task: %s\nmessage: %s", task, message)
)
) +
geom_segment(
mapping = aes(
x = time_normalized,
xend = time_normalized + duration,
y = level,
yend = level
)
) +
geom_segment(
mapping = aes(
x = time_normalized,
xend = time_normalized + duration,
y = level + 0.3,
yend = level + 0.3
)
) +
geom_segment(
mapping = aes(
x = time_normalized,
xend = time_normalized,
y = level,
yend = level + 0.3
)
) +
geom_segment(
mapping = aes(
x = time_normalized + duration,
xend = time_normalized + duration,
y = level,
yend = level + 0.3
)
) +
theme_minimal()
}
g <- f(splitLog)
g
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous. plotly::ggplotly(g)
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Error in `with_random_port()`:
#> ! Cannot find an available port. Please try again.
#> Caused by error in `startup()`:
#> ! Failed to start chrome. Error: Created on 2024-10-09 with reprex v2.1.1 |
Another log parse, but this time I took the times directly from the messages. With some regex, I was able to get the time units, and converted them all to seconds. library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union path <- "D:/CohortDiagnosticsBench/results/log.txt"
readLog <- function(path) {
df <- read.csv(path, sep = "\t", header = FALSE)
names(df) <- c("time", "thread", "level", "package", "task", "message")
df %>%
dplyr::tibble() %>%
dplyr::mutate(
time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
time_normalized = .data$time_normalized - min(.data$time_normalized)
)
}
log <- readLog(path)
df <- bind_rows(
log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? secs$", log$message), ] |>
mutate(
duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?"))
),
log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? mins$", log$message), ] |>
mutate(
duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 60
),
log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? hours$", log$message), ] |>
mutate(
duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 3600
),
log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? days$", log$message), ] |>
mutate(
duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 3600 * 24
)
)
ggplot(data = df, mapping = aes(x = .data$task, y = duration, fill = package)) +
geom_bar(stat = "identity") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) Created on 2024-10-11 with reprex v2.1.1 |
We would like to show some metrics after the sprint.
The text was updated successfully, but these errors were encountered: