-
-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse the text to detect doc comments and inlines properly in spin() #1605
Conversation
Benchmark here. In summary,
So, I think the overhead of using # define functions -------------------------
grepl_doc_comment = function(doc, x) {
d = getParseData(parse(text = x))
doc_lines = d[d$col1 == 1 & d$token == "COMMENT" & grepl(doc, d$text), ]$line1
seq_along(x) %in% doc_lines
}
try_parse = function(code, silent = TRUE) {
!inherits(
try(parse(text = code, keep.source = FALSE), silent = silent), 'try-error'
)
}
grepl_doc_comment2 = function(doc, x) {
pos = 1
is_complete = logical(length(x))
for (i in seq_along(x)) {
if (!try_parse(x[seq(pos, i)])) {
next()
}
if (i == pos) {
is_complete[i] = TRUE
}
pos = i + 1
}
is_complete & grepl(doc, x)
}
txt1 = c("#' test", "code <- \"", "#' test\"")
# confirm results --------------------------
grepl("^#+'[ ]?", txt1) # wrong
#> [1] TRUE FALSE TRUE
grepl_doc_comment("^#+'[ ]?", txt1) # ok
#> [1] TRUE FALSE FALSE
grepl_doc_comment2("^#+'[ ]?", txt1) # ok
#> [1] TRUE FALSE FALSE
# benchmark --------------------------------
bench::mark(
grepl("^#+'[ ]?", txt1),
grepl_doc_comment("^#+'[ ]?", txt1),
grepl_doc_comment2("^#+'[ ]?", txt1),
check = FALSE
)
#> # A tibble: 3 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc
#> <chr> <bch:t> <bch:tm> <bch:t> <bch:> <dbl> <bch:byt> <dbl>
#> 1 "grepl(\"~ 5.1us 11.92us 7.5us 1.25ms 83906. 0B 0
#> 2 "grepl_do~ 710.6us 1.05ms 835.5us 4.77ms 952. 72.6KB 4
#> 3 "grepl_do~ 424.1us 656.52us 654us 2.24ms 1523. 0B 2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>
txt2 = rep(txt1, times = 100)
bench::mark(
grepl("^#+'[ ]?", txt2),
grepl_doc_comment("^#+'[ ]?", txt2),
grepl_doc_comment2("^#+'[ ]?", txt2),
check = FALSE
)
#> # A tibble: 3 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc
#> <chr> <bch:t> <bch:t> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 "grepl(\"~ 47.2us 71.22us 51.4us 1.61ms 14042. 1.22KB 1
#> 2 "grepl_do~ 1.28ms 2.02ms 1.69ms 4.59ms 496. 289.19KB 3
#> 3 "grepl_do~ 43.86ms 53.94ms 53.05ms 71.37ms 18.5 3.66KB 2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>
txt3 = rep(c("code <- \"", "", "", "", "", "", "", "", "", "#' test\""), times = 30)
bench::mark(
grepl("^#+'[ ]?", txt3),
grepl_doc_comment("^#+'[ ]?", txt3),
grepl_doc_comment2("^#+'[ ]?", txt3),
check = FALSE
)
#> # A tibble: 3 x 10
#> expression min mean median max `itr/sec` mem_alloc n_gc
#> <chr> <bch:t> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 "grepl(\"~ 36.8us 53.86us 39.7us 2.73ms 18567. 1.22KB 0
#> 2 "grepl_do~ 844us 1.25ms 1.01ms 3.63ms 798. 75.94KB 4
#> 3 "grepl_do~ 115.4ms 131.84ms 133.41ms 146.72ms 7.58 3.66KB 2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm> |
Hmm, /*
print("This is comment
*/
") |
Given that ") |
Ready for review now. |
Sounds great! Thanks! I'll review it either this Friday or a week later. |
Thanks, no hurry at all :) |
Sorry, this sentence is not true, although the fix may stay the same...
A "complete" line can have spaces at its beginnings, which means there's no It seems I should either
Let me think a bit. |
Sorry for deciding late. I chose this because 2. seems to make the implementation a bit complecated.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thank you very much!
Thanks for merging! |
I feel you understand the knitr source code well enough and is also careful enough, so I just granted the push access to this repo to you, in case I'm too busy and you want to make a critical change (or merge an important PR). I have wished to give other people push access for long, but never did it, and you are the first official collaborator of this repo :) |
Wow, thanks! I don't think I fully deserve such a great honor, but I'd love to help you when you are too busy. Honestly, I only understand the source codes partially, so I'll need to learn a lot... For the time being, I'll do my best not to make a mess before doing actual contribution. 👍 |
Partially understanding the source code is enough. I don't fully understand the source code of all packages I maintain 😉 |
This had the unfortunate side-effect that I can no longer use spin() for languages other than R ! I've borrowed a bunch of your older code, and added a 'language' parameter to make this backward compatible for non-R languages. Thanks! |
@Hemken This sounds like a bug to me. Do you mind filing a new issue? Thanks! |
Problem
(originally reported at rstudio/rmarkdown#1444)
Currently,
spin()
detects doc comments by checking if the"^#+'[ ]?"
matches on a line-by-line basis.knitr/R/spin.R
Line 71 in 01407a7
Accordingly, it mistakenly detects string literals that contain
#'
as doc comments in cases like this:The same thing occurs on
inline
in the following case:In addition to string literals, symbols can also be multi-line:
Fix
To identify the true comments and double-brackets correctly, we need to parse the whole code first and ignore string literals and symbols that are over multiple lines. To me, @yihui's suggestion of
utils::getParsedData()
at rstudio/rmarkdown#1444 (comment) seems simpler and faster (benchmark: #1605 (comment)), so I implemented as such.The idea is simple; since multi-line
STR_CONST
orSYMBOL
are tokenized as below, if the beginning of a line is contained by them, there are nocol1 = 1
token in the line. So, we can filter the parsed data bycol1 == 1
to see what lines are "complete".Created on 2018-09-16 by the reprex package (v0.2.0).