Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“ymd” family of functions fail with “Error in gsub,” language locale bug? #181

Closed
ghost opened this issue Apr 26, 2013 · 34 comments
Closed

Comments

@ghost
Copy link

ghost commented Apr 26, 2013

I am using version 3.0.0 of R on 64-bit Windows 7. It may (or may not) be worth noting that I live in Japan and the system language of my OS is Japanese. I am running R in English, however.

Lubridate has been updated to the most recent version, and

library(lubridate)

loads the package without any issue. I can for example run something like

now()
[1] "2013-04-26 21:07:30 JST"
now() - days(2)
[1] "2013-04-24 21:07:41 JST"

with no issue. That said, the critically important ymd family of functions (ymd, dmy, etc.) does not function properly. For ymd, mdy, any of them, with any argument, I get the following error:

ymd("2010-12-08")
Error in gsub("+", "*", fixed = T, gsub(">", "_e>", num)) :
invalid multibyte string at '<8c>)<28>?![[:alpha:]]))|((?<H_s_e>
2[0-4]|[01]?\d)\D+(?<M_s_e>[0-5]?\d)\D+((?<OS_s_S_e>[0-5]?\d.\d+)|
(?<S_s_e>[0-6]?\d))))'

I received a response on StackExchange suggesting it might be a bug related to the language locale. For reference, the sessionInfo() is as follows:

sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C
[5] LC_TIME=Japanese_Japan.932

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] lubridate_1.3.0 TSA_1.01 tseries_0.10-31 mgcv_1.7-22
[5] locfit_1.5-9.1 leaps_2.9

loaded via a namespace (and not attached):
[1] digest_0.6.3 grid_3.0.0 lattice_0.20-15 Matrix_1.0-12
[5] memoise_0.1 nlme_3.1-109 plyr_1.8 quadprog_1.5-5
[9] stringr_0.6.2 zoo_1.7-9

Any attention would be most appreciated.

@vspinu
Copy link
Member

vspinu commented Apr 26, 2013

I am afraid this is a system specific problem. I cannot reproduce it on linux. This is what I have:

> Sys.setlocale("LC_TIME", "ja_JP.utf8")
[1] "ja_JP.utf8"
> format(Sys.time(), format = "%a %Y %b %d %I:%M:%S %p")
[1] "金 2013  4月 26 04:40:43 午後"
> ymd("2010-12-08")
[1] "2010-12-08 UTC"
> 

Let's try to isolate the problem.

From what I can see it has to do with the following code in lubridate:::.build_locale_regs()

╭──────── #408 ─ /home/vitoshka/TVC/lubridate/R/guess.r ──num_exact[] <- gsub("(?<!\\()\\?(?!<)", "", perl = T, # remove ?
│                         gsub("+", "*",  fixed = T,  
│                              gsub(">", "_e>", num))) # append _e to avoid duplicates
╰──────── #410 ─

Would be nice if you post here what is the value of "num" variable. You can do that either by placing the browser() at that location or with options(error=recover) and choose the appropriate frame. once the error occurred.

@ghost
Copy link
Author

ghost commented Apr 26, 2013

Thanks so much for the response.
I tried working with the code a bit, though couldn't get as far as the value of variable "num." As the code below shows, I must be missing something on how to reach the arguments of that nested gsub. Apologies for having to be walked through this, I'm still very new to R.

options(error=recover)
ymd("2012-02-02")
Error in gsub("+", "*", fixed = T, gsub(">", "_e>", num)) :
invalid multibyte string at '<8c>)<28>?![[:alpha:]]))|((?<H_s_e>2[0-4]|[01]?\d)\D+(?<M_s_e>[0-5]?\d)\D+((?<OS_s_S_e>[0-5]?\d.\d+)|(?<S_s_e>[0-6]?\d))))'

Enter a frame number, or 0 to exit

1: ymd("2012-02-02")
2: .parse_xxx(..., orders = "ymd", quiet = quiet, tz = tz, locale = locale,
3: as.POSIXct(parse_date_time(dates, orders, quiet = quiet, tz = tz, locale
4: parse_date_time(dates, orders, quiet = quiet, tz = tz, locale = locale,
5: .local_parse(x[to_parse], TRUE)
6: .best_formats(train, orders, locale = locale, .select_formats)
7: unique(guess_formats(x, orders, locale = locale, preproc_wday = TRUE))
8: guess_formats(x, orders, locale = locale, preproc_wday = TRUE)
9: .get_loc_regs(locale)
10: f(...)
11: gsub("(?<!()?(?!<)", "", perl = T, gsub("+", "", fixed = T, gsub(">
12: gsub("+", "
", fixed = T, gsub(">", "_e>", num))

Selection: 12
Called from: top level
Browse[1]> num
Error during wrapup: object 'num' not found
Browse[1]> fixed
[1] TRUE
Browse[1]> gsub
function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
{
if (!is.character(x))
x <- as.character(x)
.Internal(gsub(as.character(pattern), as.character(replacement),
x, ignore.case, perl, fixed, useBytes))
}
<bytecode: 0x0c74c7dc>
<environment: namespace:base>
Browse[1]> gsub
function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
{
if (!is.character(x))
x <- as.character(x)
.Internal(gsub(as.character(pattern), as.character(replacement),
x, ignore.case, perl, fixed, useBytes))
}
<bytecode: 0x0c74c7dc>
<environment: namespace:base>
Browse[1]> num
Error during wrapup: object 'num' not found

@vspinu
Copy link
Member

vspinu commented Apr 26, 2013

You are doing things right. The frame you should enter is 11 or even 10
not 12. Thanks.

@ghost
Copy link
Author

ghost commented Apr 26, 2013

Thanks very much for the guidance. Got it from frame 10. Resulted in quite the explosion of text.. I wonder if it's having issues with the AM/PM kanji in there?.

Selection: 10
Called from: top level
Browse[1]> num d
"(?[012]?[1-9]|3[01]|[12]0)"
H
"(?2[0-4]|[01]?\d)"
h
"(?2[0-4]|[01]?\d)"
I
"(?1[0-2]|0?[1-9])"
j
"(?[0-3]?\d?\d)"
M
"(?[0-5]?\d)"
S
"((?<OS_S>[0-5]?\d.\d+)|(?[0-6]?\d))"
s
"((?<OS_S>[0-5]?\d.\d+)|(?[0-6]?\d))"
U
"(?[0-5]?\d)"
w
"(?[0-6])"
u
"(?[1-7])"
W
"(?[0-5]?\d)"
Y
"(?\d{4})"
y
"((?<Y_y>\d{4})|(?\d{2}))"
Oz
"(?<Oz_Oz>[-+]\d{4})"
OO
"(?[-+]\d{2}:\d{2})"
Oo
"(?[-+]\d{2})"
T
"(((?<I_s>1[0-2]|0?[1-9])\D+(?<M_s_T>[0-5]?\d)\D+((?<OS_s_T_S>[0-5]?\d.\d+)|(?<S_s_T>[0-6]?\d))\D_(?<p_s>午前|午後)(?![[:alpha:]]))|((?<H_s>2[0-4]|[01]?\d)\D+(?<M_s>[0-5]?\d)\D+((?<OS_s_S>[0-5]?\d.\d+)|(?<S_s>[0-6]?\d))))"
R
"(((?<I_s>1[0-2]|0?[1-9])\D+(?<M_s_T>[0-5]?\d)\D_(?<p_s>午前|午後)(?![[:alpha:]]))|((?<H_s>2[0-4]|[01]?\d)\D+(?<M_s>[0-5]?\d)))"
r
"(((?<I_s>1[0-2]|0?[1-9])\D*(?<p_s>午前|午後)(?![[:alpha:]]))|(?<H_s>2[0-4]|[01]?\d))"

Browse[1]>

@vspinu
Copy link
Member

vspinu commented Apr 27, 2013

Yes, it has to do with the encoding and most likely with the fact that your R is English and locale is Japanese.

Now take those funny expressions each at a time and check which one is causing problem. For example:

funny_exp <- "(((?<I_s>1[0-2]|0?[1-9])\\D*(?<p_s>午前|午後)(?![[:alpha:]]))|(?<H_s>2[0-4]|[01]?\\d))"
gsub("+", "*", funny_exp, fixed = T)
gsub("(?<!\\()\\?(?!<)", "", funny_exp, perl = T)

Once we figure this out we are very close to a reproducible example that you can post further on stackoverflow or R-help. This is really not a lubridate issue per see, but would be good to know what is going on for the future.

@ghost
Copy link
Author

ghost commented Apr 27, 2013

Great, once I get home tonight I will try playing around with it in more detail. I will post again as soon as I can.

For reference, however, even returning my R to Japanese so that it matches the native locale, the error still occurs.

@ghost
Copy link
Author

ghost commented Apr 27, 2013

Bizarre... I went through all the values of "num" in there, and each time the exact same pattern of output came out. No errors or anomalies pop up, however. The pattern is, for example,

strange <- "(((?< I_s >1[0-2]|0?[1-9])\D_(?< p_s >午前|午後)(?![[:alpha:]]))|(?< H_s >2[0-4]|[01]?\d))"
gsub("+", "", strange, fixed = T)
[1] "(((?< I_s >1[0-2]|0?[1-9])\D
(?< p_s >午前|午後)(?![[:alpha:]]))|(?< H_s >2[0-4]|[01]?\d))"
gsub("(?<!()?(?!<)", "", strange, perl = T)
[1] "(((?< I_s >1[0-2]|0[1-9])\D_(?< p_s >午前|午後)(?![[:alpha:]]))|(?< H_s >2[0-4]|[01]\d))"

or

strange <- "(?< H >2[0-4]|[01]?\d)"
gsub("+", "*", strange, fixed = T)
[1] "(?< H >2[0-4]|[01]?\d)"
gsub("(?<!()?(?!<)", "", strange, perl = T)
[1] "(?< H >2[0-4]|[01]\d)"

where the question mark(s) are removed in the second gsub. Besides that, fiddling around with things I cannot seem to come across a reproduction of the error.

R is quite widely used in Japan, and I would be surprised if no one in the country was able to use lubridate.

Apologies for not being able to get more information together myself, the support is much appreciated.

@vspinu
Copy link
Member

vspinu commented Apr 27, 2013

The value of num is correct. It's some multibyte glitch in gsub with fixed=T. Try the following

fmt <- format(as.POSIXct("1970-01-01 02:00:00"), "%a+%A+%b@%B@%p@")
gsub("+", "*", fmt, fixed = T)

It should give "木*木曜日* 1月@1月@午前@"

Also try:

gsub("+", "*", sprintf("%s", fmt), fixed = T)

This is how num is constructed, using format and sprintf.

@ghost
Copy link
Author

ghost commented Apr 27, 2013

Seems to be no issue in trying these guys as well:

fmt <- format(as.POSIXct("1970-01-01 02:00:00"), "%a+%A+%b@%B@%p@")
gsub("+", "", fmt, fixed = T)
[1] "木_木曜日_1@1月@午前@"
gsub("+", "
", sprintf("%s", fmt), fixed = T)
[1] "木_木曜日_1@1月@午前@"

@vspinu
Copy link
Member

vspinu commented Apr 28, 2013

Ok I think I know where it comes from, but I have no clue how to solve it.

Does this work for you?

paste(c("午前", "午後"), collapse = "|")
## -> [1] "午前|午後"

If so, does this work as expected:

ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
ampm
## -> [1] "午前" "午後"
paste(ampm, collapse = "|")
## -> [1] "午前|午後"

@vspinu
Copy link
Member

vspinu commented Apr 28, 2013

And then of course,

gsub("+", "*", paste(ampm, collapse = "|"), fixed = T)
gsub("+", "*", sprintf("%s", paste(ampm, collapse = "|")), fixed = T)

since that is where the error comes.

@vspinu
Copy link
Member

vspinu commented Apr 28, 2013

Eh, it sinks gradually. Also double gsub:

gsub("+", "*", gsub("|", "*", sprintf("%s", paste(ampm, collapse = "+|")), fixed = T), fixed = T)

This is how remote debugging works :)

@ghost
Copy link
Author

ghost commented Apr 28, 2013

This is an extremely useful learning experience... while I have some experience with programming it has all been fairly small-scale applications and thus was always quite easy to debug. Going through the above suggestions the functions responses are as follows:

paste(c("午前", "午後"), collapse = "|")
[1] "午前|午後"
ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
ampm
[1] "午前" "午後"
paste(ampm, collapse = "|")
[1] "午前|午後"
gsub("+", "", paste(ampm, collapse = "|"), fixed = T)
[1] "午前|午後"
gsub("+", "
", sprintf("%s", paste(ampm, collapse = "|")), fixed = T)
[1] "午前|午後"
gsub("+", "", gsub("|", "", sprintf("%s", paste(ampm, collapse = "+|")), fixed = T), fixed = T)
[1] "午前**午後"

@vspinu
Copy link
Member

vspinu commented Apr 28, 2013

Sorry. I am out of options, this is the most obscure bug I have seen in years.

Here is the last attempt:

gsub("+", "*", gsub(">", "*", sprintf("%s", paste(ampm, collapse = "+>"))), fixed = T)

Here is an isolated code from .build_locale_regs. It is virtually identical to the internal code (btw, I hope you are using the most recent version of lubridate from github).

ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
p <- unique(ampm)
p <- p[nzchar(p)]
alpha_p <- sprintf("(?<p>%s)(?![[:alpha:]])", paste(p, collapse = "|"))

##  NUMERIC FORMATS
num <- c(
  d = "(?<d>[012]?[1-9]|3[01]|[12]0)",
  H = "(?<H>2[0-4]|[01]?\\d)",
  h = "(?<H>2[0-4]|[01]?\\d)",
  I = "(?<I>1[0-2]|0?[1-9])", 
  j = "(?<j>[0-3]?\\d?\\d)", 
  M = "(?<M>[0-5]?\\d)",
  S = "((?<OS_S>[0-5]?\\d\\.\\d+)|(?<S>[0-6]?\\d))", 
  s = "((?<OS_S>[0-5]?\\d\\.\\d+)|(?<S>[0-6]?\\d))",
  U = "(?<U>[0-5]?\\d)", 
  w = "(?<w>[0-6])", # merge with a, A??
  u = "(?<u>[1-7])", 
  W = "(?<W>[0-5]?\\d)", 
  ## x = "(?<x>\\d{2}/[01]?\\d/[0-3]?\\d)", 
  ## X = "(?<X>[012]?\\d:[0-5]?\\d:[0-6]?\\d)", 
  Y = "(?<Y>\\d{4})",
  y = "((?<Y_y>\\d{4})|(?<y>\\d{2}))",
  Oz = "(?<Oz_Oz>[-+]\\d{4})", ## sptrtime implements only this format (4 digits)
  ## F = "(?<F>\\d{4)-\\d{2}-\\d{2})",
  OO = "(?<OO>[-+]\\d{2}:\\d{2})", 
  Oo = "(?<Oo>[-+]\\d{2})")


check <- sprintf("((%s\\D+%s\\D+%s\\D*%s)|(%s\\D+%s\\D+%s))",
                 num[["I"]], num[["M"]], num[["S"]], alpha_p, num[["H"]], num[["M"]], num[["S"]])

check <- gsub("(<[IMSpHS]|<OS)", "\\1_s", check)

gsub("(?<!\\()\\?(?!<)", "", perl = T, 
     gsub("+", "*",  fixed = T,  
          gsub(">", "_e>", check))) # append _e to avoid duplicates

If this one doesn't give you an error I am afraid you have to step through the code and try to isolate the problem yourself. Here is how to do that:

.date_template <- lubridate:::.date_template
lubridate:::.build_locale_regs() ## to get the code

Now copy paste the body of the function into a new file and execute as usual. If you don't get the error that can only mean you are kidding me. Once you get an error, try to cut the irrelevant pieces till you get something manageable.

BTW, just to make sure,

lubridate:::.build_locale_regs()

should give you the original error.

@ghost
Copy link
Author

ghost commented Apr 29, 2013

Thanks so much for all the help. Extremely appreciated.
I'll try my best with the debugging and if I make any progress will let you know.

@hadley
Copy link
Member

hadley commented Apr 29, 2013

@Vitoshka this is probably some detail with Windows locales + character encodings and will probably need a windows box to reproduce.

@ghost
Copy link
Author

ghost commented Apr 29, 2013

In following the above suggestions, it seems as if the "location" of the error has been honed down.

As above, assigning

ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")

trying this guy finally brought up an error:

gsub("+", "", gsub(">", "", sprintf("%s", paste(ampm, collapse = "+>"))), fixed = T)
Error in gsub("+", "", gsub(">", "", sprintf("%s", paste(ampm, collapse = "+>"))), :
invalid multibyte string at ' < 8 c > '

I put in the spaces between the <, 8, c, and > since Github's auto-formatting made it disappear otherwise.

@vspinu
Copy link
Member

vspinu commented Apr 29, 2013

Thanks. just to make sure.

The following does work as expected:

    gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>"), fixed = T), fixed = T)

And this one breaks:

    gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>")), fixed = T)

If so I can fix it today and you probably should report an R bug.

@ghost
Copy link
Author

ghost commented Apr 29, 2013

Really! That would be wonderful. Indeed, the first one works as expected, while the latter does break:

gsub("+", "", gsub(">", "", paste(ampm, collapse = "+>"), fixed = T), fixed = T)
[1] "午前*午後"
gsub("+", "
", gsub(">", "", paste(ampm, collapse = "+>")), fixed = T)
Error in gsub("+", "
", gsub(">", "*", paste(ampm, collapse = "+>")), :
invalid multibyte string at '< 8 c >'

I'll be sure to report the bug. Please let me know if there is anything else I should do.

vspinu added a commit to vspinu/lubridate that referenced this issue Apr 29, 2013
vspinu added a commit to vspinu/lubridate that referenced this issue Apr 29, 2013
@vspinu
Copy link
Member

vspinu commented Apr 29, 2013

I have committed a change. Try it out with:

library(devtools)
install_github("lubridate", "vitoshka")

@ghost
Copy link
Author

ghost commented Apr 30, 2013

Apologies for the delay... for some reason I'm having issues with install_github, getting an error in tools:::.install_packages() saying it "cannot create temporary directory." Read/write permissions are all fine for the temp directory, so I'm digging around the error a bit deeper.

Once I get this worked out I'll get back to you ASAP.

@ghost
Copy link
Author

ghost commented May 1, 2013

Sorry about the delay. I've got devtools working properly again (had to tweak some environment variables) and I believe the download from github went fine:

install_github("lubridate","vitoshka")
Installing github repo(s) lubridate/master from vitoshka
Installing lubridate.zip from https://github.com/vitoshka/lubridate/archive/master.zip
Installing lubridate
"C:/.../R-3.0.0/bin/x64/R" --vanilla CMD INSTALL "C:...\lubridate-master" --library="C:/.../R-3.0.0/library" --with-> > keep.source
installing source package 'lubridate' ... # asterisks below removed by me due to github formatting
R
data
moving datasets to lazyload DB
inst
preparing package for lazy loading
help
installing help indices
building package indices
installing vignettes
testing if installed package can be loaded
arch - i386
arch - x64
DONE (lubridate)

However, unfortunately when I load the package and try the function again,

library(lubridate)
ymd("2012-02-02")
Error in gsub(">", "e>", num, fixed = TRUE) :
invalid multibyte string at '< 8c >)< 28 >?![[:alpha:]]))|((?< H_s >2[0-4]|[01]?\d)\D+(?< M_s >[0-5]?\d)\D+((?< OS_s_S >[0-5]?\d.\d+)|(?< S_s >[0-6]?\d))))'
Enter a frame number, or 0 to exit
1: ymd("2012-02-02")
2: parse.r#67: .parse_xxx(..., orders = "ymd", quiet = quiet, tz = tz, loca
3: parse.r#551: as.POSIXct(parse_date_time(dates, orders, quiet = quiet, tz
4: parse_date_time(dates, orders, quiet = quiet, tz = tz, locale = locale,
5: parse.r#450: .local_parse(x[to_parse], TRUE)
6: parse.r#427: .best_formats(train, orders, locale = locale, .select_forma
7: guess.r#248: unique(guess_formats(x, orders, locale = locale, preproc_wd
8: guess_formats(x, orders, locale = locale, preproc_wday = TRUE)
9: guess.r#134: .get_loc_regs(locale)
10: f(...)
11: guess.r#409: gsub("(?<!()?(?!<)", "", perl = TRUE, gsub("+", "
", fi
12: gsub("+", "_", fixed = TRUE, gsub(">", "_e>", num, fixed = TRUE))
13: gsub(">", "_e>", num, fixed = TRUE)

which is a bit different than before, though I'm not sure of what has precisely changed. Going through the same routine as above once again, the results were the same, including with the same final error popping up here:

ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
gsub("+", "", gsub("|", "", sprintf("%s", paste(ampm, collapse = "+|")), fixed = T), fixed = T)
[1] "午前*午後"
gsub("+", "
", gsub(">", "", sprintf("%s", paste(ampm, collapse = "+>"))), fixed = T)
Error in gsub("+", "
", gsub(">", "", sprintf("%s", paste(ampm, collapse = "+>"))), :
invalid multibyte string at '< 8c >'
Enter a frame number, or 0 to exit
1: gsub("+", "
", gsub(">", "*", sprintf("%s", paste(ampm, collapse = "+>")

Assuming the download from github was of the correct files, please let me know if anything looks interesting to you in terms of the error content. I'll be sure to try anything out if need be. Regards!

vspinu added a commit to vspinu/lubridate that referenced this issue May 1, 2013
@vspinu
Copy link
Member

vspinu commented May 1, 2013

Hm, it is getting grimmer and grimmer. I have commited a change that completely avoids standard R regexp (that is, it uses perl or fixed regexp). I hope perl works for you, otherwise there is really no other option than deactivation of internationalization in lubridate.

Try my master branch again and also try this:

gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>"), perl = T), fixed = T)

And you really should report this bug to R people (here https://bugs.r-project.org/bugzilla3/)

@hadley
Copy link
Member

hadley commented May 1, 2013

There is no point submitting a bug to R unless you can create a simple reproducible example. For example, when I run the following code on my windows machine, I don't get an error (but neither do I get the correct output)

Sys.setlocale("LC_ALL", "Japanese_Japan.932")

ampm <- format(as.POSIXct(c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")), format = "%p")
ampm
# [1] "??" "??"

gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>"), fixed = T), fixed = T)
# [1] "??**??"
gsub("+", "*", gsub(">", "*", paste(ampm, collapse = "+>")), fixed = T)
# [1] "??**??"

@ghost
Copy link
Author

ghost commented May 1, 2013

For reference, I have been able to reproduce it on other machines, though the setup has been identical (Japanese 64-bit Windows 7).

With the new version, I get

d <- ymd("2012-02-02")
Error in gsub(">", "e>", num, fixed = TRUE) :
invalid multibyte string at '< 8c >)< 28 >?![[:alpha:]]))|((?< H_s >2[0-4]|[01]?\d)\D+(?< M_s >[0-5]?\d)\D+((?< OS_s_S >[0-5]?\d.\d+)|(?< S_s >[0-6]?\d))))'
Enter a frame number, or 0 to exit
1: ymd("2012-02-02")
2: parse.r#67: .parse_xxx(..., orders = "ymd", quiet = quiet, tz = tz, loca
3: parse.r#551: as.POSIXct(parse_date_time(dates, orders, quiet = quiet, tz
4: parse_date_time(dates, orders, quiet = quiet, tz = tz, locale = locale,
5: parse.r#450: .local_parse(x[to_parse], TRUE)
6: parse.r#427: .best_formats(train, orders, locale = locale, .select_forma
7: guess.r#248: unique(guess_formats(x, orders, locale = locale, preproc_wd
8: guess_formats(x, orders, locale = locale, preproc_wday = TRUE)
9: guess.r#134: .get_loc_regs(locale)
10: f(...)
11: guess.r#409: gsub("(?<!()?(?!<)", "", perl = TRUE, gsub("+", "
", fi
12: gsub("+", "_", fixed = TRUE, gsub(">", "_e>", num, fixed = TRUE))
13: gsub(">", "_e>", num, fixed = TRUE)

The other statement yields:

Error in gsub("+", "", gsub(">", "", paste(ampm, collapse = "+>"), perl = T), :
invalid multibyte string at '< 8c >'
Enter a frame number, or 0 to exit
1: gsub("+", "", gsub(">", "", paste(ampm, collapse = "+>"), perl = T), f
Selection: 1
Called from: top level
Browse[1]> perl
[1] FALSE

@hadley
Copy link
Member

hadley commented May 1, 2013

It's unlikely that the R maintainers will have a Japanese version of windows available, so it would be helpful to create an example that fails for everyone. It's quite possible that there are other ways to fix the bug by correcting the string encoding but without a reproducible example, there's no way I can explore.

@vspinu
Copy link
Member

vspinu commented May 1, 2013

@hadley

This is why this bug is a nightmare. The bug is easily reproducible on japanise machine and it is a problem with regexp parser because fixed=T works.

Though in your case it might be something else going on. ?? in the output might simply mean that your terminal doesn't know how to display it or encoding is absent on your machine. I have no clue how windows deals with this but on linux I am getting a warning when I try to set a missing language locale.

@vspinu
Copy link
Member

vspinu commented May 1, 2013

Fixing the encoding with Enconding(x) <- value. Right?

May be at least ask on R-devel. Someone might recommend a fix without
even reproducing the problem.

hadley wickham notifications@github.com
on Wed, 01 May 2013 04:38:54 -0700 wrote:

hw> It's unlikely that the R maintainers will have a Japanese version of windows
hw> available, so it would be helpful to create an example that fails for
hw> everyone. It's quite possible that there are other ways to fix the bug by
hw> correcting the string encoding but without a reproducible example, there's no
hw> way I can explore.

hw> ---
hw> Reply to this email directly or view it on GitHub:
hw> #181 (comment)

@hadley
Copy link
Member

hadley commented May 1, 2013

Ok - I can reproduce it now - the key is to use the R gui, not RStudio (I'll report that bug).

But the plot thickens:

Sys.setlocale("LC_ALL", "Japanese_Japan.932")

times <- c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC")
ampm <- format(as.POSIXct(times), format = "%p")
x <- gsub(">", "*", paste(ampm, collapse = "+>"))

y <- "午前+*午後"
identical(x, y)
# [1] TRUE
gsub("+", "*", x, fixed = T)
# Error in gsub("+", "*", x, fixed = T) : 
#  invalid multibyte string at '<8c>'
gsub("+", "*", y, fixed = T)
# [1] "午前**午後"

@hadley
Copy link
Member

hadley commented May 1, 2013

@hadley
Copy link
Member

hadley commented May 7, 2013

It seems like a known problem, but it's not obvious what the fix is.

@vspinu
Copy link
Member

vspinu commented May 7, 2013

R-dev was pretty silent:) Would it help to explicitly convert the string
in utf8 before processing with grep?

I see enc2utf8, Encoding and iconv that are apparently designed for this
task.

Vitalie

hadley wickham notifications@github.com
on Tue, 07 May 2013 06:02:47 -0700 wrote:

It seems like a known problem, but it's not obvious what the fix is.

Reply to this email directly or view it on GitHub:
#181 (comment)

@hadley
Copy link
Member

hadley commented May 7, 2013

No, that doesn't help as far as I can tell :(

@ghost
Copy link
Author

ghost commented May 7, 2013

Thanks so much for all your efforts guys.
I've seen some Japanese stats bloggers using lubridate before, so I'll try and get in touch to see if there has been any kind of workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants