-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fread memory leak when small number bumped up to character #918
Comments
Which version of data.table please. In particular is it v1.9.4 or higher?
Although that was a crash rather than a memory leak, it could manifest itself as a memory leak as well. I'm guessing that the scientific notation is a red herring since reading of scientific notation is pretty stable. Please provide |
Here is the sessioninfo R version 3.1.1 (2014-07-10) locale: attached base packages: other attached packages: loaded via a namespace (and not attached): Here is the output for a 130m file. And it took 5G of my memory when fread finished. When I tried fread on another 300m file without colClass to preset the column to character, it ended up took 30G of my memory. Input contains no \n. Taking this to be a filename to open |
Many thanks. No type bumps are happening then. That warning is telling you about loss of accuracy - I can't see why that would result in larger memory usage. Can you also provide the commands you run and the output of |
I shared my data Just run Aad I monitor the memory usage in window's task manager and I can see how much memory hold by R session. In my machine It took 13G of memory after read 8.9% of the 219m file and it is very slow. I had to stop the R session otherwise it could take all my 36G memory and crash my machine. If I specify both column as character, then dat = fread('debug.csv', verbose=TRUE, colClasses = list(character=c('V1','V2'))) |
Matt could you repro this issue? |
Thanks. Have downloaded and run but it works fine here on Linux using the same CRAN version (v1.9.4). How strange! It's a really simple file (2 numeric columns). Maybe it is to do with the ERANGE warning then and since it works fine here on Linux maybe it's a Windows-only problem. Could you try removing some digits from the numbers in the file and see if it then works? $ R
R version 3.1.1 (2014-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit) > require(data.table)
Loading required package: data.table
data.table 1.9.4 For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.
> DT = fread("debug.csv", verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.219189 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
Found 2 columns
First row with 2 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 6313724
Subtracted 2 for last eol and any trailing empty lines, leaving 6313722 data rows
Type codes ( first 5 rows): 33
Type codes (+ middle 5 rows): 33
Type codes (+ last 5 rows): 33
Type codes: 33 (after applying colClasses and integer64)
Type codes: 33 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 dropped)
Read 6313722 rows and 2 (of 2) columns from 0.219 GB file in 00:00:12
0.051s ( 0%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
1.166s ( 10%) Count rows (wc -l)
0.000s ( 0%) Column type detection (first, middle and last 5 rows)
0.201s ( 2%) Allocation of 6313722x2 result (xMB) in RAM
9.927s ( 88%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.000s ( 0%) Changing na.strings to NA
11.344s Total
Warning message:
In fread("debug.csv", verbose = TRUE) :
C function strtod() returned ERANGE for one or more fields. The first was string input '2.32741124362878E-309'. It was read using (double)strtold() as numeric value 2.3274112436287792E-309 (displayed here using %.16E); loss of accuracy likely occurred. This message is designed to tell you exactly what has been done by fread's C code, so you can search yourself online for many references about double precision accuracy and these specific C functions. You may wish to use colClasses to read the column as character instead and then coerce that column using the Rmpfr package for greater accuracy.
> print(DT)
V1 V2
1: 0.04007285 0.8010419
2: 0.04898210 0.7638770
3: -0.07365259 0.8425065
4: -59.74810854 0.1918155
5: -39.97517367 0.3965352
---
6313718: 1.94000883 0.5497541
6313719: 0.11001585 0.5206822
6313720: 0.11050033 0.4940505
6313721: 0.06749008 0.3739320
6313722: 0.04381130 0.2168884
> sapply(DT,class)
V1 V2
"numeric" "numeric"
> system("ls -lh debug.csv")
-rw-r----- 1 mdowle mdowle 225M Nov 1 09:23 debug.csv
> system("head debug.csv")
V1,V2
0.0400728476107927,0.801041935693612
0.0489820969563939,0.763877007211593
-0.0736525895727923,0.842506514813372
-59.7481085409406,0.191815544728967
-39.9751736658287,0.396535177786151
-0.0464704021731438,0.405312124283196
0.313994450640944,0.163031903374044
0.402107788498037,0.0678644932003186
0.21551724137931,0.177449963851469
> system("tail debug.csv")
-0.0350671334109038,0.84996384428055
0.148874493837568,0.851469131093495
-0.132688742284806,0.858676881345481
-0.0635038868127799,0.988171874680027
1.94000883097708,0.549754083953011
0.110015845109051,0.520682247269614
0.110500329886343,0.494050546140977
0.0674900801548572,0.373931998470873
0.043811304035503,0.216888407882552
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] graphics grDevices datasets stats utils methods base
other attached packages:
[1] data.table_1.9.4 bit64_0.9-4 bit_1.1-12
loaded via a namespace (and not attached):
[1] chron_2.3-45 plyr_1.8.1 Rcpp_0.11.3 reshape2_1.4 stringr_0.6.2
> tables()
NAME NROW NCOL MB COLS KEY
[1,] DT 6,313,722 2 97 V1,V2
Total: 97MB
> |
Did you have a chance to try it on a windows machine? |
I hunted online for similar issues but didn't get lucky. |
Yes, I can confirm the memory leak on Windows. |
I'd love to see this fixed. Reading my 30MB file which contains some super small numbers leaves me with 5+ GB of leaked memory on Windows, causing me to have to restart between executions. I could treat as character then coerce, but what a pain. |
@braidm Can you provide some more context please? How small are "super small numbers"? Say, can you post few lines of your file as an example? Also, do you see the leak in CRAN version, or in the latest dev (or both)? |
I cannot share my original file which has 160,000 rows, 28 fields, tab delimited, and leaks over 7GB whenever I read it (on disk it is 30MB). The leak does not happen if all numerics are larger than the machine limit of 2.225074e-308. However I created a test file which leaks about 300MB each time it is read. With repeated executions it will leak multiple GB. Here is the message from data.table
and here is the sessionInfo locale: attached base packages: other attached packages: loaded via a namespace (and not attached): |
@braidm Thanks for providing the test data. I was able to verify that the "leak" occurs on a Windows machine with data.table Notably, the same problem does not occur on a Linux machine, or a MacOS machine. The good news is that the problem doesn't occur with latest development version of
Please let me know if this solves your problem |
I'm grateful for you taking a look. I cannot try the dev version of data.tables at this time. You seem to be suggesting the memory is not truly leaked, because it does and can get garbage collected. What is strange then is that in RStudio when the memory has ballooned up to 8GB after reading a 30MB file, that a subsequent run pushes the memory demand yet higher into the teens of GB when in fact if it only did gc we could be OK. And if it's not a gc issue, how else would the memory be reclaimed. Thanks again! |
Talking to Pasha we've realized where the 'leak' arises. Each and every warning is slightly different because it includes the string value in this part of the warning message: dev is much better and the problem has gone away. Just a test needed then to be added to the test suite, and this is closed. |
Closed by #2451 |
I have a 300m file with some small numbers in scientific notation (E-300) and after using fread it used more than 30gig of my memory. After manually use colClass to specify the column to be character the problem is gone and the memory usage is normal. I suspect there is memory leak somewhere when handling small number.
The text was updated successfully, but these errors were encountered: