-
Notifications
You must be signed in to change notification settings - Fork 47
java.lang.NegativeArraySizeException #222
Comments
Can you narrow it down to a particular WARC that's causing the issue? |
Haven't been able to. If you look at the error trace, I've tested the last batch of WARCs that the script ingested and they all work. i.e. tested on:
So either our error logging in fishy, or something's happening in the combination of data? (have I missed a WARC here, @ruebot?) |
Just had this happen again on a collection that we had successfully run URL extraction on, but crashed during link extraction (twice).
|
I just ran the same script on Rho (/mnt/vol1/data_sets/walk-test/*.gz) and it worked. |
Aye, works on some collections and not on others. I guess it must be related to funky data, although there's ton of it within these Archive-It collections. @ruebot – maybe we should move a funky collection over to |
Sure. Tell me what collection to copy over, and I'll make it happen. |
Why don't we move import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val university_of_alberta_websites =
RecordLoader.loadArchives("/data/university_of_alberta_websites/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl)))
.countItems()
.saveAsTextFile("/data/derivatives/urls/university_of_alberta_websites") and see if it blows? |
rsyncing over now. |
Forgot to say it is was done. Test directory is |
👍 @ruebot. Am running this on rho. We'll see if it's a dataset problem or a WALK problem. Stay tuned! import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val university_of_alberta_websites =
RecordLoader.loadArchives("/mnt/vol1/data_sets/TEST/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl)))
.countItems()
.saveAsTextFile("/mnt/vol1/derivative_data/walk/university_of_alberta_websites") |
Curses. Failed again with this error. Different this time. At least we know it's not related to system, but connected to the files. I guess next step, try error logging.. isolating a WARC or something. 😦 |
I note that this error is thrown here. The code is minting a byte array, which will choke on large (>2GB) payloads. Firstly, somewhere upstream you are casting But that's not really the point because arrays in Java are limited to 2GB anyway. If you are going to read into a byte array you'll need to truncate the payload (ensuring |
Ohhhh. That makes sense because the great majority if the warcs in the
|
Seems to be the same issue #234 that we're encountering at ArchivesUnleashed hackathon 2.0. Moving discussion more to there. |
Closed as moving to #234, and opening up new ticket on WALK |
We (@ruebot and I) are running a URL extract job with the following script:
On a Compute Canada VM, Ubuntu.
It fails with the following error (tested, twice):
Full error trace is available at https://gist.github.com/ruebot/25d505d4e530c3b9430135f6c9f140fe#file-gistfile1-txt.
Any clue what's up?
The text was updated successfully, but these errors were encountered: