Skip to content
This repository has been archived by the owner on Apr 27, 2018. It is now read-only.

java.lang.NegativeArraySizeException #222

Closed
ianmilligan1 opened this issue Apr 26, 2016 · 15 comments
Closed

java.lang.NegativeArraySizeException #222

ianmilligan1 opened this issue Apr 26, 2016 · 15 comments
Labels

Comments

@ianmilligan1
Copy link
Collaborator

We (@ruebot and I) are running a URL extract job with the following script:

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val university_of_alberta_websites = 
  RecordLoader.loadArchives("/data/university_of_alberta_websites/*.gz", sc) 
  .keepValidPages() 
  .map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))) 
  .countItems() 
  .saveAsTextFile("/data/derivatives/urls/university_of_alberta_websites")

On a Compute Canada VM, Ubuntu.

It fails with the following error (tested, twice):

java.lang.NegativeArraySizeException
    at org.warcbase.data.WarcRecordUtils.copyStream(WarcRecordUtils.java:125)
    at org.warcbase.data.WarcRecordUtils.getContent(WarcRecordUtils.java:98)
    at org.warcbase.spark.archive.io.GenericArchiveRecord.<init>(GenericArchiveRecord.scala:48)
    at org.warcbase.spark.matchbox.RecordLoader$$anonfun$loadArchives$2.apply(RecordLoader.scala:45)
    at org.warcbase.spark.matchbox.RecordLoader$$anonfun$loadArchives$2.apply(RecordLoader.scala:45)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Full error trace is available at https://gist.github.com/ruebot/25d505d4e530c3b9430135f6c9f140fe#file-gistfile1-txt.

Any clue what's up?

@lintool
Copy link
Owner

lintool commented Apr 27, 2016

Can you narrow it down to a particular WARC that's causing the issue?

@ianmilligan1
Copy link
Collaborator Author

Haven't been able to. If you look at the error trace, I've tested the last batch of WARCs that the script ingested and they all work.

i.e. tested on:

ARCHIVEIT-1830-NONE-EWVEGS-20120301170834-00230-crawling211.us.archive.org-6682.warc.gz       
ARCHIVEIT-1830-MONTHLY-PBLTCT-20121012203013-00001-wbgrp-crawl063.us.archive.org-6683.warc.gz 
ARCHIVEIT-1830-SEMIANNUAL-JOB166244-20150727112437633-00015.warc.gz                           
ARCHIVEIT-1830-NONE-TEZIEC-20111016191300-00002-crawling200.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-FWPGCP-20111002023208-00154-crawling209.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-MONTHLY-OZBJIK-20120612205311-00002-crawling200.us.archive.org-6681.warc.gz    
ARCHIVEIT-1830-NONE-UHNVDX-20110930193117-00255-crawling202.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-MONTHLY-CSRYZP-20120815092532-00017-crawling208.us.archive.org-6681.warc.gz    
ARCHIVEIT-1830-MONTHLY-QRDYRH-20120312205931-00006-crawling113.us.archive.org-6681.warc.gz    
ARCHIVEIT-1830-NONE-NHHENM-20120529210105-00011-crawling212.us.archive.org-6680.warc.gz       
ARCHIVEIT-1830-NONE-UHNVDX-20110930061048-00095-crawling202.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-FWPGCP-20111001190411-00049-crawling209.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-FWPGCP-20111002110121-00311-crawling209.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-SEMIANNUAL-JOB166244-20150727082037153-00012.warc.gz                           
ARCHIVEIT-1830-NONE-5176-20140601225449881-00000-wbgrp-crawl052.us.archive.org-6442.warc.gz   
ARCHIVEIT-1830-NONE-FWPGCP-20111002045116-00207-crawling209.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-QUARTERLY-MWZFCS-20120629155541-00077-crawling212.us.archive.org-6682.warc.gz  
ARCHIVEIT-1830-NONE-FRYBJH-20111008205513-00038-crawling208.us.archive.org-6683.warc.gz       
ARCHIVEIT-1830-NONE-FWPGCP-20111001145505-00017-crawling209.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-XCUJHA-20111206043053-00000-crawling206.us.archive.org-6683.warc.gz       
ARCHIVEIT-1830-NONE-UHNVDX-20110930101326-00147-crawling202.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-EWVEGS-20120301152756-00203-crawling211.us.archive.org-6682.warc.gz       
ARCHIVEIT-1830-QUARTERLY-MWZFCS-20120628023343-00007-crawling212.us.archive.org-6682.warc.gz  
ARCHIVEIT-1830-NONE-CHONHQ-20111006151942-00058-crawling205.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-2540-20140603145904065-00000-wbgrp-crawl104.us.archive.org-6445.warc.gz   
ARCHIVEIT-1830-NONE-YDUKWP-20111130205519-00016-crawling203.us.archive.org-6680.warc.gz       
ARCHIVEIT-1830-NONE-UHNVDX-20110930090309-00132-crawling202.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-QUARTERLY-MWZFCS-20120629222444-00101-crawling212.us.archive.org-6682.warc.gz  

So either our error logging in fishy, or something's happening in the combination of data?

(have I missed a WARC here, @ruebot?)

@ianmilligan1
Copy link
Collaborator Author

Just had this happen again on a collection that we had successfully run URL extraction on, but crashed during link extraction (twice).

[Stage 0:====================>                               (1048 + 16) / 2673]INFO  WacGenericInputFormat - Loading file:/data/idle_no_more/ARCHIVEIT-3490-DAILY-RGLLBX-20130130061306-00006-wbgrp-crawl054.us.archive.org-6680.warc.gz
[Stage 0:====================>                               (1049 + 16) / 2673]ERROR Executor - Exception in task 1036.0 in stage 0.0 (TID 1036)
java.lang.NullPointerException
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
        at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
INFO  WacGenericInputFormat - Loading file:/data/idle_no_more/ARCHIVEIT-3490-NONE-OUDOSH-20130201045442-00313-wbgrp-crawl057.us.archive.org-6682.warc.gz
WARN  TaskSetManager - Lost task 1036.0 in stage 0.0 (TID 1036, localhost): java.lang.NullPointerException
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
        at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

ERROR TaskSetManager - Task 1036 in stage 0.0 failed 1 times; aborting job
WARN  TaskSetManager - Lost task 1063.0 in stage 0.0 (TID 1063, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1048.0 in stage 0.0 (TID 1048, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1059.0 in stage 0.0 (TID 1059, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1054.0 in stage 0.0 (TID 1054, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1062.0 in stage 0.0 (TID 1062, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1039.0 in stage 0.0 (TID 1039, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 997.0 in stage 0.0 (TID 997, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1058.0 in stage 0.0 (TID 1058, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1061.0 in stage 0.0 (TID 1061, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1065.0 in stage 0.0 (TID 1065, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1057.0 in stage 0.0 (TID 1057, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1056.0 in stage 0.0 (TID 1056, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1053.0 in stage 0.0 (TID 1053, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1036 in stage 0.0 failed 1 times, most recent failure: Lost task 1036.0 in stage 0.0 (TID 1036, localhost): java.lang.NullPointerException
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
        at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
        at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:264)
        at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:126)
        at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
        at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
        at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:547)
        at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:548)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.RDD.sortBy(RDD.scala:545)
        at org.warcbase.spark.rdd.RecordRDD$CountableRDD.countItems(RecordRDD.scala:40)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:80)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:82)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:84)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:86)
        at $iwC$$iwC$$iwC.<init>(<console>:88)
        at $iwC$$iwC.<init>(<console>:90)
        at $iwC.<init>(<console>:92)
        at <init>(<console>:94)
        at .<init>(<console>:98)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$pasteCommand(SparkILoop.scala:825)
        at org.apache.spark.repl.SparkILoop$$anonfun$standardCommands$8.apply(SparkILoop.scala:345)
        at org.apache.spark.repl.SparkILoop$$anonfun$standardCommands$8.apply(SparkILoop.scala:345)
        at scala.tools.nsc.interpreter.LoopCommands$LoopCommand$$anonfun$nullary$1.apply(LoopCommands.scala:65)
        at scala.tools.nsc.interpreter.LoopCommands$LoopCommand$$anonfun$nullary$1.apply(LoopCommands.scala:65)
        at scala.tools.nsc.interpreter.LoopCommands$NullaryCmd.apply(LoopCommands.scala:76)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:809)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
        at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


scala> WARN  TaskSetManager - Lost task 1064.0 in stage 0.0 (TID 1064, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1044.0 in stage 0.0 (TID 1044, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1045.0 in stage 0.0 (TID 1045, localhost): TaskKilled (killed intentionally)

@jrwiebe
Copy link
Collaborator

jrwiebe commented May 14, 2016

I just ran the same script on Rho (/mnt/vol1/data_sets/walk-test/*.gz) and it worked.

@ianmilligan1
Copy link
Collaborator Author

Aye, works on some collections and not on others. I guess it must be related to funky data, although there's ton of it within these Archive-It collections.

@ruebot – maybe we should move a funky collection over to rho, so we can make sure it's not the setup on WALK somehow..

@ruebot
Copy link
Contributor

ruebot commented May 17, 2016

Sure. Tell me what collection to copy over, and I'll make it happen.

@ianmilligan1
Copy link
Collaborator Author

Why don't we move university_of_alberta_websites over, run a variation of

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val university_of_alberta_websites = 
  RecordLoader.loadArchives("/data/university_of_alberta_websites/*.gz", sc) 
  .keepValidPages() 
  .map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))) 
  .countItems() 
  .saveAsTextFile("/data/derivatives/urls/university_of_alberta_websites")

and see if it blows?

@ruebot
Copy link
Contributor

ruebot commented May 17, 2016

rsyncing over now.

@ruebot
Copy link
Contributor

ruebot commented May 18, 2016

Forgot to say it is was done. Test directory is /mnt/vol1/data_sets/TEST on rho.

@ianmilligan1
Copy link
Collaborator Author

👍 @ruebot.

Am running this on rho. We'll see if it's a dataset problem or a WALK problem. Stay tuned!

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val university_of_alberta_websites = 
  RecordLoader.loadArchives("/mnt/vol1/data_sets/TEST/*.gz", sc) 
  .keepValidPages() 
  .map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))) 
  .countItems() 
  .saveAsTextFile("/mnt/vol1/derivative_data/walk/university_of_alberta_websites")

@ianmilligan1
Copy link
Collaborator Author

Curses. Failed again with this error. Different this time.

At least we know it's not related to system, but connected to the files. I guess next step, try error logging.. isolating a WARC or something. 😦

@anjackson
Copy link
Contributor

anjackson commented Jun 14, 2016

I note that this error is thrown here.

The code is minting a byte array, which will choke on large (>2GB) payloads. Firstly, somewhere upstream you are casting recordLength to an int, and because the value can be long this will set the sign bit sometimes, creating a negative value.

But that's not really the point because arrays in Java are limited to 2GB anyway. If you are going to read into a byte array you'll need to truncate the payload (ensuring byte[].length <= Integer.MAX_VALUE). FWIW, in webarchive-discovery I used a streaming interface rather than an in-memory array, which is trickier but significantly reduces memory pressure.

@ruebot
Copy link
Contributor

ruebot commented Jun 14, 2016

Ohhhh. That makes sense because the great majority if the warcs in the
dataset are around the ~1GB default. But, there are scattering of ~20GB
warcs.
On Jun 14, 2016 7:56 PM, "Andy Jackson" notifications@github.com wrote:

I note that this error is thrown here

byte [] scratchbuffer = new byte[recordLength];

.

The code is minting a byte array, which will choke on large (>2GB)
payloads. Firstly, somewhere upstream you are casting recordLength to an
int, and because the value can be long this will set the sign bit
sometimes, creating a negative value.

But that's not really the point because arrays in Java are limited to 2GB
anyway. If you are going to read into a byte array you'll need to truncate
the payload (ensuring byte[].length < Integer.MAX_VALUE). FWIW, in
webarchive-discovery I used a streaming interface rather than an in-memory
array, which is trickier but significantly reduces memory pressure.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#222 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AANVwcPexv8ea6Sd0_y15md2DZ36nw5sks5qLvl2gaJpZM4IQX8c
.

@lintool
Copy link
Owner

lintool commented Jun 15, 2016

Seems to be the same issue #234 that we're encountering at ArchivesUnleashed hackathon 2.0. Moving discussion more to there.

@ianmilligan1
Copy link
Collaborator Author

Closed as moving to #234, and opening up new ticket on WALK

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants