Skip to content
This repository has been archived by the owner on Jun 28, 2022. It is now read-only.

stream2es indexing of local wikipedia dump fails #50

Open
stucker0530 opened this issue Sep 11, 2015 · 3 comments
Open

stream2es indexing of local wikipedia dump fails #50

stucker0530 opened this issue Sep 11, 2015 · 3 comments

Comments

@stucker0530
Copy link

I am getting the following error when attempting to ingest a local dump of the latest wikipedia. I am running ES 1.7.1 and stream2es 20150720170522978252e

[stream2es]$ ./stream2es wiki --max-docs 5 --source ./enwiki-latest-pages-articles1.xml.bz2
java.io.IOException: unexpected end of stream
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.bsGetBit(CBZip2InputStream.java:371)
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.recvDecodingTables(CBZip2InputStream.java:476)
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:550)
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:287)
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.init(CBZip2InputStream.java:246)
at org.elasticsearch.river.wikipedia.bzip2.CBZip2InputStream.(CBZip2InputStream.java:148)
at org.elasticsearch.river.wikipedia.support.WikiXMLParser.getInputSource(WikiXMLParser.java:80)
at org.elasticsearch.river.wikipedia.support.WikiXMLSAXParser.parse(WikiXMLSAXParser.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:313)
at stream2es.stream.wiki$fn__6612$fn__6613.invoke(wiki.clj:45)
at stream2es.main$stream_BANG_.invoke(main.clj:241)
at stream2es.main$main.invoke(main.clj:329)
at stream2es.main$_main.doInvoke(main.clj:335)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at stream2es.main.main(Unknown Source)
2015-09-11T11:13:32.937-0600 ERROR unexpected exception: java.io.IOException: unexpected end of stream
2015-09-11T11:13:33.056-0600 INFO 00:00.208 0.0d/s 0.0K/s (0.0mb) indexed 0 streamed 0 errors 0
[stream2es]$

@drewr
Copy link
Contributor

drewr commented Sep 11, 2015

Which dump did you download? You'd want this one:

@Jbrunn
Copy link

Jbrunn commented Sep 14, 2015

I'm having the same issue (without the max-docs option). I've tried using both of the dumps that you suggested. I'm on OSx, if that makes any difference. I have also turned sleep off to eliminate that as a possible issue. The bz2 dump you suggested did gave me the highest number of documents successfully processed thus far at 534,792. Any guidance would be appreciated.

@funnydevnull
Copy link

I'm using the dump enwiki-20140707-pages-articles.xml.bz2 and so far its working (but only 62k articles in so far).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants