-
Notifications
You must be signed in to change notification settings - Fork 16
Home
- Before building
nutchpy
from source, make sure you have the following setup:- Apache-maven (Installation instructions)
- py4j (Installation instructions)
- Get the source by cloning the repository using the following command.
git clone https://github.com/ContinuumIO/nutchpy.git
- Then run the following commands to run the
setup.py
script. (Make sure to have the super user permission while running the setup script)
cd nutchpy
sudo python setup.py install
The nutchpy
setup by default comes with 2 simple and easy to understand examples. It's basic usage is as follows:
import nutchpy
node_path = "<FULL-PATH-TO-CRAWLED-DATA>/data"
seq_reader = nutchpy.sequence_reader
print(seq_reader.head(n,node_path)) # Prints first n rows from the file
print(seq_reader.slice(start,stop,node_path)) # Prints lines between start and stop
data = seq_reader.read(node_path)
print(data) # Prints the whole file content
-
node_path
- It is generally the path to the crawled data file. Typically on a nutch default installation, it'd look something likenutch/runtime/local/crawl/crawldb/current/part-00000/data
To process the entire data and to run through the urls, read the content. The content is in the form of a list. The below sample runs through all the urls.
import nutchpy
path = 'path-to-nutch/nutch/runtime/local/crawl/crawldb/current/part-00000/data'
data = nutchpy.sequence_reader.read(path)
for list_item in data:
print(list_item[0]) # Prints the url
print(list_item[1]) # Prints details abt the url
A sample output of 1 row of the above code would be as follows
https://www.abc.com
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sat Sep 26 23:52:36 PDT 2015
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0
Signature: null
Metadata:
_repr_=https://www.abc.com
_pst_=moved(12), lastModified=0: https://www.abc.com
Content-Type=text/html
_rs_=115
Using the above sample program, one can get all the details of the crawled database. We can get the status of the urls, whether it is fetched or not. We can also get the reason, as to why it failed and also mime-types of different fetched files.
The program may result in the following error sometimes.
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.continuumio.seqreaderapp.SequenceReader.slice.
: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3181)
at java.util.ArrayList.grow(ArrayList.java:261)
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
at java.util.ArrayList.add(ArrayList.java:458)
at com.continuumio.seqreaderapp.SequenceReader.slice(SequenceReader.java:143)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
This happens because the size of the data set is too huge for py4j
to process in one go. To overcome this issue, the slice
and head
methods should be used. A sample program that runs through the entire data using slice
is given below. The below program parses through the data 1000 items at a time.
import nutchpy
# Parses through the data and do the processing on it
def parseData(data):
# return false if there is no new data to parse through
if not data:
return False
# return true if there is more data
return True
i = 0
path = 'path-to-nutch/nutch/runtime/local/crawl/crawldb/current/part-00000/data'
dataPresent = True
# Parsing through the data 1000 at a time, otherwise the system wont be able to handle such huge data
while dataPresent:
data = nutchpy.sequence_reader.slice(i, i + 1000, path)
dataPresent = parseData(data)
i = i + 1000