Install

Install Hadoop

After compiling the product following the instructions of Compile there will be a need to create a Hadoop cluster to process a large number of arc files. This file has for base the tutorial http://wiki.apache.org/hadoop/QuickStart, http://hadoop.apache.org/common/docs/current/cluster_setup.html and http://wiki.apache.org/hadoop/GettingStartedWithHadoop. The cluster of Hadoop servers is only used by the pwa for the creation of indexes for the collections, for an active and production Hadoop cluster consult the Project page.

This assumes a tutorial basic knowledge of Hadoop and some knowledge of apache tomcat server.

First you have to grab the files generated by the Compile procedure of the Hadoop and copy them to all your cluster servers.

Hadoop configuration files

As the tutorial pages before explain, the Hadoop system can be run in a distributed way. To run a large number of arc files this is the best way of achieving this.

Has a base configuration the Portuguese Web Archive assumes the following directory structure, and this tutorial is based on it:

/opt/searcher
             /apache-tomcat-5.5.25 -> tomcat server
             /arcproxy -> files for the arcproxy database
             /collections -> collections served by this hadoop search server
             /dictionaries -> dictionaries for the spellchecker application
             /hadoop -> hadoop for processing the arc files
             /logs -> Logs from the hadoop search servers
             /run -> directory to hold the pid files
             /scripts -> scripts for starting applications
/data/outputs -> directory for the indexes created from the arc files
/hadooptemp -> directory for all the Hadoop files used for indexing

Starting up a cluster:

You will need to define a server for Hadoop master, this server will have the namenode and tasktracker (for more information on Hadoop Hadoop wiki page and project page).
Copy the Hadoop folder after Compile to every server of the cluster.
Ensure that the Hadoop package is accessible from the same path on all nodes that are to be included in the cluster. If you have separated configuration from the install then ensure that the config directory is also accessible the same way.
Populate the slaves file with the nodes to be included in the cluster. One node per line.
Format the Namenode
Configure a environment variable for the Hadoop home % export HADOOP_HOME=/opt/searcher/hadoop
Run the command % ${HADOOP_HOME}/hadoop/bin/start-dfs.sh on the node you want the Namenode to run on. This will bring up HDFS with the Namenode running on the machine you ran the command on and Datanodes on the machines listed in the slaves file mentioned above.
Run the command % ${HADOOP_HOME}/hadoop/bin/start-mapred.sh on the machine you plan to run the Jobtracker on. This will bring up the Map/Reduce cluster with Jobtracker running on the machine you ran the command on and Tasktrackers running on machines listed in the slaves file.

After knowing what server to use has master define it in the configurations files. Setup masters at ${HADOOP_HOME}/conf/masters file:

master.example.com

Setup slaves at ${HADOOP_HOME}/conf/slaves file:

server1.example.com
server2.example.com

For your cluster you need to edit the file of ${HADOOP_HOME}/conf/hadoop-site.xml. You should change the value of fs.default.name to the namenode server , the value of mapred.job.tracker to the tasktracker server. There is also a need to correct the directories path for your system: dfs.name.dir, dfs.data.dir, mapred.system.dir, mapred.local.dir and hadoop.tmp.dir. Every option of this configurable in this file is in the file ${HADOOP_HOME}/conf/hadoop-default.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
   <name>fs.default.name</name>
   <value>master.example.com:9000</value>
 </property>
 <property>
   <name>mapred.job.tracker</name>
   <value>master.example.com:9001</value>
 </property>
 <property>
   <name>dfs.name.dir</name>
   <value>/hadooptemp/dfs/namenode</value>
 </property>
 <property>
   <name>dfs.data.dir</name>
   <value>/hadooptemp/dfs/datanode</value>
 </property>
 <property>
   <name>mapred.system.dir</name>
   <value>/hadooptemp/mapred/system</value>
 </property>
 <property>
   <name>mapred.local.dir</name>
   <value>/hadooptemp/mapred/local</value>
 </property>
 <property>
   <name>mapred.child.java.opts</name>
   <value>-Xmx6000m</value>
 </property>
  <property>
     <name>hadoop.tmp.dir</name>
     <value>/hadooptemp/tmp/hadoop-${user.name}</value>
  </property>
</configuration>

Edit JAVA_HOME variable at ${HADOOP_HOME}/conf/hadoop-env.sh:

...
export JAVA_HOME=/usr/java/default
...

Edit files at master and copy for all other machines.

There is also the need to share the ssh public key from the master server to the other servers:

generate ssh key:

 ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

remove pwd request from localhost:

 cat ~/.ssh/id_dsa.pub >>> ~/.ssh/authorized_keys

test (if no pwd is requested then it is OK):

 ssh localhost

Put the key in other machines, repeat for every machine in the cluster.

ssh-copy-id -i ~/.ssh/id_dsa.pub user@server1.example.com

Create directories configured in the hadoop-site.xml:

mkdir -p /hadooptemp/dfs/datanode
mkdir -p /hadooptemp/dfs/datanode/
mkdir -p /hadooptemp/mapred/system/
mkdir -p /hadooptemp/mapred/local/

Format the HDFS (attention the y is capitalized):

${HADOOP_HOME}/bin/hadoop namenode -format

Note: If format is aborted then remove directories and format it again (directories from hadoop-site.xml):

rm -rf /hadooptemp/dfs/namenode/*
rm -rf /hadooptemp/dfs/datanode/*
rm -rf /hadooptemp/mapred/system/*
rm -rf /hadooptemp/mapred/local/*

Using Hadoop

Start the Hadoop daemons in all machines from the Hadoop cluster:

${HADOOP_HOME}/bin/start-all.sh

See if the services started OK:

NameNode http://master.example.com:50070/
JobTracker http://master.example.com:50030/

Stop the Hadoop deamons:

${HADOOP_HOME}/bin/stop-all.sh

Operating system changes

Change the limits for the number of open files

There is a need to change some parameters for the servers. Because the Hadoop system opens a large number of files there is a need to add 2 lines to the file /etc/security/limits.conf:

...
* hard nofile 65000
* soft nofile 30000
...

Set the default charset in the machines

Change variable LANG in file /etc/sysconfig/i18n:

LANG="pt_PT.ISO-8859-1"
...

Install search system

Requirements

Install apache tomcat 5.5.25 http://archive.apache.org/dist/tomcat/tomcat-5/v5.5.25/bin/apache-tomcat-5.5.25.tar.gz
Get the compilation result of Hadoop 0.14.4 instructions in Compile
The files are going to be called, for simplification purposes:
- nutchwax.jar: pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-job/target/nutchwax-job-0.11.0-SNAPSHOT.jar
- nutchwax.war: pwa-technologies/PwaArchive-access/projects/nutchwax/nutchwax-webapp/target/nutchwax-webapp-0.11.0-SNAPSHOT.war
- wayback.war: pwa-technologies/PwaArchive-access/projects/wayback/wayback-webapp/target/wayback-1.2.1.war
- pwalucene.jar: pwa-technologies/PwaLucene/target/pwalucene-1.0.0-SNAPSHOT.jar

Install tomcat

To install and configure tomcat this documentation should be followed.
Untar the file to /opt/searcher/apache-tomcat-5.5.25 tar -zxf apache-tomcat-5.5.25.tar.gz
Configure a environment variable for the Catalina home % export CATALINA_HOME=/opt/searcher/apache-tomcat-5.5.25

Configure nutchwax Web application

Copy the file nutchwax.war to ${CATALINA_HOME}/webapps/

Set ${CATALINA_HOME}/webapps/nutchwax/WEB-INF/classes/hadoop-site.xml with:

<name>searcher.dir</name>
<value>/opt/searcher/scripts</value>
...
<name>wax.host</name>
<value>example.com:8080/wayback/wayback</value>

Where the directory of the searcher.dir /opt/searcher/scripts should have a file named search-servers.txt that defines where the nutch servers are running:

  example.com 21111
  example.com 21112

The property wax.host should have the host with the wayback configuration that is open to the world.

Configure wayback Web application

Copy the file wayback.war to ${CATALINA_HOME}/webapps/

Update ${CATALINA_HOME}/webapps/wayback/WEB-INF/wayback.xml file:

The resourceIndex maps the arc name and an URL for the arc it self, normally the URL of the arc is served by an http server. The configuration should refer the arcproxy that knows where all the arc files exist.
The remotecollection is the location of the search for an url. It should be configured with the nutchwax search.
The uriConverter is used to reply the url to be accessed.

...
<property name="resourceStore">
<bean class="org.archive.wayback.resourcestore.Http11ResourceStore">
<property name="urlPrefix" value="http://127.0.0.1:8080/arcproxy/arcproxy/" />
...
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.NutchResourceIndex" init-method="init">
<property name="searchUrlBase" value="http://127.0.0.1:8080/nutchwax/opensearch" />
...
<property name="uriConverter">
<bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
<property name="replayURIPrefix" value="http://MACHINE:8080/wayback/wayback/" />
...

Configure arcproxy Web application

There are two manner of setting up a arcproxy:

Isolated arcpoxy

- Copy the file wayback.war to ${CATALINA_HOME}/webapps/ and rename it to arcproxy, mv wayback.xml arcproxy.xml

Replace ${CATALINA_HOME}/webapps/arcproxy/WEB-INF/wayback.xml file:

Change the bdbPath, bdbName and logPath to the parameters to your desire, the directories have to be writable by tomcat application.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>

<!--
    The following 3 beans are required when using the ArcProxy for providing 
    HTTP 1.1 remote access to ARC files distributed across multiple computers 
    or directories.
-->
 
	<bean id="filelocationdb" class="org.archive.wayback.resourcestore.http.FileLocationDB"
		init-method="init">
		<property name="bdbPath" value="/home/wayback/searcher/arcproxy" />
		<property name="bdbName" value="arquivo" />
		<property name="logPath" value="/home/wayback/searcher/arcproxy/tmp_arc-db.log" />
	</bean>

	<bean name="8080:arcproxy" class="org.archive.wayback.resourcestore.http.ArcProxyServlet">
		<property name="locationDB" ref="filelocationdb" />
	</bean>
	<bean name="8080:locationdb" class="org.archive.wayback.resourcestore.http.FileLocationDBServlet">
		<property name="locationDB" ref="filelocationdb" />
	</bean>

</beans>

2 - Wayback Append the configuration of the bean above into the file wayback.xml.

...
<property name="resourceStore">
<bean class="org.archive.wayback.resourcestore.Http11ResourceStore">
<property name="urlPrefix" value="http://127.0.0.1:8080/wayback/arcproxy/" />
...
<bean id="filelocationdb" class="org.archive.wayback.resourcestore.http.FileLocationDB"
		init-method="init">
		<property name="bdbPath" value="/opt/searcher/arcproxy" />
		<property name="bdbName" value="arquivo" />
		<property name="logPath" value="/opt/searcher/arcproxy/tmp_arc-db.log" />
	</bean>

	<bean name="8080:arcproxy" class="org.archive.wayback.resourcestore.http.ArcProxyServlet">
		<property name="locationDB" ref="filelocationdb" />
	</bean>
	<bean name="8080:locationdb" class="org.archive.wayback.resourcestore.http.FileLocationDBServlet">
		<property name="locationDB" ref="filelocationdb" />
	</bean>

</beans>
...
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.NutchResourceIndex" init-method="init">
<property name="searchUrlBase" value="http://127.0.0.1:8080/nutchwax/opensearch" />
...
<property name="uriConverter">
<bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter">
<property name="replayURIPrefix" value="http://MACHINE:8080/wayback/wayback/" />
...

Configure browser Web application

The browser application has the objective of making the arc files available through http. By creating a repository of files arc files. The creation of the application is made creating a folder for the application inside webapps directory of tomcat.

mkdir -p ${CATALINA_HOME}/webapps/browser/WEB-INF
mkdir -p ${CATALINA_HOME}/webapps/browser/files

Then creating a file web.xml in the WEB-INF directory.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE web-app
    PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
    "http://java.sun.com/dtd/web-app_2_3.dtd">

<web-app>
    <display-name>browser</display-name>
    <description>File Browsing Application for the Document Share</description>

    <!-- Enable directory listings by overriding the server default web.xml -->
    <!-- definition for the default servlet -->
    <servlet>
        <servlet-name>DefaultServletOverride</servlet-name>
        <servlet-class>org.apache.catalina.servlets.DefaultServlet</servlet-class>
        <init-param>
            <param-name>listings</param-name>
            <param-value>true</param-value>
        </init-param>
        <load-on-startup>1</load-on-startup>
    </servlet>

    <!-- Add a mapping for our new default servlet -->
    <servlet-mapping>
        <servlet-name>DefaultServletOverride</servlet-name>
        <url-pattern>/</url-pattern>
    </servlet-mapping>

</web-app>

Start Search System

For this you need to have the nutchwax.jar and the Hadoop folder after Compile has been run.

Copy the folder of Hadoop to the server that will serve the collection. For each collection you will need a new copy of this folder. (example: /opt/searcher/collections/test_collection_hadoop).
Copy the file nutchwax.jar to /opt/searcher/collections/test_collection_hadoop
Get the scripts from https://github.com/arquivo/pwa-technologies/tree/master/scripts to /opt/searcher/scripts
Configure a environment variable for the Scripts directory % export SCRIPTS_DIR=/opt/searcher/scripts.
Copy the file pwa_lucene.jar to /opt/searcher/scripts.
Configure a environment variable for the Collections directory % export COLLECTIONS_DIR=/opt/searcher/collections.
Create the file ${COLLECTIONS_DIR}/search-servers.txt should have the following definition, 1 line per collection:

hostname  port_for_server folder_of_hadoop_server folder_for_outputs

example:

master.example.com 21111 /opt/searcher/collections/test_collection_hadoop /data/outputs

Start the servers: % ${SCRIPTS_DIR}/start-slave-searchers.sh.
Verify the /opt/searcher/logs/slave-searcher-21111.log for startup errors of the server.
To stop the server use: % ${SCRIPTS_DIR}/stop-slave-searchers.sh

Install Plone

Download https://github.com/arquivo/pwa-technologies/tree/master/Plone/ploneConf
Download http://sobre.arquivo.pt/~pwa/PWA-TechnologiesSourceCodeDump22-11-2013/Data.fs
tar -xvf Plone-3.0.5-UnifiedInstaller.tar.gz
tar -xvf LinguaPlone-2.0.tar.gz
Install Plone without PWA configurations

Plone-3.0.5-UnifiedInstaller/install.sh standalone

The following files are the PWA visual configurations

Plone-3.0.5/zinstance/lib/python/plone/app/i18n/locales/browser/selector.py
Plone-3.0.5/zinstance/lib/python/plone/app/i18n/locales/browser/languageselector.pt
Plone-3.0.5/zinstance/lib/python/plone/app/layout/viewlets/personal_bar.pt
Plone-3.0.5/zinstance/Products/LinguaPlone/browser/languageselector.pt
Plone-3.0.5/zinstance/Products/PloneTranslations/i18n/plone-pt.po

Replace files generated by Plone installation for the PWA files

cp ploneConf/selector.py Plone-3.0.5/zinstance/lib/python/plone/app/i18n/locales/browser/
cp ploneConf/languageselector.pt Plone-3.0.5/zinstance/lib/python/plone/app/i18n/locales/browser/
cp ploneConf/personal_bar.pt Plone-3.0.5/zinstance/lib/python/plone/app/layout/viewlets/
cp ploneConf/LinguaPlone/languageselector.pt Plone-3.0.5/zinstance/Products/LinguaPlone/browser/
cp ploneConf/PloneTranslations/* Plone-3.0.5/zinstance/Products/PloneTranslations/i18n/
cp Data.fs Plone-3.0.5/zinstance/var/

start Plone

 Plone-3.0.5/zinstance/bin/zopectl start

Test configuration

http://servername:8080/arquivo-web

Provide feedback

Saved searches

Use saved searches to filter your results more quickly