Skip to content
Eversong edited this page Jun 26, 2015 · 1 revision

Introduction

During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results and scalability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support the web archive IR requirements.

Nutchwax, Nutch and Wayback’s code were adapted to the web archive IR requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. All interactions of the users with the system are now registered for search log mining. These data can be used to detect problems and envision more efficient searching solutions for the users. The following two papers address these issues:

Details

This code is integrated with our Lucene extension, PwaLucene.

This version is being used at http://arquivo.pt.