-
Notifications
You must be signed in to change notification settings - Fork 7
PwaArchiveAccess
During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results and scalability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support the web archive IR requirements.
Nutchwax, Nutch and Wayback’s code were adapted to the web archive IR requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. All interactions of the users with the system are now registered for search log mining. These data can be used to detect problems and envision more efficient searching solutions for the users. The following two papers address these issues:
- Miguel Costa, Mário J. Silva, Characterizing Search Behavior in Web Archives, Temporal Web Analytics Workshop, 2011.
- Miguel Costa, Mário J. Silva, Understanding the Information Needs of Web Archive Users, 10th International Web Archiving Workshop, 2010.
This code is integrated with our Lucene extension, PwaLucene.
This version is being used at http://arquivo.pt.