Exodatasets & Vespa Indexes #372

docuracy · 2024-09-17T17:41:09Z

An improved indexing system using Vespa, together with the overhaul of the system for asserting links between Places outlined here, might remove the current need for manual reconciliation and accessioning. This approach can eliminate the need for multiple passes through contributed datasets, offering contributors a unified interface to complete all stages of place data processing in a single pass, rather than waiting for dataset-wide match-seeking operations to finish.

This system could present a Cluster of potential matches from exodatasets like Wikidata, GeoNames, and others (see below), as well as from other WHG datasets, and it might suggest relevant citations from LLMs (proposed here). It would allow a dataset contributor to focus on a single place in their dataset and complete all stages of its processing before moving on to another. It would also reduce the friction that arises when dealing with historical places which do not exist in the modern exodatasets, but which may well be present in other contributed datasets.

Vespa

Vespa offers hybrid search capabilities that can seamlessly combine traditional keyword-based search with vector-based search methods. Running in a Docker container, it would allow ranking based on textual, spatial, temporal, linguistic, phonetic, and semantic facets across both global and regional exodatasets and contributed WHG data. Vespa supports real-time updates and provides a broad range of APIs for querying and data management.

Normalised Place Records (NPRs)

Each exodataset would require its own transformer function to produce indexable NPRs, with namespaced @ids recording the authority name. The NPRs would include many of the LPF properties, with these additional fields (among others):

Embeddings (for each toponym) for phonetic and semantic representations. These would be generated using pre-trained multilingual G2P and BERT models (which would also provide embeddings for search terms).
Simplified representations of geometries as points and bounding boxes, to be used in containment and proximity queries.
Reduction of timespans to an array of start and end years for each record.

Benefits

Features from WHG dataset contributions would also be transformed to NPRs (with an additional raw LPF field for swift retrieval of entire place records and datasets) and stored in Vespa, and once fully accessioned these would be available in API searches. Vespa would also deliver dataset downloads based on stored raw LPF, and dataset FeatureCollections customised for map tileset generation and efficient browser-based map visualisations.

This system would allow:

Automatic acceptance for any "very obvious" matches in the accessioning workflow.
Preparation of candidate-matches for entire datasets in the background while a contributor steps through individual records.
Great improvement to the quality and extent of our existing range of APIs.
Improved support for search incorporating toponymic BCP 47 language-tags.
Replacement of our inefficient mapdata filesystem cache.
Removal of Place data from our Postgres database, for both submitted datasets and exodatasets.
Considerable reduction in backup data volume and overhead.

It should be implemented in such a way that the indexing of each exodataset might be periodically and independently refreshed.

Reinforcement Learning

Integration of Stable-Baselines3 for reinforcement learning in the Vespa indexing system could further enhance the place-matching and ranking process. User interactions—such as marking good or bad matches when contributing a dataset—can be used as training data, allowing the model to adjust future rankings. The search engine would continue to refine and optimise its performance, delivering more accurate place matches both on the web site and in the WHG APIs, and reducing the need for manual intervention in the accessioning workflow.

Proposed Exodatasets

Further suggestions of exodatasets not already aggregated by GeoNames would be very welcome - please add in a comment below.

Global

GeoNames (25m place names aggregated from numerous sources).
Wikidata: open a torrent of all entities from https://www.wikidata.org/wiki/Wikidata:Database_download with ijson, and filter places from the stream directly into Vespa.
OSM (6m+ nodes tagged as places).
Pleiades (63,282 toponyms; 71,082 attestations; 37,743 places).
TGN (3m+ place records; 5m+ names) "While most records in TGN include coordinates, these coordinates are approximate and are intended for reference ("finding purposes") only (as is true of coordinates in most atlases and other resources".
Library of Congress (LOC) (not geolocated, but linked to other geolocated sources: example).

Regional

DE

DNB: Deutsche National Bibliothek (Incorporates some coordinates directly from GeoNames)

GB

EPNS: Survey of English Place-Names
GB1900 UK-specific dataset with over 2.5 million historical place names.
RCAHMW: List of Historic Place Names of Wales (700k+ names).
Semantic Name Authority Repository Cymru (60k+ place names)

PL

PRNG: Państwowy Rejestr Nazw Geograficznych

The text was updated successfully, but these errors were encountered:

tomersagi · 2024-09-24T07:51:39Z

oh boy. I will have a read.

docuracy · 2024-09-24T13:05:00Z

https://www.infoworld.com/article/3535633/why-vector-databases-arent-just-databases.html

tomersagi · 2024-09-24T14:35:56Z

Looks exciting! A few issues to be considered -

many places have name variants in different languages. How do you handle this in indexing and search?
There are relations between places that can help researchers find out more about a place. Some of these come via the external references (e.g., wikidata), some can be contributed with the dataset by the researcher. Having a graph-based traversal system allows the researcher to traverse these links, not just the same-as links found.
There is more to place linking than "same-as", the RL pipeline can be trained with additional relations - "part of (london city is part of the london municipality" / "replaced (newer settlement in the same location as an older one)"

docuracy · 2024-09-24T18:58:13Z

My current thinking is to split the toponyms out into a separate (uid-cross-referenced) index, together with their various BCP 47 tags (where known). This way embeddings would be calculated for each only once.

docuracy · 2024-10-08T13:52:22Z

Vespa backups can be effectively managed through replication to a remote server equipped with ZFS. This setup provides additional storage capacity and facilitates load sharing for the Vespa index, improving performance by distributing query and indexing loads. Additionally, it enables the creation of instant snapshots of the replicated data, ensuring both data availability and efficient backup management.

docuracy self-assigned this Sep 17, 2024

docuracy added enhancement New feature or request infrastructure labels Sep 17, 2024

docuracy added this to the v3 beta+ milestone Sep 17, 2024

docuracy added elastic API labels Sep 18, 2024

docuracy changed the title ~~Vespa Indexes~~ Authorities & Vespa Indexes Sep 19, 2024

docuracy pinned this issue Sep 19, 2024

docuracy changed the title ~~Authorities & Vespa Indexes~~ Exodata & Vespa Indexes Sep 29, 2024

docuracy changed the title ~~Exodata & Vespa Indexes~~ Exodatasets & Vespa Indexes Sep 29, 2024

docuracy removed this from the v3 beta+ milestone Oct 3, 2024

docuracy added the reconciliation label Oct 8, 2024

docuracy added the Technical Board label Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exodatasets & Vespa Indexes #372

Exodatasets & Vespa Indexes #372

docuracy commented Sep 17, 2024 •

edited

Loading

tomersagi commented Sep 24, 2024

docuracy commented Sep 24, 2024

tomersagi commented Sep 24, 2024

docuracy commented Sep 24, 2024

docuracy commented Oct 8, 2024

Exodatasets & Vespa Indexes #372

Exodatasets & Vespa Indexes #372

Comments

docuracy commented Sep 17, 2024 • edited Loading

Vespa

Normalised Place Records (NPRs)

Benefits

Reinforcement Learning

Proposed Exodatasets

Global

Regional

DE

GB

PL

tomersagi commented Sep 24, 2024

docuracy commented Sep 24, 2024

tomersagi commented Sep 24, 2024

docuracy commented Sep 24, 2024

docuracy commented Oct 8, 2024

docuracy commented Sep 17, 2024 •

edited

Loading