Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exodatasets & Vespa Indexes #372

Open
docuracy opened this issue Sep 17, 2024 · 5 comments
Open

Exodatasets & Vespa Indexes #372

docuracy opened this issue Sep 17, 2024 · 5 comments

Comments

@docuracy
Copy link
Member

docuracy commented Sep 17, 2024

An improved indexing system using Vespa, together with the overhaul of the system for asserting links between Places outlined here, might remove the current need for manual reconciliation and accessioning. This approach can eliminate the need for multiple passes through contributed datasets, offering contributors a unified interface to complete all stages of place data processing in a single pass, rather than waiting for dataset-wide match-seeking operations to finish.

This system could present a Cluster of potential matches from exodatasets like Wikidata, GeoNames, and others (see below), as well as from other WHG datasets, and it might suggest relevant citations from LLMs (proposed here). It would allow a dataset contributor to focus on a single place in their dataset and complete all stages of its processing before moving on to another. It would also reduce the friction that arises when dealing with historical places which do not exist in the modern exodatasets, but which may well be present in other contributed datasets.

Vespa

Vespa offers hybrid search capabilities that can seamlessly combine traditional keyword-based search with vector-based search methods. Running in a Docker container, it would allow ranking based on textual, spatial, temporal, linguistic, phonetic, and semantic facets across both global and regional exodatasets and contributed WHG data. Vespa supports real-time updates and provides a broad range of APIs for querying and data management.

Normalised Place Records (NPRs)

Each exodataset would require its own transformer function to produce indexable NPRs, with namespaced @ids recording the authority name. The NPRs would include many of the LPF properties, with these additional fields (among others):

  • Embeddings (for each toponym) for phonetic and semantic representations. These would be generated using pre-trained multilingual G2P and BERT models (which would also provide embeddings for search terms).
  • Simplified representations of geometries as points and bounding boxes, to be used in containment and proximity queries.
  • Reduction of timespans to an array of start and end years for each record.

Benefits

Features from WHG dataset contributions would also be transformed to NPRs (with an additional raw LPF field for swift retrieval of entire place records and datasets) and stored in Vespa, and once fully accessioned these would be available in API searches. Vespa would also deliver dataset downloads based on stored raw LPF, and dataset FeatureCollections customised for map tileset generation and efficient browser-based map visualisations.

This system would allow:

  • Automatic acceptance for any "very obvious" matches in the accessioning workflow.
  • Preparation of candidate-matches for entire datasets in the background while a contributor steps through individual records.
  • Great improvement to the quality and extent of our existing range of APIs.
  • Improved support for search incorporating toponymic BCP 47 language-tags.
  • Replacement of our inefficient mapdata filesystem cache.
  • Removal of Place data from our Postgres database, for both submitted datasets and exodatasets.
  • Considerable reduction in backup data volume and overhead.

It should be implemented in such a way that the indexing of each exodataset might be periodically and independently refreshed.

Reinforcement Learning

Integration of Stable-Baselines3 for reinforcement learning in the Vespa indexing system could further enhance the place-matching and ranking process. User interactions—such as marking good or bad matches when contributing a dataset—can be used as training data, allowing the model to adjust future rankings. The search engine would continue to refine and optimise its performance, delivering more accurate place matches both on the web site and in the WHG APIs, and reducing the need for manual intervention in the accessioning workflow.

Proposed Exodatasets

Further suggestions of exodatasets not already aggregated by GeoNames would be very welcome - please add in a comment below.

Global

  • GeoNames (25m place names aggregated from numerous sources).
  • Wikidata: open a torrent of all entities from https://www.wikidata.org/wiki/Wikidata:Database_download with ijson, and filter places from the stream directly into Vespa.
  • OSM (6m+ nodes tagged as places).
  • Pleiades (63,282 toponyms; 71,082 attestations; 37,743 places).
  • TGN (3m+ place records; 5m+ names) "While most records in TGN include coordinates, these coordinates are approximate and are intended for reference ("finding purposes") only (as is true of coordinates in most atlases and other resources".
  • Library of Congress (LOC) (not geolocated, but linked to other geolocated sources: example).

Regional

DE

GB

PL

@docuracy docuracy self-assigned this Sep 17, 2024
@docuracy docuracy added enhancement New feature or request infrastructure labels Sep 17, 2024
@docuracy docuracy added this to the v3 beta+ milestone Sep 17, 2024
@docuracy docuracy changed the title Vespa Indexes Authorities & Vespa Indexes Sep 19, 2024
@docuracy docuracy pinned this issue Sep 19, 2024
@tomersagi
Copy link

oh boy. I will have a read.

@docuracy
Copy link
Member Author

@tomersagi
Copy link

Looks exciting! A few issues to be considered -

  • many places have name variants in different languages. How do you handle this in indexing and search?
  • There are relations between places that can help researchers find out more about a place. Some of these come via the external references (e.g., wikidata), some can be contributed with the dataset by the researcher. Having a graph-based traversal system allows the researcher to traverse these links, not just the same-as links found.
  • There is more to place linking than "same-as", the RL pipeline can be trained with additional relations - "part of (london city is part of the london municipality" / "replaced (newer settlement in the same location as an older one)"

@docuracy
Copy link
Member Author

My current thinking is to split the toponyms out into a separate (uid-cross-referenced) index, together with their various BCP 47 tags (where known). This way embeddings would be calculated for each only once.

@docuracy docuracy changed the title Authorities & Vespa Indexes Exodata & Vespa Indexes Sep 29, 2024
@docuracy docuracy changed the title Exodata & Vespa Indexes Exodatasets & Vespa Indexes Sep 29, 2024
@docuracy docuracy removed this from the v3 beta+ milestone Oct 3, 2024
@docuracy
Copy link
Member Author

docuracy commented Oct 8, 2024

Vespa backups can be effectively managed through replication to a remote server equipped with ZFS. This setup provides additional storage capacity and facilitates load sharing for the Vespa index, improving performance by distributing query and indexing loads. Additionally, it enables the creation of instant snapshots of the replicated data, ensuring both data availability and efficient backup management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants