-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exodatasets & Vespa Indexes #372
Comments
oh boy. I will have a read. |
Looks exciting! A few issues to be considered -
|
My current thinking is to split the toponyms out into a separate (uid-cross-referenced) index, together with their various BCP 47 tags (where known). This way embeddings would be calculated for each only once. |
Vespa backups can be effectively managed through replication to a remote server equipped with ZFS. This setup provides additional storage capacity and facilitates load sharing for the Vespa index, improving performance by distributing query and indexing loads. Additionally, it enables the creation of instant snapshots of the replicated data, ensuring both data availability and efficient backup management. |
An improved indexing system using Vespa, together with the overhaul of the system for asserting links between Places outlined here, might remove the current need for manual reconciliation and accessioning. This approach can eliminate the need for multiple passes through contributed datasets, offering contributors a unified interface to complete all stages of place data processing in a single pass, rather than waiting for dataset-wide match-seeking operations to finish.
This system could present a Cluster of potential matches from exodatasets like Wikidata, GeoNames, and others (see below), as well as from other WHG datasets, and it might suggest relevant citations from LLMs (proposed here). It would allow a dataset contributor to focus on a single place in their dataset and complete all stages of its processing before moving on to another. It would also reduce the friction that arises when dealing with historical places which do not exist in the modern exodatasets, but which may well be present in other contributed datasets.
Vespa
Vespa offers hybrid search capabilities that can seamlessly combine traditional keyword-based search with vector-based search methods. Running in a Docker container, it would allow ranking based on textual, spatial, temporal, linguistic, phonetic, and semantic facets across both global and regional exodatasets and contributed WHG data. Vespa supports real-time updates and provides a broad range of APIs for querying and data management.
Normalised Place Records (NPRs)
Each exodataset would require its own transformer function to produce indexable NPRs, with namespaced
@id
s recording the authority name. The NPRs would include many of the LPF properties, with these additional fields (among others):Benefits
Features from WHG dataset contributions would also be transformed to NPRs (with an additional raw LPF field for swift retrieval of entire place records and datasets) and stored in Vespa, and once fully accessioned these would be available in API searches. Vespa would also deliver dataset downloads based on stored raw LPF, and dataset FeatureCollections customised for map tileset generation and efficient browser-based map visualisations.
This system would allow:
It should be implemented in such a way that the indexing of each exodataset might be periodically and independently refreshed.
Reinforcement Learning
Integration of Stable-Baselines3 for reinforcement learning in the Vespa indexing system could further enhance the place-matching and ranking process. User interactions—such as marking good or bad matches when contributing a dataset—can be used as training data, allowing the model to adjust future rankings. The search engine would continue to refine and optimise its performance, delivering more accurate place matches both on the web site and in the WHG APIs, and reducing the need for manual intervention in the accessioning workflow.
Proposed Exodatasets
Further suggestions of exodatasets not already aggregated by GeoNames would be very welcome - please add in a comment below.
Global
Regional
DE
GB
PL
The text was updated successfully, but these errors were encountered: