forked from ontoportal/ontologies_linked_data
-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Feature: Index Ontologies metadata and content & Agents (#130)
* use standard SOLR in docker compose with no ontoportal old confgis * migrate ontology properties SOLR configuration to use Schema API * migrate ontology classes SOLR configuration to use Schema API * migrate provisional classes indexation to use Schema API and model hooks * update tests to handle the new indexation API * simplify the ontology properties index schema * update class and properties schema to use the existent dynamic names * index submission and ontologies metadata on save * index agents metadata * add ontology and agent metadata indexation tests * make agent, name , acronym, email and identifiers searchable * unindex ontology submission when archived * make ontology acronym and name searchable * update embedded ontology to all the fields and update submission in save * fix embed docs search tests * rename ontology unindex to unindex_all_data to prevent conflicts * implement index all ontology content * fix unescaping indexed properties naming * fix an issue after update RDF gem to 3.0 that frozen request params * add parallel processing the index_all_data step * clear indexed data after ontology delete * optimize index all data in Virtuoso and GraphDb by pre-fetching all ids - Before optimization - fs ⇒ 15.224490000051446s - ag ⇒ 19.238805999979377s - vo ⇒ 42.95274499990046s - gb ⇒ 33.52821200003382s - After optimization - fs ⇒ 15.369778999942355s - ag ⇒ 17.367580000078306s - vo ⇒ 16.564614000031725s - gb ⇒ 15.431716999970376s
- Loading branch information
1 parent
cab072e
commit d37aeaf
Showing
12 changed files
with
496 additions
and
140 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
11 changes: 0 additions & 11 deletions
11
lib/ontologies_linked_data/concerns/mappings/mapping_external.rb
This file was deleted.
Oops, something went wrong.
161 changes: 161 additions & 0 deletions
161
lib/ontologies_linked_data/concerns/ontology_submissions/submission_index_all_data.rb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
require 'parallel' | ||
module LinkedData | ||
module Concerns | ||
module OntologySubmission | ||
module IndexAllData | ||
|
||
module ClassMethods | ||
def clear_indexed_content(ontology) | ||
conn = Goo.init_search_connection(:ontology_data) | ||
begin | ||
conn.delete_by_query("ontology_t:\"#{ontology}\"") | ||
rescue StandardError => e | ||
puts e.message | ||
end | ||
conn | ||
end | ||
|
||
end | ||
|
||
def self.included(base) | ||
base.extend(ClassMethods) | ||
end | ||
|
||
def index_sorted_ids(ids, ontology, conn, logger, commit = true) | ||
total_triples = Parallel.map(ids.each_slice(100), in_threads: 10) do |ids_slice| | ||
index_ids = 0 | ||
triples_count = 0 | ||
documents = {} | ||
time = Benchmark.realtime do | ||
documents, triples_count = fetch_triples(ids_slice, ontology) | ||
end | ||
|
||
return if documents.empty? | ||
|
||
logger.info("Worker #{Parallel.worker_number} > Fetched #{triples_count} triples of #{id} in #{time} sec.") if triples_count.positive? | ||
|
||
time = Benchmark.realtime do | ||
conn.index_document(documents.values, commit: false) | ||
conn.index_commit if commit | ||
index_ids = documents.size | ||
documents = {} | ||
end | ||
logger.info("Worker #{Parallel.worker_number} > Indexed #{index_ids} ids of #{id} in #{time} sec. Total #{documents.size} ids.") | ||
triples_count | ||
end | ||
total_triples.sum | ||
end | ||
|
||
def index_all_data(logger, commit = true) | ||
page = 1 | ||
size = 1000 | ||
count_ids = 0 | ||
total_time = 0 | ||
total_triples = 0 | ||
old_count = -1 | ||
|
||
ontology = self.bring(:ontology).ontology | ||
.bring(:acronym).acronym | ||
conn = init_search_collection(ontology) | ||
|
||
ids = {} | ||
|
||
while count_ids != old_count | ||
old_count = count_ids | ||
count = 0 | ||
time = Benchmark.realtime do | ||
ids = fetch_sorted_ids(size, page) | ||
count = ids.size | ||
end | ||
|
||
count_ids += count | ||
total_time += time | ||
page += 1 | ||
|
||
next unless count.positive? | ||
|
||
logger.info("Fetched #{count} ids of #{id} page: #{page} in #{time} sec.") | ||
|
||
total_triples += index_sorted_ids(ids, ontology, conn, logger, commit) | ||
|
||
end | ||
logger.info("Completed indexing all ontology data: #{self.id} in #{total_time} sec. (#{count_ids} ids / #{total_triples} triples)") | ||
logger.flush | ||
end | ||
|
||
private | ||
|
||
def fetch_sorted_ids(size, page) | ||
query = Goo.sparql_query_client.select(:id) | ||
.distinct | ||
.from(RDF::URI.new(self.id)) | ||
.where(%i[id p v]) | ||
.limit(size) | ||
.offset((page - 1) * size) | ||
|
||
query.each_solution.map(&:id).sort | ||
end | ||
|
||
def update_doc(doc, property, new_val) | ||
unescaped_prop = property.gsub('___', '://') | ||
|
||
unescaped_prop = unescaped_prop.gsub('_', '/') | ||
existent_val = doc["#{unescaped_prop}_t"] || doc["#{unescaped_prop}_txt"] | ||
|
||
if !existent_val && !property['#'] | ||
unescaped_prop = unescaped_prop.sub(%r{/([^/]+)$}, '#\1') # change latest '/' with '#' | ||
existent_val = doc["#{unescaped_prop}_t"] || doc["#{unescaped_prop}_txt"] | ||
end | ||
|
||
if existent_val && new_val || new_val.is_a?(Array) | ||
doc.delete("#{unescaped_prop}_t") | ||
doc["#{unescaped_prop}_txt"] = Array(existent_val) + Array(new_val).map(&:to_s) | ||
elsif existent_val.nil? && new_val | ||
doc["#{unescaped_prop}_t"] = new_val.to_s | ||
end | ||
doc | ||
end | ||
|
||
def init_search_collection(ontology) | ||
self.class.clear_indexed_content(ontology) | ||
end | ||
|
||
def fetch_triples(ids_slice, ontology) | ||
documents = {} | ||
count = 0 | ||
filter = ids_slice.map { |x| "?id = <#{x}>" }.join(' || ') | ||
query = Goo.sparql_query_client.select(:id, :p, :v) | ||
.from(RDF::URI.new(self.id)) | ||
.where(%i[id p v]) | ||
.filter(filter) | ||
query.each_solution do |sol| | ||
count += 1 | ||
doc = documents[sol[:id].to_s] | ||
doc ||= { | ||
id: "#{sol[:id]}_#{ontology}", submission_id_t: self.id.to_s, | ||
ontology_t: ontology, resource_model: self.class.model_name, | ||
resource_id: sol[:id].to_s | ||
} | ||
property = sol[:p].to_s | ||
value = sol[:v] | ||
|
||
if property.to_s.eql?(RDF.type.to_s) | ||
update_doc(doc, 'type', value) | ||
else | ||
update_doc(doc, property, value) | ||
end | ||
documents[sol[:id].to_s] = doc | ||
end | ||
[documents, count] | ||
end | ||
|
||
end | ||
end | ||
end | ||
end | ||
|
||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.