Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr 8.8 upgrade - remaining issues with solrconfig.xml #7662

Open
poikilotherm opened this issue Mar 8, 2021 · 8 comments
Open

Solr 8.8 upgrade - remaining issues with solrconfig.xml #7662

poikilotherm opened this issue Mar 8, 2021 · 8 comments

Comments

@poikilotherm
Copy link
Contributor

poikilotherm commented Mar 8, 2021

Mistake

Since we upgraded from Solr 7.3.0, we made one bad mistake (mea culpa, too): we did not adapt the luceneMatchVersion to the version of the running server.

Other changes

We also did not incorporate upstream changes to solrconfig.xml:

--- solrconfig.xml	2021-03-08 10:29:37.810488567 +0100
+++ solrconfig-881.xml	2021-02-12 19:56:43.000000000 +0100
@@ -35,7 +35,7 @@
        that you fully re-index after changing this setting as it can
        affect both how text is indexed and queried.
   -->
-  <luceneMatchVersion>7.3.0</luceneMatchVersion>
+  <luceneMatchVersion>8.8.1</luceneMatchVersion>
 
   <!-- <lib/> directives can be used to instruct Solr to load any Jars
        identified and use them to resolve any "plugins" specified in
@@ -69,20 +69,11 @@
        If a 'dir' option (with or without a regex) is used and nothing
        is found that matches, a warning will be logged.

The formerly present JARs have been excluded since 8.0, see apache/lucene-solr@dce36c1

I don't know if we actually use any of those. Remove and look if it breaks.

-       The examples below can be used to load some solr-contribs along
+       The example below can be used to load a solr-contrib along
        with their external dependencies.
     -->
-  <lib dir="${solr.install.dir:../../../..}/contrib/extraction/lib" regex=".*\.jar" />
-  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-cell-\d.*\.jar" />
+    <!-- <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-ltr-\d.*\.jar" /> -->
 
-  <lib dir="${solr.install.dir:../../../..}/contrib/clustering/lib/" regex=".*\.jar" />
-  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-clustering-\d.*\.jar" />
-
-  <lib dir="${solr.install.dir:../../../..}/contrib/langid/lib/" regex=".*\.jar" />
-  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-langid-\d.*\.jar" />
-
-  <lib dir="${solr.install.dir:../../../..}/contrib/velocity/lib" regex=".*\.jar" />
-  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-velocity-\d.*\.jar" />
   <!-- an exact 'path' can be used instead of a 'dir' to specify a
        specific jar file.  This will cause a serious error to be logged
        if it can't be loaded.

These are newer changes we should incorporate.

@@ -161,6 +152,15 @@
     <!-- <ramBufferSizeMB>100</ramBufferSizeMB> -->
     <!-- <maxBufferedDocs>1000</maxBufferedDocs> -->
 
+    <!-- Expert: ramPerThreadHardLimitMB sets the maximum amount of RAM that can be consumed
+         per thread before they are flushed. When limit is exceeded, this triggers a forced
+         flush even if ramBufferSizeMB has not been exceeded.
+         This is a safety limit to prevent Lucene's DocumentsWriterPerThread from address space
+         exhaustion due to its internal 32 bit signed integer based memory addressing.
+         The specified value should be greater than 0 and less than 2048MB. When not specified,
+         Solr uses Lucene's default value 1945. -->
+    <!-- <ramPerThreadHardLimitMB>1945</ramPerThreadHardLimitMB> -->
+
     <!-- Expert: Merge Policy
          The Merge Policy in Lucene controls how merging of segments is done.
          The default since Solr/Lucene 3.3 is TieredMergePolicy.
@@ -367,23 +367,32 @@
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
   <query>
 
-    <!-- Maximum number of clauses in each BooleanQuery,  an exception
-         is thrown if exceeded.  It is safe to increase or remove this setting,
-         since it is purely an arbitrary limit to try and catch user errors where
-         large boolean queries may not be the best implementation choice.
+    <!-- Maximum number of clauses allowed when parsing a boolean query string.
+         
+         This limit only impacts boolean queries specified by a user as part of a query string,
+         and provides per-collection controls on how complex user specified boolean queries can
+         be.  Query strings that specify more clauses then this will result in an error.
+         
+         If this per-collection limit is greater then the global `maxBooleanClauses` limit
+         specified in `solr.xml`, it will have no effect, as that setting also limits the size
+         of user specified boolean queries.
       -->
-    <maxBooleanClauses>1024</maxBooleanClauses>
+    <maxBooleanClauses>${solr.max.booleanClauses:1024}</maxBooleanClauses>
 
     <!-- Solr Internal Query Caches
 
-         There are two implementations of cache available for Solr,
-         LRUCache, based on a synchronized LinkedHashMap, and
-         FastLRUCache, based on a ConcurrentHashMap.
+         There are four implementations of cache available for Solr:
+         LRUCache, based on a synchronized LinkedHashMap, 
+         LFUCache and FastLRUCache, based on a ConcurrentHashMap, and CaffeineCache -
+         a modern and robust cache implementation. Note that in Solr 9.0
+         only CaffeineCache will be available, other implementations are now
+         deprecated.
 
          FastLRUCache has faster gets and slower puts in single
          threaded operation and thus is generally faster than LRUCache
          when the hit ratio of the cache is high (> 75%), and may be
          faster under other scenarios on multi-cpu systems.
+         Starting with Solr 9.0 the default cache implementation used is CaffeineCache.
     -->
 
     <!-- Filter Cache
@@ -403,13 +412,12 @@
            initialSize - the initial capacity (number of entries) of
                the cache.  (see java.util.HashMap)
            autowarmCount - the number of entries to prepopulate from
-               and old cache.
+               an old cache.
            maxRamMB - the maximum amount of RAM (in MB) that this cache is allowed
                       to occupy. Note that when this option is specified, the size
                       and initialSize parameters are ignored.
       -->
-    <filterCache class="solr.FastLRUCache"
-                 size="512"
+    <filterCache size="512"
                  initialSize="512"
                  autowarmCount="0"/>
 
@@ -421,8 +429,7 @@
             maxRamMB - the maximum amount of RAM (in MB) that this cache is allowed
                        to occupy
       -->
-    <queryResultCache class="solr.LRUCache"
-                      size="512"
+    <queryResultCache size="512"
                       initialSize="512"
                       autowarmCount="0"/>
 
@@ -432,14 +439,12 @@
          document).  Since Lucene internal document ids are transient,
          this cache will not be autowarmed.
       -->
-    <documentCache class="solr.LRUCache"
-                   size="512"
+    <documentCache size="512"
                    initialSize="512"
                    autowarmCount="0"/>
 
     <!-- custom cache currently used by block join -->
     <cache name="perSegFilter"
-           class="solr.search.LRUCache"
            size="10"
            initialSize="0"
            autowarmCount="10"
@@ -452,8 +457,7 @@
          even if not configured here.
       -->
     <!--
-       <fieldValueCache class="solr.FastLRUCache"
-                        size="512"
+       <fieldValueCache size="512"
                         autowarmCount="128"
                         showItems="32" />
       -->
@@ -469,7 +473,6 @@
       -->
     <!--
        <cache name="myUserCache"
-              class="solr.LRUCache"
               size="4096"
               initialSize="1024"
               autowarmCount="1024"
@@ -521,6 +524,23 @@
       -->
     <queryResultMaxDocsCached>200</queryResultMaxDocsCached>
 
+  <!-- Use Filter For Sorted Query
+
+   A possible optimization that attempts to use a filter to
+   satisfy a search.  If the requested sort does not include
+   score, then the filterCache will be checked for a filter
+   matching the query. If found, the filter will be used as the
+   source of document ids, and then the sort will be applied to
+   that.
+
+   For most situations, this will not be useful unless you
+   frequently get the same search repeatedly with different sort
+   options, and none of them ever use "score"
+-->
+    <!--
+       <useFilterForSortedQuery>true</useFilterForSortedQuery>
+      -->
+
     <!-- Query Related Event Listeners
 
          Various IndexSearcher related events can trigger Listeners to
@@ -569,6 +589,64 @@
 
   </query>
 
+  <!-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+     Circuit Breaker Section - This section consists of configurations for
+     circuit breakers
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -->
+
+    <!-- Circuit Breakers
+
+     Circuit breakers are designed to allow stability and predictable query
+     execution. They prevent operations that can take down the node and cause
+     noisy neighbour issues.
+
+     This flag is the uber control switch which controls the activation/deactivation of all circuit
+     breakers. If a circuit breaker wishes to be independently configurable,
+     they are free to add their specific configuration but need to ensure that this flag is always
+     respected - this should have veto over all independent configuration flags.
+    -->
+    <circuitBreakers enabled="true">
+
+    <!-- Memory Circuit Breaker Configuration
+
+     Specific configuration for max JVM heap usage circuit breaker. This configuration defines whether
+     the circuit breaker is enabled and the threshold percentage of maximum heap allocated beyond which queries will be rejected until the
+     current JVM usage goes below the threshold. The valid value range for this value is 50-95.
+
+     Consider a scenario where the max heap allocated is 4 GB and memoryCircuitBreakerThreshold is
+     defined as 75. Threshold JVM usage will be 4 * 0.75 = 3 GB. Its generally a good idea to keep this value between 75 - 80% of maximum heap
+     allocated.
+
+     If, at any point, the current JVM heap usage goes above 3 GB, queries will be rejected until the heap usage goes below 3 GB again.
+     If you see queries getting rejected with 503 error code, check for "Circuit Breakers tripped"
+     in logs and the corresponding error message should tell you what transpired (if the failure
+     was caused by tripped circuit breakers).
+
+     If, at any point, the current JVM heap usage goes above 3 GB, queries will be rejected until the heap usage goes below 3 GB again.
+     If you see queries getting rejected with 503 error code, check for "Circuit Breakers tripped"
+     in logs and the corresponding error message should tell you what transpired (if the failure
+     was caused by tripped circuit breakers).
+    -->
+    <!--
+   <memBreaker enabled="true" threshold="75"/>
+    -->
+
+      <!-- CPU Circuit Breaker Configuration
+
+     Specific configuration for CPU utilization based circuit breaker. This configuration defines whether the circuit breaker is enabled
+     and the average load over the last minute at which the circuit breaker should start rejecting queries.
+
+     Consider a scenario where the max heap allocated is 4 GB and memoryCircuitBreakerThreshold is
+     defined as 75. Threshold JVM usage will be 4 * 0.75 = 3 GB. Its generally a good idea to keep this value between 75 - 80% of maximum heap
+     allocated.
+    -->
+
+      <!--
+       <cpuBreaker enabled="true" threshold="75"/>
+      -->
+
+  </circuitBreakers>
+
 
   <!-- Request Dispatcher

These are definitly changes we did. I don't know why they happened (it's really tricky to find its sources) and I don't know if this is actually used.

@@ -693,48 +771,6 @@
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
-      <str name="defType">edismax</str>
-      <float name="tie">0.075</float>
-        <str name="qf">
-            dvName^400
-            authorName^180
-            dvSubject^190
-            dvDescription^180
-            dvAffiliation^170
-            title^130
-            subject^120
-            keyword^110
-            topicClassValue^100
-            dsDescriptionValue^90
-            authorAffiliation^80
-            publicationCitation^60
-            producerName^50
-            fileName^30
-            fileDescription^30
-            variableLabel^20
-            variableName^10
-            _text_^1.0
-        </str>
-        <str name="pf">
-            dvName^200
-            authorName^100
-            dvSubject^100
-            dvDescription^100
-            dvAffiliation^100
-            title^75
-            subject^75
-            keyword^75
-            topicClassValue^75
-            dsDescriptionValue^75
-            authorAffiliation^75
-            publicationCitation^75
-            producerName^75
-        </str>
-        <!-- Even though this number is huge it only seems to apply a boost of ~1.5x to final result -MAD 4.9.3--> 
-        <str name="bq">
-            isHarvested:false^25000
-        </str>
-
       <!-- Default search field
          <str name="df">text</str> 
         -->
@@ -805,43 +841,12 @@
     </lst>
   </requestHandler>

More changes by upstream, should be incorporated. (Seems related to the same change in apache/lucene-solr@dce36c1)

-
-  <!-- A Robust Example
-
-       This example SearchHandler declaration shows off usage of the
-       SearchHandler with many defaults declared
-
-       Note that multiple instances of the same Request Handler
-       (SearchHandler) can be registered multiple times with different
-       names (and different init parameters)
-    -->
-  <requestHandler name="/browse" class="solr.SearchHandler" useParams="query,facets,velocity,browse">
-    <lst name="defaults">
-      <str name="echoParams">explicit</str>
-    </lst>
-  </requestHandler>
-
-  <initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
+  <initParams path="/update/**,/query,/select,/spell">
     <lst name="defaults">
       <str name="df">_text_</str>
     </lst>
   </initParams>
 
-  <!-- Solr Cell Update Request Handler
-
-       http://wiki.apache.org/solr/ExtractingRequestHandler
-
-    -->
-  <requestHandler name="/update/extract"
-                  startup="lazy"
-                  class="solr.extraction.ExtractingRequestHandler" >
-    <lst name="defaults">
-      <str name="lowernames">true</str>
-      <str name="fmap.meta">ignored_</str>
-      <str name="fmap.content">_text_</str>
-    </lst>
-  </requestHandler>
-
   <!-- Search Components
 
        Search components are registered to SolrCore and used by
@@ -972,30 +977,6 @@
     </arr>
   </requestHandler>
 
-  <!-- Term Vector Component
-
-       http://wiki.apache.org/solr/TermVectorComponent
-    -->
-  <searchComponent name="tvComponent" class="solr.TermVectorComponent"/>
-
-  <!-- A request handler for demonstrating the term vector component
-
-       This is purely as an example.
-
-       In reality you will likely want to add the component to your
-       already specified request handlers.
-    -->
-  <requestHandler name="/tvrh" class="solr.SearchHandler" startup="lazy">
-    <lst name="defaults">
-      <bool name="tv">true</bool>
-    </lst>
-    <arr name="last-components">
-      <str>tvComponent</str>
-    </arr>
-  </requestHandler>
-
-  <!-- Clustering Component. (Omitted here. See the default Solr example for a typical configuration.) -->
-
   <!-- Terms Component
 
        http://wiki.apache.org/solr/TermsComponent
@@ -1016,30 +997,6 @@
     </arr>
   </requestHandler>
 
-
-  <!-- Query Elevation Component
-
-       http://wiki.apache.org/solr/QueryElevationComponent
-
-       a search component that enables you to configure the top
-       results for a given query regardless of the normal lucene
-       scoring.
-    -->
-  <searchComponent name="elevator" class="solr.QueryElevationComponent" >
-    <!-- pick a fieldType to analyze queries -->
-    <str name="queryFieldType">string</str>
-  </searchComponent>
-
-  <!-- A request handler for demonstrating the elevator component -->
-  <requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
-    <lst name="defaults">
-      <str name="echoParams">explicit</str>
-    </lst>
-    <arr name="last-components">
-      <str>elevator</str>
-    </arr>
-  </requestHandler>
-
   <!-- Highlighting Component
 
        http://wiki.apache.org/solr/HighlightingParameters

🚨 THIS IS CRUCIAL FOR US. Newer versions of Solr default to the managed schema factory that @pkiraly suggested in #5989.

@@ -1170,8 +1127,6 @@
 
        See http://wiki.apache.org/solr/GuessingFieldTypes
     -->
-<schemaFactory class="ClassicIndexSchemaFactory"/>
-
   <updateProcessor class="solr.UUIDUpdateProcessorFactory" name="uuid"/>
   <updateProcessor class="solr.RemoveBlankFieldUpdateProcessorFactory" name="remove-blank"/>
   <updateProcessor class="solr.FieldNameMutatingUpdateProcessorFactory" name="field-name-mutating">

These have been changed by upstream and as they seem to use regexes now, should be OK to incorporate.

@@ -1183,28 +1138,16 @@
   <updateProcessor class="solr.ParseDoubleFieldUpdateProcessorFactory" name="parse-double"/>
   <updateProcessor class="solr.ParseDateFieldUpdateProcessorFactory" name="parse-date">
     <arr name="format">
-      <str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>
-      <str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>
-      <str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>
-      <str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>
-      <str>yyyy-MM-dd'T'HH:mm:ssZ</str>
-      <str>yyyy-MM-dd'T'HH:mm:ss</str>
-      <str>yyyy-MM-dd'T'HH:mmZ</str>
-      <str>yyyy-MM-dd'T'HH:mm</str>
-      <str>yyyy-MM-dd HH:mm:ss.SSSZ</str>
-      <str>yyyy-MM-dd HH:mm:ss,SSSZ</str>
-      <str>yyyy-MM-dd HH:mm:ss.SSS</str>
-      <str>yyyy-MM-dd HH:mm:ss,SSS</str>
-      <str>yyyy-MM-dd HH:mm:ssZ</str>
-      <str>yyyy-MM-dd HH:mm:ss</str>
-      <str>yyyy-MM-dd HH:mmZ</str>
-      <str>yyyy-MM-dd HH:mm</str>
-      <str>yyyy-MM-dd</str>
+      <str>yyyy-MM-dd['T'[HH:mm[:ss[.SSS]][z</str>
+      <str>yyyy-MM-dd['T'[HH:mm[:ss[,SSS]][z</str>
+      <str>yyyy-MM-dd HH:mm[:ss[.SSS]][z</str>
+      <str>yyyy-MM-dd HH:mm[:ss[,SSS]][z</str>
+      <str>[EEE, ]dd MMM yyyy HH:mm[:ss] z</str>
+      <str>EEEE, dd-MMM-yy HH:mm:ss z</str>
+      <str>EEE MMM ppd HH:mm:ss [z ]yyyy</str>
     </arr>
   </updateProcessor>

Is the removal of this processors still a thing?

-
-  <!--Dataverse removed-->
-<!--  <updateProcessor class="solr.AddSchemaFieldsUpdateProcessorFactory" name="add-schema-fields">
+  <updateProcessor class="solr.AddSchemaFieldsUpdateProcessorFactory" name="add-schema-fields">
     <lst name="typeMapping">
       <str name="valueClass">java.lang.String</str>
       <str name="fieldType">text_general</str>
@@ -1212,7 +1155,7 @@
         <str name="dest">*_str</str>
         <int name="maxChars">256</int>
       </lst>
-
+      <!-- Use as default mapping instead of defaultFieldType -->
       <bool name="default">true</bool>
     </lst>
     <lst name="typeMapping">
@@ -1232,11 +1175,11 @@
       <str name="valueClass">java.lang.Number</str>
       <str name="fieldType">pdoubles</str>
     </lst>
-    </updateProcessor> -->
+  </updateProcessor>

We should us the setting to disable this instead of changing the default... 🙈

   <!-- The update.autoCreateFields property can be turned to false to disable schemaless mode -->
-  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:false}"
-           processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date">
+  <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}"
+           processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
     <processor class="solr.LogUpdateProcessorFactory"/>
     <processor class="solr.DistributedUpdateProcessorFactory"/>
     <processor class="solr.RunUpdateProcessorFactory"/>
@@ -1265,46 +1208,6 @@
      </updateRequestProcessorChain>
     -->

More upstream due to the libs removed. Looks like we never configured those.

-  <!-- Language identification
-
-       This example update chain identifies the language of the incoming
-       documents using the langid contrib. The detected language is
-       written to field language_s. No field name mapping is done.
-       The fields used for detection are text, title, subject and description,
-       making this example suitable for detecting languages form full-text
-       rich documents injected via ExtractingRequestHandler.
-       See more about langId at http://wiki.apache.org/solr/LanguageDetection
-    -->
-  <!--
-   <updateRequestProcessorChain name="langid">
-     <processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
-       <str name="langid.fl">text,title,subject,description</str>
-       <str name="langid.langField">language_s</str>
-       <str name="langid.fallback">en</str>
-     </processor>
-     <processor class="solr.LogUpdateProcessorFactory" />
-     <processor class="solr.RunUpdateProcessorFactory" />
-   </updateRequestProcessorChain>
-  -->
-
-  <!-- Script update processor
-
-    This example hooks in an update processor implemented using JavaScript.
-
-    See more about the script update processor at http://wiki.apache.org/solr/ScriptUpdateProcessor
-  -->
-  <!--
-    <updateRequestProcessorChain name="script">
-      <processor class="solr.StatelessScriptUpdateProcessorFactory">
-        <str name="script">update-script.js</str>
-        <lst name="params">
-          <str name="config_param">example config parameter</str>
-        </lst>
-      </processor>
-      <processor class="solr.RunUpdateProcessorFactory" />
-    </updateRequestProcessorChain>
-  -->
-
   <!-- Response Writers
 
        http://wiki.apache.org/solr/QueryResponseWriter
@@ -1340,23 +1243,6 @@
     <str name="content-type">text/plain; charset=UTF-8</str>
   </queryResponseWriter>
 
-  <!--
-     Custom response writers can be declared as needed...
-    -->
-  <queryResponseWriter name="velocity" class="solr.VelocityResponseWriter" startup="lazy">
-    <str name="template.base.dir">${velocity.template.base.dir:}</str>
-    <str name="solr.resource.loader.enabled">${velocity.solr.resource.loader.enabled:true}</str>
-    <str name="params.resource.loader.enabled">${velocity.params.resource.loader.enabled:false}</str>
-  </queryResponseWriter>
-
-  <!-- XSLT response writer transforms the XML output by any xslt file found
-       in Solr's conf/xslt directory.  Changes to xslt files are checked for
-       every xsltCacheLifetimeSeconds.
-    -->
-  <queryResponseWriter name="xslt" class="solr.XSLTResponseWriter">
-    <int name="xsltCacheLifetimeSeconds">5</int>
-  </queryResponseWriter>
-
   <!-- Query Parsers
 
        https://lucene.apache.org/solr/guide/query-syntax-and-parsing.html

Conclusion

Instead of maintaining a static config, we should rely on using the _default configset and apply our changes to it.
At least this is what I'm going to do in the Dataverse Solr container images.

@poikilotherm poikilotherm changed the title Solr 8.8 upgrade - remaining issues Solr 8.8 upgrade - remaining issues with solrconfig.xml Mar 8, 2021
@poikilotherm
Copy link
Contributor Author

Triggering @qqmyers @mheppler @scolapasta @pdurbin @sekmiller here.

@mheppler
Copy link
Contributor

mheppler commented Mar 8, 2021

Noted, @poikilotherm. Thank you for catching this, opening an issue and providing all the details. Was already coordinating with @qqmyers and @scolapasta on #7378, and I'll add this new issue to my agenda as well.

Might be worth scheduling another tech hour discussion tomorrow, if there are any questions.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Mar 8, 2021

I also looked around for upstream changes to schema.xml. There are some changes and maybe we should discuss those, too. (Deprecation of TrieXXXField, some language stuff)

@pdurbin
Copy link
Member

pdurbin commented Mar 8, 2021

@poikilotherm first of all, thanks for creating this issue.

Instead of maintaining a static config, we should rely on using the _default configset and apply our changes to it.

This sounds good but I'm not sure how it would work technically. As a starting point, it probably makes sense to list "our changes" so that we're all on the same page. We know we want "boosting" for example (see #1928 (comment) ) but I'm sure there are other tweaks we've made that I'm not thinking of. My guess is that we make fewer than half a dozen changes to the Solr config. Perhaps we should start by listing them in the dev guide so that when we do upgrades developers are aware of them.

poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Dec 21, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Dec 21, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Dec 23, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Dec 23, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Dec 23, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Dec 23, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Dec 23, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Dec 23, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Dec 23, 2021
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Jan 3, 2022
Simple Makefile to download Solr, extract the default configset
and create a Dataverse flavored one.

- Uses Maven to find the Solr distribution version to download.
- Uses xsltproc to apply our XSLT transformations to sorlconfig.xml
- Replaces the managed-schema with the static one we provide
- Zips the configset to make it distributable as artifact
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Jan 3, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Feb 3, 2022
Simple Makefile to download Solr, extract the default configset
and create a Dataverse flavored one.

- Uses Maven to find the Solr distribution version to download.
- Uses xsltproc to apply our XSLT transformations to sorlconfig.xml
- Replaces the managed-schema with the static one we provide
- Zips the configset to make it distributable as artifact
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Feb 3, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Feb 7, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Feb 7, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Feb 7, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 26, 2022
…QSS#7662

Instead of relying on Java provided exceptions, we want to track line numbers
and other more details of the parsing process, so we need custom mechanics.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 26, 2022
Our custom metadata block TSV files follow a certain order of things.
We also do not allow for repetitions or similar. All of this can be
most easily be depicted with a state maschine, so we know where to send
a line to for parsing.

This commit also adds the very basic (empty) POJOs to store the block,
fields and vocabularies in to enable testing the state transition.

It also adds constants we rely on, like what's the trigger char, the
comment intro and the field delimiter
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 26, 2022
The TSV parser needs to verify if a certain line is a header line
and matching the spec. To avoid duplicated validation code,
this validator can be used with an arbitrary list of strings
(so it can be reused for blocks, fields and vocabularies).

As we will need to validate URLs in certain fields, this validator
also offers a helper function to create predicates checking for valid URLs.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 26, 2022
The Block POJO now contains the header specification (uses the Validator class
to perform the validation) and allows to parse a line into a List.
A later relaxation of the spec allowing for reordering of fields, etc is possible,
while the calling code of the parser can reuse the found header definition.

A builder pattern is used to parse and validate the actual definition.
As the block may only be used once the definition, all fields and vocabularies
have been parsed (if the is an error within the TSV the parsing has to fail!),
the builder pattern is a natural match to that.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 26, 2022
The Block POJO now contains the header specification (uses the Validator class
to perform the validation) and allows to parse a line into a List.
A later relaxation of the spec allowing for reordering of fields, etc is possible,
while the calling code of the parser can reuse the found header definition.

A builder pattern is used to parse and validate the actual definition.
As the block may only be used once the definition, all fields and vocabularies
have been parsed (if the is an error within the TSV the parsing has to fail!),
the builder pattern is a natural match to that.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 29, 2022
This simple class will allow to make the parser somewhat configurable,
so future changes and command line options can be integrated more
easily.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 29, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 29, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 29, 2022
Instead of defining a static trigger, we want to be able to configure
the trigger sign. Due to this, we use the keyword only and move the
trigger handling into the ParsingState (which is analysing the line for
state transition anyway).
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 29, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 29, 2022
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 29, 2022
- Implement first details of the Block POJO
- Change parsing with BlockBuilder to use an internal state with a not-exposed Block object
- The BlockBuilder may manipulate the Block, but after calling build() the calling code will
  have no option to edit the POJO (proper capsulation and sealing)
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 29, 2022
Add field types and make them usable as predicates for fields.
Add test.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 29, 2022
Predicates are not null safe - need to make validate() check for null
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 29, 2022
Includes all the predicates according to spec and test for them.
qqmyers added a commit to QualitativeDataRepository/dataverse that referenced this issue May 4, 2022
luceneMatchVersion update should be the only real change.
qqmyers added a commit to QualitativeDataRepository/dataverse that referenced this issue May 5, 2022
luceneMatchVersion update should be the only real change.
qqmyers added a commit to QualitativeDataRepository/dataverse that referenced this issue May 5, 2022
luceneMatchVersion update should be the only real change.
@gksachin04
Copy link

gksachin04 commented May 24, 2022

After upgrading from SOLR 8.2 to SOLR 8.8.2 LTR response has degraded by 35%. Please find LTR specific details.

QUERY_DOC_FV

Let me know is there anything needs. to added

@pdurbin
Copy link
Member

pdurbin commented May 24, 2022

@gksachin04 in PR #8415 we already upgraded to Solr 8.11. If you're still having a problem with that version, can you please open a fresh issue? Thanks. And yes, more details would be great. 😄

@gksachin04
Copy link

@pdurbin Thanks, do you have LTR specific configuration in Solr 8.11 ?

@pdurbin
Copy link
Member

pdurbin commented May 24, 2022

@gksachin04 sorry, I don't. You might want to ask the community about it: https://groups.google.com/g/dataverse-community

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants