Skip to content

A solr6+ schema experiment focusing on easy indexing and clean stored field names

Notifications You must be signed in to change notification settings

billdueber/solr6_test_conf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bill Dueber’s DynamicField Solr Config

The goals

I started this looking for a configuration that was flexible and resulting in solr documents that made sense to me. I ended up with the following goals:

  • every field is either stored or indexed — never both.

  • indexed fields reflect their type in the name

  • stored fields are either a bare word ('main_author') or a bare word with a _a indicating it’s multivalued (an array)

  • a single field can result in multiple indexed fields

This results in nice clean stored values (either field or field_a), and takes advantage of the fieldTypes described at the bottom of this document.

Using the dynamic field definitions

A constructed field name has two to four parts, separated by underscores.

  • The basename. This is the descriptive field name (title, author, etc.)

  • The fieldtype suffix. There is a mapping of fieldtype suffixes to fieldtypes in the generate_dfields.rb script. Some map to multiple indexed types, and adding to that list is as easy as editing the top of the file.

  • An optional _stored (that literal string), indicating that the item should be stored.

  • An optional _single (again that exact string) indicating that this is a single-valued field instead of a multi-valued field.

If _stored and _single are both present, they need to be in that order.

Examples

title_t_stored

Will create both stored and indexed fields, multivalued because we didn’t specify _single

  • title_t, an indexed text field

  • title_a, a multivalued stored field ('a' for array, since this is a multi-valued field)

mainauthor_ef_stored_single
  • mainauthor_e, an indexed exactish field

  • mainauthor_f, an indexed string suitable for faceting

  • mainauthor, the stored, single-valued field.

fulltext_t
  • fulltext_t, just the single indexed, multivalued field with no stored field.

fulltext_t_single
  • fulltext_t, again the indexed field with no stored field, but this one will complain if you try to send multiple values.

rawmarc
  • Just the single-valued string field rawmarc

emoji_a
  • A single multi-valued string field called emoji_a

Fields meant for sorting

Two fieldname suffixes are special-cased: ssort for strings meant as a sort key, and isort for integers (longs, actually, under the hood) meant as a sort key. The idea is that you don’t have to separately store a sort field as, say, title_sort_str just so you can see what it is when debugging.

Examples:

  • title_ssort will just produce the string field title_ssort; no changes

  • title_ssort_stored will produce title_ssort, but also a stored string called title_sort (note sort instead of ssort)

  • Similarly, age_isort_stored will produce both age_isort (indexed long) and age_sort (stored string)

A couple quick notes

  • Anything that doesn’t match a dynamic field is going to end up as a single-valued stored, unindexed string. In particular, folks that like to use whatever_display for display text can still do so.

  • Anything that ends is _a will end up as a set of multivalued, stored, unindexed strings.

Known limitations of the dynamic fields

  • All indexed types are multivalued under the hood. This means that if you define two fields:

    • fulltext_tsearch_single

    • fulltext_t_single

…​then you’ll have overloaded fulltext_t and it will end up with multiple values if you send data for both fulltet_tsearch_single and fulltet_t_single, even though both individual fields are single-valued. There’s no good solution except to be aware of what indexed fields you’re actually producing

  • There’s no good way to know what’s actually been indexed. This is a limitation of dynamic fields in general, but my schema exacerbates the problem because there’s not a one-to-one mapping between the field name sent to solr (title_t_stored_single) and the actual fields solr has (title_t and title_a in this case)

Why this works

I take advantage of a couple peculiarities of solr:

  • There’s no penalty (that I can find, anyway) for having a stored, unindexed field and an unstored, indexed field as opposed to a single field that is both stored and indexed

  • Dynamic fields can be totally ignored (neither indexed nor stored) but still be available for copyFields

  • Searching a multi-valued field with one value is no different than searching a single-valued field. This allows me to "reuse" indexed field types while allowing the field name actually passed to be used as a gatekeeper for non-multi fields (e.g., if you send multiple values to a single-valued field, it’ll still blow up real nice).

Field types

There are several field type definitions in the [conf/schema](https://github.com/billdueber/solr6_test_conf/tree/master/test_core/conf/schema) directory that might have some advantages over the stock Solr types. Some highlights:

Pre-tokenization manipulation

Some common and/or important text strings are hard to search on, like &, C++_ and A♮. The [common text chain](https://github.com/billdueber/solr6_test_conf/blob/master/test_core/conf/schema/basic_text_chain.xml) I use does reasonably substitutions of these before tokenization, so you can muck with punctuation terms before throwing them out. I also take that opporntunity to do unicode normalization.

text

A basic analyzed text type, built for unicode support (for those of us that have to deal with many languages) and using unicode folding (lowercasing), normalization, and the ICU tokenizer. Forms the basis of all text_leftjustified and exactish

text_leftjustified

The text_leftjustified type will only match a phrase query at the start of a string.

exactish

A replacement of sorts for the String type, for exact matching without taking into account case or most punctuation.

numericID

A relatively specialized type that allows you to extract numeric strings from text, demanding that they be of a certain length (or length range). Currently set up, essentially, for ISSN extraction, but can be adapted for any data where the numeric ID you’re looking for might be buried in other text.

Special library types

…​for us library-types. This repo includes a .jar file and fieldTypes that do normalization on ISBNs and LCCNs, so you know index-time and query-time changes are equivalent.

About

A solr6+ schema experiment focusing on easy indexing and clean stored field names

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages