Bill Dueber’s DynamicField Solr Config

Table of Contents

The goals
Using the dynamic field definitions
Field types

The goals

I started this looking for a configuration that was flexible and resulting in solr documents that made sense to me. I ended up with the following goals:

every field is either stored or indexed — never both.
indexed fields reflect their type in the name
stored fields are either a bare word ('main_author') or a bare word with a _a indicating it’s multivalued (an array)
a single field can result in multiple indexed fields

This results in nice clean stored values (either field or field_a), and takes advantage of the fieldTypes described at the bottom of this document.

Using the dynamic field definitions

A constructed field name has two to four parts, separated by underscores.

The basename. This is the descriptive field name (title, author, etc.)
The fieldtype suffix. There is a mapping of fieldtype suffixes to fieldtypes in the generate_dfields.rb script. Some map to multiple indexed types, and adding to that list is as easy as editing the top of the file.
An optional _stored (that literal string), indicating that the item should be stored.
An optional _single (again that exact string) indicating that this is a single-valued field instead of a multi-valued field.

If _stored and _single are both present, they need to be in that order.

Examples

title_t_stored

Will create both stored and indexed fields, multivalued because we didn’t specify _single

title_t, an indexed text field
title_a, a multivalued stored field ('a' for array, since this is a multi-valued field)

mainauthor_ef_stored_single

mainauthor_e, an indexed exactish field
mainauthor_f, an indexed string suitable for faceting
mainauthor, the stored, single-valued field.

fulltext_t

fulltext_t, just the single indexed, multivalued field with no stored field.

fulltext_t_single

fulltext_t, again the indexed field with no stored field, but this one will complain if you try to send multiple values.

rawmarc

Just the single-valued string field rawmarc

emoji_a

A single multi-valued string field called emoji_a

Fields meant for sorting

Two fieldname suffixes are special-cased: ssort for strings meant as a sort key, and isort for integers (longs, actually, under the hood) meant as a sort key. The idea is that you don’t have to separately store a sort field as, say, title_sort_str just so you can see what it is when debugging.

Examples:

title_ssort will just produce the string field title_ssort; no changes
title_ssort_stored will produce title_ssort, but also a stored string called title_sort (note sort instead of ssort)
Similarly, age_isort_stored will produce both age_isort (indexed long) and age_sort (stored string)

A couple quick notes

Anything that doesn’t match a dynamic field is going to end up as a single-valued stored, unindexed string. In particular, folks that like to use whatever_display for display text can still do so.
Anything that ends is _a will end up as a set of multivalued, stored, unindexed strings.

Known limitations of the dynamic fields

All indexed types are multivalued under the hood. This means that if you define two fields:
- fulltext_tsearch_single
- fulltext_t_single

…then you’ll have overloaded fulltext_t and it will end up with multiple values if you send data for both fulltet_tsearch_single and fulltet_t_single, even though both individual fields are single-valued. There’s no good solution except to be aware of what indexed fields you’re actually producing

There’s no good way to know what’s actually been indexed. This is a limitation of dynamic fields in general, but my schema exacerbates the problem because there’s not a one-to-one mapping between the field name sent to solr (title_t_stored_single) and the actual fields solr has (title_t and title_a in this case)

Why this works

I take advantage of a couple peculiarities of solr:

There’s no penalty (that I can find, anyway) for having a stored, unindexed field and an unstored, indexed field as opposed to a single field that is both stored and indexed
Dynamic fields can be totally ignored (neither indexed nor stored) but still be available for copyFields
Searching a multi-valued field with one value is no different than searching a single-valued field. This allows me to "reuse" indexed field types while allowing the field name actually passed to be used as a gatekeeper for non-multi fields (e.g., if you send multiple values to a single-valued field, it’ll still blow up real nice).

Field types

There are several field type definitions in the [conf/schema](https://github.com/billdueber/solr6_test_conf/tree/master/test_core/conf/schema) directory that might have some advantages over the stock Solr types. Some highlights:

Pre-tokenization manipulation: Some common and/or important text strings are hard to search on, like &, C++_ and A♮. The [common text chain](https://github.com/billdueber/solr6_test_conf/blob/master/test_core/conf/schema/basic_text_chain.xml) I use does reasonably substitutions of these before tokenization, so you can muck with punctuation terms before throwing them out. I also take that opporntunity to do unicode normalization.
text: A basic analyzed text type, built for unicode support (for those of us that have to deal with many languages) and using unicode folding (lowercasing), normalization, and the ICU tokenizer. Forms the basis of all text_leftjustified and exactish
text_leftjustified: The text_leftjustified type will only match a phrase query at the start of a string.
exactish: A replacement of sorts for the String type, for exact matching without taking into account case or most punctuation.
numericID: A relatively specialized type that allows you to extract numeric strings from text, demanding that they be of a certain length (or length range). Currently set up, essentially, for ISSN extraction, but can be adapted for any data where the numeric ID you’re looking for might be buried in other text.
Special library types: …for us library-types. This repo includes a .jar file and fieldTypes that do normalization on ISBNs and LCCNs, so you know index-time and query-time changes are equivalent.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
bin		bin
configure		configure
lib		lib
test_core		test_core
.gitignore		.gitignore
README.adoc		README.adoc
solr.xml		solr.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bill Dueber’s DynamicField Solr Config

The goals

Using the dynamic field definitions

Examples

Fields meant for sorting

A couple quick notes

Known limitations of the dynamic fields

Why this works

Field types

About

Releases

Packages

Languages

billdueber/solr6_test_conf

Folders and files

Latest commit

History

Repository files navigation

Bill Dueber’s DynamicField Solr Config

The goals

Using the dynamic field definitions

Examples

Fields meant for sorting

A couple quick notes

Known limitations of the dynamic fields

Why this works

Field types

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages