I started this looking for a configuration that was flexible and resulting in solr documents that made sense to me. I ended up with the following goals:
-
every field is either stored or indexed — never both.
-
indexed fields reflect their type in the name
-
stored fields are either a bare word ('main_author') or a bare word with a
_a
indicating it’s multivalued (an array) -
a single field can result in multiple indexed fields
This results in nice clean stored values (either field or field_a), and takes advantage of the fieldTypes described at the bottom of this document.
A constructed field name has two to four parts, separated by underscores.
-
The basename. This is the descriptive field name (title, author, etc.)
-
The fieldtype suffix. There is a mapping of fieldtype suffixes to fieldtypes in the
generate_dfields.rb
script. Some map to multiple indexed types, and adding to that list is as easy as editing the top of the file. -
An optional
_stored
(that literal string), indicating that the item should be stored. -
An optional
_single
(again that exact string) indicating that this is a single-valued field instead of a multi-valued field.
If _stored
and _single
are both present, they need to be in that order.
- title_t_stored
-
Will create both stored and indexed fields, multivalued because we didn’t specify
_single
-
title_t
, an indexedtext
field -
title_a
, a multivalued stored field ('a' for array, since this is a multi-valued field)
-
- mainauthor_ef_stored_single
-
-
mainauthor_e
, an indexedexactish
field -
mainauthor_f
, an indexed string suitable for faceting -
mainauthor
, the stored, single-valued field.
-
- fulltext_t
-
-
fulltext_t
, just the single indexed, multivalued field with no stored field.
-
- fulltext_t_single
-
-
fulltext_t
, again the indexed field with no stored field, but this one will complain if you try to send multiple values.
-
rawmarc
-
-
Just the single-valued string field
rawmarc
-
emoji_a
-
-
A single multi-valued string field called
emoji_a
-
Two fieldname suffixes are special-cased: ssort for strings meant as a sort key, and
isort for integers (longs, actually, under the hood) meant as a sort key. The idea
is that you don’t have to separately store a sort field as, say, title_sort_str
just
so you can see what it is when debugging.
Examples:
-
title_ssort
will just produce the string fieldtitle_ssort
; no changes -
title_ssort_stored
will producetitle_ssort
, but also a stored string calledtitle_sort
(note sort instead of ssort) -
Similarly,
age_isort_stored
will produce bothage_isort
(indexed long) andage_sort
(stored string)
-
Anything that doesn’t match a dynamic field is going to end up as a single-valued stored, unindexed string. In particular, folks that like to use
whatever_display
for display text can still do so. -
Anything that ends is
_a
will end up as a set of multivalued, stored, unindexed strings.
-
All indexed types are multivalued under the hood. This means that if you define two fields:
-
fulltext_tsearch_single
-
fulltext_t_single
-
…then you’ll have overloaded fulltext_t
and it will end up with multiple values if you send
data for both fulltet_tsearch_single
and fulltet_t_single
, even though both individual fields are
single-valued. There’s no good solution except to be aware of what indexed fields you’re actually producing
-
There’s no good way to know what’s actually been indexed. This is a limitation of dynamic fields in general, but my schema exacerbates the problem because there’s not a one-to-one mapping between the field name sent to solr (
title_t_stored_single
) and the actual fields solr has (title_t
andtitle_a
in this case)
I take advantage of a couple peculiarities of solr:
-
There’s no penalty (that I can find, anyway) for having a stored, unindexed field and an unstored, indexed field as opposed to a single field that is both stored and indexed
-
Dynamic fields can be totally ignored (neither indexed nor stored) but still be available for copyFields
-
Searching a multi-valued field with one value is no different than searching a single-valued field. This allows me to "reuse" indexed field types while allowing the field name actually passed to be used as a gatekeeper for non-multi fields (e.g., if you send multiple values to a single-valued field, it’ll still blow up real nice).
There are several field type definitions in the [conf/schema
](/~https://github.com/billdueber/solr6_test_conf/tree/master/test_core/conf/schema) directory that
might have some advantages over the stock Solr types. Some highlights:
- Pre-tokenization manipulation
-
Some common and/or important text strings are hard to search on, like &, C++_ and A♮. The [common text chain](/~https://github.com/billdueber/solr6_test_conf/blob/master/test_core/conf/schema/basic_text_chain.xml) I use does reasonably substitutions of these before tokenization, so you can muck with punctuation terms before throwing them out. I also take that opporntunity to do unicode normalization.
- text
-
A basic analyzed text type, built for unicode support (for those of us that have to deal with many languages) and using unicode folding (lowercasing), normalization, and the ICU tokenizer. Forms the basis of all
text_leftjustified
andexactish
- text_leftjustified
-
The
text_leftjustified
type will only match a phrase query at the start of a string. - exactish
-
A replacement of sorts for the String type, for exact matching without taking into account case or most punctuation.
- numericID
-
A relatively specialized type that allows you to extract numeric strings from text, demanding that they be of a certain length (or length range). Currently set up, essentially, for ISSN extraction, but can be adapted for any data where the numeric ID you’re looking for might be buried in other text.
- Special library types
-
…for us library-types. This repo includes a .jar file and fieldTypes that do normalization on ISBNs and LCCNs, so you know index-time and query-time changes are equivalent.