Elasticsearch for regulator search

Off-the-shelf Elasticsearch will not give a regulator a good search experience. We know — we've integrated it into several. What follows is what made the difference between "search box that returns 5,000 results sorted by relevance" and "search box that helps a citizen find the regulator's response to their specific complaint type".

Highlighted snippets are mandatory

The default search-result presentation (title + description) gives users no way to assess which of 50 results actually contains the phrase they searched. Highlighted snippets — showing the query terms in context within the result body — are the single largest UX improvement available. Elasticsearch supports this natively via the highlight API; we wire it through into the Volto search results component.

Faceting that knows the content model

Regulators publish heterogeneous content: investigation reports, complaints summaries, guidance documents, annual reports, media releases, datasets. A flat search across all of them buries what the user actually wants. We configure facets per content type — date ranges that make sense ("reports from 2020 onwards"), classifications relevant to the regulator's subject domain, and language facets for multilingual content.

The work isn't building facets in Elasticsearch — that's the easy part. The work is content modelling: making sure the metadata that supports useful faceting exists on every document, consistently, going back five years. We do that as part of CMS migration projects.

Per-language analyzers

If your regulator publishes in Welsh, Mandarin, Arabic, or Hindi (and many of them do), stemming and tokenisation matter. Elasticsearch ships per-language analyzers; pick the right one per document language, store the language as a document field, and route queries appropriately.

Cross-language search is a separate, harder problem — answering "what does this regulator say about X" when the document is in a language the user doesn't read. We solve that via translation memory or, increasingly, via the LLM layer (see our Onyx integration).

Synonyms and ontology

Regulators have vocabulary. "Adverse outcome" / "complaint" / "incident" / "notification" may all refer to the same regulatory event with different acronyms in different sub-domains. Citizens search using everyday language; the documents use the regulator's terms. A synonym list, maintained by content staff and applied at query time, closes that gap.

Boost the recent and the official

"Most recent guidance" should outrank "2014 guidance that mentions the same term more often." Boost by recency, especially on guidance-type documents. Boost by document type when authority matters — official policy outranks a media release on the same topic by default.

The non-Elasticsearch wins

After everything above, what makes the biggest difference is the search bar itself. Show recent searches. Show autocomplete suggestions from the actual document corpus. Show "did you mean" for typos. Show how many results there are before the user commits to a search.

Search is a UI problem with a backend underneath. Most regulator search failures we see are UI failures with a perfectly good Elasticsearch cluster behind them.