Introduction to The Solr Enterprise Search Server

Solr in a Nutshell

Solr is a standalone enterprise search server with a web-services like API. You put documents in it (called “indexing”) via XML over HTTP. You query it via HTTP GET and receive XML results.

Advanced Full-Text Search Capabilities
Optimized for High Volume Web Traffic
Standards Based Open Interfaces – XML and HTTP
Comprehensive HTML Administration Interfaces
Scalability – Efficient Replication to other Solr Search Servers
Flexible and Adaptable with XML configuration
Extensible Plugin Architecture

Solr Uses the Lucene Search Library and Extends it!

A Real Data Schema, with Dynamic Fields, Unique Keys
Powerful Extensions to the Lucene Query Language
Support for Dynamic Result Grouping and Filtering
Advanced, Configurable Text Analysis
Highly Configurable and User Extensible Caching
Performance Optimizations
External Configuration via XML
An Administration Interface
Monitorable Logging
Fast Incremental Updates and Snapshot Distribution

Detailed Features

Schema

Defines the field types and fields of documents
Can drive more intelligent processing
Declarative Lucene Analyzer specification
Dynamic Fields enables on-the-fly addition of fields
CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field
Explicit types eliminates the need for guessing types of fields
External file-based configuration of stopword lists, synonym lists, and protected word lists

Query

HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
Highlighted context snippets
Faceted Searching based on field values and explicit queries
Sort specifications added to query language
Constant scoring range and prefix queries – no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches.
Function Query – influence the score by a function of a field’s numeric value or ordinal
Performance Optimizations

Core

Pluggable query handlers and extensible XML data format
Document uniqueness enforcement based on unique key field
Batches updates and deletes for high performance
User configurable commands triggered on index changes
Searcher concurrency control
Correct handling of numeric types for both sorting and range queries
Ability to control where docs with the sort field missing will be placed
Support for dynamic grouping of search results

Caching

Configurable Query Result, Filter, and Document cache instances
Pluggable Cache implementations
Cache warming in background
- When a new searcher is opened, configurable searches are run against it in order to warm it up to avoid slow first hits. During warming, the current searcher handles live requests.
Autowarming in background
- The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabing high cache hit rates across index/searcher changes.
Fast/small filter implementation
User level caching with autowarming support

Replication

Efficient distribution of index parts that have changed via rsync transport
Pull strategy allows for easy addition of searchers
Configurable distribution interval allows tradeoff between timeliness and cache utilization

Admin Interface

Comprehensive statistics on cache utilization, updates, and queries
Text analysis debugger, showing result of every stage in an analyzer
Web Query Interface w/ debugging output
- parsed query output
- Lucene explain() document score detailing
- explain score for documents outside of the requested range to debug why a given document wasn’t ranked higher.