Zend Framework에 포함되어 있는 Zend_Lucene_Search

php에는 여러 프레임워크가 존재한다. 이 중 Zend Framework라는게 있는데, php로 구현되어 있는 검색엔진이 없을까 찾다가 알게 되었다. 이 프레임워크에는 Jakarta 프로젝트의 하나인 Lucene을 PHP로 포팅한 Zend_Lucene_Search 패키지가 존재한다.

이 패키지를 이용하여 검색엔진을 만들어서 테스트해봤다. 역시 빠르다. 하지만 문서의 갯수가 많아지니 현저히 느려지는 속도..

나름 결론을 내린 건 하드웨어 사양에 따라 틀리겠지만 10만건정도 이하의 문서를 검색하는 사이트에서 유용할 것 같다. 부하가 많이 걸리는 부분은 키워드에 맞는 문서를 검색하는 부분이 아니라 검색되어진 문서를 배열로 만들고 이를 사용자의 요구에 맞게 정렬하는 부분이었다. 이는 PHP의 한계가 아닐까 보여지는데… 검색결과문서수가 평균 천건을 넘지 않는다면 인덱싱되어 있는 문서건수가 많더라도 유용하게 사용되어 질 것 같다. 또한 검색엔진, Lucene을 이해하는 데에도 유용할 것 같다.

Introduction to The Solr Enterprise Search Server

Introduction to The Solr Enterprise Search Server

Solr in a Nutshell


Solr is a standalone enterprise search server with a web-services like API. You put documents in it (called “indexing”) via XML over HTTP. You query it via HTTP GET and receive XML results.



  • Advanced Full-Text Search Capabilities
  • Optimized for High Volume Web Traffic
  • Standards Based Open Interfaces – XML and HTTP
  • Comprehensive HTML Administration Interfaces
  • Scalability – Efficient Replication to other Solr Search Servers
  • Flexible and Adaptable with XML configuration
  • Extensible Plugin Architecture

Solr Uses the Lucene Search Library and Extends it!




  • A Real Data Schema, with Dynamic Fields, Unique Keys
  • Powerful Extensions to the Lucene Query Language
  • Support for Dynamic Result Grouping and Filtering
  • Advanced, Configurable Text Analysis
  • Highly Configurable and User Extensible Caching
  • Performance Optimizations
  • External Configuration via XML
  • An Administration Interface
  • Monitorable Logging
  • Fast Incremental Updates and Snapshot Distribution

Detailed Features


Schema




  • Defines the field types and fields of documents
  • Can drive more intelligent processing
  • Declarative Lucene Analyzer specification
  • Dynamic Fields enables on-the-fly addition of fields
  • CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field
  • Explicit types eliminates the need for guessing types of fields
  • External file-based configuration of stopword lists, synonym lists, and protected word lists

Query



  • HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
  • Highlighted context snippets
  • Faceted Searching based on field values and explicit queries
  • Sort specifications added to query language
  • Constant scoring range and prefix queries – no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches.
  • Function Query – influence the score by a function of a field’s numeric value or ordinal
  • Performance Optimizations

Core



  • Pluggable query handlers and extensible XML data format
  • Document uniqueness enforcement based on unique key field
  • Batches updates and deletes for high performance
  • User configurable commands triggered on index changes
  • Searcher concurrency control
  • Correct handling of numeric types for both sorting and range queries
  • Ability to control where docs with the sort field missing will be placed
  • Support for dynamic grouping of search results

Caching



  • Configurable Query Result, Filter, and Document cache instances
  • Pluggable Cache implementations
  • Cache warming in background

    • When a new searcher is opened, configurable searches are run against it in order to warm it up to avoid slow first hits. During warming, the current searcher handles live requests.

  • Autowarming in background

    • The most recently accessed items in the caches of the current searcher are re-populated in the new searcher, enabing high cache hit rates across index/searcher changes.

  • Fast/small filter implementation
  • User level caching with autowarming support

Replication



  • Efficient distribution of index parts that have changed via rsync transport
  • Pull strategy allows for easy addition of searchers
  • Configurable distribution interval allows tradeoff between timeliness and cache utilization

Admin Interface



  • Comprehensive statistics on cache utilization, updates, and queries
  • Text analysis debugger, showing result of every stage in an analyzer
  • Web Query Interface w/ debugging output

    • parsed query output
    • Lucene explain() document score detailing
    • explain score for documents outside of the requested range to debug why a given document wasn’t ranked higher.