Apache Solr Search Mastery

 

Peter Wolanin and Robert Douglass work on the Solr search module in Drupal. They both work for Acquia.

Overview

  • What is solr and how do you run it locally
  • Getting Drupal data into Solr
  • Changes in Drupal 7
  • Field API integration
  • Searching Solr from Drupal
  • Modifying what's searched and the results
  • Theming search results

The code examples in these presentation work in Drupal 7; the Drupal 6 module is very similar with slight modifications.

Solr is a server, just like your web server. It's specialized in indexing text in a way that makes it searchable very quickly, and does grouping and clustering very well so you can do great things like faceted search. It's based on the Lucene search engine, but it provides a TCP/HTTP interface to get to that.

You send POST requests with XML documents to add to the search database; Solr will tear that down and index it. Then, you send HTTP GET requests to the server, and what you get back is a result set. In concept, it's very similar to adding and then querying data to/from a mysql database, except Solr is optimized for text retrieval and grouping more than anything else. Also, Solr is a flat structure, so you can only search against full documents, not links between documents.

Once you have your SOLR server running, you can hack the URL (it's a perfect REST api) to get some new Solr results.

Getting It Running

Once you have Solr installed, you'll have to replace the schema.xml and solrconfig.xml with the ones from the Drupal module, and then invoke the start.jar:

java -jar start.jar

You can do all this in about five minutes. Then, if you go to localhost:8983/solr/admin you'll get the Solr control panel.

Just as in a database, there are data types and fields with Solr. In the schema.xml file, we define the types of fields. We have things like the body and teaser and title, which you'll recognize from the Drupal world as handling data from nodes. There are some other interesting fields; the site, hash and id fields. We have a concept of multi-site search in our Solr Drupal configuration. We use the hash and site fields to support these multi-site installations. Many of these are String fields, but there are other types.

Indexed vs stored: Indexed means, is it searchable? Stored means, can I get the data back from Solr in the same way. By using Stored, you can actually avoid doing a lot of node_load()s. This is why Solr is faster than Drupal core search.

Solr also has a great type-based schema for dynamic fields. That allows us to accommodate fields like CCK fields, where Solr will recognize the field types based on the field prefixes.

Getting a Connection to Solr

This is an object that you build when you're in the business of querying Solr. We have a factory method:

$conn = apachesolr_get_solr($host, $port, $path);

Querying

The classes that you use are based on an interface that we came up with, which has helper functions you can use to help you build a query, such as:

  • get_filters($name)
  • has_filter($field, $value)
  • add_filter($field, $value, $exclude)
  • remove_filter($field, $value)

We also have some functions for getting and setting the search keys:

  • get_keys()
  • set_keys($keys)
  • remove_keys()

Finally you can execute the search:

search($keys = NULL)

Drupal 7 Changes

  • The $query object itself now has the params array included as $query->params.
  • Previously, you had to separately get your $solr search object, but to simplify life we've added a method $query->search() passes this call into the $solr object.
  • Taxonomy in Drupal 7 is now part of the Field API, so we moved a lot of the taxonomy specific code, and it's now much more integrated with the Field API code.
  • We've been able to remove a bunch of code from apachesolr by committing fixes to Drupal 7 search. We have this concept in Drupal 7 of available search implementations vs active implementations.
  • In Drupal core, we have cron sending data to Solr via XML. You can implement hook_apachesolr_update_index() to customize documents before they're sent to the search index. It works like an _alter() hook.
  • You can also use hook_apachesolr_modify_query() to modify the query before it is sent to Solr.
  • You can control the indexing more precisely using hook_apachesolr_node_exclude() and hook_node_update_index().
  • We can create multiple documents from one node (e.g. a document per comment) using hook_apachesolr_document_handlers().

Example from McGill University

The demo included how to use custom search pages

Analysis of an apachesolr search request

POST
hook_menu()
hook_menu_alter()
search_view()
hook_apachesolr_prepare_query($query)
hook_apachesolr_modify_query($query)
$response=$query->search(...)
$results = apachesolr_search_process_response()
theme('search-results', results)

We used hook_menu() to define our paths, and hook_menu_alter() to change the page callback to change the displayed items, layout and formatting of the search page.

$query->param parameters

  • rows
  • fl (fields to return)
  • facet
  • hl (highlighting)
  • spellcheck
  • bq, bf, qf (boosting)
  • everything else

Solr Resources

If you need to understand the code at this level,

  • Read the Solr wiki on apache.org
  • There's also a book, Enterprise Search with Solr
  • Lucid Imagination is a company that specializes in Solr, they have a PDF you can download for free
  • Finally, join some Solr mailing lists

You can use hook_apachesolr_prepare_query($query) to set a default Solr search parameter. The user will see these changes, because it's in prepare_query and not modify_query. These filters are actually passed to Solr through the fq (filters) parameter.

Themeing

By default, Solr will return a block of text that describes the node. For our course catalog, we wanted to change this. We wanted our snippet to show students what department, level and quarter was associated with each course.

theme_apachesolr_search_snippets($document, $snippets) {
return 'whatever you want!';
}

We also have search-result.tpl.php, which you can use to render a single search result. Remember if this is user input use check_plain(). Solr can send you back the same (unsafe) user input that you originally put in the index. See apachesolr_clean_text() if you want to index text without tags.

3 Comments

Extending the highlight result

Nice article!

Greetings from Sacramento where I am working on my law degree. I am trying to display a larger snippet than the default snippet used in ApacheSolr. Currently it will only display about 100 characters per keyword. So if my snippet has:

the law is good and great 100 more characters here

I would like to be able to display *at least* 200 characters regardless of what Solr finds in that snippet. For some reason Drupal likes to truncate the snippets to about 100 unless it find another keyword in that 100 and then, only then, extend the snippet from 100 characters to 200, which is annoying.

Any idea? I am looking at line 1464 apachesolr_search.module but can't track it any further...

Hi there, interesting that

Hi there, interesting that you should mention McGill University, since I work there and indeed worked directly on our implementation of apachesolr! Curious to know, how did you know we were using it? Cheers for the article btw!

Did you enjoy this post? Please spread the word.