Although search engines sometimes return the “perfect” document at the top of the results list, this is often not the case.  Searches like “tv”, “las vegas” or “gun control” could mean a lot of things to different people, based on part on what they’re trying to accomplish and their personal opinions and preferences, so there isn’t one “right” document anyway.

It’s a good idea present other clickable options in your results list, to let users drill down and tell you more about what they’re looking for.  This gives your site a second chance to get it right!

Two popular types of results list navigators are Taxonomies and Facets.  They have a lot in common: both present clickable links that filter down the results list to a smaller set, and in some systems portions are implemented with the same code.  But typical differences include:

Facets:

  • Usually relies on field-based data / attributes (e.g.: color, size, price, etc.)
  • Data type can be text, numeric or date based
  • Mostly independent of each other
  • Usually has match counts

Taxonomies:

  • Usually Subject or Category based
  • Typically a Hierarchy, sometimes you can expand and navigate a tree
  • Usually text based
  • May have match counts, but not universal
  • Some standard taxonomies within certain industries

Some large sites actually use both.  For example, a large eCommerce site might have a taxonomy of product categories and subcategories (such as Electronics / Televisions), and then also have specific attributes within each category (such as TV size or screen type)

But as the title suggests, what’s right for a site depends a lot on type of data being searched, and the types of queries users will be expecting.

Factors that tend to favor Facets:

  • Items being searched have lots of good, well-defined data, such as database fields from a product inventory, or documents with lots of curated Metadata
  • Many similar items that vary by only 1 or 2 properties (e.g.: size, material, manufacturer, etc.)
  • Different types of data (text, numeric, dates, location, etc.)
  • Search engine supports / excels at it
  • Competing sites have it!

Factors that tend to favor Taxonomies:

  • Content is logically group-able into categories and subcategories
  • Documents have fewer well-defined fields
  • Content and users are from an industry that are accustomed to using a Taxonomy
  • Content is from many sources or groups
  • Content with widely distributed location data that has a logical nesting (cities, states, countries, etc.)
  • Dates that may be inconsistent or incomplete (e.g.: only have year or month and year)
  • Industry standard taxonomies exist
  • Somewhat specialized staffing

You’ll notice that Location appeared in both lists.  In Faceted sites there’s a tendency towards raw longitude / latitude type of data and letting people search by distance.  In taxonomies it’s usually more of a logical grouping.  Similarly, dates can work in either; in faceted applications they tend to be more precise and consistent, whereas in taxonomies they can be more vague such as “18th century” or “1970’s”.  But as I said above, there’s a lot of overlap between taxonomies and facets, so dates and times can be worked into either type of system.

Facets are more common and probably better understood; if you’re on the fence, I’d say try them first.  For Solr users, you’ll certainly find a lot more examples online.  Taxonomies are a must for some content, just keep in mind that the workflow of creating/obtaining and maintaining them is more complex and implies a longer term, more specialized staffing commitment.

And what if you data doesn’t have good Metadata and is not logically grouped into any convenient categories either!?  What if you just have a “giant pile of text”?  This isn’t a great situation to be in, but it’s much more common than you might think.  Generally the idea is toe “upgrade your data” with tools that infer additional metadata or category data.  These tools aren’t perfect, so lower your expectations a bit.

One common approach is to use Entity Extraction to find Metadata in the text and add those attributes to each document.  This allows you to display facets after all, things like people, places, company names, amounts of money, dates, etc.  This won’t be perfect: some documents might not have any recognizable entities, and others might have vague reference (Paris, France or Paris, Maine?).  If your data has some fields, but the values are somewhat inconsistent, you might just consider a Data Cleansing tool.

Another technique to consider is to try and Autocategorize your documents into a Taxonomy.  This is typically done by either scanning your content for keywords to match up with an existing taxonomy, or by clustering similar documents and trying to auto-generate a taxonomy based on word and phrase occurrences.  These tools are perhaps less evolved and, as they say, “your mileage may vary”.

For web content, one “poor man’s” approach is to just look for patterns in the URLs.  If you can characterize some of the main web sites that the content came from, you can add a source or category field.  Looking at the actual path in URLs might give you some hint about the date, region or group.

User expectation and experience can make a big difference.  Shoving a dozen different facets at a casual user might just confuse them.  Even if you can only one drill down item, e.g.: “source”, that still might be enough to improve the overall search experience.