One of Solr’s lesser known features (at least from my perspective) is the StatsComponent. Stats is a feature that was added in Solr 1.4 and enables Solr to compute various statistics for numeric fields in your documents. It even supports the ability to compute these statistics per facet constraint from other fields.

Since Primary Election season is ramping up (here in the US Anyway) we’ll demonstrate Solr’s Stats functionality using some data from from Data.Gov. Specifically, we’ll index the “2009-2010 Candidate Summary File” data from the FEC…

Summary financial information about campaigns for U.S. Senate, U.S. House of Representatives, and President of the United States

That certainly sounds like it should contain some interesting numeric data, and the “Data Dictionary/Variable List” certainly seems to support this. But before we can use Solr to get some statistics from this data, we need to index it. We start by fetching the data, and taking a peek inside….

$ curl -sSL http://www.data.gov/download/1930/xml | xmllint --format - | head -20
<?xml version="1.0"?>
<data.fec.gov   fecdc_schemaLocation="http://www.fec.gov/data http://www.fec.gov/finance/disclosure/schema/CandidateSummary.xsd">
  <title>Candidate Summary</title>
  <description>Financial information for each candidate who has registered with the FEC or appears on an official state ballot for an election to the U.S. House of Representatives or U.S. Senate during the current two-year election cycle (including special elections).</description>
  <timestamp>Thu, 06 May 2010 05:05:47 EST</timestamp>
  <copyright>Copyright 2010, Federal Election Commission.</copyright>
  <can_sum>
    <lin_ima>http://images.nictusa.com/cgi-bin/fecimg/?H4NY07011</lin_ima>
    <can_id>H4NY07011</can_id>
    <can_nam>ACKERMAN, GARY L.</can_nam>
    <can_off>H</can_off>
    <can_off_sta>NY</can_off_sta>
    <can_off_dis>05</can_off_dis>
    <can_par_aff>DEM</can_par_aff>
    <can_inc_cha_ope_sea>INCUMBENT</can_inc_cha_ope_sea>
    <can_str1>113 DEER RUN</can_str1>
    <can_str2/>
    <can_cit>ROSLYN HEIGHTS</can_cit>
    <can_sta>NY</can_sta>
    <can_zip>11577</can_zip>

We haven’t seen much numeric data yet, but we’ve at least verified that the data seems to match the metadata we’ve been given — an important first step. So now we can setup our basic Solr index. The first step is to create a simple config file for the DataImportHandler (DIH). DIH makes it easy to index from a variety of sources, including XML files. Solr 1.4 comes with an example “rss-data-config.xml” file that demonstrates indexing an RSS Feed, and we’re going to use that as a template for creating our “fec-data-config.xml”….

<dataConfig>
  <dataSource type="HttpDataSource" encoding="UTF-8" />
  <document>
    <entity name="data-gov-1930"
	    pk="id"
	    url="http://www.data.gov/download/1930/xml"
	    processor="XPathEntityProcessor"
	    forEach="/data.fec.gov/can_sum"
	    transformer="DateFormatTransformer,RegexTransformer">
      <field column="id" xpath="/data.fec.gov/can_sum/can_id" />
      <field column="name" xpath="/data.fec.gov/can_sum/can_nam" />
      <field column="url" xpath="/data.fec.gov/can_sum/lin_ima" />
      <field column="office" xpath="/data.fec.gov/can_sum/can_off" />
      <field column="state" xpath="/data.fec.gov/can_sum/can_off_sta" />
      <field column="district" xpath="/data.fec.gov/can_sum/can_off_dis" />
      <field column="party" xpath="/data.fec.gov/can_sum/can_par_aff" />
      <field column="candidate_type" xpath="/data.fec.gov/can_sum/can_inc_cha_ope_sea" />

      <field column="itemized_individual_contrib" xpath="/data.fec.gov/can_sum/ind_ite_con" regex="$|,|(.d{1,2})" replaceWith="" />
      <field column="unitemized_individual_contrib" xpath="/data.fec.gov/can_sum/ind_uni_con" regex="$|,|(.d{1,2})" replaceWith="" />
      <field column="individual_contrib" xpath="/data.fec.gov/can_sum/ind_con" regex="$|,|(.d{1,2})" replaceWith="" />

      <field column="party_contrib" xpath="/data.fec.gov/can_sum/par_com_con" regex="$|,|(.d{1,2})" replaceWith="" />
      <field column="committees_contrib" xpath="/data.fec.gov/can_sum/oth_com_con" regex="$|,|(.d{1,2})" replaceWith="" />
      <field column="candidate_contrib" xpath="/data.fec.gov/can_sum/can_con" regex="$|,|(.d{1,2})" replaceWith="" />
      <field column="total_contrib" xpath="/data.fec.gov/can_sum/tot_con" regex="$|,|(.d{1,2})" replaceWith="" />
      <field column="net_contrib" xpath="/data.fec.gov/can_sum/net_con" regex="$|,|(.d{1,2})" replaceWith="" />

      <field column="candidate_loan" xpath="/data.fec.gov/can_sum/can_loa" regex="$|,|(.d{1,2})" replaceWith="" />
      <field column="other_loan" xpath="/data.fec.gov/can_sum/oth_loa" regex="$|,|(.d{1,2})" replaceWith="" />
      <field column="total_loan" xpath="/data.fec.gov/can_sum/tot_loa" regex="$|,|(.d{1,2})" replaceWith="" />

      <field column="net_operating_expenditures" xpath="/data.fec.gov/can_sum/net_ope_exp" regex="$|,|(.d{1,2})" replaceWith="" />

      <field column="report_date" xpath="/data.fec.gov/can_sum/cov_end_dat" dateTimeFormat="MM/dd/yy" />
    </entity>
  </document>
</dataConfig>

A couple of important things to note here…

  • I’ve take the somewhat “terse” field names from the source data, and expanded them a bit. Core fields have simple names, while field names ending with “contrib” relate to contributions and field names ending with “loan” relate to loans.
  • DIH’s XPathEntityProcessor uses a custom XPath implementation that allows it to support streaming of large XML datasources with out building an entire DOM in memory — the downside is that as a result, it only supports a small subset of XPath. Hence the somewhat verbose expressions.
  • DIH has a NumberFormatTransformer which can be handy for parsing currency, but the Java NumberFormat for en-US Currency has some peculiur ideas about negative values, so we use a simple regex instead. (This also makes it easy for us to ignore pennies.)

Now that we have a DIH config, we need an instance of Solr to use it. Starting with the example configs from Solr 1.4, we edit the fields in our schema.xml to match the fields in the data that we are interested in…

 <fields>
   <field name="id" type="string" multiValued="false" />
   <field name="name" type="text" multiValued="false" />
   <field name="url" type="string" indexed="false" multiValued="false" />
   <field name="report_date" type="date" multiValued="false" />
   <field name="net_operating_expenditures" type="long" multiValued="false" />
   <dynamicField name="*_loan" type="long" multiValued="false" />
   <dynamicField name="*_contrib" type="long" multiValued="false" />
   <dynamicField name="*" type="string" multiValued="false" />
 </fields>
 <uniqueKey>id</uniqueKey>
 <defaultSearchField>name</defaultSearchField>

Note that we’re using dynamicFields to define all of the *_loan and *_contrib fields as a simple way to save time instead of listing them all out — a handy side effect of picking a simple field naming system.

We also make some modifications to our solrconfig.xml file…

<requestHandler name="/load" >
  <lst name="defaults">
    <str name="config">fec-data-config.xml</str>
    <str name="indent">true</str>
  </lst>
</requestHandler>
<requestHandler name="standard" default="true">
   <lst name="defaults">
     <str name="echoParams">none</str>
     <str name="q">*:*</str>
     <int name="rows">0</int>
     <bool name="stats">true</bool>
   </lst>
</requestHandler>

The first section creates an instance of the DIH that knows about our fec-data-config.xml file. The second section is the default SearchHandler with some modifications to it’s default behavior:

  • Search for all documents: q=*:*
  • Enable statistics: stats=true
  • Hide the rows that match: rows=0 (this is just for convenience, since we’re mainly interested in the stats, we frequently won’t care about the specific rows that match

With this in place, we start up Solr, instruct DIH to do a “full import” and monitor it to see when it’s finished (ie: idle)…

$ curl -sS "http://localhost:8983/solr/load?command=full-import" > /dev/null
$ curl -sS "http://localhost:8983/solr/load" | grep status
 <int name="status">0</int>
<str name="status">busy</str>
<lst name="statusMessages">
$ curl -sS "http://localhost:8983/solr/load" | grep status
 <int name="status">0</int>
<str name="status">idle</str>
<lst name="statusMessages">

About 8 seconds later (mostly due to lag from the FEC webserver), we have our data and we can start running some queries. Without even using the StatsComponent, we can already learn some interesting facts from this data using a basic search/sort:

  • The candidate with the highest net operating expenditures ($14,404,986) is the Senate campaign for Harry Reid (D) in Nevada: /solr/select?rows=1&sort=net_operating_expenditures+desc
  • The Senate candidate who has received the most total loans ($14,000,000) for her campaign is Linda McMahon (R) in South Carolina: /solr/select?q=office:S&rows=1&sort=total_loan+desc
  • The Texas candidate for the House of Representatives who has reported the highest total contributions ($1,983,946) is Chet Edwards (D) in district #17: /solr/select?q=%2Boffice:H+%2Bstate:TX&rows=1&sort=total_contrib+desc

I know what you’re thinking: “These are just basic searches of the raw data — where are the stats?” So here we go….

/solr/select?stats.field=total_contrib&stats.facet=office&stats.facet=party

With this single URL, we ask for three very powerful pieces of information, and get a response back in about 45ms….

  • stats.field=total_contrib – This instructs Solr to compute a set of basic statistics using the total_contrib field of all docs matching our query (which for now is all documents in our index). That simple param gives us the following information:
    <result name="response" numFound="2783" start="0"/>
    ...
    <double name="min">15.0</double>
    <double name="max">1.5645053E7</double>
    <double name="sum">7.35762149E8</double>
    <long name="count">1619</long>
    <long name="missing">1164</long>
    <double name="sumOfSquares">2.038433485484829E15</double>
    <double name="mean">454454.6936380482</double>
    <double name="stddev">1026250.909483362</double>

    So we can see that of the 2783 campaigns that we have data for, 1619 have reported on their total contributions. Of those, the candidate with the smallest amount of total contributions received only $15, while the candidate with the highest total contributions received $15,645,053. The sum total of all contributions reported by all candidates is $735,762,149 — and it’s only May!

  • stats.facet=office – This param tells the StatsComponent that in addition to the “main” stats listed above, we’d also like it to see stats “broken down” for each “facet” of the office field:Comparing the stats from the Senate campaigns with those of the House of Representatives we can see that on “average” the Senate candidates are raking in roughly 4 times what the House candidates are getting ($1,296,275 vs. $328,988) although the sum total of all contributions to House campaigns is roughly twice that of all Senate campaigns ($463,544,421 vs $272,217,728).
  • stats.facet=party – This tells Solr that in parallel with the other stats we’ve already seen, we also want it to break down the stats on the “party” field. With this we can compare the total contributions based on the political affiliation of candidates:
    <lst name="IND">
      <double name="min">90.0</double>
      <double name="max">31845.0</double>
      <double name="sum">219388.0</double>
      <long name="count">32</long>
      <long name="missing">94</long>
      <double name="sumOfSquares">4.021803544E9</double>
      <double name="mean">6855.875</double>
      <double name="stddev">9012.011410406132</double>
    </lst>
    ...
    <lst name="DEM">
      <double name="min">18.0</double>
      <double name="max">1.2442579E7</double>
      <double name="sum">3.84472068E8</double>
      <long name="count">574</long>
      <long name="missing">279</long>
      <double name="sumOfSquares">1.08397412569238E15</double>
      <double name="mean">669811.9651567944</double>
      <double name="stddev">1200966.840203952</double>
    </lst>
    ...
    <lst name="REP">
      <double name="min">15.0</double>
      <double name="max">1.5645053E7</double>
      <double name="sum">3.49445424E8</double>
      <long name="count">974</long>
      <long name="missing">542</long>
      <double name="sumOfSquares">9.53132479905344E14</double>
      <double name="mean">358773.5359342916</double>
      <double name="stddev">922350.4925797922</double>
    </lst>

    Here we can see that the mean total contributions reported by Democratic candidates is roughly twice as much as that of Republicans, and ten times as much as independents ($669,811 vs. $358,773 vs. $6,855).

All of these stats so far have been computed across the entire index — but we can also narrow the scope to analyze stats of only specific documents….

  • Itemized Individual Contributions made to Democrat and Republican Candidates for Senate: /solr/select?q=%2Bparty%3A(DEM+REP)+%2Boffice%3AS&stats.facet=party&stats.field=itemized_individual_contrib
    <lst name="REP">
      <double name="min">15.0</double>
      <double name="max">9543057.0</double>
      <double name="sum">9.2695335E7</double>
      <long name="count">127</long>
      <long name="missing">74</long>
      <double name="sumOfSquares">3.40761264537769E14</double>
      <double name="mean">729884.5275590551</double>
      <double name="stddev">1472241.498436037</double>
    </lst>
    <lst name="DEM">
      <double name="min">24.0</double>
      <double name="max">1.0200223E7</double>
      <double name="sum">1.01949939E8</double>
      <long name="count">69</long>
      <long name="missing">63</long>
      <double name="sumOfSquares">4.58668696973531E14</double>
      <double name="mean">1477535.347826087</double>
      <double name="stddev">2128359.138730218</double>
    </lst>
  • Net operating expenditures for Democratic candidates for the House of Representatives broken down by state: /solr/select?q=%2Bparty%3ADEM+%2Boffice%3AH&stats.facet=state&stats.field=net_operating_expenditures
    ...
    <lst name="MI">
      <double name="min">585.0</double>
      <double name="max">761487.0</double>
      <double name="sum">3046606.0</double>
      <long name="count">18</long>
      <long name="missing">4</long>
      <double name="sumOfSquares">1.32730625872E12</double>
      <double name="mean">169255.88888888888</double>
      <double name="stddev">218504.30768344642</double>
    </lst>
    ...
    <lst name="MA">
      <double name="min">1130.0</double>
      <double name="max">839985.0</double>
      <double name="sum">3454586.0</double>
      <long name="count">11</long>
      <long name="missing">3</long>
      <double name="sumOfSquares">1.650605122526E12</double>
      <double name="mean">314053.2727272727</double>
      <double name="stddev">237840.5102811928</double>
    </lst>
    ...
    <lst name="IL">
      <double name="min">2981.0</double>
      <double name="max">1206092.0</double>
      <double name="sum">1.1289099E7</double>
      <long name="count">39</long>
      <long name="missing">16</long>
      <double name="sumOfSquares">6.798080371243E12</double>
      <double name="mean">289464.07692307694</double>
      <double name="stddev">304798.9738687473</double>
    </lst>
    ...

These statistics are all relatively simplistic — and in truth, would be easier to compute in many other tools. The key take away I want people to be aware of is that if your data is already in Solr for searching, Solr can generate these types of stats for any search result quickly and easily.