A customer that had implemented custom security filtering in Solr 3.x, and then moved to 4.x, recently worked with us to port their filtering code to Solr 5.x.  Lucene was overhauled in 5.x (see LUCENE-5666 for the gory details) such that uninverted access used for sorting, faceting, grouping, etc uses the DocValues API instead of FieldCache.  Below is a refresher course on how Solr implements custom filtering and what’s changed with Solr 5.x. (Here’s our previous post about custom security filters for Solr 3.x and Solr 4.x)

Recap of Solr’s Filtering and Caching

First let’s review Solr’s filtering and caching capabilities. Queries to Solr involve a full-text, relevancy scored, query (the infamous q parameter). As users navigate they will browse into facets. The search application generates filter query (fq) parameters for faceted navigation (eg. fq=color:red). The filter queries are not involved in document scoring, serving only to reduce the search space. Solr has a filter cache, caching the document sets of each unique filter query. These document sets are generated in advance, cached, and reduce the documents considered by the main query. Caching can be turned off on a per-filter basis; when filters are not cached, they are used in parallel to the main query to “leap frog” to documents for consideration, and a cost can be associated with each filter in order to prioritize the leap-frogging (smallest set first would minimize documents being considered for matching).

Post Filtering

Even without caching, filter sets default to generate in advance. In some cases it can be extremely expensive and prohibitive to generate a filter set. One example of this is with access control filtering that needs to take the users query context into account in order to know which documents are allowed to be returned or not. Ideally, only documents that match the query and straightforward filters should be evaluated for security access control. It’s wasteful to evaluate any other documents that would not otherwise match anyway. So let’s run through an example… a contrived example for the sake of showing how Solr’s post filtering works. Here’s the design:

Documents have an “access control list” associated with them, specifying allowed and disallowed users as well as allowed and disallowed groups.  The access control list is an ordered list of allowed/disallowed users and groups. Order matters, such that the first matching rule determines access.  If no allowing access is found, the document is not allowed.

For example, a document could have an access control string specified as “+u:user1 +g:group1 -g:group2 +u:user2 -u:user3”. Query requests to Solr will include the user name and the users group memberships. Given this example access control string, here’s how this contrived design should respond:

user='user1', groups=null: allowed
user='user2', groups=null: allowed
user='user1', groups=[group1]: allowed
user='user2', groups=[group2]: NOT ALLOWED
user='user3', groups=[group1]: allowed
user='user3', groups=[group2]: NOT ALLOWED
user='user3', groups=[group1, group2]: allowed

That’s to say if user2, as a member of group2 searches, he should not be allowed to find this particular document (-g:group2 precedes +u:user2 in the rules, and order matters). I know, I know, this is pretty contrived, but does mirror the types of complexity needed in some very real world environments.

Because these rules are dependent on order and the query request, it’s not possible to do a straightforward Lucene query to filter allowed documents. Play along with me here on this example, I tried to make it sufficiently complicated to require a custom filter.  If your filtering needs can be accomplished using Solr’s fq capability, use that instead; see the “Fine Print” section below for a reminder and elaboration of this point.

Solr has a PostFilter feature that allows this last check on filtering documents on the fly. It takes some know-how to implement a PostFilter appropriately, so the code example here will be a nice starting point for your own custom post filtering. The way a PostFilter gets leveraged is through a Solr QParserPlugin. Here’s my AccessControlQParserPlugin, which is just a simple factory to construct an AccessControlQuery:

public class AccessControlQParserPlugin extends QParserPlugin {
  public static String NAME = "acl";

  @Override
  public void init(NamedList args) {
  }

  @Override
  public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
    return new QParser(qstr, localParams, params, req) {

      @Override
      public Query parse() throws SyntaxError {
        return new AccessControlQuery(localParams.get("user"), localParams.get("groups"));
      }
    };
  }
}

Here’s AccessControlQuery:

public class AccessControlQuery extends ExtendedQueryBase implements PostFilter {
  private String user;
  private String[] groups;

  public AccessControlQuery(String user, String groups) {
    this.user = user;
    this.groups = groups.split(",");
  }

  /**
   * acl is in the form of a series of whitespace separated [+|-][u|g]:name
   * allowed is determined by any explicit user or group mentions, plus or minus
   * order matters
   * if nothing matches, it is not allowed
   */
  public static boolean isAllowed(String acl, String user, String[] groups) {

    if (user == null && groups == null) return false;

    String[] permissions = acl.split(" ");

    for(String p : permissions) {
      boolean allowed = p.charAt(0) == '+';
      String name = p.substring(3);
      if (p.charAt(1) == 'u') { // user
        if (user != null && user.equals(name)) return allowed;
      } else { // group
        if (groups != null) {
          for (String g : groups) {
            if (g.equals(name)) return allowed;
          }
        }
      }
    }

    return false;
  }

  @Override
  public boolean getCache() {
    return false;  // never cache
  }

  @Override
  public int getCost() {
    return Math.max(super.getCost(), 100);  // never return less than 100 since we only support post filtering
  }

  @Override
  public DelegatingCollector getFilterCollector(IndexSearcher searcher) {
    return new DelegatingCollector() {
      SortedDocValues acls;

      @Override
      protected void doSetNextReader(LeafReaderContext context) throws IOException {
        acls = context.reader().getSortedDocValues("acl");
        super.doSetNextReader(context);
      }

      @Override
      public void collect(int doc) throws IOException {
        if (isAllowed(acls.get(doc).utf8ToString(), user, groups)) super.collect(doc);
      }

    };
  }

  @Override
  public boolean equals(Object o) {
    if (this == o) return true;
    if (o == null || getClass() != o.getClass()) return false;
    if (!super.equals(o)) return false;

    AccessControlQuery that = (AccessControlQuery) o;

    if (!Arrays.equals(groups, that.groups)) return false;
    if (user != null ? !user.equals(that.user) : that.user != null) return false;

    return true;
  }

  @Override
  public int hashCode() {
    int result = super.hashCode();
    result = 31 * result + (user != null ? user.hashCode() : 0);
    result = 31 * result + (groups != null ? Arrays.hashCode(groups) : 0);
    return result;
  }

  public static void main(String[] args) {
    String acl = "+u:user1 +g:group1 -g:group2 +u:user2 -u:user3";

    System.out.println("acl = " + acl);

    test(acl, "user1", null);
    test(acl, "user2", null);
    test(acl, "user1", new String[] {"group1"});
    test(acl, "user2", new String[] {"group2"});
    test(acl, "user3", new String[] {"group1"});
    test(acl, "user3", new String[] {"group2"});
    test(acl, "user3", new String[] {"group1","group2"});
  }

  private static void test(String acl, String user, String[] groups) {
    System.out.println("user='" + user + "'" +
        ", groups=" + (groups == null ? null : Arrays.asList(groups)) +
        ": " + (isAllowed(acl, user, groups) ? "allowed" : "NOT ALLOWED"));
  }
}

The main() method was used to generate the above rule processing results. A few notes to emphasize from this code:

  • This implementation can only be used as a filter query (fq) parameter, not a q parameter.
  • hashCode/equals are very important to get right, otherwise unexpected/incorrect results can occur.
  • Caching is explicitly disabled, so no need to set cache=false.
    Solr has logic that only kicks in PostFilter’s when the cost is >= 100, that’s why the getCost method is the way it is.
  • The custom filtering logic is all within the single isAllowed() method.
  • This example was built using the Lucene/Solr 5.x codebase.  It is not compatible with any prior version of Solr.

In this implementation, the access control rules are entirely specified on each document, in the acl field. In order to efficiently filter by these rules at query time, Lucene’s DocValues are used.

So, with all that implementation behind us, here’s how we finally use it: index some documents and make queries that filter using the “acl” query parser. Here are the documents, in CSV format:

acl_docs.csv

id,acl
1,+u:bob
2,-g:sales +g:engineering
3,+g:hr -g:engineering
4,-u:alice +g:hr
5,+g:hr -u:alice
6,+g:sales +g:engineering -u:bob
7,+g:hr -u:alice +g:sales
8,+g:sales
9,+g:engineering
10,+g:hr

An “acl_example” collection was created and two important configuration changes were made: register the “acl” query parser and customize the “acl” field definition.  To register the “acl” query parser, add this to solrconfig.xml:

     <queryParser name="acl" class="AccessControlQParserPlugin"/>

 

The “acl” field is defined in the schema as:

    <field name="acl" type="string" indexed="true" stored="true" multiValued="false" docValues="true"/>

Note docValues=”true”, a key setting here.

The documents were indexed using Solr’s bin/post tool:

bin/post -c acl_example acl_docs.csv

And finally let’s see the results, using the base request of http://localhost:8983/solr/select?q=*:*, which by itself returns all documents. Appending an fq parameter using the syntax &fq={!acl user=’username’ groups=’group1,group2′} applies the security filter. Here are several variations in user and groups and the results:

&fq={!acl user='alice' groups=''}: Matching ids: None
&fq={!acl user='bob' groups=''}: Matching ids: 1
&fq={!acl user='alice' groups='hr'}: Matching ids: 3 5 7 10
&fq={!acl user='alice' groups='hr,sales'}: Matching ids: 3 5 6 7 8 10
&fq={!acl user='alice' groups='hr,sales,engineering'}: Matching ids: 3 5 6 7 8 9 10
&fq={!acl user='bob' groups='hr'}: Matching ids: 1 3 4 5 7 10

Fine Print

It’s important to note that PostFilter is a last resort for implementing document filtering. Don’t make the solution more complicated than it needs to be. More often than not, even access control filtering can be implemented using plain ol’ search techniques, by indexing allowed users and groups onto documents and using the lucene (or another) query parser to do the trick. Only when the rules are too complicated, or external information is needed, does a custom PostFilter make sense. Performance is key here, and the internal #collect() method will be called for every matching document; a *:* query was used in this example causing every document in the index to be post-filter evaluated and this may be prohibitive on a large index, and as such your application may need to require a narrowing query or another filter constraint involved before kicking in a PostFilter. What happens in #collect needs to be highly optimized.

Code

Here’s the code as text files making it cleaner and easier to save than copying and pasting from above: