Blog, Lucidworks Fusion, Technical Article

Introducing Anda: a New Crawler in Lucidworks Fusion

by Evan Sayer
August 20, 2015

Introduction
Developing a Crawler
What’s next?
Additional Reading

Introduction

Lucidworks Fusion 2.0 ships with roughly 30 out-of-the-box connector plugins to facilitate data ingestion from a variety of common datasources. 10 of these connectors are powered by a new general-purpose crawler framework called Anda, created at Lucidworks to help simplify and streamline crawler development. Connectors to each of the following Fusion datasources are powered by Anda under-the-hood:

Web
Local file
Box
Dropbox
Google Drive
SharePoint
JavaScript
JIRA
Drupal
Github

Inspiration for Anda came from the realization that most crawling tasks have quite a bit in common across crawler implementations. Much of the work entailed in writing a crawler stems from generic requirements unrelated to the exact nature of the datasource being crawled, which indicated the need for some reusable abstractions. The below crawler functionalities are implemented entirely within the Anda framework code, and while their behavior is quite configurable via properties in Fusion datasources, the developer of a new Anda crawler needn’t write any code to leverage these features:

Starting, stopping, and aborting crawls
Configuration management
Crawl-database maintenance and persistence
Link-legality checks and link-rewriting
Multithreaded-ness and thread-pool management
Throttling
Recrawl policies
Deletion
Alias handling
De-duplication of similar content
Content splitting (e.g. CSV and archives)
Emitting content

Instead, Anda reduces the task of developing a new crawler to providing the Anda framework with access to your data. Developers provide this access by implementing one of two Java interfaces that form the core of the Anda Java API: Fetcher and FS (short for filesystem). These interfaces provide the framework code with the necessary methods to fetch documents from a datasource and discern their links, enabling traversal to additional content in the datasource. Fetcher and FS are designed to be as simple to implement as possible, with most all of the actual traversal logic relegated to framework code.

Developing a Crawler

With so many generic crawling tasks, it’s just inefficient to write an entirely new crawler from scratch for each additional datasource. So in Anda, the framework itself is essentially the one crawler, and we plug-in access to the data that we want it to crawl. The Fetcher interface is the more generic of two ways to provide that access.

Writing a `Fetcher`

public interface Fetcher extends Component<Fetcher> {

    public Content fetch(String id, long lastModified, String signature) throws Exception;
}

Fetcher is a purposefully simple Java interface that defines a method fetch() to fetch one document from a datasource. There’s a WebFetcher implementation of Fetcher in Fusion that knows how to fetch web pages (where the id argument to fetch() will be a web page URL), a GithubFetcher for Github content, etc. The fetch() method returns a Content object containing the content of the “item” referenced by id, as well as any links to additional “items”, whatever they may be. The framework itself is truly agnostic to the exact type of “items”/datasource in play—dealing with any datasource-specific details is the job of the Fetcher.

A Fusion datasource definition provides Anda with a set of start-links (via the startLinks property) that seed the first calls to fetch() in order to begin the crawl, and traversal continues from there via links returned in Content objects from fetch(). Crawler developers simply write code to fetch one document and discern its links to additional documents, and the framework takes it from there. Note that Fetcher implementations should be thread-safe, and the fetchThreads datasource property controls the size of the framework’s thread-pool for fetching.

Incremental Crawling

The additional lastModified and signature arguments to fetch() enable incremental crawling. Maintenance and persistence of a crawl-database is one of the most important tasks handled completely by the framework, and values for lastModified (a date) and signature (an optional String value indicating any kind of timestamp, e.g. ETag in a web-crawl) are returned as fields of Content objects, saved in the crawl-database, and then read from the crawl-database and passed to fetch() in re-crawls. A Fetcher should use these metadata to optionally not read and return an item’s content when it hasn’t changed since the last crawl, e.g. by setting an If-Modified-Since header along with the lastModified value in the case of making HTTP requests. There are special “discard” Content constructors for the scenario where an unchanged item didn’t need to be fetched.

Emitting Content

Content objects returned from fetch() might be discards in incremental crawls, but those containing actual content will be emitted to the Fusion pipeline for processing and to be indexed into Solr. The crawler developer needn’t write any code in order for this to happen. The pipelineID property of all Fusion datasources configures the pipeline through which content will be processed, and the user can configure the various stages of that pipeline using the Fusion UI.

Configuration

Fetcher extends another interface called Component, used to define its lifecycle and provide configuration. Configuration properties themselves are defined using an annotation called @Property, e.g.:

@Property(title="Obey robots.txt?",
        type=Property.Type.BOOLEAN,
        defaultValue="true")
public static final String OBEY_ROBOTS_PROP = "obeyRobots";

This example from WebFetcher (the Fetcher implementation for Fusion web crawling) defines a boolean datasource property called obeyRobots, which controls whether WebFetcher should heed directives in robots.txt when crawling websites (disable this setting with care!). Fields with @Property annotations for datasource properties should be defined right in the Fetcher class itself, and the title= attribute of a @Property annotation is used by Fusion to render datasource properties in the UI.

Error Handling

Lastly, it’s important to notice that fetch() is allowed to throw any Java exception. Exceptions are persisted, reported, and handled by the framework, including logic to decide how many times fetch() must consecutively fail for a particular item before that item will be deleted in Solr. Most Fetcher implementations will want to catch and react to certain errors (e.g. retrying failed HTTP requests in a web crawl), but any hard failures can simply bubble up through fetch().

What’s next?

Anda’s sweet spot is definitely around quick and easy development of crawlers at present, which usually connote something a bit more specific than the term “connector”. That items link to other items is currently a core assumption of the Anda framework. Web pages have links to other web pages and filesystems have directories linking to other files, yielding structures that clearly require crawling.

We’re working towards enabling additional ingestion paradigms, such as iterating over result-sets (e.g. from a traditional database) instead of following links to define the traversal. Mechanisms to seed crawls in such a fashion are also under development. For now, it may make sense to develop connectors whose ingestion paradigms are less about crawling (e.g. the Slack and JDBC connectors in Fusion) using the general Fusion connectors framework. Stay tuned for future blog posts covering new methods of content ingestion and traversal in Anda.

An Anda SDK with examples and full documentation is also underway, and this blog post will be updated as soon as it’s available. Please Contact Lucidworks in the meantime.

Download Fusion

Additional Reading

Fusion Anda Documentation

Planned upcoming blog posts (links will be posted when available):

Web Crawling in Fusion
The Anda-powered Fusion web crawler provides a number of options to control how web pages are crawled and indexed, control speed of crawling, etc.

Content De-duplication with Anda
De-duplication of similar content is a complex but generalizable task that we’ve tackled in the Anda framework, making it available to any crawler developed using Anda.

Anda Crawler Development Deep-dive
Writing a Fetcher is one of the two ways to provide Anda with access to your data; it’s also possible to implement an interface called FS (short for filesystem). Which one you choose will depend chiefly on whether the target datasource can be modeled in terms of a standard filesystem. If a datasource generally deals in files and directories, then writing an FS may be easier than writing a Fetcher.

About Evan Sayer

LEARN MORE

Contact us today to learn how Lucidworks can help your team create powerful search and discovery applications for your customers and employees.

Fusion Platform Overview

Fusion Platform Pricing

AI Hub

Lucidworks Features and capabilities (all Included)

Product Discovery

Searchandising

Site Search

Workplace Search

Ingest Data and Capture Signals

Employee Search Experience

Customer Service and Case Resolution

AI and Large Language Models

Solutions

Commerce

Customer Service

Knowledge Management

Industries

Retail

Government and Public Sector

Healthcare

B2B Commerce and Distribution

B2B Manufacturing

Financial Services

EXPLORE OUR CONTENT

Ebooks & Reports

Blog

Videos

Press

Resources

About Lucidworks

Documentation

Careers

LucidAcademy

Contact Us

Technical Support