Introducing Anda: a New Crawler in Lucidworks Fusion
Introduction
Lucidworks Fusion 2.0 ships with roughly 30 out-of-the-box connector plugins to facilitate data ingestion from a variety of common datasources. 10 of these connectors are powered by a new general-purpose crawler framework called Anda, created at Lucidworks to help simplify and streamline crawler development. Connectors to each of the following Fusion datasources are powered by Anda under-the-hood:
- Web
- Local file
- Box
- Dropbox
- Google Drive
- SharePoint
- JavaScript
- JIRA
- Drupal
- Github
Inspiration for Anda came from the realization that most crawling tasks have quite a bit in common across crawler implementations. Much of the work entailed in writing a crawler stems from generic requirements unrelated to the exact nature of the datasource being crawled, which indicated the need for some reusable abstractions. The below crawler functionalities are implemented entirely within the Anda framework code, and while their behavior is quite configurable via properties in Fusion datasources, the developer of a new Anda crawler needn’t write any code to leverage these features:
- Starting, stopping, and aborting crawls
- Configuration management
- Crawl-database maintenance and persistence
- Link-legality checks and link-rewriting
- Multithreaded-ness and thread-pool management
- Throttling
- Recrawl policies
- Deletion
- Alias handling
- De-duplication of similar content
- Content splitting (e.g. CSV and archives)
- Emitting content
Instead, Anda reduces the task of developing a new crawler to providing the Anda framework with access to your data. Developers provide this access by implementing one of two Java interfaces that form the core of the Anda Java API: Fetcher
and FS
(short for filesystem). These interfaces provide the framework code with the necessary methods to fetch documents from a datasource and discern their links, enabling traversal to additional content in the datasource. Fetcher
and FS
are designed to be as simple to implement as possible, with most all of the actual traversal logic relegated to framework code.
Developing a Crawler
With so many generic crawling tasks, it’s just inefficient to write an entirely new crawler from scratch for each additional datasource. So in Anda, the framework itself is essentially the one crawler, and we plug-in access to the data that we want it to crawl. The Fetcher
interface is the more generic of two ways to provide that access.
Writing a Fetcher
public interface Fetcher extends Component<Fetcher> { public Content fetch(String id, long lastModified, String signature) throws Exception; }
Fetcher
is a purposefully simple Java interface that defines a method fetch()
to fetch one document from a datasource. There’s a WebFetcher
implementation of Fetcher
in Fusion that knows how to fetch web pages (where the id
argument to fetch()
will be a web page URL), a GithubFetcher
for Github content, etc. The fetch()
method returns a Content
object containing the content of the “item” referenced by id
, as well as any links to additional “items”, whatever they may be. The framework itself is truly agnostic to the exact type of “items”/datasource in play—dealing with any datasource-specific details is the job of the Fetcher
.
A Fusion datasource definition provides Anda with a set of start-links (via the startLinks
property) that seed the first calls to fetch()
in order to begin the crawl, and traversal continues from there via links returned in Content
objects from fetch()
. Crawler developers simply write code to fetch one document and discern its links to additional documents, and the framework takes it from there. Note that Fetcher
implementations should be thread-safe, and the fetchThreads
datasource property controls the size of the framework’s thread-pool for fetching.
Incremental Crawling
The additional lastModified
and signature
arguments to fetch()
enable incremental crawling. Maintenance and persistence of a crawl-database is one of the most important tasks handled completely by the framework, and values for lastModified
(a date) and signature
(an optional String value indicating any kind of timestamp, e.g. ETag in a web-crawl) are returned as fields of Content
objects, saved in the crawl-database, and then read from the crawl-database and passed to fetch()
in re-crawls. A Fetcher
should use these metadata to optionally not read and return an item’s content when it hasn’t changed since the last crawl, e.g. by setting an If-Modified-Since
header along with the lastModified
value in the case of making HTTP requests. There are special “discard” Content
constructors for the scenario where an unchanged item didn’t need to be fetched.
Emitting Content
Content
objects returned from fetch()
might be discards in incremental crawls, but those containing actual content will be emitted to the Fusion pipeline for processing and to be indexed into Solr. The crawler developer needn’t write any code in order for this to happen. The pipelineID
property of all Fusion datasources configures the pipeline through which content will be processed, and the user can configure the various stages of that pipeline using the Fusion UI.
Configuration
Fetcher
extends another interface called Component
, used to define its lifecycle and provide configuration. Configuration properties themselves are defined using an annotation called @Property
, e.g.:
@Property(title="Obey robots.txt?", type=Property.Type.BOOLEAN, defaultValue="true") public static final String OBEY_ROBOTS_PROP = "obeyRobots";
This example from WebFetcher
(the Fetcher
implementation for Fusion web crawling) defines a boolean datasource property called obeyRobots
, which controls whether WebFetcher
should heed directives in robots.txt when crawling websites (disable this setting with care!). Fields with @Property
annotations for datasource properties should be defined right in the Fetcher
class itself, and the title=
attribute of a @Property
annotation is used by Fusion to render datasource properties in the UI.
Error Handling
Lastly, it’s important to notice that fetch()
is allowed to throw any Java exception. Exceptions are persisted, reported, and handled by the framework, including logic to decide how many times fetch()
must consecutively fail for a particular item before that item will be deleted in Solr. Most Fetcher
implementations will want to catch and react to certain errors (e.g. retrying failed HTTP requests in a web crawl), but any hard failures can simply bubble up through fetch()
.
What’s next?
Anda’s sweet spot is definitely around quick and easy development of crawlers at present, which usually connote something a bit more specific than the term “connector”. That items link to other items is currently a core assumption of the Anda framework. Web pages have links to other web pages and filesystems have directories linking to other files, yielding structures that clearly require crawling.
We’re working towards enabling additional ingestion paradigms, such as iterating over result-sets (e.g. from a traditional database) instead of following links to define the traversal. Mechanisms to seed crawls in such a fashion are also under development. For now, it may make sense to develop connectors whose ingestion paradigms are less about crawling (e.g. the Slack and JDBC connectors in Fusion) using the general Fusion connectors framework. Stay tuned for future blog posts covering new methods of content ingestion and traversal in Anda.
An Anda SDK with examples and full documentation is also underway, and this blog post will be updated as soon as it’s available. Please Contact Lucidworks in the meantime.
Additional Reading
Planned upcoming blog posts (links will be posted when available):
Web Crawling in Fusion
The Anda-powered Fusion web crawler provides a number of options to control how web pages are crawled and indexed, control speed of crawling, etc.
Content De-duplication with Anda
De-duplication of similar content is a complex but generalizable task that we’ve tackled in the Anda framework, making it available to any crawler developed using Anda.
Anda Crawler Development Deep-dive
Writing a Fetcher
is one of the two ways to provide Anda with access to your data; it’s also possible to implement an interface called FS
(short for filesystem). Which one you choose will depend chiefly on whether the target datasource can be modeled in terms of a standard filesystem. If a datasource generally deals in files and directories, then writing an FS
may be easier than writing a Fetcher
.
Best of the Month. Straight to Your Inbox!
Dive into the best content with our monthly Roundup Newsletter!
Each month, we handpick the top stories, insights, and updates to keep you in the know.