Indexing web sites in Solr with Python

Indexing web sites in Solr with Python

In this post I will show a simple yet effective way of indexing web sites into a Solr index, using Scrapy and Python.

We see a lot of advanced Solr-based applications, with sophisticated custom data pipelines that combine data from multiple sources, or that have large scale requirements. Equally we often see people who want to start implementing search in a minimally-invase way, using existing websites as integration points rather than implementing a deep integration with particular CMSes or databases which may be maintained by other groups in an organisation. While crawling websites sounds fairly basic, you soon find that there are gotchas, with the mechanics of crawling, but more importantly, with the structure of websites.
If you simply parse the HTML and index the text, you will index a lot of text that is not actually relevant to the page: navigation sections, headers and footers, ads, links to related pages. Trying to clean that up afterwards is often not effective; you’re much better off preventing that cruft going into the index in the first place. That involves parsing the content of the web page, and extracting information intelligently. And there’s a great tool for doing this: Scrapy. In this post I will give a simple example of its use. See Scrapy’s tutorial for an introduction and further information.

Crawling

My example site will be my personal blog. I write the blog in Markdown, generate HTML with Jekyll, deploy through git, and host on lighttpd and CloudFront; but none of that makes a difference to our consumption of that content, we’ll just crawl the website.

First to prepare to run Scrapy, in a Python virtualenv:

Then to create a Scrapy application, named blog:

The items we want to index are the blog posts; I just use title, URL and text fields:

Next I create a simple spider which crawls my site, identifies blog posts by URL structure,
and extract the text from the blog post. The cool thing about this is that we can extract specific parts of the page.

The main bit of logic here is the matching: my blog URLs all start with dates (/YYYY/MM/DD/), so I use that to identify blog posts, which I then parse using XPath. Gotchas here are that you need to create absolute URLs from relative paths in HTML A tags (with urljoin), and that I skip links to binary types. I could have used the CrawlSpider and defined rules for extracting/parsing, but with the BaseCrawler it’s a bit clearer to see what happens.

To run the crawl:

which produces a JSON file items.json with a list of items like:

Indexing

Next, to get that into Solr, we could use the JSON Request Handler, and just transform the JSON into the appropriate form. But, seeing as we’re using Python, we’ll just use pysolr.

To install pysolr:

and write the python code:

This looks too easy, right? The trick is that the attribute names title/url/body in the JSON file match field definitions in the Default Solr schema.xml. Note that the text field is configured to be indexed, but not stored; this means you do not get the page content back with your query, and you can’t do things like highlighting.

We do need a “id” field, and we’ll use the URL for that. I could have set that in the crawler, so it would become part of the JSON file, but seeing as it is a Solr-specific requirement (and so I could illustrate simple field mapping) I choose to do it here.

The url points at a Solr 4 server on the host “vm116.lan” on my LAN. Adjust hostname and port number to match yours.
See the Solr 4.3.0 tutorial for details on how to run Solr.

To run (change the URL to point to your Solr instance):

and to query, point your browser to http://vm116.lan:8983/solr/collection1/browse and search for Fabric, which should find the Installing Distributed Solr 4 with Fabric post. Or use the lower-level Solr query page http://vm116.lan:8983/solr/#/collection1/query and do a query for text:Fabric.

In this example we’re crawling and indexing in separate stages; you may want to inject directly from the crawler.

License and Disclaimer

The code snippets above are covered by:

The MIT License (MIT)

Copyright (c) 2013 Martijn Koster

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the “Software”), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

Share on FacebookTweet about this on TwitterShare on Google+Pin on PinterestShare on RedditShare on LinkedIn

6 Comments

Comments are closed.