Generating a Sitemap from a Solr Index

By on January 5, 2017

Our clients often ask if Solr supports generating a sitemap from an existing Solr Index. While Solr has a full-featured set of APIs, these interfaces are generally geared more towards providing a generic data-management platform for your application.  Thus the short answer is: No, Solr doesn’t have a specialized API for generating sitemaps, RSS feeds, and so on.  

That said, with just a few lines of code you can create your own sitemap generator.  

For the purposes of this article, I’ve written rudimentary sitemap generators in Java, PHP and Python.  You’ll find each of these examples below.   They are all about the same length, and all pretty much do the same thing:  

   1) Call the collections API with the collection name and retrieve the data. 

   2) Spin the raw content up into a JSON object. 

   3) Iterate over the document extracting the URLs and writing them to the XML string output. 

   4) Print out the result. 

   I). Java Sitemap Example

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Iterator;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;

public class Sitemap {

    public static void main(String[] args) {
        String url = "http://localhost:8983/solr/[MY_COLLECTION_NAME]/select?q=*%3A*&wt=json";
        StringBuffer buf = new StringBuffer();
        try {
            URL solrSite = new URL(url);

            BufferedReader in = new BufferedReader(
                    new InputStreamReader(solrSite.openStream()));

            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                buf.append(inputLine);
            }
            in.close();

            JSONParser parser = new JSONParser();
            JSONObject jsonObject = (JSONObject) parser.parse(buf.toString());
            JSONObject resp = (JSONObject) jsonObject.get("response");
            JSONArray docs = (JSONArray) resp.get("docs");
            Iterator<JSONObject> iter = docs.iterator();
            JSONObject doc;
            buf = new StringBuffer();
            buf.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");

            buf.append("<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">");

            while (iter.hasNext()) {
                doc = iter.next();
                buf.append("<url><loc>" + doc.get("id") + " </loc></url>");
            }

            buf.append("</urlset>");

        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            System.out.println(buf.toString());
        }

    }
}

  II). PHP Sitemap Example

 
<?php header("Content-Type: text/xml");
   
   $content = file_get_contents("http://localhost:8983/solr/[MY_COLLECTION_NAME]/select?q=*%3A*&wt=json");
   $json = json_decode($content, true);
   
   $output = '<?xml version="1.0" encoding="UTF-8"?>

    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">';

    $docs = $json["response"]["docs"];
    
    foreach($docs as $key=>$doc){
       
        $output .= "<url>";
        $output .= "<loc>" . $doc["id"] . "</loc>";
        $output .= "</url>";
        
        
    }
   
   $output .= "</urlset>";
   
   
   
   echo $output;

?>

  III). Python Sitemap Example

#!/usr/bin/env python2
#encoding: UTF-8
import urllib
import json

if __name__ == "__main__"
   link = "http://localhost:8983/solr/[MY_COLLECTION_NAME]/select?q=*%3A*&wt=json"
   f = urllib.urlopen(link)
   myfile = f.read()
   stdout = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">"
 
   jsonout = json.loads(myfile);
   resp = jsonout["response"]["docs"];
   for url in resp:
       #print url["id"]
       stdout += "<url><loc>" + url["id"] + " </loc></url>";
       
      
   stdout += "</urlset>"
   print stdout

 Note:  In the examples above, I am simply printing out the result.  For your implementation, you will like write the output to your site’s root directory.  Automating this task can be accomplished with a simple Cron job.  There is a nice tutorial on creating Cron jobs here.   Also for this example, I tried to contain my imports to whatever would be available with a simple core installation of any language.  There are certainly many ways you could go about it, but this provides basic examples.  Further, I’m only setting the required ‘loc’ element here, using the ‘id’ field gathered when the document was crawled. You could extends these to include the other option elements (E.g. last_modified, etc). 

 Happy Mapping! 

Share on LinkedInShare on FacebookTweet about this on Twitter

Related Posts

The Twilight of the Vengine Gods (Die Göttervenginedämmerung) or Die Hard with A Vengines!!!

Fusion plus Solr Suggesters for More Search, Less Typing

Open Source Hadoop Connectors for Solr

Data Security and Human Insecurities: How Scammers Take Advantage

Top Posts

Using Word2Vec in Fusion for Better Search Results

Data Science for Dummies

The Data That Lies Beneath: A Dark Data Deep Dive

Visualizing Search Results in Solr: /browse and Beyond

Implementing Apache Solr at Target

Posted in General

Your email address will not be published. Required fields are marked *

*