Indexing PDF for OSINT and Pentesting

By Alejandro Nolla – @z0mbiehunt3r

Most of us, when conducting OSINT tasks or gathering information for preparing a pentest, draw on Google hacking techniques like site:company.acme filetype:pdf “for internal use only” or something similar to search for potential sensitive information uploaded by mistake. At other times, a customer will ask us to find out if through negligence they have leaked this kind of sensitive information and we proceed to make some google hacking fu.

But, what happens if we don’t want to make this queries against Google and, furthermore, follow links from search that could potentially leak referrers? Sure we could download documents and review them manually in local but it’s boring and time consuming. Here is where Apache Solr comes into play for processing documents and creating an index of them to give us almost real time searching capabilities.

What is Solr?

Solr is a schema based (also with dynamic field support) search solution built upon Apache Lucene providing full-text searching capabilities, document processing, REST API to fetch results in various formats like XML or JSON, etc. Solr allows us to process document indexing with multiple options regarding how to treat text, how to tokenize it, convert (or not) to lowercase automatically, build distributed cluster, automatic duplicates document detection and so on.

Setting up Solr

There is a lot of stuff about how to install Solr so I’m not going to cover it here, just specific core options for this quick’n dirty solution. First thing to do is create a core config and data dir, in this case I created /opt/solr/pdfosint/ and /opt/solr/pdfosintdata/ to store config and document data respectively.

To set the schema up just create /opt/solr/pdfosint/conf/schema.xml file with following content:

schema.xml content for pdfosint core

&lt;!--?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?--&gt;
&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; ?&gt;
&lt;schema name=&quot;pastebincom&quot; version=&quot;1.5&quot;&gt;
 &lt;fields&gt;
   &lt;field name=&quot;id&quot; type=&quot;uuid&quot; indexed=&quot;true&quot; stored=&quot;true&quot; default=&quot;NEW&quot; multiValued=&quot;false&quot; /&gt;
   &lt;field name=&quot;text&quot; type=&quot;text_general&quot; indexed=&quot;true&quot; stored=&quot;true&quot;/&gt;
   &lt;field name=&quot;timestamp&quot; type=&quot;date&quot; indexed=&quot;true&quot; stored=&quot;true&quot; default=&quot;NOW&quot; multiValued=&quot;false&quot;/&gt;
   &lt;field name=&quot;_version_&quot; type=&quot;long&quot; indexed=&quot;true&quot; stored=&quot;true&quot;/&gt;
   &lt;dynamicField name=&quot;attr_*&quot; type=&quot;text_general&quot; indexed=&quot;true&quot; stored=&quot;true&quot; multiValued=&quot;true&quot;/&gt;
 &lt;/fields&gt;

 &lt;types&gt;
   &lt;fieldType name=&quot;string&quot; class=&quot;solr.StrField&quot; sortMissingLast=&quot;true&quot; /&gt;
   &lt;fieldType name=&quot;long&quot; class=&quot;solr.TrieLongField&quot; precisionStep=&quot;0&quot; positionIncrementGap=&quot;0&quot;/&gt;
   &lt;fieldType name=&quot;date&quot; class=&quot;solr.TrieDateField&quot; precisionStep=&quot;0&quot; positionIncrementGap=&quot;0&quot;/&gt;
   &lt;fieldType name=&quot;uuid&quot; class=&quot;solr.UUIDField&quot; indexed=&quot;true&quot; /&gt;
   &lt;fieldType name=&quot;text_general&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;&gt;
      &lt;analyzer type=&quot;index&quot;&gt;
        &lt;tokenizer class=&quot;solr.WhitespaceTokenizerFactory&quot;/&gt;
        &lt;filter class=&quot;solr.LowerCaseFilterFactory&quot;/&gt;
      &lt;/analyzer&gt;
      &lt;analyzer type=&quot;query&quot;&gt;
        &lt;tokenizer class=&quot;solr.WhitespaceTokenizerFactory&quot;/&gt;
        &lt;filter class=&quot;solr.LowerCaseFilterFactory&quot;/&gt;
      &lt;/analyzer&gt;
    &lt;/fieldType&gt;
 &lt;/types&gt;
&lt;/schema&gt;

Just a quick review of config for schema.xml, I specified an id field to be unique (UUID), a text field to store text itself, timestamp to be set to the date when the document is pushed into Solr, version to track index version (internal Solr use to replicate, and so) and a dynamic field named attr_* to store any no specified value in schema and provided by parser. Last, I specified how to treat indexing and querying, for tokenize I use whitespace (splice words based just on whitespace without caring about special punctuation) and convert it to lowercase. If you want to know more about text processing I would recommend Python Text Processing with NLTK 2.0 Cookbook as an introduction, Natural Language Processing with Python for a more in-depth usage (both Python based) and Natural Language Processing online course available in Coursera.

Next step is notifying Solr about new core, just adding to /opt/solr/solr.xml/

New core for PDF indexing

&lt;cores&gt;
  ...
  &lt;core name=&quot;pdfosint&quot; instanceDir=&quot;pdfosint&quot;/&gt;
&lt;/cores&gt;

Now only left to provide Solr with binary document processing capabilities through a request handler, in that case, only for pdfosint core. For this create /opt/solr/pdfosint/solrconfig.xml (we can always copy provided example with Solr and modify when needed) and specify request handler:

Setting up solr request handler for binary documents

 &lt;requestHandler name=&quot;/update/extract&quot; class=&quot;org.apache.solr.handler.extraction.ExtractingRequestHandler&quot; &gt;
    &lt;lst name=&quot;defaults&quot;&gt;
        &lt;str name=&quot;fmap.content&quot;&gt;text&lt;/str&gt;
        &lt;str name=&quot;lowernames&quot;&gt;true&lt;/str&gt;
        &lt;str name=&quot;uprefix&quot;&gt;attr_&lt;/str&gt;
        &lt;str name=&quot;captureAttr&quot;&gt;true&lt;/str&gt;
    &lt;/lst&gt;
  &lt;/requestHandler&gt;

A quick review of this, class could be changed depending on version and classes names, fmap.content specify to index extracted text to a field called text, lowernames specify converting to lowercase all processed documents, uprefix specifies how to handle field parsed and not provided in schema.xml (in that case use dynamic attribute with a suffix of attr_) and captureAttr to specify indexing parsed attributes into separate fields. To learn more about ExtractingRequestHandler please go here.

Now we have to install the required libraries to do binary parsing and indexing, for this, I have created /opt/solr/extract/ and copied solr-cell-4.2.0.jar from dist directory inside of Solr distribution archive and also copied to the same folder everything from contrib/extraction/lib/ again from distribution archive.

At last, adding this line to /opt/solr/pdfosint/solrconfix.xml to specify from where load libraries:

...
&lt;lib dir=&quot;/opt/solr/extract&quot; regex=&quot;.*.jar&quot; /&gt;
...

To know more about this process and more recipes, I strongly recommend Apache Solr 4 Cookbook.

Indexing and digging data

Now we have an extracting and indexing handler at http://localhost:8080/solr/pdfosint/update/extract/ so we only need to send the PDFs to Solr and analyze them. The easiest way once downloaded (or maybe fetched from a meterpreter session? }:) ) is sending them with curl to Solr:

$ for i in ls /tmp/pdf/*.pdf; do curl &quot;http://localhost:8080/solr/pdfosint/update/extract/?commit=true&quot; -F &quot;myfile=@$i&quot;; done;

After a while, depending on several factors like machine specs and documents size, we should have an index like this:

So now we try a query to find documents with phrase “internal use only” and bingo!:

It’s important to have in mind the fact that Solr will split words and treat them before indexing when doing queries, to see how a phrase should be treated and indexed by Solr when submitted we can do an analysis with the built-in interface:

I hope you find this useful and give it a try, see you soon!

Original post by Alejandro Nolla – @z0mbiehunt3r can be found here.

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...

From Search to Solutions: How AI Agents Can Power Digital Commerce in 2025

Watch this on-demand webinar to discover the six smartest AI-driven DX strategies...

Build custom AI agents without writing a single line of code? Yep, we did that.

Finally, a low-code AI platform (really, no code) that lets the people...