Search Results for: document/5ea8054ed8348e6f/highlight_arbitrary_text

Getting Started with Lucene Setup

…understanding as many features about the content as possible. Regardless of the number of documents, there are many tips and techniques that can help. I’ll cover the highlights here, but…

Tags:

Full Text Search Engines vs. DBMS

searching large volumes of unstructured textdocuments or other ‘records’ containing free form text — and returning these documents based on how well they match the user’s query. They…

Tags:

Content Extraction with Tika

…content type detection and content extraction framework. Tika provides a general application programming interface that can be used to detect the content type of a document and also parse textual…

Tags:

Crawling in Open Source, Part 1

…HTTPS, FTP and local file system. Nutch can also extract textual content from several document formats like HTML, RSS, ATOM, PDF, ms formats (doc, excel, ppt), etc right out of…

Tags:

blog_photo_Solution-for-multi-term-synonyms

Solution for multi-term synonyms in Lucene/Solr using the Auto Phrasing TokenFilter

…causes the document to match; there are problems with highlighting the original document when synonym is matched (see unit tests for an example), if the synonym is of different length…

Tags: ,

Debugging Search Application Relevance Issues

…of queries, documents and relevance judgments created by a third party group. The most popular instance of this is the Text Retrieval Evaluation Conference (TREC) held by the National Institute…

Tags:

Scaling Lucene and Solr

…score higher. Just as with omitTf, this may not be useful for short or non full text fields. Norms are stored in the index as a byte value per document

Tags:

Solr Cloud Document Routing

Overview Solr Cloud document routing was released in Solr 4.1. This feature expanded upon the simple hash based routing that was available in Solr 4.0 by introducing a new…

Tags: , , ,

Poor man's "entity" extraction with Solr

…you really want to do with lat/long’s extracted from arbitrary text is to attach it to the document as a formal, retrievable, filterable, geospatial field type.  Spatial search is a…

Tags: , ,

pipeline_preview_5a

When the mapping gets tough, the tough use JavaScript

The use case for this stage is when processing a collection of documents where the document text may contain a certain amount of boilerplate text, e.g., disclaimers in email messages….