Building a Subject Classifier Using Automatically Discovered Keyword Clusters, Part I

A lot has been written and discussed over the years about the relative benefits of automated or machine based classification and that of human curated ontologies and taxonomies. It is generally acknowledged that human curated taxonomies can be more precise but they are difficult to scale and require a lot of maintenance to achieve good recall. Machine learning techniques have the advantage that with some effort, they can be made to scale quite nicely but it is difficult to get them to be 100% accurate. The initial cost of building machine learning models can be acceptable, but like any 80/20 problem, perfecting them takes much more effort. This is complicated by issues like “overfitting” where the model performs very well on the training set but very poorly on “real” data. There is also the fact that almost all of these techniques require a numerical “vectorization” of textual data on which the learning algorithms operate. Thus the models that they create are often not intuitive and cannot be edited by human subject matter experts – they are effectively a “black box”. (You can learn what words have been selected as keywords by mapping their numerical IDs but it is very difficult to edit the “model”). Google’s Word2Vec algorithm stands out as a exception here in that it creates word – to – word vectors that can be used to do linear algebra-esque operations on conceptual relationships such as “Man is to Woman” as “Boy is to Girl”, “Brother is to Sister” or “King is to Queen”. More on this fascinating ‘pivot’ and on Word2Vec later.

My goal here is to develop a subject classifier that uses automatically discovered key term “clusters” that can then be used to classify documents. It is similar to methods like Latent Dirichlet Allocation (LDA) or Latent Semantic Indexing (LSI) that identify keywords statistically and then use the keywords to classify documents. In contrast, other unsupervised learning methods like K-Means first cluster the documents in a purely statistical way and then identify common words that are then used to label the clusters. The method I am describing here also differs from traditional ML in that it does not use a vector approach to find the keyword ‘clusters’, rather it uses Solr facets to determine keyword co-occurrences in documents. Like vector-based supervised learning techniques, there is some aspect to this method that is automated and some that requires human subject matter expertise. The goal though is to minimize the ratio of human-time vs machine time – i.e. lessening the workload on the human at the expense of the machine.

Keywords have the basic attribute that they occur frequently in documents about a certain subject but rarely if at all in documents that do not discuss that subject. An important characteristic of keywords that we wish to discover ‘automagically’ is that they are not just statistically but also semantically related to the document subject – that is, they are there for a reason, not just by chance! Acronyms, jargon terms and other specialized terms or phrases tend to make good keywords. Keywords that are associated with the same subject tend to co-occur in the documents that we want to classify. For example, the terms ‘kerberos’ and ‘ldap’ tend to occur only in documents that discuss security or related concepts. Thus one way to distinguish a keyword from a non-keyword is that they tend to cluster and that they only do so in the documents that we want to tag. General language or stop words do not share this characteristic, even rare ones.

So in a nutshell, the method that I will describe looks for what I call “keyword clusters” first, using a simple faceting technique and then uses these keyword clusters to cluster documents (and potentially to do other interesting things). I will cover the latter topic in Part II of this post.

If successful, methods like this can bridge the gap between purely human curated taxonomies and machine generated ones – it lets machines and humans both do what they are good at. The knock on taxonomy based approaches is that they require constant maintenance, especially as concepts and terminology evolve. This approach helps to solve this problem by letting the machine find the new keyword relationships either because a new word has been coined or because an existing term is being used in new ways. An example of the latter would be the word ‘spark’ – an existing word that has now acquired a new meaning in to the distributed computing and machine learning communities. The term ‘apple’ would be another example. An example of the first type would be ‘hadoop’ – which wasn’t a word in our lexicon until Doug Cutting introduced it to us. (For that matter Doug, what is a ‘nutch’?) The Apache Foundation is a source of many of these terms – new or old. Another example here is ‘zookeeper’ which manages a different type of ‘Zoo’ than the one in San Diego (but still has the problem of waste removal but in a different sense of course).

Like all machine learning approaches, the automated keyword clustering algorithm that I will describe is not perfect – it makes mistakes. But unlike the traditional mathematical vectorization approaches, the mistakes can be corrected by human curation. For example, the technique does not detect phrases. It detects that the terms ‘certification’ and ‘authority’ are related to the subject of ‘Security’ but it doesn’t know that ‘Certification Authority’ is a key phrase in this domain. I have addressed this problem using the Autophrasing TokenFilter that I introduced to the Solr community a few years ago (1). This enables subject matter experts to pre-seed the process with known key phrases like “big data” that would otherwise be treated as two noise words (pretty much). Another example would be the term “query”, which for the content set that I am working with (Lucidworks blogs) is pretty much ubiquitous – but “query parser” or “query response time” are specific concepts that make very good key phrases. So, like taxonomy management, the keyword clustering approach benefits from vocabulary curation like stopwords, autophrases and synonyms. It works pretty well out of the box but works much better if you seed it with some semantic knowledge. Some of these things can be fed with other machine learning techniques such as NLP parsers to detect noun phrases.

Automated generation of keywords from a Solr collection:

So how does keyword clustering work? The first step is to create a Solr collection of some set of documents. I used our lucidworks.com site and did some processing in a Lucidworks Fusion Index Pipeline to focus on blog articles (as I write this, I am creating more data for a future execution of this method – and also creating some problems for it too – as discussed below.) With the Fusion Index Pipeline HTMLTransformStage (moved to the Datasource layer in Fusion 3.0), I pulled out just the article text sections of blog articles and article title and removed some boilerplate. It turns out that getting clean data for this procedure is very important. Indexing the raw HTML page introduces a lot of noise that can trip you up. Once I have this collection autophrased and indexed, I can use the /terms request handler to get a list of all of the unique terms and phrases in this set of blog articles from which I will try to select out the key terms. To start with, I pre-filter the raw /terms list, choosing terms that are not too common (doc frequency < 25%) but not too rare either (doc Frequency > 1%).

Since in Solr, I can facet on any field whether tokenized (text) or not, for each term in the terms list, I do two sets of queries: a positive query in which I ask for all documents that have a given term (from the terms list) in the field article_text and a negative query in which I ask for all documents that do not have this term. For each of these queries, I facet on the autophrased article_text field – and collect a lot of facet values (50,000). (Note that for term sets greater than 16 Million, faceting on a tokenized field will not work – so I needed to do a little fiddling at index time to get the term set into a multi-valued string field, with Doc Values turned on). The queries that I sent to Solr for each candidate keyword look like this:

Positive Query:

q=article_text_auto:<query>&facet.field=article_terms&facet.mincount=2&rows=0&facet.limit=50000&fq=article_text_auto:*

Negative Query:

q=-article_text_auto:<query>&facet.field=article_terms&facet.mincount=2&rows=0&facet.limit=50000&fq=article_text_auto:*

So the difference is the minus sign in the negative query case. I through in fq=article_text_auto:* to make sure I am only dealing with documents that have a value for this field (about half the documents from the crawl didn’t). For each facet value (which is another term or phrase in the document set relative to the test word) I then calculate a ratio of the number of facet counts divided by the total number of hits for the positive query and another ratio for the negative query case. I then calculate the ratio for the positive query divided by the ratio for the negative query – this tends to be a high number (as much as 1000 or more) for terms that concentrate in the same documents as the initial term and low for words that occur with about the same frequency in the positive and negative result sets. I also use a proximity search to filter out terms that do not occur within 10 words of the term being tested. I then apply a threshold to remove non-keywords. The threshold that I settled on was 5.0 meaning that a word was more than 5 times more likely in the positive set compared to the negative set.

    Facet count (positive Q)
   ---------------------------
    Total counts (positive Q)
-------------------------------------------
    Facet count (negative Q)
   ---------------------------
    Total counts (negative Q)

This technique works pretty well in that it pulls out related keywords but it also brings in some noise words that have the characteristic of being relatively rare that due to the sample size (about 950+ blog articles) get randomly associated with the source term. Here is an example of the output for a security keyword ‘authorization’ (the numbers on the left are the ratio of ratios that were calculated from the facet and hit counts for the positive and negative queries):

178.312 transport
92.458 authentication
84.911 kerberos
75.648 permission
74.297 login
47.550 secure
39.625 ldap
39.625 specifies
39.625 authorized
39.625 identity
22.763 mechanism
16.982 protocol
14.859 rule
14.265 subsequent
12.608 security
12.297 status
11.598 layer
10.489 browser
9.510 plugin
8.741 log
8.491 properly
7.925 domain
7.034 request
6.426 operation
6.202 api
5.683 require
5.487 directory
5.283 unique
5.283 completely
5.105 server

As you can see, given the source term ‘authorization’, it found related terms like ‘authentication’, ‘kerberos’, ‘permission’, ‘login’, ‘secure’ and so on, but also terms like ‘transport’, ‘subsequent’, ‘properly’, ‘completely’ etc. The probable reason for this is that the non-keywords are rare and so by chance were more concentrated in these documents. They could also reflect author styles as the blog articles about security were written by a few authors. Given a larger sample size, it it possible that there would be less of this kind of noise. Continued curation of the stopwords list helps too.

So this is pretty good already, but how can we improve the situation? Well remember that I calculated these clusters for every word in the document set, only some of which were determined to be keywords based on the thresholding of the ratios (about one third). That means that I have clusters for most of the related words for each term too. So what I did was to ask these other clusters, which terms in the source term set they also associated with, and counted the ‘votes’ so to speak. This analysis is called a cross-correlation. In other words, I am boosting cases where the associated terms are reciprocal. I then used this count to boost the scores and applied another threshold. This made the clusters for the key term ‘authorization’ somewhat cleaner.

1849.166 authentication
1443.482 kerberos
983.420 permission
742.968 login
594.375 ldap
570.600 secure
317.000 specifies
250.396 mechanism
203.786 protocol
198.125 identity
133.734 rule
100.863 security
94.401 browser
86.082 status
85.590 subsequent
61.185 log
55.475 domain
49.238 request
46.390 layer
43.892 directory
25.525 server

So what is happening here is that key terms tend to vote for each other because they are semantically not just statistically related in the document set, but unrelated terms (relative to the keyword being examined) do not, so they tend to drop out of the picture. Some noise words hang around but overall, the result is cleaner.

Here is another example for the phrase “machine learning” – extracted as a token using the AutophrasingTokenFilter:

296.237 classification
193.387 natural language processing
192.554 collaborative filtering
148.542 semantic search
137.538 vector
123.785 taming text
116.379 computer science
114.432 apache mahout
113.782 semantic
110.031 temporal
102.079 algorithm
82.523 regression
74.271 ontology
68.769 artificial intelligence
61.892 clustering
60.806 technique
60.173 apache spark
59.600 scientist
58.189 pattern
55.015 mathematical
55.015 mining
50.431 information retrieval
50.014 phd
34.385 timothy
33.009 engineering
27.508 academic
23.578 data analytics
22.506 detect
20.269 map reduce

There is not too much to quibble about here. ‘timothy’ undoubtedly refers to our own Tim Potter who did a lot of work with Apache Spark and other machine learning tools. Names are noise to subject matter determinations, but they can be useful later to identify experts in those subjects. Some terms like ‘engineering’, ‘academic’ and ‘phd’ are a bit non-specific but overall, the cluster paints a pretty good picture of the subject area.

So with two relative simple techniques, I now have a set of keyword clusters that would make good sense to a subject matter expert. I should also mention at this point that some true noise words did stick around (most were filtered out in the first step) so I did some additional cleanup by creating a stop word list that was used to re-compute the first phase. The stop words that got through on my initial attempts tended to have long lists of associated terms with relatively low ratio scores but above the threshold that I set. Most were obvious stop words like ‘them’ and ‘being’ – so this represents some of the human sweat-equity that this technique uses. Other than that, the clusters above are entirely machine generated. Note that the stop words should be reusable for any subject areas so that this effort can be leveraged to cluster other document sets.

Noun Phrases,Term Proximity and Traditional Machine Learning Approaches

Two of the most significant improvements were achieved by adding a proximity filter and isolating noun phrases or autophrases – sequences of two or more tokens that mean one thing. In the data set that I used for this work (Lucidworks blogs), words like ‘search’, ‘query’, ‘data’ and ‘Solr’ are ubiquitous. By themselves they can have many contexts, but a ‘query parser’ is a very different thing than a ‘query log’, ‘query analytics’ or a ‘query request’. The phrase ‘critical issue’ means something very different from ‘not an issue’. In the same way, ‘data science’ is not the same thing as a ‘data model’, ‘data center’ or even a ‘data scientist’. A ‘Solr collection’ is different from a ‘Solr query’, a ‘Solr core’, a ‘Solr cluster’ and so on. For that matter, a ‘Solr collection’ is not the same thing as ‘garbage collection’. In other words, when it comes to semantics, phrases matter – a lot. Adding these autophrases made a considerable improvement in cluster quality, both by adding more precise terms and by reducing the frequency of ambiguous single terms that were absorbed by the noun phrases. A good synonyms list can also contribute to cluster quality. By collecting a set of synonymous terms under one roof so to speak, synonyms do two things for us. First they can increase the prevalence of a keyword or phrase by boosting its frequency to where it becomes significant by summing the contributions of its synonymous alternatives. Second, with noise words, synonyms can boost the frequency of the preferred variant to the point that they drop out of the picture. The net effect is an overall improvement in signal-to-noise. So the same best-practices that apply to search applications also apply to text analytics (this is an important realization by the way). The more semantic knowledge the algorithm has to work with up front, the better it does. No rocket science here, just hard reality, as well as something that I have been harping on in this and recent blog posts – in text search – semantics matter.

Another significant boost in quality was achieved by adding proximity searches to filter the clusters. The current implementation uses a proximity radius of 10 terms. Since documents may discuss more than one topic, the co-occurrence of keywords within sentences or paragraphs provides a tighter semantic focus and provides better statistics for clustering. With respect to traditional ML approaches like LDA, this is one metric that cannot be used with complete documents. The reason is that the positional information needed for proximity analysis is thrown away in a bag-of-words measurement like TF-IDF which the traditional vector approaches rely upon. Once information is lost, it cannot be recreated. Term vectors can only be computed on a document-wide basis and thus can only have document level granularity. This is a major source of noise in and of itself, especially for documents that may discuss a range or topics. In my view, clustering techniques that can operate at a finer level or granularity will be superior for this reason alone. However, all is not lost. Pre-chunking or snippeting of documents into topic focused snippets or simply breaking up the document by paragraphs can be used with standard ML methods. Autophrasing can also be used to improve the precision of vector-based ML methods like LDA and Word2Vec. Anything we can do to improve signal to noise will be of benefit and getting agreement between disparate methods gives us greater confidence in the results.

Comparison with Word2Vec

The machine learning algorithm output that most resembles what the keyword clustering algorithm produces is Word2Vec. Both produce what can be called “term association vectors”. There are some key differences however. For one thing, the keyword clustering approach works well for small document sets whereas Word2Vec typically requires much more data to crunch to come up with decent term association vectors. They also take a different path. Keyword clustering is top-down. It starts at the document and corpus level and works its way down to individual key terms, filtering out noise words on the way. Word2Vec is by contrast bottom up – it looks at individual words and nearby words that are statistically associated with them. It finds synonyms, matching terms in noun-noun, adjective-noun and noun-verb phrases. Interestingly, using autophrasing with Word2Vec turns it more into “Thing2Vec”. With individual words, like “network”, Word2Vec will find associations like “card”, “adapter” and “traffic” but if I give it phrases like “network card”, “network adapter” and “network traffic” as inputs – it will find higher-level concepts like “connection failure”. It will use stopwords or rare terms with equal affinity as long as there is a statistically significant association with the head term of the vector. The document corpus is significant only in providing domain isolation for words that might have other meanings in a wider context and in providing enough data to get stable correlations.

For example, the famous word vector math trick that Word2Vec does with “Man is to Woman as King is to Queen” is undoubtedly accelerated by leaving stopwords in, especially gender-specific subject, object and possessive pronouns such as “he, him, his” and “she, her, hers” respectively. Given enough data, it should eventually learn to do this pivot without these pronouns using names often associated with royalty such as “Elizabeth”, “Victoria”, “Henry” or “George” as association terms but it should converge faster if we leave the stopwords in. However, stopwords are a double-edged sword here – see Lasya Marla’s recent Lucidworks blog post on Word2Vec for more on this. By contrast, the keyword clustering technique seeks to avoid noise words with pre-screening by document frequency, help from stopwords lists and by its own filtering logic.

What is remarkable is that both automatically produce term association vectors that “make sense”. They differ in detail and this itself is useful, as it is feasible to combine them to get better coverage of the concepts. I won’t show you the Word2Vec output for the Lucidworks Blogs because that would do it a disservice – there is simply not enough data for Word2Vec to work with in this data set. However, I have done this comparison with much larger proprietary customer data that I can’t show here. I’ll provide this comparison in a future post where I crunch some public-domain data with more heft like Wikipedia or the Enron email set using both techniques.

So for small datasets, the keyword clustering approach shines. One of its downsides is scalability – since it is an N-squared algorithm that requires hitting Solr a lot, it goes very slowly for large data sets. However, since it doesn’t need much data to get good results, randomly sampling down a large collection is an effective technique when dealing with larger document sets. This works well because the association weights are based on a ratio. Therefore, as long as your downsampling technique is even handed, the same or very similar clusters can be produced using a fraction of the data. I verified this on a large data set – the clusters for 10% of the data were nearly identical to the clusters for 100% but it only took a fraction of the time to compute. Or you can simply let it go at its pace against a backup Solr instance on a relaxed schedule because vocabulary associations do not nearly change as fast as content does – in fact for most associations the change rates are glacial. Only when new vocabulary is introduced or new usages for old vocabulary (e.g. ‘spark’) do we need to update the vectors. Slow may therefore not be such a bad thing as long as we can ensure quality in the end. Nevertheless, it is competitive with Word2Vec – on a proprietary dataset of about 0.5 Million documents, both techniques finished in about 5-10 hours or so.

Where do we go from here?

The next step is to use these keyword clusters to improve document clustering / classification. I’ll cover that in Part II of this series. The key to the approach is to calculate a feature vector which has the cumulative keyword score for a document’s top 20 keywords and then to use standard K-Means clustering to do the heavy lifting part. I am getting some very exciting results from this so stay tuned. (I will also have github code for the series by then).

The keyword clusters are also useful in other ways – query expansion immediately springs to mind. The associated terms in each cluster are not synonyms per se (although some may be) and can be easily looked up at query time to find associated terms for a given query term by storing them in a Solr ‘sidecar’ collection. They even come with built-in boost factors! Easy-peasy! At this point, the individual keyword clusters are not comprehensive though. They form a loosely coupled network of interlinked clusters. For example, the terms ‘saml’, ‘sso’ and ‘authentication’ are not in the same cluster but they are only one or two ‘hops’ from each other. Document clustering helps to coalesce these clusters providing a better framework for classification or query expansion as I will show in my next post. The clusters can also be used for a better “More Like This” implementation or a related terms taxonomy that can be used in a similar fashion to faceted navigation. Right now, these techniques are based on TF-IDF which the document clustering work shows is much lower in signal-to-noise than the keyword clustering method is. As I talked about earlier, clean up the garbage before it goes into the ML grinder I say.

I should also mention that the clusters shown above can be thought of as “models” in the same way that a K-Means cluster or a Logistic Regression model can. The main difference is that the keyword cluster models are textual and can therefore be edited by people. This provides a new way to work on the 80-20 problem. So while the technique that I have described is fully ‘unsupervised’ or automatic, I have ‘baked-in’ subject matter expertise at various points in the process from creating autophrase and stop words and synonyms lists to editing of the final results. This means that if your initial attempts are not so impressive, you can improve it in a step-wise fashion that is based on domain knowledge, not random walks in the dark (within the ML black box that is). Hybrid methods like this are the one way to meld machine learning with human knowledge to go beyond what either one can do on its own.

The 2025 AI reality check: What 1,100+ companies actually deploy vs. what they claim

2025 Generative AI Benchmark Report reveals only 6% deployed agentic AI while...

The State of Generative AI in Global Business: 2025 Benchmark Report, Dawn of the Agentic AI Era

The first-of-its-kind study using autonomous AI agents to benchmark AI capabilities across...

How an electronics giant meets engineers where they are, with 44 million products in catalog

Meet Mohammad Mahboob: A search platform director navigating 44 million products across...