Apache Solr und die Optimierung Ihres Index
Optimierungsoperationen und ExpungeDeletes sind vielleicht nicht mehr so schlecht für Sie. Sie sind immer noch teuer und sollten nicht leichtfertig verwendet werden.
Allerdings sind diese Vorgänge nicht mehr so anfällig für die in meinem vorherigen Artikel genannten Probleme . Wenn Sie mit dem Prozess der Segmentzusammenführung von Solr/Lucene nicht vertraut sind, finden Sie in diesem Blog einige nützliche Hintergrundinformationen.
Zusammenfassung
- expungeDeletes und optimize/forceMerge, die von der standardmäßigen TieredMergePolicy (TMP) implementiert werden, verhalten sich ab Apache Solr 7.5 ganz anders.
- TieredMergePolicy wird bald über zusätzliche Optionen zur Steuerung des Prozentsatzes der gelöschten Dokumente in einem Index verfügen. Siehe: LUCENE-8263 für den aktuellen Stand.
- TMP beachtet jetzt standardmäßig den Konfigurationsparameter maxMergedSegmentMB für forceMerge und expungeDeletes.
- Wenn Sie das alte Verhalten für forceMerge (optimize) benötigen, können Sie es durch die Angabe von maxSegments beim optimize-Befehl erhalten.
- expungeDeletes hat keine Option, um maxMergedSegmentMB zu überschreiten.
- Wenn Sie sehr große Segmente erstellt haben, da sich gelöschte Dokumente in großen Segmenten ansammeln, werden die Segmente „einzeln zusammengeführt“, um diese gelöschten Dokumente zu bereinigen. HINWEIS: Derzeit geschieht dies nur, wenn Ihr Index etwa 50 % gelöschte Dokumente enthält, obwohl dies in einem nachfolgenden JIRA möglicherweise eingestellt werden kann.
Einführung
Vor einiger Zeit habe ich in einem Blog über einen „Fehler“ bei der Verwendung der Solr-Optionen optimize und expungeDeletes commit berichtet. Ab Apache Solr 7.5 ist das in diesem Dokument beschriebene Worst-Case-Szenario nicht mehr gültig. Wenn Sie alle blutigen Details sehen möchten, lesen Sie LUCENE-7976 und die zugehörigen JIRAs. WARNUNG: Wenn die Entwickler von Solr/Lucene über so etwas diskutieren, können Ihre Augen glasig werden.
Ab Solr 7.5 respektieren optimize (auch forceMerge genannt) und expungeDeletes den Konfigurationsparameter maxMergedSegmentMB bei der Verwendung von TieredMergePolicy, die sowohl die Standard- als auch die empfohlene Merge-Policy ist.
Für eine so einfache Aussage gibt es einige ziemlich bedeutende Verzweigungen, daher dieser Blogbeitrag.
Kurzer Überblick über forceMerge und expungeDeletes vor Apache Solr 7.5
Zunächst ein kurzer Rückblick. Das Standardverhalten bei der Ausführung von optimize oder der Angabe von expungeDeletes im Befehl commit war, dass alle Segmente, die zusammengeführt werden, zu einem einzigen Segment zusammengeführt werden , unabhängig davon, wie groß das resultierende Segment wurde.
- Zur Optimierung wurde der gesamte Index in die durch den Parameter maxSegments angegebene Anzahl von Segmenten zusammengeführt (Standardwert 1).
- Für expungeDeletes wurden alle Segmente, die mehr als 10% gelöschte Dokumente enthielten, zu einem einzigen Segment zusammengefasst.
Für eine „natürliche“ Zusammenführung bei der Aktualisierung eines Index wird bei jeder harten Übergabe ein Prozess wie folgt ausgelöst:
- alle Segmente mit < 50% von maxMergedSegmentMB „live“ docs wurden untersucht und ausgewählte Segmente wurden zusammengeführt.
- „Ausgewählte Segmente“ bedeutet, dass Heuristiken angewendet wurden, um die Zusammenführungen auszuwählen, die am wenigsten Arbeit verursachen und trotzdem maxMergedSegmentMB einhalten.
The critical difference here is that optimize/forceMerge and expungeDeletes did not respect maxMergedSegmentMB. In both cases, merged segments have all data associated with deleted documents in the original segments removed. This reduces the amount of disk space occupied by the index and reduces the number of segments in the index.
Why Was maxMergedSegmentMB Implemented in the First Place?
There’s a long discussion here, but I’m going to skip much of it and say that keeping an index up to date has to deal with a number of competing priorities and maxMergedSegmentMB was part of resolving those issues. The various bits that need to be balanced include:
- Keeping I/O under control as indexing and searching can be sensitive to I/O bottlenecks.
- Keeping the segment count under control to prevent running out of file handles and the like.
- Keeping memory consumption under control, the idea of requiring, say, 5G on the heap just for indexing is unacceptable.
- When originally written, there were significant speed gains to be had by merging down to one segment, later versions of Solr don’t show the same level of improvement.
As Lucene has evolved, the utility of forceMerge/optimize has lessened, but the underlying merge policy needed to catch up.
The New Way
As of Apache Solr 7.5, optimize (aka forceMerge) and expungeDeletes now use the same algorithm that „natural“ merges use. The relevant difference between „natural“, „forceMerge/optimize“, and „expungeDeletes“ is what segments are candidates for merging.
There are three cases:
- natural: All segments are considered for merging. This is the normal operation when indexing documents to Solr/Lucene. The various possibilities are scored and the cheapest ones are chosen as measured by estimates of computation and I/O. Large segments with few deletions are unlikely to be considered cheap and thus rarely merged.
- expungeDeletes: Segments with > 10% deleted documents, no matter how large are considered for merging.
- optimize: Siiiigh, here are 2 sub-cases, maxSegments is defined and maxSegments is not defined:
- maxSegments is specified: all segments are eligible.
- maxSegments not specified: all segments < maxMergedSegmentMB „live“ documents and all segments with deleted documents are eligible. Thus segments > maxMergedSegmentMB that have no deleted docs are not eligible.
„Wait!“ you cry! „You’ve told us that maxMergedSegmentMB is respected for expungeDeletes and optimize/forceMerge, yet you can specify maxSegments = 1 and have segments waaaaaay over maxMergedSegmentMB! How does that work?“
I’m so glad you asked (I love providing both sides of the argument. While I can disagree with myself, I never lose the argument! Yes you do. No I don’t. You’re a big stupid-head… Excuse me, my therapist says I should perform calming exercises when that starts happening).
Ok, I’m back now.
TMP in Solr 7.5 introduces a „singleton merge“. Whenever a segment qualifies for merging, if it’s „too big“ it can be re-written into a new segment, removing deleted documents in the process.
This has some interesting consequences. Say you have optimized down to 1 segment and start indexing more docs that cause deletions to occur. The blog post linked at the top of this article expounds on the negatives there, namely that that single large segment won’t be merged away until the vast majority of it consists of deleted documents. This is no longer true. When certain other conditions are met, a „singleton merge“ will be performed on that one overly-large segment, essentially rewriting it to exactly 1 new segment and removing deleted documents. It will gradually shrink back to under maxMergedSegmentMB, at which point it’s treated just like any other segment.
WARNING: This comes at a cost of course, that cost being increased I/O. Let’s say you have a segment 200GB in size. Let’s further say that it consists of 20% deleted documents and is selected for a singleton merge. You’ll re-write 160GB at some point determined by the merging algorithm. It gives you a way to recover from conditions outlined in the blog linked at the beginning of the article that doesn’t require re-indexing, but it’s best by far to not get into that situation in the first place.
I’ll repeat this several times:
Do not assume optimize/forceMerge and/or expungeDeletes are A Good Thing, measure first.
If you can show evidence that it’s valuable in your situation, then only do these operations under controlled conditions as they’re expensive.
Still You’re Talking About 50% Deleted Documents, That’s Still Too Much.
I’m so glad you asked (reprise).
A follow-on JIRA LUCENE-8263 provides a discussion of the approach used to control this. I’ll update this blog post when the code is committed to Solr. You’ll be able to specify that your index consists of no more than a defined percentage of deleted documents.
WARNING: TANSTAAFL (There Ain’t No Such Thing As A Free Lunch). This reduction in deleted documents will come at a cost of increased I/O as well as CPU utilization. If the percentage deleted docs matters, it’s preferred to just expungeDeletes during off hours.
Why expungeDeletes rather than forceMerge/optimize? Well, it’s a judgement call, the consideration being whether you’re willing to expend the resources to rewrite a segment that’s 4.999G in size to reclaim 1 document’s worth of resources.
What Do You Recommend?
In order of preference:
- Don’t worry, be happy! Unless you have good reason to require that deleted docs are purged, just don’t worry about it. Let the default settings control it all.
- When LUCENE-8263 is available (probably Solr 7.5), assign a new target percent deleted to TMP (solrconfig.xml for Solr users), and measure, measure, measure. This will increase your I/O and CPU utilization when doing your regular indexing. Especially if you only test in a development environment that increased load may not seem significant but may become significant in production.
- Periodically execute a commit with expungeDeletes. Don’t fiddle with the 10% default, it represents a reasonable compromise between wasted space and out-of-control I/O. Lucene is very good at skipping deleted docs, the main expense is disk space and memory. If those aren’t in short supply, leave it alone (or even increase it).
- Optimize/forceMerge periodically. This is not nearly as „fraught“ as before as the maxMergedSegmentMB is respected, so you won’t automatically create huge segments. But you will generate more I/O and CPU resources than an expungeDeletes.
- Optimize/forceMerge with maxSegments=1. This is OK if (and only if) you can tolerate re-running the command regularly. One typical pattern is when an index is updated only once a day during off hours and you can follow that up with an optimize/forceMerge.
Fazit
Optimize/forceMerge are better behaved, but still expensive operations. We strongly advise that you do not do these at all without seriously considering the consequences. A horrible anti-pattern is to do these operations from a client program on each commit. In fact we discourage even issuing basic commits from a client program.
If you’ve tested rather than assumed that that optimize/forceMerge and/or expungeDeletes is beneficial, run them periodically from a cron job during off hours.
This post originally published on June 20, 2018.