The Fusion platform powered by Solr is our primary interface to search various kinds of content on our flagship product Fitch Connect. Some content types are directly searchable, however, others are available through filtering and drilling down the data. Our complexity of data requires us to store the content into different collections in Solr. This session will cover a use case on how to implement streaming expression. We will showcase the power of streaming expressions in handling complex data structure and how it enables us to perform text search, numeric range filters, alpha numeric range filters, all together at the same time on different data sets. We will also present the solution we devised to enhance the query performance through a hybrid approach involving streaming expressions and Fusion query pipelines. We will cover use cases on streaming sources and streaming decorators implementation.
Weiling Su, Sr. Software Engineer, Fitch Solutions
Prem Prakash, Technical Lead, Fitch Solutions
This session would be helpful for Solr developers and architects. If you are working on complex data structure in Solr and looking for a solution on how to query and filter the data, you don’t want to miss this session!
In this session attendees will learn how to implement queries using streaming expression involving complex data structures and joins. We will cover various streaming expression functions and integrating streaming expression with Fusion.
Prem Prakash: Hello and good afternoon everyone. Welcome to our session on Powering Advanced Search using Solr Streaming Expressions.
The session will be presented by myself. I’m Prem Prakash, a Technical Lead Fitch Solutions, working on a search platform.
I will also be joined by my colleague Weiling Su, Weiling is a Senior Developer with Fitch Solutions, and she is our in-house Solr and Fusion expert.
Will start with a brief introduction to our organization, Fitch Solutions is part of Fitch group. We offer solutions integrated and macro intelligence, helping clients to manage their counterparty risk. We provide insights and expertise into credit markets and offer various solutions through our flagship product called Fitch Connect, which helps our clients make more informed decisions.
If you want to know more about Fitch Solutions, please visit our site, www.fitchsolutions.com.
If you wish to explore the products and solutions provided by Fitch Solutions, you can self register yourself at login.fitchconnect.com/register.
Moving on to our agenda for the session today.
So this is what our search page looks like. On the top we have Select Data Items. We also have various filter categories, such as Sector, Geography, Financials and Credit. Sector, Geography and Credit offers selectable filters, whereas for our financials, we offer numeric range filters as well.
Once the filters are applied, we get the results in the “manage results” section.
Taking a closer look at the filter options available to us, we have entity filters where we can perform a keyword search. We can search for an entity by its name and apply that as a filter. We also have the option to filter the entities by regions, and a particular market sector.
With the Ratings Filter, we can multi-select various rating options. We also have some filters within the ratings to apply rating actions, rating watchers and rating outlook on top of the Ratings Filter.
Coming to the financial filter options, once we select a financial field, we get an option to select a Range Filter in terms of above, below, and between, and apply the Financial Filter by providing the values in the text box here.
These are the options that we are giving to our users to filter the data by.
Now when we look at the data structure here, our data is stored into three different collections. One is the entity data, another for the ratings, and the third one is for the financials. If you look here, the entity and ratings data is joined by the entity ID, and all the three collections are joined by the agent ID.
Now our requirement is to join the data from these three collections and perform a keyword search on top of it, apply your text filter, and apply numeric range filters for the financial data, and we also want the ability to retrieve the historical data as well.
This is a closer look at what the data looks like. If you see the entity ratings and financial data, it is all joined by the agent ID. The entity ID is used interchangeably here with the agent ID in the financial collection.
Now the entity and ratings collection if you’ll see is joined by the entity ID.
Now our requirement is to pull all the data, available here in these three collections, by a single query.
Okay, so how do we solve this problem? Before we solve the problem, we also looked into the option of storing all the data into a single collection. However, that was not an option for us, given the complexity of the data structure and the nature of the historical data available to us.
What benefit does it provide by storing the data into its own collection? First and foremost, the data structure is quite simple this way and easy to handle. It provides better isolation of the data and retrieving only a particular data set will be easier due to isolation.
There are many of the requirements where we need to show only a particular data set on different screens.
The data refresh also can be handled quite easily, as different datasets may come from different sources.
Overall we found that the maintainability of the data is much easier if we store the data into its own collection.
In order to solve our problem, we tried several approaches that support Solr data join. We looked at various options, such as Solr Join Query and Sub Query, Solr Parallel SQL, Fusion SQL, et cetera.
We did a side by side comparison on various parameters listed here. We look at whether the support joins, which is our basic requirement.
Do they have any limit on the number of collections that we can join?
In future, we wanted the ability to join more than three collections as well.
We also looked at whether we can perform filters on both sides of the join, whether they support keyword search, sorting, grouping, pagination, and whether they allow any customizations to be applied on the functions.
Pagination support was one of our key requirements. Pagination has listed here as maybe, we’ll come to the pagination implementation with the streaming expressions in our future slides.
Based on our analysis, streaming expression emerged as a clear choice for us, which solved all our problems of retrieving the data from different collections.
What are the benefits of using streaming expressions? It provides a powerful stream processing language and has a load of functions to perform different operations. It has a big growing library and a good community support. One such community support was very useful to us, but we’ll cover more about that in our future slides.
Talking a little bit more about the streaming expressions, it provides three different types of functions: stream source, stream decorators, and the stream evaluators. The stream sources originate from the streams.
The most commonly used function here is search, but performs a query on the solr collection. Other available functions are tuple, facet, topic, et cetera. Stream decorators, wrap other stream functions or perform operations on streams.
The functions available here are: select, sort, top, innerJoin, et cetera.
The stream evaluators are different from stream sources or the stream decorators. Both stream sources and stream decorators return the streams of tuples. Tuples are nothing but a row of data retrieved from Solr. The stream evaluators are more like a traditional function that evaluates experimenters and returns a result.
Most of the functions here are self-explanatory, for further details, you can refer to the reference guides available in the Solr documentation.
So with this, I’ll hand it over to Weiling who will cover the next set of slides, and will walk us through a use case on the streaming expression.
Weiling Su: Thank you Prem. My name is Weiling, a Senior Search Engineer and Fusion Admin for Fitch Solutions. I have worked and followed Solr, since Solr 1.4. We will walk you through a use case to understand how streaming expression is implemented to support Fitch Solutions at the one search requirement.
For example, one of the one searches where it looks like this, please find an entity that has a name that matches the bank. His financial total asset in the bank is greater or equal to 2.0 million, and rating type is long term issue default rating, and his rating value are A minus, A, A plus, AA, AA plus, and AAA. For output, we would like to sort by entity name, in ascending order. Offset starting at zero and page size equal to 25.
Let’s look at the entity data store in a core collection. We use a search function to filter out the match tuples from the core Solr collection. A tuple is a streaming data admin that consists of a combination of the raw data of the data values and the timestamp.
In the core collection, we are applying pick search on entityName_t with Bank. Add filtering conditions by specifying content type equal to entity. For the sort option, we can apply any sort on any of your return fields.
Since our next step is to perform joins, join operation requires us to perform both join streams in the same sorting field type, and in the same sorting order.
You can save one sort function by specifying the same sort field and the same order to prepare for later join.
After that, we use the select function to project the fields and or rename the fields. In this case, we want to rename entity ID to be issuer or transaction ID, and ID to be grouped by ID. Qt parameter by default, it will be used in the search handler but it will return paginated output in return.
You won’t get the total record back. If you want to apply the join to the whole collection data, Qt parameters must set it equal to export.
In our case, we want to perform innerJoin to find out all the matches at once. In the second collection, we are filtering data out of the financial Solr collection using search function by applying the range filter on total asset bank, greater or equal to 2.0 and also specify other filter criteria in periodType equal to Annual, latestStatement equal to true, stmnt date equal to last 18 months and AccountingStandards equal to USGAAP and the cross consolidationType equal to consolidated.
We then use the select function to relate the field to be a more meaningful name. Perform the sorting function on entity ID, in ascending order.
For rating collections, we use a search function to apply the filtering on rating value, and specify the rating type equal to Fitch ratings, and set ratingLatestFlg to be true.
We pick the entity IDs as our sorting field and sort in ascending order. As I talked about in the previous slides, we are preparing for the later join condition.
Once all the data has been filtering out properly from each collection, we are ready to move to join operations.
We are making the innerJoin by coding innerJoin between core collection and financials collection, With join condition on agent ID. There are several join types to pick, you can choose hash join function two, if the data has more to fit into your memory, it will definitely give you some performance edge over inner join, but we have big data to deal with, can’t use it.
When calling the innerJoin function, you specify the join condition with a parameter equal to a certain condition. If both join fields from both sides are the same name, you can just put one single field name there. If not the same, you need to specify by left field name equal to right field name.
Sometimes we need to perform another sort, before the next join, because the join field or the order are not the same.
Now, this is our final set of inner join. The data returned from the previous join of entity and financials is now joined with the rating data.
We are applying a sort function on top of the innerJoin between the entity and financial, which is required to enable join this data correctly with rating, because innerJoin with the rating data, are on the entity ID not on the agent ID we used in the previous join.
Here is what our final query looks like. We perform the inner join on entity and financial and that result is then joined with the rating.
In our requirement, we need to perform a reduce operation to group by group by ID field, because all final results are aligned by group by ID. One entity is going to have many match ratings and many financial records, grouped by it, we know if we have zero, one or many matches, group by we know how many unique entities are matched.
Finally, we are applying the pagination, using the limit function and skip function to display them in a paginated manner.
Now let’s run the above query and take a look at the JSON data returned from all the previous streaming expressions function. On the left panel, it contains unique entity information, although it only picks one of the many of one and displays on the left. On the right panel is displaying grouped data in an array. You can see that it contains different possible joins with the rating and financial details there.
We are happy that we implemented a streaming expression to produce the output we want, but how can I display all due to the performance concern? They want to be able to browse the records in a pagination format. But we search all the Solr portions, no such things, what should we do?
Go back and re-engineer it? No.
We knew Solr is an open source software and we decided to implement a new Solr streaming expression to support it. Before we did it, we found one patch for it, but it was for newer Solr 8 not for Solr 7, and streaming expression architectures have been refactored in Solr 8.
So we will migrate the same computing logic to our own Solr version, and then integrate back to the Solr.
To add new streaming functions, it will be required to extend the TupleStream abstract class and then to implement an expressible interface. Two stream functions were added, SkipStream is used to skip first the number of tuples. LimitStream was created to limit the total number of records to return.
Last update the StreamHandler class to load them into streamFactory. So the streamFactory object loads them along with other streaming expression functions at start up.
To make sure our version is compatible, we downloaded the same version we use in our server and made changes and updates to it, and recompiled it to deploy to the Solr server.
We are using Solr through Lucidworks Fusion’s software. Solr is embedded inside Fusion software, we have to update the Fusion Solr libraries. So our Fusion can see the new streaming expression function. Fusion has repackaged Solr, so there are several places to update, before you update them, back up your old one, so you can roll back to the previous status in case you mess up.
After testing one Solr instance was successful, load and update the rest of your Solr instance in your Solr cluster.
The building streaming expression phase is done. Next I will hand over to Prem, he will show you how we toggle down the next phase of project performance.
Prem: Thank you Weiling.
Talking about performance optimizations.
We were very excited to use streaming expressions, it solved all our problems. However, we observed certain slowness in the queries when performing complex joins involving many filters.
To improve this, we analyzed the queries and figured out ways to improve the performance. We found the following approaches that helped us:
We will elaborate further on the hybrid approach in our next slides.
Now let’s take a look at the architecture that we have. Our index data is available in Solr, and we have created pipelines in Fusion to retrieve the data from Solr. Here we have introduced another layer between Fusion and a web client called Fusion Query Service.
The Fusion Query Service helps construct the streaming expressions for us. The request coming from the client has passed and converted into a streaming expression query in this layer.
The response received from the Solr or Fusion is transformed into a JSON output and returned to the client.
This kind of architecture helps us to hide the unnecessary implementation details from the client. And also helps us to expose a simple request payload for the clients to query the data. Also this way the client need not worry about the joins and the technologies used underneath.
These are the benefits this kind of architecture provides to us.
Coming back to the hybrid approach. We took a decision based on the user behavior and functionality driven from that.
When we load our default page, we show only the entity and default rating information. In those cases, we have seen the queries performed where the pipeline performed better than the joint queries.
Without those queries are a normal entity and rating pipeline and process the data in the service layer itself. However, when a user applies the advanced filters on the entity rating and the financial data without those queries through the streaming expression pipeline, this approach gives us a fine balance between querying the simple data versus advanced data using complex joins.
With this, giving it back to Weiling, where she will talk about some session takeaways. And we will see you again in the question and answer sessions. Thank you.
Weiling: Thank you Prem. When we look back at our past journey with streaming expression implementation, some we did very well others we could do better with other options.
These are the takeaways we learn from our experience. If you need to add a customized implementation, use the streaming expression plugin approach to add an internal only function. Solr streaming expression plugin architecture will be a better option and more plugable.
There’s a previous Activate talk from Bloomberg about this, you can search on YouTube or ask me later in the chat channel. You are prone to share and make a contribution back to the Solr community.
You can use our approach, rebuild and redeploy the Solr create empty worker collection to run a streaming expression handler.
You can customize for streaming expressions performance by creating multiple shots and replicas, and it will trigger multiple workers to run the streaming expression on an empty data set.
Use parallel and parallelList to parallel computing whenever it is possible. For some query, there’s no dependency or order restriction. Parallel streaming processing will maximize Applying the filter before applying the join operator, based on our performance testing, the streaming expression computation time, is linear correlation to the size of the data, apply the filter first, you can reduce the total size for data pipeline.
Last, build and validate the streaming expression step by step. Inside out, we found, this is very effective in debugging and building query at the first time.
And finally, question and answers.
If you have any question on any of the topics presented here in this session, please reach out to us through the chat facility provided by Activate.
Thank you for joining the session, and also thanks to Lucidworks as well to give us the opportunity to present in this conference. In case you need to reach out to us our emails are available under the speaker profiles.
Thank you, have a great day ahead and enjoy the rest of the sessions.