Play Video
Utilizing Apache Pulsar and Apache NiFi we can parse any document in real-time at scale. We receive a lot of documents via cloud storage, email, social channels and internal document stores. We want to make all the content and metadata to Apache Solr for categorization, full text search, optimization and combination with other datastores. We will not only stream documents, but all REST feeds, logs and IoT data. Once data is produced to Pulsar topics it can instantly be ingested to Solr through Pulsar Solr Sink.
Utilizing a number of open source tools, we have created a real-time scalable any document parsing data flow. We use Apache Tika for Document Processing with real-time language detection, natural language processing with Apache OpenNLP, Sentiment Analysis with Stanford CoreNLP, Spacy and TextBlob. We will walk everyone through creating
an open source flow of documents utilizing Apache NiFi as our integration engine. We can convert PDF, Excel and Word to HTML and/or text. We can also extract the text to apply sentiment analysis and NLP categorization to generate additional metadata about our documents. We also will extract and parse images that if they contain text we can extract with TensorFlow and Tesseract.
Intended Audience
Data Engineers, Search Engineers, Programmers, Analysts, Data Scientists, Operators
Attendee Takeaway
You will learn how to use open source streaming to sink data at scale to Solr.
Speaker
Timothy Spann, Developer Advocate, StreamNative
[Timothy Spann]
Hi, I’m Timothy Spann. Today, I’m gonna be talking to you about real-time cloud native open source streaming, any data to Apache Solr.
Let’s get started. I’m a developer advocate working in the open source, mostly around Apache projects, mainly Apache Pulsar, as well as Apache Flink, Apache NiFi, and some other of those great open source projects. I post a lot of example code, applications, as well as articles, videos at the link shown. And we’ll make sure that you get these slides after the conference.
Today, I’m gonna be talking about two primary open source tools that help you feed data into your Solr instance, whether that’s in the cloud or in a small cluster. I’m running it today in Docker. What’s nice with Solr is you can put it wherever you need to, and I’ll show you some of the other tools that make processing things like documents or REST feeds, log data, IoT data, pretty straight forward. It’s not that complex. I call my set of tools that I work with the FLiP Stack. Just lets it sound kind of cool to me and just lets people know, you know, what tools we’re using. Apache Flink and Pulsar work really well together. And there’s a number of other Apache projects out there like NiFi, like Tika, that really work well together. And obviously they work well for open source data sources and syncs like Solr.
So today I want to give you a little an idea of what’s in this pipeline. So I’ve got a couple of different sources of data. I’ve got IoT data. I’ve got some cloud status data. A couple of different feeds, but what’s nice is with the tools we’re using, you have a ton of different sources you could use, and I’ll show you some getting fed into Solr through NiFi. Some of them being read natively by Pulsar, and Pulsar acting as a sync to Solr and NiFi acting as a sync to Solr.
What’s nice is there’s a couple of different ways to do it. A great feature of the open source. Makes it really easy as well. Again, a little drilling down on that a little more. Now today, the sources of data are JSON and text, and they could also be things like PDFs, MongoDB, different big data sources, cloud data. What’s nice is with the set of open source tools we have here, it doesn’t really matter. I’m also using Apache OpenNLP to analyze some of that data. One of my sources of data is a directory of some of my presentations. So it’s looking at PowerPoints and PDFs and pulling out the text with Tika and making it available so I can put it into Solr so I can easily search and use it for other applications. Great feature there. I could also push that data to any of the different cloud syncs there. So it can be used for any kind of analytics or applications that you might need or mobile apps. You know, you get the idea.
Now to make this a little easier, Apache open source tools are great, but sometimes you don’t want to have to set up all the different servers, all the different clusters, monitoring, management. Fortunately there’s some cloud tools that make it easier. StreamNative Cloud makes it a lot easier. Runs natively in the cloud, has Flink and Pulsar in there, and can run on Kubernetes, which makes it very easy to scale up and down, which is important when you don’t know the size of your workloads, or sometimes you have really powerful jobs coming through that need that extra horsepower, sometimes you don’t.
So Pulsar is a very interesting system for doing messaging and streaming as it just does compute, which is pretty nice, when you think of it in cloud native terms. Compute isolated from storage, which has done in Apache BookKeeper, allows you to scale those independently, which often makes a lot of sense ’cause sometimes you have a lot of messages coming through. Some of them could be persistent. Some might not be. Storage doesn’t usually require as many nodes as compute. Also we can run Pulsar functions inside of Pulsar to let you do some of that processing while data comes in and out. You could do some machine learning or other analytics in there, gives you lots of options, especially when you can also share that load with Flink, and Flink SQL for doing some of those application tier stuff for you.
Now what’s cool with Pulsar is obviously you could have a ton of data out there, and you may not want to keep it in these individual BookKeeper nodes, or maybe it’s in something that, you know, the storage is expensive or it’s small. You can easily configure Pulsar to have tiered storage. So when it hits certain timing or sizes, that data will go right to something like S3 or other much more scalable and cheaper storage, like HDFS, makes that very easy, having all this managed for you in a cleaner way, Kubernetes is the way everyone likes to do most of their cloud compute now. So having that as an option to run all your Pulsar clusters is great. And whether that you deploy it locally or on Amazon, Google, Azure, Alibaba, all handled for you, makes it easy.
Let’s talk a little bit about Pulsar. I gave you a little bit of an overview there, but there’s a couple of key features that you might not know about. When you think of, okay, another messaging tool, another streaming tool, this is a little different. As we mentioned that tiered storage, big difference, having the separation between computing and storage, another big one. A nice built-in feature here is geo replication. You don’t have to think of things on your own or have to build out some specialized clustering or caching to do that or use some hardware. Geo replication is in there. Scales out really well. Again, having that compute and storage separated makes it very easy to do that. Multi-tenancy is built in. So that means it’s very easy. I might have different Solr apps that really shouldn’t be together. I mean, there might be proprietary data. So you could set up different tenants, different namespaces so no one can see anyone else’s. Keeps them very separated. Let’s you run one clusters. So you don’t have to run extra clusters for that. We’ll be using a Pulsar connector that lets us easily stream right into Solr. It’s a simple sync. And I’ll give you the source code for that. What’s nice is as soon as I get that message into a Pulsar topic, it will go right into Solr. It couldn’t be easier than that. You don’t have to write any special code. Once it gets in there, it’s there, makes it very nice.
Now you might tell me, I already have other messaging queues. I don’t want to rewrite or move everything to a new a messaging system right now. And that’s fine. Another unique feature of Pulsar is I can support other messaging protocols. This comes in really handy if you have say MQTT, I use that for IoT. Pulsar cluster can natively accept those protocol calls. You don’t have to change your client. Just point to a different broker. Same for AMQP, JMS, Kafka, makes it very nice. I could take all those apps that I have with different messaging clusters which may not be scalable, have them all go to Pulsar, makes it very easy. One cluster scales out, tiered storage, geo replication, different types of subscriptions, makes it really easy. Now what’s nice is, you know, those two sides unified.
So queuing, which you tend to do with like ActiveMQ or Rabbit, Pulsar supports that. Data streaming, think Kafka, Kinesis, Pulsar does that too. Just different subscription types makes it very straightforward. And again, I get that data into Pulsar, maybe syncing it to Solr is enough, maybe it’s not. Maybe I want to do additional stream computing on it. Nice connector there with Flink makes that trivial. Solr, obviously everyone knows Solr here. That’s what you’re at the conference, but I just wanted to show you Pulsar plus Solr, really good friends. ‘Cause I just set up that sync and just start streaming data there. You don’t have to write any custom apps. You don’t have to think about it. So I could just have that go there every time I add something there. I could still consume it somewhere else for doing other processing, but it makes it really easy to make things available to your search.
Now in here as well, I’ve got a little bit of NiFi. It lets me do some of the cleanup and makes me able to have a very easy way to ingest and translate data. This is another good way to get all different types of data into Solr, and I’ll show you that in the demo. We’ll go through those pretty quick, but it’s got native connectors for Solr. Lets you pick what fields you want to index on there. You don’t have to do too much work to get your data quickly, go into Solr. You could also query it out. So if you need to see what’s going on in Solr or use it in an app, very trivial. I do a little bit with NLP. So you could do some natural language processing. Very straightforward. Again, we’ll show you this in the demos. Tika lets me process any kind of file. Like turn a PDF into usable data, which is really powerful.
Now let’s go to the demo. I think that’s really important. Now I have a Pulsar cluster running. It’s got schemas, it’s got some messages already, and I’ll show you how I got the data there. Now for certain applications, I’m doing it with NiFi. This one in NiFi, I’m listing all those PDFs that I have. And then I’m processing them, pulling the data apart, splitting it, running some NLP on there, getting to see if there’s any reserved words or locations in there. Those can be really handy here. Here, I found a name and a date. And that way I can use that. I’m doing some sentiment analysis. Get all that data into a nice format, and then I could push it right to Solr. Obviously I’ve got a ton of data there. I could take a look and let’s see which a collection that’s in. I’m pushing that to my document collection. If you look here, you can see all the data I’m getting out. There’s that sentiment. There’s the name of the file. There’s the individual sentence that I gone on there. So ton of different documents. I could have pushed it in as one big document, but I really wanted it at an event level. So I could really get to that depth. Again, it depends on what you want to do. Pretty straightforward to do that.
Now in this one, I’m consuming from a topic that I’ve already populated. So it makes it pretty easy. Consuming from Pulsar, really simple. You put in the name of the topic. You might have a login. You put in a name for a subscription. Subscriptions enabled me to have different people consume topics and how are you gonna consume them, very important there. And then I just set specs on how many messages at a time, those sorts of things. So see here, I’ve got 30 of them. NiFi is gonna do a query on there. The query’s pretty straightforward, but it’s gonna look and see. This is energy data. The voltage gets over a certain amount, I want to send an alert out. Otherwise I’m gonna send it to Solr. And this is again, NiFi pushing to Solr. If we take a look at the one for energy over here, and you can see we’re getting current voltage power, things you’d expect from Solr energy data. Pretty straight forward. I’ve got some data collectors running here on some devices. One of them’s grabbing energy records. One over here, I’m running a camera on a Jetson box. It’s doing pictures, taking a look at something in my office here. I don’t know which I pointed that camera at. Too many cameras in here. So that’s running. That’s gonna be sending more data. Energy’s getting some more data.
And over here, I’ve got more data coming through. This one, I’m grabbing statuses from different cloud resources, and I’m just gonna run this once. They don’t update too often. That’s gonna go through some NiFi custom processing here. I’ve got these stopped. What’s cool in NiFi, start and stop data, doesn’t matter, never loses it. In between each step as a connection pool. Here, I’m sending a special data out to Slack if it meets a certain criteria. Otherwise I’m building up a custom record that I want to go out to Solr here. This data has different incidents that are happening in different cloud resources that I’m paying for. And there’s all my metadata. So let me send those out to Solr. This is the incident data, which let’s find that one. This is energy, we just mentioned. This is Jetson data. That’s another IoT data. Here’s our incident data. And as you can see here, AWS had some connectivity issues. You could see this one came for, something was resolved. So this could be very useful dataset where you’re trying to find out when was something offline in your cloud resources.
Again, pretty straight forward with NiFi. And that’s just running there. I’ve got these two devices processing data pretty quickly. And then over here, I’ve got an example Java application that I wrote to read from an IoT data source, and send it to Pulsar. It is pretty straight forward, but I wanted you to have that Java code so you could do it yourself. And then having that Pulsar sync there, automatically get that data into Solr. So you didn’t even have to think about that part. Write your data, get it into a message queue that can be available to anyone, anywhere, any different kinds of client and have it automatically streamed to Solr so you have it available for searching, great feature there. Here you have all your different logging credentials if you need them. I’m running an example locally because I have my IoT devices on a local network. Security with IoT is pretty dangerous. So I like to just run out my own disconnected network. So I run just Solr on a local server here. And you see here, this is as easy as it is. Pick a name of the producer, pick a topic, take your message, pick a key, put a value, send it in. If you’ve done Kafka, it’s gonna look familiar. If you’ve done JMS, looks familiar, pretty straight forward. And here I have JSON comes in, I’d put map it automatically into my message, and then that will be the same format that’s get sent to Solr.
So when we have one that’s Jetson, we could find the right one. Here’s Jetson. There’s all those fields that we were looking at. This is that IoT application we saw running on the Jetson. This is getting an image. This is getting some information on the device. So you know it’s a Mac address, IP address, all those sorts of things that might be important to you. I think it’s taken a picture of my monitor, which is not the most fun picture to take, but interesting. You see here, it’s just constantly running, grabbing more data, pushing that right to Solr. And now I have all that IoT and deep learning data available for me. For whatever kind of applications you build on top of Solr, whether it’s dashboards, reports, whatever it may be, lots of things you could do there, very powerful. You know, pretty straightforward. Now, one thing I want to show you real quick is I mentioned schema. Now, it’s pretty easy for you to do these schemas. Pulsar supports a lot of them. It’s got a built in schema registry. This one is just a JSON schema, but it could be a Avro. Lots of different options there. Here you see I could support nulls. This will look familiar. These are same fields we have here with some are double, some are string, pretty straightforward. Not too much else to do. It’s pretty straightforward for you to just put data into Pulsar and have it automatically go into Solr. To create that sync is just one line. You have to download the connector. Most distributions will have it, or if you’re using StreamNative, you’ll have it there for you. Put in a couple information. If you had any secure credentials, you’d put them in there. You’re set to go. I’ve got that all documented at GitHub. Pretty straight forward.
If you want to learn more, I’ve got a ton of links here. You’ll get these slides. You’ll be ready to go. Thank you for coming to my talk and enjoy the rest of your conference.