Where did all the Librarians go?

Saw this in Cyberspace somewhere:

“I hate people who don’t know the difference between “your” and “you’re”. There so stupid!!”

You’ve probably gotten tired of me by now, that’s OK because I’m tired of me too. Believe me, you don’t have to live with me – I do. You may be thinking – Search Curmudgeon guy? He’s a real jerk”. No argument there. But if you have gotten this far into the blog – that is, you got past the author tag line “By Search Curmudgeon” without clicking off (which at this point might make a tweet with some texting-style lingo like ‘lol’ and lots of emojis) maybe I am being too self-critical.

I make my living working with computers but that doesn’t mean that I’m in love with the damn things. I mean, it scares the heck out of me that my car is essentially a computer now. “Raid kills Bugs Dead” could now become “Bugs kill People Dead”. The more you know about computers, the more you will agree with that! It reminds me of an old internet joke that was at least mildly amusing at the time because it was really about computer programmer geeks like myself, not computers – and it was ludicrous. It may still be funny now – you be the judge – but now its also true. The joke went something like this. Three engineers are driving in a car, an electrical engineer, a mechanical engineer and a software engineer. The car breaks down and they get into an argument about what’s wrong with it. The electrical engineer says “It must be the ignition system”, the mechanical engineer says “No, its gotta be the transmission” and the software engineer says – “Well, why don’t we just get out of the car and then get back in?” I’ll wait till the laughter subsides …

In any case, fast forwarding to the now, I was driving my new car and somehow the blind-spot warning system went offline (a really cool innovation by the way!). I was left to fend off potential lane incidents the old fashioned way – I had to actually LOOK behind me before making a lane change!!! OMG!! I was pretty stressed out – I mean, its a brand new car and now I have to take it back to the dealer to find out why the blind-spot warning system only lasted a few thousand miles. Then it hit me – its software dummy, – maybe I didn’t give it a chance to load when I tried to call my girlfriend using the voice-activated bluetooth (another awesome thing) before the voice recognition system had a chance to initialize itself … So, sure enough when I restarted the car the next morning, the system came up, good as new like nothing had happened. (The Curmudgeon has a girlfriend?? Yes, I do. <G>) A software failure in the blind spot warning system I can live with, but what else can go wrong? Unfortunately, lots of things. Now when you take your car back to the dealer for a recall, more often than not they will be installing a new software patch – or maybe they could spare you the trip and do what Microsoft does with your home PC, update and reboot your car computer’s OS remotely at the least convenient moment, i.e. while you are driving. … No, they’re (or should it be ‘their’ ?) not that stupid … are they? “I’m sorry officer, it wasn’t my fault. Just as I was pulling onto the Interstate, Microsoft rebooted my car.”

But enough about cars, I’m here to talk about (ah rant that is) search related stuff. But the theme is established – we have grown so enamored of what computers can do for us that we let them do as many things as we can think of – especially really cool things. But lets take a step back for a moment. Bugs aside (like the poor, they will always be with us), are computers capable of doing everything that we want them to (yet)? I stress the word “want” here, because you really can’t always get what you want – thanks Mick and Keith – because what we should be focusing on is getting what we need them to do. We should do this when doing it ourselves is A) too tedious or overwhelming (or we are just lazy) and 2) we know that the computer can do it really well and really fast. So we need a case clause in our project management software labeled – “Yes – computers are really good at that.” and “No – computers totally suck at that”, and let humans pick up the slack. Pushing the envelope technologically speaking is important, but we need to be sensible too. And hiring people back is better for the economy.

As another example of what I am talking about, have you ever overheard someone talking into their cell phone where the “person” they are talking to is obviously not one? It often goes something like this:

“yes …. yes …. yes …. repeat menu …. yes …. no …. yes …. yes …. I don’t know …uhhh … SH*T … Can I please talk to a person now? … oh OK REPRESENTATIVE!!”

And since programmers usually have a good sense of humor (they have to or they would go insane) and can detect expletives in the output of the voice recognition system, the computer unbeknownst to you – or maybe beknownst if quite obnoxiously they had it on speaker – might have responded:

“I don’t respond to profanity – please say ‘I’m Sorry’ and then select a menu option – and if you do it again, I’m going to call your Mom.”

But wouldn’t it be cool if we had one of these for ourselves to screen our calls like businesses do? We could have our app say “Please respond to one of the following menu options: 1 – Family, 2 – Friend, 3 – Business Associate, 4 – Doctor’s Office, 5 – Bill Collector, 6 – Solicitor/Telemarketer/”Courtesy Caller”, 7 – Computer”. If the answer is 1, 2 or 3 we could provide some security questions like “What did I do to the dog when I was 3?” (if family). or “What’s my favorite drinking spot?” (if friend) or “What’s my typical Starbucks order?” (business associate). For 4 we would have it ask “What’s my Date of Birth?”. For 5 and 6, we can have our app just say “F*ck Off” and “Not Interested” respectively and hang up. I’m not sure what to do if our app is called by another computer. In the worst case scenario, this could cause an infinite recursion that would drive our cell phone bill through the roof. (“Sorry Verizon, your robo-caller and my personal answering app were stuck in an infinite loop 47 times last month – no, I not going to pay the $15,632.27 – just shut the damn thing off – I’m switching to T-Mobile.”)

So these computer phone answering systems are now ubiquitous. Quick quiz – when was the last time that you spoke to a person on the initial call to a bank, insurance company or, ah snap, basically any business? That is because we have fired most of the phone support workers and replaced them with that same robotic female voice that our GPS and car bluetooth systems use. Likewise, in the search business, we’ve fired the librarians and replace them with HP Autonomy IDOL (now commonly referred to as Autonomy IDLE thanks to me – heh heh). It used to be back in the day that these people would be hired to help other employees find information. They were experts at getting information out of systems that had inscrutable UI’s and complex arcane query languages (uh to be more precise – query programming languages). Then Google came along and everything changed. Ah, we don’t need these people anymore, we can just have our employees “Google It”. This works as far as it goes, but in the enterprise, it doesn’t go very far.

But I am NOT saying that search systems are still so bad that they can only be used effectively by someone with a Masters in Library Science degree or an MLS – which also stands for “More Literate Sh*t”, to put it in the pantheon of degree acronym joke rewordings that began with “BullSh*t”, “Bullsh*t Artist”, “Master Bullsh*t Artist”, “More Sh*t” and “Piled Higher and Deeper” (yours truly). Far from it. We have come a long way in that respect as our apps have become Goggle-ized and the arcane “Advanced Search” screen is largely a thing of the past. What I am saying is that the systems can get even better if we hire some of the librarians back (definitely Marian but not Conan The Librarian to be sure) to help us make them smarter – because in my opinion (notice that I didn’t say ‘humble’ because that is one thing that the Curmudgeon is definitely NOT) – there are still some things that computers suck at in the search world and won’t be totally solved by armies of software developers any time soon. Maybe they eventually will, but in the meantime there is still work for humans. And that work is to help the computer understand semantic contexts by engaging them with lexical knowledge bases. I’m a humanist believe it or not – I like humans even if they don’t like me sometimes – I EARNED my nickname of ‘curmudgeon’ you know.

Now I know that “Taxonomy” is a dirty word to many especially the “Just let the Wookie – uh – machine do it.” crowd – who always say “But that’s too slow and it doesn’t scale, nyah, nyah, pants on fire” – as if scale and speed are everything. Their systems may provide crappy answers – but they provide them really fast and can do it at tremendous scale. Now we don’t have to find a gas station to ask for directions and end up speaking to an attendant who doesn’t speak much English – we can talk to our car computer’s GPS which understands English about as well. This may or may not be more enlightening when you are lost but it is undoubtedly a much faster way to get unhelpful information.

What I am really getting at here is that for every 5 or 6 software developers that you hire, hire one person who’s job is to create or find data sets that the software developers can use to build smarter systems. There are a lot of really good open source knowledge bases out there – especially in healthcare – but it takes time and effort to 1) find them and C) integrate them to the current purpose. Taxonomies or ontologies, synonym lists, phrase lists, stopwords lists, precomputed Word2Vec models, DBPedia, Open Calais etc. etc. etc. But the time invested in doing this is well worth it. Your users will thank you for it, believe me. You don’t have to boil the ocean here. The more semantic knowledge the computer has to work with, the better it does at listening and talking to humans (even without emojis). So you can start small and build your vocabulary sets more enthusiastically as you see your search relevance start to zoom up or see your click-through rates start to become very respectable (and more to the point, profitable!).

For example. I don’t know how many times I’ve engaged with a customer who was unhappy with their Solr search engine’s relevance and find that their synonyms.txt file just knows that “Television” == TV and that “fooaaa”, “baraaa”, and “bazaaa” mean the same thing. If you don’t get this joke, look at the ‘synonyms.txt’ file that ships with Solr 6.2 – and by the way, what the hell is a “pixma”? OK, its a Canon printer model – thanks again Google!. Maybe the customer thinks that “it should just work” out of the box. Yeah, if the user types in “aaafoo” – the search engine will return stuff that has “aaabar” as well! Cool right? Works for me – no kidding, that’s how I unit test search code too – lots of documents named “Test Document 42” where “foo”, “bar”, “baz”, and “bat” are all keywords. And speaking of “foobar”, another anecdote that I inferred from watching “Saving Private Ryan” is that “foobar” is derived from the military FUBAR which stands for – oh you know – but I’ve used more than enough profanity for one blog post already so I’ll sanitize it – “F*cked Up Beyond All Recognition”. And if you read my last blog post, note that “foo” rhymes with “poo” – not sure what relevance that observation has though. Maybe the first programmer who used this to document something meant to say “fubar” but since programmers are notorious for being crappy spellers … And trust me, there is a lot of “foobar” (sic) code out there – I’ve seen more than my share. Maybe that’s why I’m such a curmudgeon.

No people, the OOTB ‘synonyms.txt’ file that ships with Solr is intended to document how to write one – i.e. what its syntax is. It is NOT intended to be used in production applications, but surprisingly it crops up there far too often. Why? Because we have fired all of the librarians who actually understand why we need to edit and maintain this thing. Machine learning algorithms can deal with two types of data – data to analyze and data to help the algorithm analyze other data. Yes, there is a major scaling problem with terms and phrases but there are already vocabularies out there like WordNet that are tackling this issue and some very cool entity extraction software to find phrases automagically. And scale is relative to the domain you are in. WordNet may not know about your business’ product names or specialized jargon for example (Google may because it crawled your website but this data may not be for sale). In as much time as it takes 14 developers to design, code, test, debug, test, debug, test, debug, test, re-design, re-code, test, debug, test, debug, test and deploy (phew!) a complex enterprise application, one or two librarians can create a fairly complete lexicon for the enterprise with the important phrases, stopwords and synonyms that can feed a kick-ass Lucene Analyzer chain or do some bitchin’ query expansions. The programmers don’t even have to lift a finger in this case because maybe some of them aren’t even aware of ‘synonyms.txt’ to begin with. Let them think that their magic code is what improved the search results.

Or even better have them integrate a “Taxonomy” which is a dirty word in some circles. Just have your Librarians refer to it as a “lexicon” or “vocabulary” and nobody will notice.

We’ll just keep the secret to ourselves.