Chatbots are getting increasingly popular nowadays. It is said to be a new phase of user interaction to digital devices, a sort of nouveau user interface. Companies are already utilizing chatbots for various customer or user interactions. This is also the reason why a lot of startups are offering different bot related services internationally and even here in the Philippines.
This also includes us here at Indigo Research.
Indigo Research was founded on the desire to push the limits of technology further. Research is our primary driving force. We take bleeding-edge technologies fresh from the lab and work on applying these to impactful use-cases that benefit industry and society. We believe that true innovation is best achieved through pure and applied research.
That being said, one of our ongoing research products is a conversational engine that can be used for chatbots. This post will discuss about the techniques we're currently using on our engine. We hope that the developments we have in this space will help improve the technology behind the chatbots being offered today.
Lector is a conversational engine built to parse a message, mapped to a matching rule, and then provide the appropriate response. It's still very early stage and the current implementation is using a mix of TF-IDF and Word2vec.
A demo UI to showcase Lector
The screenshot above is a demo user interface to show how Lector works. The user's input message was parsed and split by word to determine their importance score. That score was determined using TF-IDF.
Term Frequency - Inverse Document Frequency (TF-IDF) is a simple NLP technique for information retrieval. It's a score to determine how important a word is to a document with respect to the corpus. The score of a word increases with the number of times it appears in the document, but is also offset by the frequency of the word in the corpus. This helps to adjust for some words that appear frequently in general (possible stop words). See below screenshot for the importance score of each words.
Each words are given with an importance score computed using TF-IDF
Additionally, word2vec was used to determine the similar words. It is a technique to represent words into vectors called as word embeddings. Representing entities as vectors lets vector operations possible, for instance, getting the similarities using distances.
"pede": [ ["pwde", 0.9432555437088013], ["pwedi", 0.9389035701751709], ["pde", 0.9375213980674744], ["pwede", 0.9282710552215576], ["pwd", 0.9014163613319397] ], "enrol": [ ["enroll", 0.8133280873298645] ], "skul": [ ["school", 0.7343262434005737] ]
An example similarity output of word2vec
Lector was using word2vec to match similar and even misspelled words. In the above example, different variations of
pwede was found. Also,
skul were matched to their appropriate spelling. This approach normalizes each word. This also removes the need to explicitly specify the exact words needed for a response.
The rules were used to determine the matching question and the appropriate response for the user's input message. The parsed message was compared to each rule and compute a comprehension score using the important (TF-IDF) and similar (word2vec) words. The rule with the highest comprehension score will be the output.
We at Indigo Research are aware that this approach is still very far from perfect. We're still currently doing research on how we can validate and improve the methodology behind our approach. A research paper with a more detailed explanations of our method is also currently in the works. If there are any questions on how we implement this solution, feel free to hit us up at [email protected].