Cheatsheets
Build Chatbots with Python - Basics of Retrieval-Based Chatbots

Build Chatbots with Python - Basics of Retrieval-Based Chatbots

Understanding Text with Basic Models

Bag-of-Words Model

The Bag-of-Words (BoW) model analyzes text by counting how often each word appears, ignoring word order. It's like making a list of all words in a document and noting their frequency.


# For the sentence "Squealing suitcase squids are not like regular squids."
# The Bag-of-Words dictionary would be:
{'squeal': 1, 'like': 1, 'not': 1, 'suitcase': 1, 'be': 1, 'regular': 1, 'squid': 2}

Turning Text into Numbers

To use text in models, we turn it into numbers using a method called feature extraction. This converts each word into a number based on its frequency in the text.


# For the dictionary {'are':0, 'many':1, 'success':2, 'there':3, 'to':4, 'ways':5}
# The sentence "many success ways" would be represented as [0, 1, 1, 0, 0, 1]

Converting New Text

When new text is introduced, it is converted into numbers using the same method as the training text. This helps in comparing and processing new data.


# With the dictionary {'are':0, 'many':1, 'success':2, 'there':3, 'to':4, 'ways':5}
# New text would be converted into the BoW vector [0, 1, 1, 0, 0, 1]

Creating Feature Vectors

Feature vectors are numerical representations of text based on the frequency of words. These vectors help in processing and comparing text data.


# For the training data “There are many ways to success.” and the test data “many success ways”
# The feature dictionary might look like {'five': 0, 'fantastic': 1, 'fish': 2, 'fly': 3, 'off': 4, 'to': 5, 'find': 6, 'faraway': 7, 'function': 8, 'maybe': 9, 'another': 10}

Handling Unseen Words

Language smoothing helps manage new or unknown words by giving them a small probability, preventing them from being ignored entirely.


# Example dictionary {'squeal': 0, 'suitcase': 1, 'squid': 2, 'be': 3, 'not': 4, 'like': 5, 'regular': 6}

Understanding Bag-of-Words Data

Bag-of-Words data can be sparse, meaning many words in a dictionary may not appear in every document, leading to a lot of zeroes in the data.

Understanding Perplexity

Perplexity measures how well a model predicts text. A lower perplexity indicates better performance in predicting words.

Word Representation with Vectors

What Are Vectors?

Vectors are numerical representations of data. In text processing, they help represent the meaning and relationships between words.


even_nums = np.array([2,4,6,8,10])

Understanding Word Embeddings

Word embeddings are numerical representations of words. They capture semantic meaning by encoding words into vectors of numbers.


nlp = spacy.load('en')

Measuring Distance Between Words

Vectors can be used to measure how similar or different words are from each other. This helps in understanding relationships between words.


nlp = spacy.load('en')

Context of Words in Vectors

Word embeddings provide context to words by encoding their meanings in numerical vectors, making it easier to process and understand text.


nlp('peace').vector

Using Word2Vec for Embeddings

Word2Vec is a method for creating word embeddings, which can be used to find relationships and meanings of words.


nlp('peace').vector

Creating Embeddings with Gensim

Gensim allows you to create word embeddings and access them as vectors, which helps in analyzing word meanings and relationships.


[5.2907305, -4.20267, 1.6989858, -1.422668, -1.500128, ...]

How Retrieval-Based Chatbots Work

Basics of Retrieval-Based Chatbots

Retrieval-based chatbots use a set of predefined responses to answer user questions. They work by identifying the user's intent, recognizing key entities, and selecting the appropriate response.


# Example of finding the best response using tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vectorizer = TfidfVectorizer()
tfidf_vectors = vectorizer.fit_transform(processed_docs)
cosine_similarities = cosine_similarity(tfidf_vectors[-1], tfidf_vectors)
similar_response_index = cosine_similarities.argsort()[0][-2]
best_response = documents[similar_response_index]

Finding User Intent

Retrieval-based chatbots use methods like Bag-of-Words or tf-idf to understand what the user is asking by comparing the intent of the user’s message with predefined responses.


import spacy
# Load word2vec model
word2vec = spacy.load('en')
# Check similarity between words
tokens = word2vec("wednesday, dog, flower")
response_category = word2vec("weekday")
output_list = list()
for token in tokens:
    output_list.append(f"{token.text} {response_category.text} {token.similarity(response_category.text)}")
# Example output:
# wednesday weekday 0.453354920245737
# dog weekday 0.21911001129423147
# flower weekday 0.17118961389940174

Recognizing Key Entities

Retrieval-based chatbots can identify important entities (like names or dates) using methods such as part-of-speech tagging or word embeddings.

Programming Cheatsheets: Quick Reference for Productivity

Welcome to our comprehensive collection of programming language cheatsheets! Whether you're a seasoned developer or a beginner, these quick reference guides provide essential tips and key information for all major languages. They focus on core concepts, commands, and functions—designed to enhance your efficiency and productivity.

ManageEngine Site24x7, a leading IT monitoring and observability platform, is committed to equipping developers and IT professionals with the tools and insights needed to excel in their fields.