icon picker
Search Functionality

Abhinav notes - (Elastisearch open-source and TF-IDF algo)

TF-IDF stands for Term Frequency-Inverse Document Frequency, and it is a simple yet powerful algorithm for information retrieval that is based on the frequency of words in a given document or body. The basic idea behind TF-IDF is to weigh each term in a document based on how often it appears in that document, and how often it appears in the entire body of documents.
Steps :
Collect and preprocess your product data: To use the TF-IDF algorithm, you need to have a collection of product data that includes the product name, description, and other relevant information. Once you have your product data, you need to preprocess it by removing stop words (common words like "the" and "and" that don't carry much meaning) and stemming (reducing words to their root form, such as "running" to "run").
Calculate the IDF values: IDF values measure the rarity of a term in a body of documents. To calculate the IDF values for your product data, you need to count the number of documents that contain each term, and then divide the total number of documents by that count. The result is the IDF value for each term.
Calculate the TF-IDF values: Once you have the IDF values, you can calculate the TF-IDF values for each term in each document. To do this, you simply multiply the term frequency (how often a term appears in a document) by the IDF value for that term.
Rank the results: Once you have the TF-IDF values for each term in each document, you can rank the search results based on their relevance to the search query. You can do this by summing up the TF-IDF values for each term in a document, and then sorting the results by their score.
Amazon uses TF-IDF to help determine which products to recommend to customers. The algorithm takes into account the words used in a customer's search query, as well as the frequency of those words in different product descriptions.
This algorithm can be implemented with minimal tech effort and can lead to a better customer experience.
Elasticsearch is a powerful, open-source search engine that is designed to handle large amounts of data and provide fast, relevant search results. It uses a combination of different algorithms and techniques to provide accurate search results, including:
Tokenization: Elasticsearch tokenizes each document into individual terms, which are then indexed and used for search.
Inverted index: Elasticsearch uses an inverted index to quickly look up terms and their associated documents.
TF-IDF: Elasticsearch uses the TF-IDF algorithm to score search results based on the frequency of terms in a document and their rarity in the body of documents.
Query parsing: Elasticsearch supports a wide variety of query types, including simple keyword searches, phrase searches, and more advanced queries using boolean operators, wildcards, and regular expressions.
Relevance scoring: Elasticsearch calculates a relevance score for each search result based on factors such as term frequency, term proximity, and other factors.
Elasticsearch is highly configurable and can be customized however we want it to.

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.