Extra Resources
- Scrape Anything with DeepSeek V3 + Scraping Tool Integration (CHEAP & EASY)
- Tantivy - search engine library purely written in rust
- spider-rs - web crawler and automation framework for rust
- I Made a FAST Search Engine - extremely fast search engine written in Rust
- Building a Search Engine in rust - a series of blog posts
Very primitive search engine architecture
graph LR Client(Client) QueryEngine(Query Engine) Indexer(Indexer) Spider(Spider) Database[(Database)] Database <-- TCP/IP --> Spider Database <-- TCP/IP --> Indexer Database -- TCP/IP --> QueryEngine -- HTTP --> Client
- Spider(Web Crawler): Gathers all the pages it can from the internet and stores their data to be processed later
- Indexer: Organizes the data collected by the spider so that it is easy to search alter, it makes the query engine’s job easier
- Query Engine: Processes whatever the user enters, and searches through the indexed data
- Client: Main web page/web app that the user interacts with
Distributed Systems Architecture: Each component of the architecture is independant of the other, so each one can be written in a different programming language
A more modern search engine architecture
graph TD Client(Client) ReverseProxy{{Reverse Proxy}} Database[(Database)] PageRankService(Page Rank Service) Redis(Redis) IndexerMessageQueue{{Indexer Message Queue}} SpiderMessageQueue{{Spider Message Queue}} ScalingService(Scaling Service) subgraph QueryEngineCluster[Query Engine Cluster] QueryEngines(Query Engine 1..N) end subgraph BacklinksCluster[Backlinks Cluster] Backlinks(Backlinks 1..N) end subgraph IndexerCluster[Indexer Cluster] Indexer(Indexer 1..N) end subgraph SpiderCluster[Spider Cluster] Spider(Spider 1..N) end QueryEngineCluster -- HTTP o--o ReverseProxy -- HTTP o--o Client QueryEngineCluster -- TCP/IP --> Database Database -- TCP/IP --> PageRankService Database -- TCP/IP --> BacklinksCluster Database -- TCP/IP --> IndexerCluster Redis --- IndexerMessageQueue -- TCP/IP --> IndexerCluster Redis --- SpiderMessageQueue -- TCP/IP --> SpiderCluster BacklinksCluster --- ScalingService IndexerCluster --- ScalingService SpiderCluster --- ScalingService
Spider
The point of the spider is to collect all the pages that exist on the internet and store their data so we can organize it later.
- You provide the spider with a list of URLs(seed urls)
- The spider fetches the HTML from seed urls
WARNING
Often times, websites don’t want to be crawled through, so they often add a file called
robots.txtfor the web crawlers to know which pages of the site they can and cannot scrape, it is generally recommended to respect the choise of such websites and use the robots.txt file instead of crawling the website
- The returned HTML is parsed to extract information like urls, title, some other basic data
- The urls that you get from a page need to normalized. The return HTML might contain both absolute(https://example.com/page) and relative(/page) urls so we need to convert all of those into absolute urls to be easily traversed by the spider. Also we need to ignore certain link types such as javascript, css, and other static files
- You need to keep track of two things: the urls you need to visit and urls you visited. Every time you visit a web page and extract the urls from that page, the url that you visited need to be stored inside the
visited, and the urls you need to visit need to be stored inside aqueue. What the spider does is that it goes through each of the url in the queue untill there is no more urls in the queue or untill you stop the spider by yourself
NOTE
The process described above is termed as Breadth First Search. It is a common tree traversal algorithm. Another such example is Depth First Search, but BFS allows the spider to search a broader range of categories of pages and urls efficiently.
- If you are targeting just a specific website or range of websites, you should limit your spider to the domains of just those websites
- In order to prevent dire consequences such as IP bans, server crashes, and voilating legal terms of websites. It is recommended to add a delay between your requests(usually 1-2 seconds)
- You can also add proper storage of the crawled pages and sites into a database. When you crawl a page, you store that page’s link, text(not html, just plain text), metadata(title, description, keywords etc). You can also use a database table to track which pages and urls you have already visited
NOTE
Many modern websites don’t include any data whatsoever when you make a simple HTTP fetch request, that is due to rendering that occurs on the client side of the pages. In order to tackle this, your crawler must be able to use a headless chrome or a simulated browser enviroment in order to load and scrape the pages. You have to pay attention to the seed urls that you choose. Ideally, you want to choose urls of pages or website that get listed and referenced quite often. This includes wikipedia, news pages, blogs etc.
Along with the urls, metadata, and raw text of the pages, a spider will also have to store additional things in order to be made into a useful search engine:
- Backlinks: All the links that take you to the current page
- Outlinks: All the links that take you from the current page to other pages
These two things might seem unimportant but in the later stages, they will be used to determine which pages to be shown first to the user.
In order to better optimize the spider, we utilize Priority Message Queue. This helps in crawling the important pages before the others. If a url is added to the queue more than once, its priority increases. The urls with highest priority will be crawled first.
Indexer
It take raw content like HTML, processes it, and create data structures from that processed data which are optimized for fast searching.
-
The indexer receives data such as URL, metadata, and raw HTML from the spider.
-
The full HTML of the pages is very bloated and we need to reduce it to the contents that matter to us.
- We strip out all the script, style, tags etc.
- We extract only the plain text from the pages and extra metadata that may be usefull
-
Now that we have the plain text, we need to tokenize it. We basically just convert the entire strings into lists, lowercase them, remove punctuations, and detect which language it is (often times not needed).
-
Now that we have the tokenized lists, we need to normalize the data a little bit. This is done to improve efficiency and remove further bloatware from the stored data.
- Stemming: reduce the words to their roots(convert all the words into their simplest forms): jumps,jumping,jumped → jump
- Lemmatization: it is just a smarter version of stemming which is also context aware
- Stop-word removal: remove the common useless words like prepositions, connecters etc
-
For a page, we create a simple map that contains all the (unique)words of that page and their weight(simply the number of times they appear in the page)
-
Now, we just store all the elements of that map along with the page(url) on which they appear and their weights on that page, into the database. The best approach is the following
- the object in the database will contain the following fields: id,word,url,tf,idf,weight,document_id,position
- id,word,url,documend_id are self-explanatory
- tf is the “Term Frequency”, it just means how often does a word appear in a document, higher the tf, the more important the word is for that document
- idf is the “Inverse Document Frequency”, how rare a word is across all documents, higher the idf, the more rare the word is across all the documents here, N is the total number of documents and DF(t) is the total number of documents containing the term ‘t’
- weight is just the the product of idf and tf
- the position is just where exactly in the document does the word appear
-
All this data is finally stored in a database to be querried later by the query builder.
NOTE
There is another component of a search engine that generally sits between the indexer and the query engine, it is called the PageRank. It is an algorithm used to rank various web apges in the search engine results. It was the algorithm that started Google.
Page Ranking
In order to rank websites as more and less important you can use algorithms like PageRank, BM25. PageRank is the algorithm that made Google.
NOTE
In PageRank algorithm, you can create millions of websites that link to your page to make your page more important and people used to do this in the earlier days of google. In order to prevent this we can use something lile TrustRank
We can use these algorithms to check which pages are globally more important than the others and these pages will be prioritized during search querries that contain words from these pages.
Yes this does give a ranking to all the pages in the index causing one page to be the higest rank but that does not mean that is the best page of all. The ranking of a page depends upon the context of the query. A lot of factors such as geographical location, the contextual keywords, the device, the user history and so much more contributes to the rank of a web page in the query results other than just the page rank (yes it is influencial though).
When building a proper page ranker, you want to convert it into its separate service in order to follow the distributed architecture system and ensure proper scalability.
Query Engine
It is the actual thing that searches. It takes in whatever user enters in the search and fetches the results for that search from the indexes that we have created.
- The query engine takes in the user input and goes through the process of normalization, ideally, the process of normalization used here should be exactly same to the process of normalization used in the indexer
- lowercasing, removing punctuation, splitting into words, removing stop words, stemming and lemmatizing
- Now after the search query has been normalized, you can implement some logic that is usually optional but very powerful, like detect phrases(group of words), handling operators(AND, OR, NOT) or filters(limiting the search to a specific filetype, date, domain, website etc etc)
- Now the actual searching begins, based on the parsed query, we search through the index database to find matches for the words or phrases from the query. We find the documents that contain those said words along with the weight,idf,tf etc.
- We go through each keyword/word in the query one by one and get its results from the database. The pages that contain all(or majority) the search terms will be given a higher rank and the ones that contain a lesser number of keywords will be ranked lower but still ranked for search results for better user experience. In order to rank the other pages, we can just sum the weights of all the keywords that appear in those pages and sort according to that score.
- We filter the results by applying the operator logic and the filters that we have parsed earlier.
- Now that we have many pages/urls in which the word appears we compare all of the pages and return the page that is most relevant to the user’s query. Each document is usually scored/ranked on following factors: TF, IDF, Weight(TF-IDF), Recency, Custom logic(check for popularity, click-throughs etc) etc. This is exactly what SEO is, in this part, you will be implementing the SEO logic.
- After obtaining a properly ranked list of pages/urls, we just sort it in a descending order of the rank and return the results to the user.
Scaling up services
Having just one instance of the spider, indexer, and query engine can instantly lead to performance issues as just one instance is not enough to write a fully-working production ready search engine.
This is where the distributed systems architecture comes in clutch. It allows us to run multiple instances of the same service as the same time. So instead of having just 1 spider crawling the web, we can have a bunch of them crawling simultaneously, the same goes for other services as well (horizontal scaling).
Valkey is just like Redis due to issues going on with reddis. It just allows us to have different message queues that different services can access in order to synchronize with each other.
In order words, each spider accesses a remote message queue from the valkey database and the cluster collectively does not crawl a single url multiple times.
To actually store data that will be accessed later by components such as the Indexer, we can use any SQL or no-SQL database like SQLite, MySQL, MongoDB, Supabase, Pocketbase etc.
In order to ensure that all the instances in a cluster are synchronized(different spiders don’t crawl the same page multiple times), we utilize a valkey database.
Scaling Service
Basically, it is a services that scales up and down the number of instances in the various clusters according to the traffic and computational needs of the service. If it is not serving much traffic or data, it is scaled down, and vise-versa.
Reverse Proxy (Load Balancer)
It just prevents the query engine from crashing or slowing down by distributing the incoming traffic between various instances depending upon which is currently serving more or less traffic.
The most commonly used tools include Ngnix and Caddy, although caddy seems to be a bit more “modern” solution for this problem.
Spell Checking
For a better optimized query engine, we need to implement a middleware for it which checks if the current inputed query contains any spelling errors.
If the query contains any spelling errors that just passing it along to the search engine would be bad because that word does exist, it’s a mistake.
In order to fix this, we can compare the words that the user entered, with the words from a dictionary, and by utilizing the Levenshtein distance to figure out the correct word.
Client
It is just a frontend app that is wrapper around the entire search engine. It makes requests to the query engine and displays the results in a pretty format.