System Design: Distributed Search System
What we need search system?
A search system is a system that takes some text input, which is query, then return the relevant content in a few seconds.
there are 3 main components in a search system:
- crawler: which fetch the content available in target material pool
- indexer: build the searchable index for the content crawled
- searcher: takes in a search query, then run the search query on the index created by indexer
Design
Requirements
- search: user can give some text input, and get some relevant content back
Integration
overall design
basically the distributed search system could be divided into two phases: online phase and offline phase.
- online phase: in this phase, our system takes into users’ input, interpret it and return the results in real time
- offline phase: in this phase, crawler will fetch the content from target documents pool, then indexer will build inverted indexing upon them.
crawling
Inevitably, we need to build a distributed system for crawler due to the high demand of storage space.
distributed crawler will be discussed later. while now, let’s imagine, distributed crawlers works as expected, then the crawled content will be divided across the nodes.
then the problem is, how could we build the distributed indexing upon the crawled documents?
let’s look into indexing itself.
indexing
indexing is the organization and manipulation of data that’s done to facilitate fast and accurate information retrieval.
a searchable index
the simplest way to build a searchable index is just assign a unique index to each document, then store the mapping in key-value store.
while in real world, the document is very diverse, like video, audio or executable code. so in this way our table could be very big and hard to maintain.
Also, this data structure could make the efficient search impossible.
inverted index
then let’s talk the solution for above issue.
an inverted index is a Map-like data structure that employs a document-term matrix, by splitting the whole documents into individual words, remove those non-sense word, then build the statistics of each word in one documents.
so instead of getting a list of individual word from a document, we can search a list of documents from individual word.
the information we can get from this index are:
- a list of documents in which the term appeared
- a list that counts of frequency with which the term appears in each documents
searching on inverted index
assume our system already build up the inverted index, how is the searching working?
we split the query text into individual word, then find the top 10 related documents of them, merge the result and return.
distributed indexer
let’s take a look at the problem if we perform indexing on a single node.
the overall process should looks like:
- the indexing process takes the documents and convert them into individual words, then build the inverted index upon them
- query process interprets the binary file that contains the inverted index, then find the most relevant documents to return.
so we need to partition the indexing, there are two directions of partition:
- document partition: we store the subsets of documentations on each nodes
- term partition: in this way, we already build up the inverted index, then split the subset of entries in inverted index among the nodes.
I will use document partition as example to explain the overall processes
overall workflow
indexing:
- crawler collect a document set
- cluster manager split the input documents into N partitions, the size of each partitions is determined by cluster manager depends on the size of overall data
- cluster manager has the healthy information of each nodes through the heartbeat
- cluster manager run indexing algorithms for all the N partitions simultaneously, each process produce a tiny inverted index and stored on node’s local storage.
searching:
- when system get user’s input, we run parallel searches on each tiny inverted index on nodes’ local storages
- merger will merge the result from each inverted index is a mapping list
- then we can get an aggregated mapping list based on the frequency of input term
- sorted documents based on the relevance get returned to user
problems
so we have a initial design of search system, while it has several issues
- collocated indexing and searching : we perform the indexing and search operations on same nodes, both operations are resource demanding so they can interfere each other.
- indexing re-computation: we assume each replica compute the index individually, leads to the waste of resources.
solution
- collocated indexing and search:
- we can separate the indexing tasks and search tasks
- crawler fetch the content and store in crawler’s node
- crawler node could also responsible to build inverted index on local documents
- crawler node store the inverted index file into distributed storage
- searching process will be executed on distributed storage
- we can separate the indexing tasks and search tasks
- recompute index: we can only compute the inverted index on primary node, then replica the inverted index to secondary nodes.