3 minute read

Let’s design a large-scale distributed email service like Gmail or outlook.

In 2020, Gmail has 1.8 B active users and Outlook has 400M active users

Step1: Scope the problem

email service has changed significantly in complexity and scale.

Since a email system can have multiple features, let’s assume we will design:

  • send and receive emails
  • fetch all emails
  • search emails by subject / senders / body

Step2: High level design

back of envelope estimation

  • 1B users
  • QPS for sending email = 10 ^ 5
  • storage requirement: 1000 PB

it’s clear we should deal with lots of data.

Email Knowledge 101

Historically, most mail servers use mail protocols such as POP, IMAP and SMTP

SMTP: Simple Mail Transfer Protocol

The standard protocol for sending emails from one mail server to another.

POP: Post Office Protocol

standard mail protocol to receive and download emails from remote mail to local email client.

Once the email downloaded to your computer, they are deleted from the email server.

IMAP: also a standard mail protocol to receive emails for a local email client.

Traditional Mail Server

The process consists of following steps

  1. user 1 login outlook client, compose an email and press send button.
  2. email is sent to outlook mail server. The communication between outlook client and mail server is SMTP
  3. outlook mail server queries DNS to find the address of the recipient’s SMTP server, in this case, it’s Gmail SMTP
  4. outlook mail server send the email to Gmail mail server
  5. Gmail server store the email to make it available to user 2
  6. Gmail client fetch new emails through IMAP/POP server when user 2 logins

Distributed Mail Server

Let’s examine mail sending flow first

Email Sending Flow

  1. load balancer make sure it doesn’t exceed the rate limiting
  2. web server are responsible for
    1. email validation
    2. pass the email to message queue
  3. SMTP outgoing worker: pull messages from the outgoing queue and make sure emails are virus free
  4. outgoing email are stored in “Sent Folder”

Email Receiving Flow

  1. incoming email arrives at SMTP load balancer
  2. load balancer distribute traffic among SMTP servers
  3. emails are put in the incoming email queue
  4. mail processing worker are responsible for some time consuming jobs like validation,
  5. email passed validation will be stored in storage
  6. when receiver login the email client, client will fetch the available emails from the storage.

Step3: Design deep dive

Metadata DB

Let’s examine the pattern of email metadata

  • headers are usually small and frequently accessed
  • email body can range from small to big
  • mails owns by a user are only accessible by that user.
  • data recency impacts data usage. user usually read recent emails

At high level, an email service should support following queries

  1. get all emails for a user
  2. create/delete a specific email
  3. fetch all read/unread email
  4. mark unread emails as read

Based on trade-off, we can choose relational DB for this use case.

Consistency

Distributed DB that relies on replication for high availability must make a fundamental trade-off between consistency and availability.

We decide to trade availability in favor of consistency.

The search feature in email system has a lot more write than read.

We can leverage Elastic Search to build reverse index and support search features.

Step4: Wrap up

In this article, we started from traditional email architecture, then evaluate how to scale it up on sending flow and receiving flow separately.

Then we deep dive the DB solution choice, what factors we should consider during this process, then we explore how to support search features in the system.

Some takeaways:

  1. When to introduce a message queue: when find a component is time consuming, e.g. SMTP Worker, put a message queue before it to increase the system performance.
  2. When evaluate the DB solution, some thing to consider:
    1. the frequent used query
    2. data access pattern