WhatsApp System Design
Never dive directly into the design phase; it may raise red flags during the interview!
Interviewer: Can you design a messaging platform similar to WhatsApp?
1. Feature Expectations [5 mins]
You: Before we proceed, could you please clarify the use cases and scope for WhatsApp's system design? This will help me understand the main goals and functionalities we need to focus on.
Interviewer: Absolutely. WhatsApp aims to accommodate a large user base with seamless communication capabilities, ensuring messages are delivered reliably and securely. We need robust infrastructure to handle peak loads without compromising performance or security.
You: Thank you for the clarification. Based on this, let's outline the functional requirements of WhatsApp to ensure it meets these goals effectively.
Functional Requirements
- Realtime messaging
- Group messaging
- Online status
- Image and video uploads
These features are central to the user experience and represent the primary interactions on the platform.
To keep the discussion focused and manageable, I will not cover features like Voice and Video Calls, status updates and payment integration, as these can be considered secondary and would require additional time to discuss comprehensively.
Target users of WhatsApp are individuals, groups, and businesses. Knowing your audience helps tailor the design to meet their specific needs.
Limit the number of features discussed to one or two, as covering more can be time-consuming and may detract from explaining the most critical aspects of the design.
2. Estimations [5 mins]
Understanding how much data the system handles and how fast it processes it is crucial for designing it well. This ensures that the network and storage can handle all user activities smoothly, without slowing down or causing issues.
- Assume a daily active user base of 1 billion
- Total Profile Data (name, contact details, etc.): 1 billion users * 1 KB/user = 1 TB
- On average, each user sends 100 messages per day
- With each message averaging 200 bytes in size, the total daily data generated amounts to approximately 20 terabytes (TB).
- To handle this volume, WhatsApp requires a system capable of processing data at a rate of around 70 million TPM.
Clear estimations demonstrate planning and analytical skills crucial for system scalability and performance assessment.
3. Design Goals [5 mins]
Based on estimates and discussions, the system's non-functional requirements include managing high user loads effectively, implementing robust data security measures, and having scalable infrastructure for rapid growth.
- Low latency
- High availability
- Security
- Scalability
Specify latency/throughput targets and decide on consistency/availability levels based of estimations discussed for robust system design.
4. High-Level Design [5-8 mins]
To create a high-level system architecture for WhatsApp, we'll focus on several key aspects such as designing APIs for both reading and writing data, defining the structure of the database, implementing core algorithms, and outlining the overall framework for managing how data is read from and written to the system.
I. APIs for Read/Write Scenarios
Creating APIs for WhatsApp is important because it sets up clear ways for different parts of the system to communicate with each other. This organization makes it easier to modify and add new features later on. APIs also allow other programs or services to use WhatsApp's features in a controlled manner, helping to build and expand the platform smoothly.
Send Message
Endpoint | Parameters | Response |
POST /messages/send |
senderId , recipientId , content |
Success/Failure message |
Receive Message
Endpoint | Parameters | Response |
Real-time via WebSockets |
N/A | Real-time message |
Send Media
Endpoint | Parameters | Response |
POST /messages/sendMedia |
senderId , recipientId , media , caption |
Success/Failure message |
Update Online Status
Endpoint | Parameters | Response |
POST /users/updateStatus |
userId , status |
Success/Failure message |
Group Messaging
Endpoint | Parameters | Response |
POST /groups/{groupId}/messages/send |
senderId , content |
Success/Failure message |
II. Database Schema
WhatsApp, as a messaging application, prioritizes availability and partition tolerance over strict consistency, which aligns well with the characteristics offered by NoSQL databases. Unlike SQL databases that adhere to ACID (Atomicity, Consistency, Isolation, Durability) properties for data integrity, NoSQL databases provide flexibility and scalability without requiring strict consistency across all operations. This makes NoSQL databases like MongoDB suitable for handling WhatsApp's dynamic and high-volume messaging environment, where rapid data ingestion and flexible schema management are essential, and occasional data inconsistencies are acceptable.
Users Table
"user_id": "UUID",
"phone_number": "string",
"name": "string",
"profile_picture_url": "string",
"created_at": "timestamp"
}
Messages Table
"message_id": "UUID",
"sender_id": "UUID",
"recipient_id": "UUID",
"content": "string",
"timestamp": "timestamp"
"status": "string",
}
Groups Table
"group_id": "UUID",
"group_name": "string",
"created_by": "UUID",
"members": [
{
"user_id": "UUID",
"is_admin": true/false
},
...
]
"recipient_id": "UUID",
"created_at": "timestamp"
"status": "string",
}
Conversations Table
"conversation_id": "UUID",
"conversation_name": "string",
"participants": [
{
"user_id": "UUID",
"is_admin": true/false
},
...
]
"created_at": "timestamp"
}
User Conversation Table
"user_conversation_id": "UUID",
"conversation_id": "UUID",
"user_id": "UUID",
"is_admin": "true/false",
}
Ensure clear and concise communication of design choices and their implications to demonstrate deep understanding and critical thinking.
5. Message Communication [5 mins]
It is crucial to understand the communication between client and server when implementing a chat application. This understanding forms the foundation for establishing real-time, bi-directional data exchange, enabling seamless interaction between users through instant messaging. Establishing a connection between clients and servers involves several methods:
1. TCP Handshake
TCP handshake is a process where a client and server establish a connection by exchanging synchronization (SYN) and acknowledgment (ACK) packets to ensure reliable data transmission.
- SYN (Synchronize): Client sends a SYN segment to the server to initiate a connection request.
- SYN-ACK (Synchronize-Acknowledgment): Server responds with SYN-ACK to acknowledge the request and signal readiness to establish a connection.
- ACK (Acknowledgment): Client sends an ACK segment to acknowledge the server's response, completing the TCP handshake and establishing a reliable connection.
2. SSL/TLS Handshake
Once the TCP connection is established, the SSL/TLS handshake begins. The SSL/TLS handshake is a process that allows a client and server to establish a secure encrypted connection. Here’s a simplified explanation of the handshake flow:
- ClientHello: The client initiates the handshake by sending its supported SSL/TLS version, cipher suites, and other parameters.
- ServerHello: The server responds with its chosen SSL/TLS version, cipher suite, and other parameters.
- Certificate (Optional): The server sends its digital certificate for authentication.
- ServerKeyExchange (Optional): The server sends necessary data for key exchange (e.g., parameters for Diffie-Hellman).
- CertificateRequest (Optional): The server requests the client's certificate for mutual authentication.
- ServerHelloDone: The server signals that it has completed its part of the handshake.
- ClientKeyExchange: The client sends data for key exchange.
- ChangeCipherSpec: Both client and server signal readiness to begin encrypted communication.
- Finished: Both parties verify the integrity of the handshake and establish secure communication.
3. HTTPS Communication
After the TCP handshake and SSL/TLS handshake have successfully established a connection, HTTPS communication involves the secure exchange of HTTP messages between the client and the server. Here’s a breakdown of what happens:
- HTTPS Request: The client sends an encrypted HTTP request to the server, specifying what resource it wants.
- Server Processing: The server decrypts the request, processes it (e.g., fetches a webpage, handles form data), and prepares a response.
- HTTPS Response: The server sends an encrypted HTTP response back to the client, containing the requested resource or indicating an error.
- Secure Transmission: All communication (requests and responses) is encrypted, ensuring data confidentiality and integrity.
- Connection Maintenance: The TCP and SSL/TLS connections remain open for efficient subsequent requests, reducing overhead.
Message Communication Methods
Now that we understand how connections are established between clients and servers, let's delve into the communication methods used for data exchange:
1. HTTP(S) Requests
Fundamental for sending and receiving messages, initiated by clients like mobile apps or web browsers using HTTP POST requests to transmit message content to servers.
2. Polling
Polling means that clients ask the server every now and then if there are new messages or updates. When the server gets a request, it checks if there's something new to send back. This method works well when getting updates right away isn't crucial, or when it's hard to keep a continuous connection like with Websockets.
3. Long Polling
Long polling is like asking the server to keep the line open until it has something new to say or until a timeout happens. This helps reduce the delay in getting updates compared to regular checking. It's good for apps that need updates quickly but can't stay connected all the time, like when messages come in occasionally.
Feature | Polling | Long Polling |
---|---|---|
Request Timing | Client sends requests at regular Interval . |
Client sends a request and Wait until server has new data. |
Server Response | Server responds immediately to each request with current data. | Server holds the request open until new data is available, then responds. |
Client Behavior | Initiates requests periodically regardless of server readiness. | Waits for server response before sending next request, reducing unnecessary requests. |
Connection Handling | Typically closes and re-establishes connection after each request. | Keeps connection open longer, reducing overhead from frequent connection setup. |
Latency | May introduce latency due to fixed intervals between requests. | Reduces latency as server notifies client immediately upon data availability. |
Suitability | Suitable for applications where periodic updates suffice and maintaining continuous connection is impractical. | Ideal for real-time applications needing prompt notifications on data updates, reducing delay compared to regular polling. |
4. Websockets
Websockets keep a constant connection open between clients and servers, allowing them to send messages back and forth without delay. This is different from methods that repeatedly check for updates. Websockets are crucial for apps like chat platforms (e.g., WhatsApp) where getting messages instantly, seeing when someone is typing, and knowing when a message is read are very important. They make sure updates happen right away, reduce waiting times, and handle many connections smoothly on different devices and platforms.
HTTPS to WebSocket Upgrade Process
- Establish HTTPS Connection: Client initiates a standard HTTPS connection to the server.
- Upgrade Request (WebSocket): Client sends an HTTP request to upgrade the connection to WebSocket.
- Upgrade Response (WebSocket): Server responds to the upgrade request confirming the switch to WebSocket protocol.
- Bi-directional WebSocket Communication Established: Both client and server can now communicate bi-directionally in real-time using WebSocket.
6. Deep Dive [10-12 mins]
Let's carefully plan out WhatsApp's system architecture to ensure we build features that meet all our needs. Going step by step helps us avoid mistakes and ensures the system is reliable and can grow smoothly over time.
1. Users
Users are the end-users of the chat application, using smartphones to send messages, share media, and get notifications. Their active participation and interaction are crucial to the app's success.
2. Load Balancer
Load balancer distributes incoming requests from users across multiple instances of the API Gateway for better performance and availability.
3. API Gateway
API Gateway serves as the single entry point for all client requests to appropriate backend services like User Service, Message Service, and Media Service. It manages authentication, authorization, and traffic flow, centralizing request handling to simplify service interactions and enhance security across the system.
- /POST Message: Directs the request to send a new message.
- /GET Messages: Routes the request to retrieve messages for the user.
- /POST Group: Handles the action of creating a new group chat.
- /POST Online Status: Handles updating the user's online status.
- /POST Media: Manages the request to send media within a message.
The API Gateway abstracts the complexity of the backend services from the clients and provides a unified interface.
4. User Service
The User Service handles user-related operations, managing user data including IDs, names, and status information such as last seen. It handles user authentication, profile management, and configuration settings, ensuring seamless user experiences and maintaining user data integrity across the platform.
5. User Database
User Database stores comprehensive user profiles and associated data necessary for user management within the chat application. It provides a reliable source for user-related information, supporting authentication processes, personalized interactions, and user-specific services.
6. Message Service
Message Service manages message delivery in the app, ensuring messages reach recipients promptly based on conversations or user IDs. It supports real-time communication, updates conversation details, and confirms message delivery, enhancing user engagement and maintaining message integrity.
7. Conversation Database
Conversation Database stores conversation IDs, names, and participants (users and groups), serving as a core data repository for the Conversation Service. It facilitates efficient retrieval of conversation details and management of group memberships and communication dynamics in the chat application.
8. User Conversation Database
User Conversation Database maintains associations between users and their respective conversations or group chats within the chat application. It stores records linking user IDs with conversation IDs, facilitating efficient retrieval of conversation details, management of group memberships, and targeted communication among users participating in specific conversations. This database supports personalized user experiences and streamlined communication dynamics within group settings.
9. Media Service
Responsible for managing media-related operations, the Media Service processes uploads, stores media files in the Object Store, and updates the Message Database with URLs linking to stored media. It ensures efficient management and retrieval of media content, supporting multimedia messaging capabilities within the chat application.
10. Media Storage
Serving as a dedicated repository, the Object Store securely stores media files uploaded by users within the chat application. It efficiently manages large volumes of media data and integrates with the Media Service for seamless upload, storage, and retrieval of media files associated with messages exchanged between users.
11. Message Database
Message Database stores message content, metadata, and media URLs, serving as a centralized repository for managing message details within the messaging application.
12. Group Database
Group Database stores metadata essential for managing groups within the chat application, including group names, creators, and settings. It also maintains associations between group IDs and conversation IDs, enabling efficient retrieval of group-specific conversations and facilitating seamless communication and membership management among group participants.
13. Notification Service
The Notification Service sends push notifications to users' devices based on events such as new messages or updates in conversation status, enhancing user engagement by keeping users informed in real-time.
High level architecture of WhatsApp
7. Futher Optimizations [2 - 5 mins]
1. Caching
Caching is crucial for a messaging service like WhatsApp. It enhances the service's performance by storing frequently accessed data, such as user profiles, chat histories, and media thumbnails, in memory. This allows for quick retrieval, significantly reducing the load on the primary database servers. As a result, WhatsApp can handle a large number of simultaneous users efficiently.
Moreover, caching ensures the service remains reliable even during peak usage times or temporary network issues. By keeping essential data readily available, caching helps maintain a smooth and uninterrupted user experience. Overall, caching makes WhatsApp faster, more efficient, and more dependable for millions of users worldwide.
Cache systems like Redis, Memcached, AWS ElastiCache, Microsoft Azure Cache for Redis, and Google Cloud Memorystore store data using key-value pairs efficiently. They offer a range of features and integrations, addressing diverse caching requirements from high-performance in-memory caching to managed cloud services that ensure scalability and reliability.
Data Stored in Cache
"key": "string",
"value": "string"
}
I. User Cache
User cache stores frequently accessed user details such as username, profile picture, status, and last seen timestamp to minimize database queries. It also includes contact lists and user-specific settings.
Example User Cache
{
"key": "e193c87b2fd6a181c04cc8a4b97238ce",
"value": {
"user_id": "e193c87b2fd6a181c04cc8a4b97238ce",
"phone_number": "+1234567890",
"name": "Test User",
"profile_picture_url": "https://example.com/profiles/alice.jpg",
"created_at": "2012-05-02 11:59:25"
}
},
{
...
},
]
II. Conversation Cache
Conversation cache holds metadata about conversations, including participants, the last message ID, and the last message timestamp. It also tracks the unread message count for each conversation.
Example Conversation Cache
{
"key": "b30f706b8764cdb3ef733fae29bfef81",
"value": {
"conversation_id": b30f706b8764cdb3ef733fae29bfef81",
"conversation_name": "Design Discussion",
"participants": [
{
"user_id": "1234",
"is_admin": true
},
{
...
},
],
"created_at": "2017-02-04 08:33:46"
}
},
{
...
},
]
III. User Conversation Cache
User conversation cache lists all conversations a user is part of, including pinned and archived conversations for quick access. It helps in rapidly loading a user's conversation history.
Example User Conversation Cache
{
"key": "7a39f348a90ca9973a9af26fdb5b0387",
"value": {
"conversation_id": "b30f706b8764cdb3ef733fae29bfef81",
"user_id": "e193c87b2fd6a181c04cc8a4b97238ce",
}
},
{
...
},
]
IV. Message Cache
Message cache stores individual messages with details like message ID, conversation ID, sender ID, timestamp, content, and status. This enables quick retrieval and display of messages within conversations.
Example Message Cache
{
"key": "651c6a4488742276767202af02c9a3a4",
"value": {
"sender_id": "1234",
"recipient_id": "3456",
"content": "Hello There!",
"timestamp": "2024-06-28T09:15:00Z",
"status": "read",
}
},
{
...
},
]
V. Group Cache
Group Cache stores essential metadata related to groups within the chat application, such as group names, creators, and settings. It optimizes group-related operations by caching frequently accessed group data, enhancing performance in group management tasks and facilitating quick retrieval of group details.
Example Group Cache
{
"key": "group:9876",
"value": {
"group_name": "We Rock Designs",
"created_by": "1234",
"members": [
{
"user_id": "1234",
"is_admin": true
},
{
...
},
],
"group_profile_picture_url": "string",
"created_at": "2023-12-20 13:34:12",
"status": "active"
}
},
{
...
},
]
2. CDN (Content Delivery Network)
The CDN optimizes media content distribution by storing and delivering images, videos, and other media files from geographically distributed servers closer to users. This geographical proximity reduces latency and accelerates content delivery, enhancing user experience by ensuring fast and reliable access to multimedia content across devices.
I. Push CDN
Push CDNs proactively distribute pre-cached content from origin servers to edge servers globally, optimizing availability and reducing latency for static content like images and videos.
II. Pull CDN
Pull CDNs fetch content from origin servers on demand, checking edge server caches first and dynamically caching new or updated content, ideal for dynamic websites needing real-time updates and flexibility in content delivery.
High level architecture of WhatsApp after Optimizations
8. Data Flow [5-8 mins]
It's important to describe how data moves through the entire system, ensuring everyone understands clearly. According to our estimates and discussions, the system's non-functional requirements focus on handling high numbers of users efficiently, implementing strong data security, and expanding the infrastructure quickly as needed.
Sending a Message:
- User sends a message through their device over the internet.
- Request is routed through the Load Balancer and API Gateway.
- Message Service processes the message and updates the Conversation Database with new message details and stores message content and metadata in the Message Database.
- Notification Service is triggered to notify message recipients.
Receiving Messages:
- User requests messages using a GET /message endpoint.
- Message Service fetches relevant data from the Message Database and the User Conversation Database retrieves message content, metadata, and media URLs, fetches user details for group messages and updates the Conversation Database with any new message details fetched.
Media Handling:
- User selects an image to send through their device.
- Image upload request is routed through the API Gateway to the Media Service.
- Media Service processes the media and stores it in the Media Storage (Object Store) and updates the Message Database with the media URL.
- CDN efficiently delivers the image URL to other users for viewing.
Last Seen Update:
- User updates their last seen status, indicating when they were last active.
- Request is sent through the API Gateway to the User Service.
- User Service updates the User Database with the new last seen timestamp (/POST updateLastSeen).
Don't forget to explain the end to end flow of your design!
This architecture is designed to be scalable, resilient, and efficient, ensuring that the platform can handle a high volume of user interactions and data processing with minimal latency and high availability.