System Design
Knowledge about system design.
Steps
- Requirements engineering
- Functional requirements
- Non-functional requirements
- Capacity estimation
- Requests (Write/Read, RPS)
- Bandwidth: (KB/sec)
- Storage size: (TB)
- Data model
- entities, attribute, relation
- API design
- functions, arguments, response
- System design
- System components
- Client application
- Service: provides a clear function
- Multiple instances
- Load Balancer
- Relational database (SQL)
- Federation (functional partitioning)
- Message queue
- File storage
- Cache
- Techniques
- Pull/push
- Chunks
- System components
- Design discussion
Requirements engineering
Initial analysis
- Read- or write- heavy?
- Single server or distributed service?
- Availability or data consistency?
- What is the actual scale of the system
Functional requirements
What the system is supposed to do? To be specific, from user’s perspective. Define the
- Core features
- Support features
Non-functional requirements
How the system is expected to work? Following are common considered non-functional requirements
- Availability: how long the system is up and running per year
- e.g. 99.999% means down for maximum 5.15 mins a year
- need to make it reliability, redundancy, fault tolerance
- Data consistency (lineralizability): it appears to be the same in all corresponding nodes at the same time
- Scalability: an ability of a system to handle a fast growing user base and temporary load spikes seamlessly
~ Vertical scalability Horizontal scalability Meaning add more hardware resources to single machine scale by adding more servers to the cluster Advantages - Fast inter-process communication
- Data consistency- Higher availability
- Scales linearlyDisadvantages - Single point of failure
- Economical limit of scalability- Data inconsistency
- Complex architecture - Latency: time to wait for data requested and delivered
- experienced latency = network latency + system latency
- To reduce the latency
- High query performance of database and data model
- Using caching
- Compartibility: the ability of a system to operate seamlessly with other software, hardware and system
Capacity estimation
Given the numbers of
- Daily Active Users (DAU) or Active users per month (MAU)
- Key metric to describe scale
- Peak Active Users
- Triggered by?
- Leads to x times of DAU?
- Interactions per users
- How many times each user interact with it
- Read-write ration for core features
- Request size
- Size of payload of these requests
- Replication factor
- Database is replicate data to multiple nodes
- Increase read scalability
- Creates redundancy
- Takes up more storage capacity
We could calculate the metrics for
- Throughput, RPS
- Bandwidth, Kbps, Mbps
- Usually: 80 Kbps for VoIP calling, 150 Kbps for screen sharing, 0.5 Mbps for live streaming, 3 Mbps for 720p video, 25 Mbps for 4k HD video
- Storage: GB, TB
Common used rough numbers
- Seconds per day: 10^5
- Requests per day -> requests per second: /10^5
- Days per year: 4 * 10^2
- Storage per second -> storage in 5 years: *(2 * 10^3)
- Thousands, KB: 10^3
- Million, MB: 10^6
- Billion, GB: 10^9
- Trillion, TB: 10^12
- Quadrillion, PB: 10^15
- Average images: 200 KB
- Standard videos for streaming: 50 MB per min
Data model
- Entities: each entity needs a separate table
- Attributes: what attributes each entity requires
- Connections: 1-to-1, 1-to-many
e.g.
- User
- UserId
- ResourceId
- Resources
- Key
- Resource content
- Due date
- Pre-defined assets
- Key range
- Whether in Use
Choose database based on data model built. Consider the following
- ACID needed?
- Scalability is a concern?
- Data query needed?
API Design
Create a binding contract between clients and servers.
- Function signature from requirements
- Arguments
- Response body
System components
Database
~ | Relationship database | Non-relationship database |
---|---|---|
Support operations | create, read, update, delete (CRUD) | same |
Features | - Support complex query - Transactions simplify error handling - Block until all nodes committed |
Store key-value pairs |
Retrieving methods | SQL query | retrieve by keys: starting with a certain letter, within a range of numbers, less than or greater than a certain number, within a certain period of time if the key is a timestamp |
Principles | ACID - Atomicity: translation is the smallest behavior - Consistency: only allow specific format of data - Isolation: multi-users can read the same record - Durability: once a transaction is committed to the database, it will preserve there |
BASE - Basically available: it’s always possible to read and write data, even though might not consistent - Soft state: data state could change without interactions with the application - Eventual consistency: data will eventually become consistent once input stops. Allow for high availability and scalability |
Emphasis on | Consistency | Scalability |
Using scenarios | - Well-structured data - Requires lots of complex querying - Cannot tolerant any data inconsistency |
- A very large amount of data that easily fit into a tabular form - Requires high availability and performance, can tolerant temp inconsistency - In-memory user cases (build caching layer, extreme access speed) - Handle lots of small continuous reads and writes (shopping carts, user preference of specific users) |
Disadvantages | requires more effort to be scalable | - temporary inconsistency data across node - No standard query language (query needs to be implemented in application layer) - No data normalization (leads to duplicate data) |
Examples | oracle MySQL | LevelDB (Google), RocksDB & ZippyDB (Meta), DynamoDB (AWS) |
Message queue
Message broker could provide async operations, which could benefit
- Decouple internal services
- Avoid single point of failures
- Reduces latency since services not blocking
- Very useful for long-running jobs
Messaging ways
- Point-to-point messaging
- only send 1 message single time
- could handle if the receiver’s lost
- Publish/subscribe messaging
- Distribute to all listeners
Examples
- RabbitMQ: messages not send to queues directly but published to exchanges
- Kafka: handles massive loads for a long period of time
- Redis: in-memory data store
- Azure Service Bus
- AmazonMQ: Amazon managed RabbitMQ
- Cloud Tasks (Google)
File storage
- File storage
- Centralized, high accessible location for files. Lower cost
- Windows, Linux, MacOS OS
- Object storage
- Unstructured data, scalability. Provides context about the data. Can only update the whole object
- AWS S3, Azure Blob, Google storage
- Block storage
- Store chunks of data, high structured, low data transfer overheads, low latency
- Amazon EBS, Azure Disk Storage, Google Cloud Persistent Disk
Cache & CDN
- Hard disk -> (15x faster) SSD -> (300x faster) RAM
- Cache keep often used data into RAM
- Methods to keep cache sync
- Set a expiration time
- Invalidation service
- Use cases
- handle many small continuous reads and writes
- Need to temporarily storing basic information
- Don’t require frequent data updates
- Examples
- Redis
- Memcache: in-memory key-value store for small data chunkcs
- Hazlecast: distributed in-memory object store
Content delivery network (CDN)
- CDN server holds a cache closing to the clients after the first time query
- To cache new modified data
- Don’t store time sensitive data
- Set a maximum duration in case the browser caches
- Cache busting. Make all cached data invalid and cache them again
Search engine database
- Search engines are NoSQL databases that excel at providing relevant results evn with typos
- Indexing allows for faster search performance
- Components
- Document: represents a specific entity, containing data to return to clients
- Inverted Index: split each document into individual search terms and makes each word point to the actual documents
- Ranking algorithm: produces a ranked list of documents
- Examples
- Lucene
- Elastic Search
- Atlas Search (MongoDB)
- Redis Search
Some techniques
- Base64 encoding: allow to transmit characters. Increase 2 bytes to 3 bytes.
- MD5 hash: map a string to fixed-size (128) string
- Websocket: real-time communication between 2 clients
- Chat apps
- Live-collaboration apps
- Social feeds
- Tracking apps
- IoT solutions
- Polling: request -> response -> wait to request again
- Updates might get delayed by the time period between pools
- Long polling: request -> waiting until time out -> request immediately
- Updates are sent to clients immediately
- Push: server send events
- Twitter updates
- Updating statues
- Push notifications
- Newsletters
- Sports results
- Disadvantages
- Unidirectional communication
- Server don’t know when connection is lost
- Chunk: break down a whole file into several chunks
- Easy for transmit
- Reduce the risk of network failures
- Streaming service pipeline
- File chunker
- Content filtering
- Transcoder: transcode into target format
- Quality conversion: choose resolution
- TCP: for reliable data delivery. Suitable for media streaming
- UDP: for fast data delivery. Suitable for online gaming, video chat