System Design

Knowledge about system design.

Steps

  1. Requirements engineering
    • Functional requirements
    • Non-functional requirements
  2. Capacity estimation
    • Requests (Write/Read, RPS)
    • Bandwidth: (KB/sec)
    • Storage size: (TB)
  3. Data model
    • entities, attribute, relation
  4. API design
    • functions, arguments, response
  5. System design
    • System components
      • Client application
      • Service: provides a clear function
        • Multiple instances
        • Load Balancer
      • Relational database (SQL)
        • Federation (functional partitioning)
      • Message queue
      • File storage
      • Cache
    • Techniques
      • Pull/push
      • Chunks
  6. Design discussion

Requirements engineering

Initial analysis

  1. Read- or write- heavy?
  2. Single server or distributed service?
  3. Availability or data consistency?
  4. What is the actual scale of the system

Functional requirements

What the system is supposed to do? To be specific, from user’s perspective. Define the

  • Core features
  • Support features

Non-functional requirements

How the system is expected to work? Following are common considered non-functional requirements

  • Availability: how long the system is up and running per year
    • e.g. 99.999% means down for maximum 5.15 mins a year
    • need to make it reliability, redundancy, fault tolerance
  • Data consistency (lineralizability): it appears to be the same in all corresponding nodes at the same time
  • Scalability: an ability of a system to handle a fast growing user base and temporary load spikes seamlessly
    ~ Vertical scalability Horizontal scalability
    Meaning add more hardware resources to single machine scale by adding more servers to the cluster
    Advantages - Fast inter-process communication
    - Data consistency
    - Higher availability
    - Scales linearly
    Disadvantages - Single point of failure
    - Economical limit of scalability
    - Data inconsistency
    - Complex architecture
  • Latency: time to wait for data requested and delivered
    • experienced latency = network latency + system latency
    • To reduce the latency
      • High query performance of database and data model
      • Using caching
  • Compartibility: the ability of a system to operate seamlessly with other software, hardware and system

Capacity estimation

Given the numbers of

  • Daily Active Users (DAU) or Active users per month (MAU)
    • Key metric to describe scale
  • Peak Active Users
    • Triggered by?
    • Leads to x times of DAU?
  • Interactions per users
    • How many times each user interact with it
    • Read-write ration for core features
  • Request size
    • Size of payload of these requests
  • Replication factor
    • Database is replicate data to multiple nodes
    • Increase read scalability
    • Creates redundancy
    • Takes up more storage capacity

We could calculate the metrics for

  • Throughput, RPS
  • Bandwidth, Kbps, Mbps
    • Usually: 80 Kbps for VoIP calling, 150 Kbps for screen sharing, 0.5 Mbps for live streaming, 3 Mbps for 720p video, 25 Mbps for 4k HD video
  • Storage: GB, TB

Common used rough numbers

  • Seconds per day: 10^5
    • Requests per day -> requests per second: /10^5
  • Days per year: 4 * 10^2
    • Storage per second -> storage in 5 years: *(2 * 10^3)
  • Thousands, KB: 10^3
  • Million, MB: 10^6
  • Billion, GB: 10^9
  • Trillion, TB: 10^12
  • Quadrillion, PB: 10^15
  • Average images: 200 KB
  • Standard videos for streaming: 50 MB per min

Data model

  • Entities: each entity needs a separate table
  • Attributes: what attributes each entity requires
  • Connections: 1-to-1, 1-to-many

e.g.

  • User
    • UserId
    • ResourceId
  • Resources
    • Key
    • Resource content
    • Due date
  • Pre-defined assets
    • Key range
    • Whether in Use

Choose database based on data model built. Consider the following

  • ACID needed?
  • Scalability is a concern?
  • Data query needed?

API Design

Create a binding contract between clients and servers.

  • Function signature from requirements
  • Arguments
  • Response body

System components

Database

~ Relationship database Non-relationship database
Support operations create, read, update, delete (CRUD) same
Features - Support complex query
- Transactions simplify error handling
- Block until all nodes committed
Store key-value pairs
Retrieving methods SQL query retrieve by keys: starting with a certain letter, within a range of numbers, less than or greater than a certain number, within a certain period of time if the key is a timestamp
Principles ACID
- Atomicity: translation is the smallest behavior
- Consistency: only allow specific format of data
- Isolation: multi-users can read the same record
- Durability: once a transaction is committed to the database, it will preserve there
BASE
- Basically available: it’s always possible to read and write data, even though might not consistent
- Soft state: data state could change without interactions with the application
- Eventual consistency: data will eventually become consistent once input stops. Allow for high availability and scalability
Emphasis on Consistency Scalability
Using scenarios - Well-structured data
- Requires lots of complex querying
- Cannot tolerant any data inconsistency
- A very large amount of data that easily fit into a tabular form
- Requires high availability and performance, can tolerant temp inconsistency
- In-memory user cases (build caching layer, extreme access speed)
- Handle lots of small continuous reads and writes (shopping carts, user preference of specific users)
Disadvantages requires more effort to be scalable - temporary inconsistency data across node
- No standard query language (query needs to be implemented in application layer)
- No data normalization (leads to duplicate data)
Examples oracle MySQL LevelDB (Google), RocksDB & ZippyDB (Meta), DynamoDB (AWS)

Message queue

Message broker could provide async operations, which could benefit

  • Decouple internal services
  • Avoid single point of failures
  • Reduces latency since services not blocking
  • Very useful for long-running jobs

Messaging ways

  • Point-to-point messaging
    • only send 1 message single time
    • could handle if the receiver’s lost
  • Publish/subscribe messaging
    • Distribute to all listeners

Examples

  • RabbitMQ: messages not send to queues directly but published to exchanges
  • Kafka: handles massive loads for a long period of time
  • Redis: in-memory data store
  • Azure Service Bus
  • AmazonMQ: Amazon managed RabbitMQ
  • Cloud Tasks (Google)

File storage

  • File storage
    • Centralized, high accessible location for files. Lower cost
    • Windows, Linux, MacOS OS
  • Object storage
    • Unstructured data, scalability. Provides context about the data. Can only update the whole object
    • AWS S3, Azure Blob, Google storage
  • Block storage
    • Store chunks of data, high structured, low data transfer overheads, low latency
    • Amazon EBS, Azure Disk Storage, Google Cloud Persistent Disk

Cache & CDN

  • Hard disk -> (15x faster) SSD -> (300x faster) RAM
  • Cache keep often used data into RAM
  • Methods to keep cache sync
    • Set a expiration time
    • Invalidation service
  • Use cases
    • handle many small continuous reads and writes
    • Need to temporarily storing basic information
    • Don’t require frequent data updates
  • Examples
    • Redis
    • Memcache: in-memory key-value store for small data chunkcs
    • Hazlecast: distributed in-memory object store

Content delivery network (CDN)

  • CDN server holds a cache closing to the clients after the first time query
  • To cache new modified data
    • Don’t store time sensitive data
    • Set a maximum duration in case the browser caches
    • Cache busting. Make all cached data invalid and cache them again

Search engine database

  • Search engines are NoSQL databases that excel at providing relevant results evn with typos
  • Indexing allows for faster search performance
  • Components
    • Document: represents a specific entity, containing data to return to clients
    • Inverted Index: split each document into individual search terms and makes each word point to the actual documents
    • Ranking algorithm: produces a ranked list of documents
  • Examples
    • Lucene
    • Elastic Search
    • Atlas Search (MongoDB)
    • Redis Search

Some techniques

  • Base64 encoding: allow to transmit characters. Increase 2 bytes to 3 bytes.
  • MD5 hash: map a string to fixed-size (128) string
  • Websocket: real-time communication between 2 clients
    • Chat apps
    • Live-collaboration apps
    • Social feeds
    • Tracking apps
    • IoT solutions
  • Polling: request -> response -> wait to request again
    • Updates might get delayed by the time period between pools
  • Long polling: request -> waiting until time out -> request immediately
    • Updates are sent to clients immediately
  • Push: server send events
    • Twitter updates
    • Updating statues
    • Push notifications
    • Newsletters
    • Sports results
    • Disadvantages
      • Unidirectional communication
      • Server don’t know when connection is lost
  • Chunk: break down a whole file into several chunks
    • Easy for transmit
    • Reduce the risk of network failures
  • Streaming service pipeline
    • File chunker
    • Content filtering
    • Transcoder: transcode into target format
    • Quality conversion: choose resolution
  • TCP: for reliable data delivery. Suitable for media streaming
  • UDP: for fast data delivery. Suitable for online gaming, video chat

Reference