IoT sensors sending frequent small bursts of data.
06 - DNS, Dynamic DNS, and the DNS Lookup Process
What is DNS?
DNS (Domain Name System) is the internet’s phonebook. It translates human-readable domain names (like openai.com) into machine-readable IP addresses (like 104.18.12.123), enabling devices to locate and communicate with each other.
Key Components of DNS
Domain Name: A readable address like example.com.
IP Address: The actual network address (IPv4 or IPv6).
DNS Resolver: A client-side service (usually your ISP) that performs DNS queries.
Root DNS Servers: Top of the DNS hierarchy. Direct queries to TLD servers.
TLD Servers: Handle top-level domains like .com, .org, .net.
Authoritative DNS Servers: Contain the actual IP address for a domain.
What is Dynamic DNS (DDNS)?
Dynamic DNS automatically updates DNS records whenever the device's IP changes (especially useful for devices on dynamic IPs, like most home internet connections).
Use Cases:
Hosting a server at home (e.g., security camera or web server).
IoT devices.
Remote access when IP changes frequently.
How It Works:
A DDNS service monitors your IP.
When it changes, it sends an update to the DNS server.
Your domain now points to the new IP without manual changes.
DNS Lookup Process (Step-by-Step)
Let’s say a user visits www.example.com. Here’s what happens:
1. Browser Cache
Browser checks if it has recently resolved www.example.com.
If yes, it uses that cached IP.
2. Operating System Cache
If not found in the browser, the OS checks its DNS cache.
3. DNS Resolver (Usually ISP)
If not found locally, the query is sent to the configured DNS resolver (e.g., Google’s 8.8.8.8).
4. Root Server Query
Resolver asks the root server for .com TLD info.
5. TLD Server Query
Root server responds with TLD server for .com.
Resolver queries this TLD server.
6. Authoritative Server Query
TLD responds with the authoritative name server for example.com.
Resolver queries this server.
7. IP Address Returned
Authoritative server responds with the IP address of www.example.com.
8. Client Connects to Server
The resolver sends the IP to the client.
The browser makes an HTTP/S request to that IP.
9. Caching for Speed
All intermediate steps are cached by the resolver and the client for future speed.
Example Timeline:
User types www.example.com ↓
Local browser/OS cache → miss ↓
Query to ISP DNS resolver (or 8.8.8.8) ↓ → Root server (where's .com?) → TLD server (where's example.com?) → Authoritative server (what's www.example.com?) ↓ IP returned (e.g., 93.184.216.34) ↓ Browser loads the website.
07 - HTTP, RPC, Methods, Status Codes, SSL/TLS, HTTPS
1. What is HTTP?
HTTP (HyperText Transfer Protocol) is the foundational protocol for data communication on the web. It defines how clients (usually browsers) communicate with servers.
Stateless: Each request is independent.
Text-based: Uses plain text for communication.
Request/Response Model: The client sends a request, and the server responds.
2. What is RPC?
RPC (Remote Procedure Call) is a protocol that allows a program to execute a function or procedure on a remote server as if it were local.
RPC abstracts the network layer, making remote calls look like local function calls.
301 Moved Permanently: Resource moved to a new URL.
302 Found: Temporary redirect.
304 Not Modified: Use cached version.
4xx: Client Errors
400 Bad Request: Invalid input.
401 Unauthorized: Auth required.
403 Forbidden: Authenticated but not allowed.
404 Not Found: Resource not found.
5xx: Server Errors
500 Internal Server Error: Something broke.
502 Bad Gateway: Invalid response from upstream.
503 Service Unavailable: Server is overloaded or down.
5. SSL/TLS and HTTPS
What is SSL/TLS?
SSL (Secure Sockets Layer) and its successor TLS (Transport Layer Security) encrypt communication between client and server.
Protects against:
Eavesdropping
Data tampering
Man-in-the-middle (MITM) attacks
How SSL/TLS Works (Simplified):
Handshake: Client and server agree on encryption methods and exchange keys.
Authentication: Server provides a digital certificate (usually issued by a trusted CA).
Session Keys: Both sides derive session keys for encryption.
Encrypted Communication: All subsequent data is encrypted using these keys.
6. What is HTTPS?
HTTPS = HTTP + TLS
Ensures encrypted and secure communication over the web.
Common on websites handling sensitive data (login, payments).
08 - WebSocket and Polling
Overview
Real-time communication is a key requirement in many modern applications, such as chat apps, multiplayer games, live updates, and collaborative tools. Two primary techniques for enabling this are WebSockets and Polling.
1. Polling
What is Polling?
Polling is a technique where the client repeatedly asks the server for new data at regular intervals.
Example:
Client sends a request every 5 seconds:
Server responds with latest data (even if there's no update).
Types of Polling
a. Short Polling
Client sends requests at fixed intervals.
Easy to implement but inefficient.
Can lead to server overload and wasted bandwidth.
setInterval(()=>{fetch("/new-messages").then((res)=> res.json()).then((data)=>console.log(data));},3000);// every 3 seconds
b. Long Polling
Client sends request and waits.
Server holds the request open until new data is available.
Once data is sent, the client immediately sends a new request.
More efficient than short polling.
Pros
Simple to implement.
Works over standard HTTP/HTTPS.
Cons
Higher latency than WebSocket.
More overhead due to repeated HTTP connections.
Not truly real-time.
2. WebSocket
What is WebSocket?
WebSocket is a protocol that enables full-duplex (two-way) communication between client and server over a single, long-lived connection.
Starts as HTTP → Upgrades to WebSocket via Upgrade header.
Uses ws:// or wss:// (secure).
Ideal for real-time data transfer.
WebSocket Lifecycle
Handshake: Client sends HTTP request with Upgrade: websocket.
Upgrade: Server responds with 101 Switching Protocols.
Open Connection: Persistent TCP connection is established.
Data Exchange: Both client and server can send data anytime.
Close Connection: Either party can close the connection.
Less network overhead after connection is established.
Cons
Requires more complex server-side setup.
Not ideal for simple applications or infrequent data changes.
3. Polling vs WebSocket
Feature
Polling
WebSocket
Protocol
HTTP
TCP (after HTTP upgrade)
Direction
Client → Server
Bidirectional
Latency
Higher
Very low
Overhead
High (many HTTP requests)
Low (single persistent connection)
Complexity
Simple
Moderate to complex
Use Case
Simple updates
Real-time apps
09 - API Paradigms
Overview
APIs (Application Programming Interfaces) define how different software components communicate. Choosing the right API paradigm impacts performance, scalability, and developer experience. This lesson covers major API paradigms:
REST
GraphQL
gRPC
Others (WebSockets, RPC)
1. REST (Representational State Transfer)
Characteristics
Stateless client-server architecture.
Uses HTTP methods: GET, POST, PUT, DELETE.
Resource-oriented (URLs represent resources).
Sessions & Cookies
Stateless by default; sessions are maintained using cookies or tokens (e.g., JWT).
Cookies store session ID on client, sent with every request.
Server validates session ID to associate with a user.
Example
GET /users/123
Returns user data for user with ID 123.
Pros
Simple and widely adopted.
Good for CRUD-based applications.
Caches well.
Cons
Overfetching or underfetching data.
Multiple round trips for complex queries.
2. GraphQL
Characteristics
Declarative data fetching query language.
One endpoint (usually /graphql) for all interactions.
Query and Mutation
Query: Fetch data.
Mutation: Modify data (create/update/delete).
Overfetching & Underfetching
Solves overfetching by allowing clients to specify exactly what they need.
Also prevents underfetching that leads to multiple requests.
Example Query
{user(id:"123"){nameemail}}
Pros
Reduces number of requests.
Fine-grained data fetching.
Strong typing (schema-driven).
Cons
Complex caching.
Harder to debug and monitor.
Learning curve for newcomers.
3. gRPC
Characteristics
High-performance RPC framework developed by Google.
Uses HTTP/2 and Protocol Buffers (Protobuf) for compact communication.
Protocol Buffers
Efficient binary serialization format.
Smaller payload size compared to JSON.
Streaming Support
Supports:
Unary (single request/response)
Server-side streaming
Client-side streaming
Bi-directional streaming
Pros
Great for internal microservice communication.
Strong typing and fast performance.
Built-in code generation for multiple languages.
Cons
Not human-readable (binary format).
Less browser support (typically used in backend systems).
4. Other Paradigms
WebSockets
Full-duplex communication over a single TCP connection.
Suitable for real-time apps (chat, notifications).
Traditional RPC (Remote Procedure Call)
Function-call semantics over the network.
gRPC is a modern take on RPC.
10 - API Design
Introduction to API Design
API (Application Programming Interface) design is the process of developing APIs that effectively expose data and application functionality for consumption by developers and applications. Good API design is crucial for developer experience, system scalability, and long-term maintenance.
Core API Design Paradigms
REST (Representational State Transfer)
REST is an architectural style for distributed systems, particularly web services. RESTful APIs use HTTP methods explicitly and have the following characteristics:
Stateless: Server doesn't store client state between requests
Resource-based: Everything is a resource identified by URIs
Representation-oriented: Resources can have multiple representations (JSON, XML, etc.)
Uniform interface: Consistent approach to resource manipulation
Client-server architecture: Separation of concerns between client and server
Layered system: Components cannot "see" beyond their immediate layer
REST Maturity Levels (Richardson Maturity Model)
Level 0: Single URI, single HTTP method (typically POST)
Level 1: Multiple URIs for different resources, but still a single HTTP method
Level 2: HTTP verbs used as intended (GET, POST, PUT, DELETE)
Level 3: HATEOAS (Hypermedia as the Engine of Application State) - APIs provide hyperlinks to guide clients
RPC (Remote Procedure Call)
RPC focuses on actions and functions rather than resources:
Typically uses POST for most operations
Emphasizes operations over resources
Examples include gRPC, XML-RPC, and JSON-RPC
Often preferred for internal service-to-service communication
gRPC
Uses Protocol Buffers for serialization
Supports bidirectional streaming
Provides auto-generated client libraries
GraphQL
A query language and runtime for APIs:
Client specifies exactly what data it needs
Single endpoint for all operations
Strongly typed schema
Reduces over-fetching and under-fetching of data
Supports queries (read), mutations (write), and subscriptions (real-time updates)
CRUD Operations
CRUD represents the four basic operations for persistent storage:
Operation
Description
HTTP Method (REST)
SQL
Create
Add new resources
POST
INSERT
Read
Retrieve resources
GET
SELECT
Update
Modify existing resources
PUT/PATCH
UPDATE
Delete
Remove resources
DELETE
DELETE
REST CRUD Mapping
POST /users # Create a user
GET /users # Read all users
GET /users/{id} # Read specific user
PUT /users/{id} # Update user (full replacement)
PATCH /users/{id} # Update user (partial modification)
DELETE /users/{id} # Delete user
URL Design Best Practices
Resource Naming
Use nouns, not verbs (e.g., /products not /getProducts)
Use plural nouns for collections (/users not /user)
Use concrete names over abstract concepts
Use lowercase letters and hyphens for multi-word resources
Hierarchy and Relationships
Express parent-child relationships in the URL path:
/departments/{id}/employees
/users/{id}/posts
Consider the depth of nesting (avoid too many levels)
Query Parameters vs Path Parameters
Path parameters: Identify specific resources
/users/123
Query parameters: Filter, sort, paginate, or control response format
Swagger/OpenAPI Editor: For designing and documenting APIs
Postman: API testing and documentation
Insomnia: REST and GraphQL client
Apiary: API design, prototyping, and documentation
11 - Caching
Introduction to Caching
Caching is a technique that stores copies of frequently accessed data in a high-speed storage layer (the cache), allowing future requests for that data to be served faster. Caching plays a crucial role in improving application performance, reducing latency, decreasing network traffic, and minimizing load on backend servers.
Why Use Caching?
Improved Performance: Reduced data retrieval time
Enhanced Throughput: Handle more requests with the same infrastructure
Reduced Bandwidth Usage: Less data transferred over networks
Decreased Server Load: Fewer requests to origin servers
Higher Availability: Services remain available even when backend experiences issues
Cost Savings: Lower compute and network resource requirements
Key Caching Metrics
Throughput
Throughput in caching refers to the number of requests a cache can process per unit of time (typically measured in requests per second).
Factors affecting throughput:
Cache hit ratio
Cache access latency
Network bandwidth and latency
Backend service performance
Cache size and eviction policy
Hardware resources (CPU, memory, network)
Optimizing for throughput:
Increase cache memory allocation
Use efficient data structures
Implement appropriate eviction policies
Distribute cache load across multiple nodes
Minimize network round trips
Pre-warm cache with frequently accessed data
Measuring throughput:
Throughput = (Total Requests × Hit Ratio) / Time Period
Higher hit ratio = higher effective throughput
Monitor throughput during peak loads to identify bottlenecks
Cache Hit Ratio
The percentage of requests that are served from the cache:
Hit Ratio = (Cache Hits / Total Requests) × 100%
Good hit ratios typically range from 80-95% depending on the application
Low hit ratio may indicate:
Cache size too small
TTL values too short
Suboptimal eviction policy
Cacheable content not being cached
Highly random access patterns
Types of Caches
Client-Side Caching
Browser Cache: Stores web assets (HTML, CSS, JS, images) locally in the user's browser
Application Cache: Data stored within client-side applications (mobile apps, SPAs)
Service Workers: Enable offline functionality for web applications
Server-Side Caching
Application/In-Memory Cache: Data stored in the application's memory space (e.g., using Redis, Memcached)
Page Cache: Fully rendered HTML pages cached on the server
Fragment Cache: Portions of pages or components cached separately
Database Query Cache: Results of database queries stored for reuse
Write operations have higher latency (must update both DB and cache)
Cache may contain unread data, wasting resources
Additional write operation even if data isn't read later
Write-Behind (Write-Back)
Application updates the cache
Cache asynchronously updates the database
function saveData(key, value) {
cache.put(key, value)
cacheQueue.queueForPersistence(key, value)
return success
}
// Separate process
function persistCacheUpdates() {
while (item = cacheQueue.dequeue()) {
database.put(item.key, item.value)
}
}
Pros:
Reduced write latency
Ability to batch multiple writes together
Buffer against write-heavy workloads
Improved throughput for write operations
Cons:
Risk of data loss if cache fails before writing to database
More complex implementation
Potential data consistency issues
Write-Around
Application writes directly to the database, bypassing cache
Cache is only populated when data is read
function saveData(key, value) {
database.put(key, value)
return success
}
function getData(key) {
data = cache.get(key)
if (data == null) {
data = database.get(key)
cache.put(key, data)
}
return data
}
Pros:
Prevents cache churn from write-heavy operations
Good for data that won't be read immediately or frequently
Cons:
Cache misses for recently written data
Higher read latency after writes
Refresh-Ahead
Proactively refreshes frequently accessed items before they expire
function setUpRefreshAhead(key, refreshThreshold) {
item = cache.get(key)
if (item.accessCount > threshold && item.expiryTime - now() < refreshThreshold) {
asyncRefresh(key)
}
}
function asyncRefresh(key) {
data = database.get(key)
cache.put(key, data)
}
Pros:
Reduces latency for frequently accessed items
Smooths out database load spikes
Maintains high throughput during peak times
Cons:
More complex to implement
May waste resources refreshing items that won't be accessed again
Difficult to predict which items to refresh
Cache Eviction Policies
When a cache reaches its capacity limit, eviction policies determine which items to remove to make space for new data.
Least Recently Used (LRU)
Removes the items that haven't been accessed for the longest time.
Least Frequently Used (LFU)
Removes items that are accessed least often. Tracks access frequency (hit count) for each item.
Other Notable Eviction Policies
First In, First Out (FIFO): Removes the oldest entries first
Random Replacement: Randomly selects items for eviction
Most Recently Used (MRU): Removes the most recently used items (useful for certain scanning patterns)
Redis as a Caching Solution
Redis (Remote Dictionary Server) is an open-source, in-memory data structure store that can be used as a database, cache, message broker, and streaming engine.
no-cache: Must revalidate with server before using cached version
no-store: Don't cache at all
public: Response can be stored by any cache
private: Response is for a single user only
Cache-Control: max-age=3600, public
Caching Challenges
Cache Invalidation
Determining when and how to invalidate cached items
Ensuring all relevant caches are invalidated
Balancing freshness vs. performance
"There are only two hard things in Computer Science: cache invalidation and naming things."
- Phil Karlton
Cache Stampede/Thundering Herd
When many requests simultaneously try to refresh an expired cache item
Solutions:
Lock/Mutex: Only one request regenerates the cache
Stale-While-Revalidate: Serve stale content while refreshing asynchronously
Probabilistic Early Expiration: Randomly refresh before expiration
Sliding Expiration Window: Reset TTL on access
Content Delivery Networks (CDNs) & Edge Computing
Introduction to CDNs
A Content Delivery Network (CDN) is a distributed network of servers deployed in multiple data centers across different geographic locations. CDNs are designed to deliver content to end-users with high availability and performance by serving content from edge servers that are physically closer to users than origin servers.
Core CDN Concepts
Purpose and Benefits
Improved Performance: Reduced latency through geographic proximity
Increased Reliability: Distributed architecture minimizes single points of failure
Scalability: Better handling of traffic spikes and large audience sizes
Cost Efficiency: Reduced origin server load and bandwidth costs
Security: Protection against DDoS attacks and other threats
Key CDN Components
Edge Locations/Points of Presence (PoPs): Distributed servers in various geographic locations
Origin Server: The original source of content (your web server or cloud storage)
Distribution Network: The infrastructure connecting edge locations and origin
Cache Servers: Specialized servers optimized for content delivery
Control Plane: Management and configuration systems
How CDNs Work
User requests content from a website/application
DNS routes the request to the nearest CDN edge server
The edge server checks its cache for requested content
If content is in cache (cache hit), it's delivered directly to the user
If content is not cached (cache miss), the edge server requests it from the origin
The edge server caches the content and delivers it to the user
Subsequent requests for the same content are served from the edge cache
CDN Architecture Types
Traditional Pull CDNs
In a pull CDN, content is "pulled" from the origin server when first requested by a user and not found in the edge cache.
Characteristics:
Content is cached on-demand (lazy loading)
Origin server remains the source of truth
Automatic cache population based on user requests
Better suited for frequently changing content
Examples: Cloudflare, Amazon CloudFront (in pull mode), Akamai
Push CDNs
In a push CDN, content is proactively "pushed" to the edge servers before users request it.
Characteristics:
Content is uploaded to CDN in advance
Better for static content that doesn't change frequently
Provides more control over what's cached and when
Requires explicit cache invalidation when content changes
Often used for large media files, software downloads, etc.
Examples: Amazon CloudFront (in push mode), Azure CDN
CDN Content Types
Static Content Delivery
Static content doesn't change between user requests and is ideal for CDN caching:
Images and Graphics: JPG, PNG, GIF, SVG, WebP
CSS and JavaScript Files: Style sheets and client-side scripts
Health check status: Up/down state of backend servers
When to Scale
Vertical scaling: More powerful load balancer hardware/instances
Horizontal scaling: Multiple load balancers with DNS round-robin or GSLB
Consider scaling when:
CPU/memory utilization consistently above 70%
Connection queues building up
Increased latency under load
14 - Consistent Hashing
The Challenge of Distributed Data Placement
Imagine you have a large collection of data items that need to be distributed across multiple servers. How do you decide which server should handle which data item? This decision becomes particularly challenging when servers are added or removed from the system.
Traditional Approach and Its Limitations
Traditional Method: Using modulo arithmetic (hash(key) % number_of_servers)
Example:
With 4 servers (S0, S1, S2, S3) and keys hashed between 0-999
Key "user_profile_123" hashes to 857
857 % 4 = 1, so it goes to server S1
Problem:
When you add a 5th server, almost all keys need to be remapped:
Now 857 % 5 = 2, so it moves to server S2
In fact, approximately 80% of all keys would need to be moved!
Consistent Hashing
Conceptual Model: The Hash Ring
Imagine a circular ring numbered from 0 to 359 degrees (like a compass). Both servers and data keys are placed on this ring using a hash function.
How it works:
Place each server at positions on the ring based on their hash values
For each data key, find its position on the ring
Move clockwise from the key's position and assign it to the first server encountered
Concrete Example:
Server A hashes to position 80 on the ring
Server B hashes to position 160
Server C hashes to position 240
Server D hashes to position 320
If we have data keys that hash to positions 30, 110, 200, and 290:
Key at position 30 goes to Server A (next server clockwise)
Key at position 110 goes to Server B
Key at position 200 goes to Server C
Key at position 290 goes to Server D
Server Addition Example:
Add Server E at position 140
Now the key at position 110 will move to Server E
All other key assignments remain the same!
Server Removal Example:
Remove Server B (at position 160)
Only the keys between positions 80-160 are affected; they now go to Server C
All other key assignments remain unchanged
Virtual Nodes
Problem: With few servers, distribution can be unbalanced
Solution: Each physical server is represented by multiple points on the ring
Example:
Instead of just "Server A," we have "Server A-1," "Server A-2," ... "Server A-100"
Each gets placed at different positions on the ring
This spreads the load more evenly
Real-World Analogy
Think of consistent hashing like postal delivery zones. Each server is responsible for a "zone" on the ring. When a new post office opens, it only takes over a portion of one existing zone, rather than changing all zone boundaries.
Rendezvous Hashing (Highest Random Weight)
Conceptual Model: Scoring Contest
Imagine that for each data item, all servers compete in a "contest" to determine who will store that item. The contest is deterministic but unique for each key-server pair.
How it works:
For a given data key, each server generates a score based on hash(server_id + key)
The server with the highest score "wins" the key
Concrete Example:
With servers A, B, C, D and data key "user_123":
Server
Score for "user_123"
A
82
B
45
C
96
D
31
Server C has the highest score, so it gets assigned "user_123"
For a different key "product_456":
Server
Score for "product_456"
A
74
B
91
C
23
D
57
Server B wins this "contest" and gets assigned "product_456"
Server Addition Example:
Add Server E which scores 88 for "user_123" and 42 for "product_456"
Since E's score (88) for "user_123" is less than C's score (96), "user_123" stays on server C
Since E's score (42) for "product_456" is less than B's score (91), "product_456" stays on server B
Only keys where the new server gets the highest score will move
Server Removal Example:
Remove Server C
For "user_123", we need to find the next highest score, which is Server A (82)
Only keys previously assigned to Server C need to be reassigned
Real-World Analogy
Think of rendezvous hashing like a specialized job assignment. Each data item is a job with unique requirements, and each server has different skills for different jobs. The server that's the best match for a particular job gets assigned that job.
Practical Applications
Consistent Hashing Applications:
CDN: Akamai uses consistent hashing to determine which edge server should cache specific content
Distributed Databases: Cassandra and DynamoDB use consistent hashing for data partitioning
Caching Systems: Memcached clients use consistent hashing to distribute cache entries
Rendezvous Hashing Applications:
Content-addressable Storage: Systems like Ceph use rendezvous hashing for object placement
Load Balancers: Some advanced load balancers use rendezvous hashing for request distribution
Peer-to-peer Networks: Used to determine which peer should store particular content
When to Choose Which
Choose Consistent Hashing when:
You need a well-established algorithm with extensive literature and tooling
The ring structure provides useful properties for your application
Memory overhead isn't a significant concern
Choose Rendezvous Hashing when:
You want a simpler implementation
Memory efficiency is important
You need natural load balancing without virtual nodes
15 - SQL
SQL (Structured Query Language)
SQL is a domain-specific language used for managing and manipulating relational databases. It serves as the standard language for relational database management systems (RDBMS).
Core Components of SQL
1. Data Definition Language (DDL)
Commands that define and modify database structure:
CREATE: Create databases, tables, indexes, views
ALTER: Modify existing database objects
DROP: Remove database objects
TRUNCATE: Remove all records from a table
2. Data Manipulation Language (DML)
Commands that manipulate data within tables:
SELECT: Retrieve data from one or more tables
INSERT: Add new records
UPDATE: Modify existing records
DELETE: Remove records
3. Data Control Language (DCL)
Commands that control access to data:
GRANT: Give privileges to users
REVOKE: Remove privileges from users
4. Transaction Control Language (TCL)
Commands that manage transactions:
COMMIT: Save changes permanently
ROLLBACK: Restore to the last commit point
SAVEPOINT: Create points to roll back to
Key SQL Concepts
Joins
Connect rows from multiple tables based on related columns:
INNER JOIN: Returns records with matching values in both tables
LEFT JOIN: Returns all records from the left table and matching records from the right
RIGHT JOIN: Returns all records from the right table and matching records from the left
FULL JOIN: Returns all records when there is a match in either table
Indexes
Special data structures that improve the speed of data retrieval operations:
Make queries faster but can slow down write operations
Primary keys are automatically indexed
Constraints
Rules enforced on data columns:
PRIMARY KEY: Uniquely identifies each record
FOREIGN KEY: Ensures referential integrity
UNIQUE: Ensures all values in a column are different
CHECK: Ensures values meet specified conditions
NOT NULL: Ensures a column cannot have NULL value
Aggregate Functions
Operations that perform calculations on sets of values:
COUNT(): Counts rows
SUM(): Calculates total
AVG(): Calculates average
MIN(): Finds minimum value
MAX(): Finds maximum value
B+ Trees
B+ Trees are self-balancing tree data structures that maintain sorted data and allow for efficient insertion, deletion, and search operations. They are the most common implementation for indexes in database systems.
Structure of B+ Trees
1. Nodes
Root Node: The top node of the tree
Internal Nodes: Contain keys and pointers to child nodes
Leaf Nodes: Store actual data or pointers to data
2. Properties
All leaf nodes are at the same level (balanced tree)
Each node contains between m/2 and m children (where m is the order of the tree)
Leaf nodes are linked together in a linked list (crucial for range queries)
All keys are present in leaf nodes
B+ Tree Operations
Search Operation
Start at the root node
For each level, find the appropriate subtree based on key comparison
Continue until reaching a leaf node
Scan the leaf node for the target key
Range Queries
Search for the lower bound key
Once found in a leaf node, traverse the linked list of leaf nodes
Continue until reaching the upper bound
Time complexity: O(log n + k) where k is the number of elements in the range
Advantages in Database Systems
Shallow depth: Most databases can find any row with 3-4 disk reads
Sequential access: Leaf nodes link enables efficient range scans
Space utilization: High branching factor minimizes wasted space
Self-balancing: Maintains performance even with frequent changes
ACID Properties
ACID is an acronym that represents a set of properties ensuring reliable processing of database transactions.
The Four ACID Properties
1. Atomicity
Transactions are "all or nothing"
If any part fails, the entire transaction fails (rollback)
The database state remains unchanged if a transaction fails
Example: A bank transfer must either complete fully (debit one account and credit another) or not happen at all. Partial completion is not acceptable.
2. Consistency
Transactions only transition the database from one valid state to another
All constraints, triggers, and rules must be satisfied
Data integrity is preserved
Example: If a table has a constraint that account balances cannot be negative, any transaction resulting in a negative balance will be rejected.
3. Isolation
Concurrent transactions do not interfere with each other
Results of a transaction are invisible to other transactions until completed
Prevents "dirty reads," "non-repeatable reads," and "phantom reads"
Example: When two users update the same data simultaneously, isolation ensures one user's changes don't overwrite or interfere with the other's.
4. Durability
Once a transaction is committed, it remains so
Changes survive system failures (power outages, crashes)
Typically implemented using transaction logs
Example: After confirming a payment, the data is permanently stored even if the database crashes immediately afterward.
ACID vs. BASE
Modern distributed systems sometimes use BASE (Basically Available, Soft state, Eventually consistent) as an alternative to ACID:
ACID
BASE
Strong consistency
Eventual consistency
High isolation
Lower isolation
Focus on reliability
Focus on availability
Traditional RDBMS
Often used in NoSQL systems
16 - NoSQL
NoSQL ("Not Only SQL") databases are non-relational database systems designed to handle various data models, provide horizontal scalability, and deliver high performance for specific use cases.
Key Characteristics
Schema Flexibility: Dynamic or no schema requirements
Horizontal Scalability: Ability to scale across multiple servers
High Availability: Designed for distributed environments
Eventual Consistency: Often prioritize availability over consistency
Specialized Workloads: Optimized for specific data patterns
NoSQL Data Models
1. Key-Value Stores
The simplest NoSQL model, storing data as key-value pairs:
Structure: Each item is stored as an attribute name (key) with its value
Performance: Extremely fast lookups by key
Use Cases: Caching, session storage, user preferences
2. Document Stores
Store semi-structured data in document format (typically JSON or BSON):
Structure: Collections of documents, each with unique structure
Flexibility: Schema-free with nested data structures
Querying: Rich query capabilities on document contents
Use Cases: Content management, user profiles, real-time analytics
3. Column-Family Stores
Store data in column families - groups of related data:
Structure: Rows with dynamic columns grouped into families
Performance: Optimized for queries over large datasets
Sparsity: Efficiently handles sparse data
Use Cases: Time-series data, logging systems, heavy write workloads
4. Graph Databases
Specialized for highly connected data:
Structure: Nodes (entities), edges (relationships), and properties
Traversal: Optimized for relationship queries and traversals
Use Cases: Social networks, recommendation engines, fraud detection
Popular NoSQL Database Examples
Key-Value Stores
Redis: In-memory store with data structures (strings, lists, sets)
MySQL Cluster: Auto-sharding with shared-nothing architecture
Caching Systems
Redis Cluster: Hash slots for sharding
Memcached: Client-side sharding
Storage Systems
HDFS: Data replication without sharding
Ceph: Dynamic subtree partitioning with replication
18 - CAP Theorem
Overview
The CAP theorem, formulated by Eric Brewer in 2000, states that a distributed data system can only guarantee at most two out of the following three properties simultaneously:
Consistency
Availability
Partition tolerance
The Three Properties Explained
Consistency
All nodes see the same data at the same time
Every read receives the most recent write or an error
Ensures that data is identical across all nodes in the system
Similar to the concept of linearizability or atomic consistency
Availability
Every request to a non-failing node receives a response
The system remains operational and responsive even during failures
No request can be left hanging without a response
Does not guarantee that the response contains the most recent data
Partition Tolerance
The system continues to operate despite network partitions
Network partitions occur when nodes cannot communicate with each other
Messages between nodes may be delayed or lost
Essential property for distributed systems that operate across networks
CAP Theorem in Practice
CA Systems (Consistency + Availability)
Sacrifice partition tolerance
Examples: Traditional RDBMSs (PostgreSQL, MySQL with single-node setup)
Not truly distributed as they cannot handle network partitions
Rarely achievable in real-world distributed environments
CP Systems (Consistency + Partition Tolerance)
Sacrifice availability during partitions
Examples: MongoDB, HBase, Redis (in certain configurations)
May become unavailable during network partitions to maintain consistency
Choose consistency over availability when data accuracy is critical
AP Systems (Availability + Partition Tolerance)
Sacrifice strong consistency
Examples: Cassandra, Amazon Dynamo, CouchDB
Provide eventual consistency instead of immediate consistency
Choose availability over consistency when system uptime is critical
Consistency Models
Strong Consistency: All reads reflect the latest write
Eventual Consistency: All replicas eventually converge given enough time
Causal Consistency: Operations causally related must be seen in the same order
Read-your-writes Consistency: A client always sees its own writes
Practical Considerations
System Design Tradeoffs
Business requirements should guide CAP choices
Financial systems may prioritize consistency
Social media may prioritize availability
Handling Partition Scenarios
Detect partitions quickly
Define recovery procedures when partitions heal
Consider using compensating transactions
Regional Considerations
Multi-region deployments face CAP challenges more acutely
Geographic distance increases partition probability
Consider region-specific consistency requirements
19 - Object Storage
Overview
Object storage is a data storage architecture that manages data as discrete units called objects, rather than as files in a hierarchical file system or blocks in a block storage system. Objects are stored in a flat address space, making object storage highly scalable and well-suited for unstructured data.
Key Characteristics
Object Structure
Data: The actual content being stored (document, image, video, etc.)
Metadata: Descriptive information about the object (creation date, size, custom attributes)
Unique Identifier: Globally unique ID or key used to retrieve the object
Architecture
Flat Namespace: No hierarchical directory structure. uses flat namespace with globally unique identifiers
RESTful APIs: Typically accessed via HTTP-based APIs (GET, PUT, DELETE)
Web-Scale Design: Built for massive scalability (petabytes and beyond)
Distributed: Data distributed across multiple nodes and often multiple locations
Advantages
Unlimited Scalability: Can scale horizontally to exabytes of data
Cost-Effectiveness: Often less expensive than block or file storage for large datasets
Data Durability: Multiple copies of data stored across different devices/locations
Rich Metadata: Extended object information beyond basic file attributes
Global Access: Objects accessible from anywhere via HTTP(S)
Immutability: Objects typically aren't modified but replaced entirely
Limitations
Performance: Generally higher latency than block storage
Limited File System Operations: No direct file locking or appending
Not Suitable for Databases: Poor fit for transactional workloads
Limited OS Integration: Not directly mountable as a traditional file system
Message queues are asynchronous communication mechanisms that enable services to communicate without being directly connected. They serve as intermediaries for passing messages between different parts of a system.
Core Components
Producer: Application that creates and sends messages to the queue
Consumer: Application that receives and processes messages from the queue
Broker: Middleware that manages the queue and ensures message delivery
Queue/Topic: The storage mechanism where messages reside until processed
Key Benefits
Decoupling: Services don't need to know about each other
Scalability: Systems can handle variable loads by adding consumers
Resilience: Messages persist during downstream service outages
Asynchronous Processing: Producers continue without waiting for consumers
Load Leveling: Absorbs traffic spikes, preventing service overload
Messaging Patterns
Publish-Subscribe (Pub/Sub)
In this pattern, publishers send messages to topics, and subscribers receive all messages from the topics they subscribe to.
Characteristics
One-to-many: Messages from one publisher go to multiple subscribers
Topic-based: Messages are filtered based on topics
Loose coupling: Publishers don't know who will receive their messages
Use Cases
Event notifications
Broadcasting updates
Real-time dashboards
Log distribution
Message Delivery Models
Push Model
In the push model, the broker actively sends messages to consumers as they arrive.