Big Data & Cloud
The Cloud
- Service Level Agreements (SLA) between provider and renter
- Service Oriented Architecture (SOA)
Big Data
- Pushes the limits of Volume, Velocity, Variety, Veracity
- Often done in the cloud, but unrelated
BigTable
- Database-style storage built on a cluster, very similar to Google File System
- Sparse schema format: column-oriented compression. Columns stored together.
Information Architecture
- Blueprint for the site, largely about design
- Choose labels well
- Categorization within your site
- Keep URL independent of logical hierarchy of your site
Auctions and Recommender Systems
Auctions
- Tightly related to game theory
- Dutch v English auctions
- Pay per click auctions (e.g. Google AdWords)
- Wanting two objects together but not being able to get them both at the same time is a common problem with the model
- Collusion with other bidders
Recommenders
- Challenges: lots of products, not much feedback from users
- Collaborative filtering: use recommendations for what people like you like
- Like tf-idf
- Cosine similarity or Pearson correlation
DNS & CDN
DNS
- DNS translates domain names (for humans) to IP addresses (for routers)
- DNS is a globally accessible database of (name, IP address) tuples. Large, lots of potential problems
- DNS namespace is a tree structure where the root node has no parent
- Each subtree is an administrative zone served by an authority servers. There are primary and secondary servers. This is recursively defined. Clients contact AS for IP address.
- Chain of caching DNS servers between client and AS
- DNS Cache Poisoning: Bad info inserted into DNS server and cached. Remedy: DNS entries must be cryptographically signed
- Sometimes invalid domain names which redirect to ad page, or can be used to steal cookies
CDN
- Built on top of giant DNS hack
- Solves the problem of people all over trying to access same data at same time: reduces load on data center and spreads load across internet
e.g. Akamai: Uses custom DNS servers to dynamically direct clients to right caching servers
Website Syndication: Content creator places content in files created according to Atom or RSS. Content duplicator accesses syndicated content (headlines) and displays on their own website
Scaling
- DNS + Load Balancing
- Round Robin DNS: DNS responds with a permuted list of load balancing proxys (w/ list of their IP addresses)
- Use a proxy to do load balancing so that it can keep track of more detailed information, proxys connect to backend database servers
- Multiple servers with one 'master' multiple 'slaves'
- "Consistent" database is a database where all users see the same data at the same time
- Partition Data: each server only keeps a fraction of the data
- CAP Theorem: Consistency, Availability, Partitioning. Pick 2 for your database
- ACID: Traditional Databases. BASE: NoSQL (when availability more important than consistency)
Scaling Datacenters
- split up work across datacenters: Horizontal sharding (datacenters according to locations), separate datacenters for different products, or copy everything
- Deploying code: rolling restarts, canarying, blue/green development, or continuous development