Big Data & Cloud

The Cloud

  • Service Level Agreements (SLA) between provider and renter
  • Service Oriented Architecture (SOA)

Big Data

  • Pushes the limits of Volume, Velocity, Variety, Veracity
  • Often done in the cloud, but unrelated

BigTable

  • Database-style storage built on a cluster, very similar to Google File System
  • Sparse schema format: column-oriented compression. Columns stored together.

Information Architecture

  • Blueprint for the site, largely about design
  • Choose labels well
  • Categorization within your site
  • Keep URL independent of logical hierarchy of your site

Auctions and Recommender Systems

Auctions

  • Tightly related to game theory
  • Dutch v English auctions
  • Pay per click auctions (e.g. Google AdWords)
  • Wanting two objects together but not being able to get them both at the same time is a common problem with the model
  • Collusion with other bidders

Recommenders

  • Challenges: lots of products, not much feedback from users
  • Collaborative filtering: use recommendations for what people like you like
    • Like tf-idf
    • Cosine similarity or Pearson correlation

DNS & CDN

DNS

  • DNS translates domain names (for humans) to IP addresses (for routers)
  • DNS is a globally accessible database of (name, IP address) tuples. Large, lots of potential problems
  • DNS namespace is a tree structure where the root node has no parent
  • Each subtree is an administrative zone served by an authority servers. There are primary and secondary servers. This is recursively defined. Clients contact AS for IP address.
  • Chain of caching DNS servers between client and AS
  • DNS Cache Poisoning: Bad info inserted into DNS server and cached. Remedy: DNS entries must be cryptographically signed
  • Sometimes invalid domain names which redirect to ad page, or can be used to steal cookies

CDN

  • Built on top of giant DNS hack
  • Solves the problem of people all over trying to access same data at same time: reduces load on data center and spreads load across internet
  • e.g. Akamai: Uses custom DNS servers to dynamically direct clients to right caching servers

  • Website Syndication: Content creator places content in files created according to Atom or RSS. Content duplicator accesses syndicated content (headlines) and displays on their own website

Scaling

  • DNS + Load Balancing
  • Round Robin DNS: DNS responds with a permuted list of load balancing proxys (w/ list of their IP addresses)
  • Use a proxy to do load balancing so that it can keep track of more detailed information, proxys connect to backend database servers
  • Multiple servers with one 'master' multiple 'slaves'
  • "Consistent" database is a database where all users see the same data at the same time
  • Partition Data: each server only keeps a fraction of the data
  • CAP Theorem: Consistency, Availability, Partitioning. Pick 2 for your database
  • ACID: Traditional Databases. BASE: NoSQL (when availability more important than consistency)

Scaling Datacenters

  • split up work across datacenters: Horizontal sharding (datacenters according to locations), separate datacenters for different products, or copy everything
  • Deploying code: rolling restarts, canarying, blue/green development, or continuous development