Web Search and System Design

Three basic problems:

  1. Result relevance
  2. Processing speed
  3. Scaling to many documents

Result Relevance

  • Most basic search is boolean queries: simply include or exclude a document from results
  • But we need to give order to results, with ranking
  • Hit Counting methods:
    1. Term frequency
    2. Term scarcity in collection
    3. Length of documents
  • Vector Space Query Model
    • Documents are points in high dimensional space where the system is a set of axes for each word
    • Documents (points) close together, are more similar
    • Thus querying is the same as looking for docs near the query in this space
    • Docs and Queries are vectors, with terms and their respective weights
    • Weight of kth term in Document i = termfrequency_ki * log(N/n_k)
      • The second part of this equation is the IDF (Inverse Document Frequency): provides high values for rare words, low values for common words
    • Also be sure to normalize weights so that longer docs are not given more weight
    • One way is using similarity

Example: What is the result of the similarity computation with query vector Q = (0.4, 0.8) and document D2 = (0.2, 0.2)?

$$sim(Q, D_2) = ((0.4 0.2) + (0.80.7))/sqrt([0.4)^2 + 0.8)^2]*[(0.2)^2 + (0.7)^2])$$

$$= 0.64/sqrt(0.42) = 0.98$$

Techniques to Evaluate Search Ranking

  1. Precision/Recall Curves
    • Precision is % of selected items that are correct
    • Recall is % of correct items that are selected
    • These are inverses of each other, as shown by P/R curve. You make a tradeoff by focusing on one or the other
  2. Kendall's Tau
    • Compute fraction of pairwise orderings that are consistent $$tau = (n_c - n_d)/0.5n(n-1)$$ where $$n_c$$ = num consistent pairs, $$n_d$$ = num inconsistent pairs, $$n$$ = total num pairs

Ex: Correct ordering: 1,2,3,4

Search Engine A: 1,2,4,3

Kendall's Tau = $$5-1/((1/2)4(4-1)) = 4/6$$

  • Mean Reciprocal Rank
  • Summation of 1/rank for reach search query

Link/Graph Analysis

Measures of Prestige and Importance

  • Degree centrality: whoever has highest degree wins
  • Measure closeness: take sum of distance to all other nodes and take the inverse
  • PageRank $$ PR(A) = (1-d)/N + d*Sum_i[ PR(I_i)/C(I_i)]

    In other words, (1-d)/N + d* the sum of PR/Number of outgoing links for each of the other nodes $$

  • Hubs and Authorities: Page is good authority if it is pointed-to by many good hubs
  • Page is good hub if it points to many good authorities

auth(p) = Sigma i=1 to n hub(i)

hub(p) = Sigma i=1 to n auth(i)

  • HITS: Hyperlink-Induced Topic Search Algorithm used
  • Latent Semantic Indexing to identify topics instead of terms
  • Ways to optimize your page for Search Engines:
    1. Use text, not images or Flash for important content
    2. Use JS, with Java and CSS disabled
    3. Avoid links that look like form queries
    4. Have pages focusing on particular topics
    5. Have relevant sites link to yours
  • Link Spam: require links have rel="nofollow" attribute, tells search engines to ignore

GOOGLE Filesystem

  • Want clusters of machines with local disks, some network

Basic Physical Set Up

  1. Servers go into a rack
  2. Servers have local processing and disks
  3. Servers in a rack share a switch
  4. Switch proves intra and inter rack network connectivity
  5. Intra-rack bandwith is cheap
  6. Inter-rack bandwith expensive

Failure Rates

  • MTTF (Mean Time to Failure): the length of time a device is expected to last
  • MTBF (Mean Time Between Failures)

Consistency Model

  • If different clients simultaneously write/read to same section
  • File region consistent if all clients will always see the same data, regardless of which replicas they read fro
  • A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety

Failure Modes

  • If chunkserver dies? Chunkservers report to master every few seconds (heartbeat message)
  • If master dies? Master can ask chunkservers to reduplicate data
  • If network switch dies? If master dies, shadow master provides read-only access until master restart

Map Reduce

  • "Data-centric programming"
  • Master, Grouper, Reducers, Mappers
  • Note: output of map is the same types as the inputs to reduce

Example:

  • Write mapper and reducer functions for computing the dot product of two large vectors

Solution: mapper function: (a[i], b[i]) => (1, a[i]*b[i]) list

reducer function: (1, a[i]*b[i] list) => (1,sum)

Example:

  • Describe a mapper and reducer for word counting

Solution: mapper function: Map(lineNo, textStr) for each word w in textSTr: emit(w,1)

Reduce(w, interm-vals)

int result = 0

for each v in interm-vales: result += v

emit(outKey, result)

  • MapReduce execution consists of 2 main stages.
  • Stage 1, map(): partition input data and run on many machines. Then group data by key
  • Stage 2, reduce()
  • Mapped values to go to the same reducer should have the same key