Web Search and System Design

Three basic problems:

Result Relevance

Most basic search is boolean queries: simply include or exclude a document from results
But we need to give order to results, with ranking
Hit Counting methods:
1. Term frequency
2. Term scarcity in collection
3. Length of documents
Vector Space Query Model
- Documents are points in high dimensional space where the system is a set of axes for each word
- Documents (points) close together, are more similar
- Thus querying is the same as looking for docs near the query in this space
- Docs and Queries are vectors, with terms and their respective weights
- Weight of kth term in Document i = termfrequency_ki * log(N/n_k)
  - The second part of this equation is the IDF (Inverse Document Frequency): provides high values for rare words, low values for common words
- Also be sure to normalize weights so that longer docs are not given more weight
- One way is using similarity

Example: What is the result of the similarity computation with query vector Q = (0.4, 0.8) and document D2 = (0.2, 0.2)?

$$sim(Q, D_2) = ((0.4 0.2) + (0.80.7))/sqrt([0.4)^2 + 0.8)^2]*[(0.2)^2 + (0.7)^2])$$

$$= 0.64/sqrt(0.42) = 0.98$$

Precision/Recall Curves
- Precision is % of selected items that are correct
- Recall is % of correct items that are selected
- These are inverses of each other, as shown by P/R curve. You make a tradeoff by focusing on one or the other
Kendall's Tau
- Compute fraction of pairwise orderings that are consistent $$tau = (n_c - n_d)/0.5n(n-1)$$ where $$n_c$$ = num consistent pairs, $$n_d$$ = num inconsistent pairs, $$n$$ = total num pairs

Ex: Correct ordering: 1,2,3,4

Search Engine A: 1,2,4,3

Kendall's Tau = $$5-1/((1/2)4(4-1)) = 4/6$$

Degree centrality: whoever has highest degree wins
Measure closeness: take sum of distance to all other nodes and take the inverse
PageRank $$ PR(A) = (1-d)/N + d*Sum_i[ PR(I_i)/C(I_i)]

In other words, (1-d)/N + d* the sum of PR/Number of outgoing links for each of the other nodes $$
Hubs and Authorities: Page is good authority if it is pointed-to by many good hubs
Page is good hub if it points to many good authorities

auth(p) = Sigma i=1 to n hub(i)

hub(p) = Sigma i=1 to n auth(i)

HITS: Hyperlink-Induced Topic Search Algorithm used
Latent Semantic Indexing to identify topics instead of terms
Ways to optimize your page for Search Engines:
1. Use text, not images or Flash for important content
2. Use JS, with Java and CSS disabled
3. Avoid links that look like form queries
4. Have pages focusing on particular topics
5. Have relevant sites link to yours
Link Spam: require links have rel="nofollow" attribute, tells search engines to ignore

Basic Physical Set Up

Failure Rates

Consistency Model

If different clients simultaneously write/read to same section
File region consistent if all clients will always see the same data, regardless of which replicas they read fro
A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety

Failure Modes

If chunkserver dies? Chunkservers report to master every few seconds (heartbeat message)
If master dies? Master can ask chunkservers to reduplicate data
If network switch dies? If master dies, shadow master provides read-only access until master restart

Example:

Write mapper and reducer functions for computing the dot product of two large vectors

Solution: mapper function: (a[i], b[i]) => (1, a[i]*b[i]) list

reducer function: (1, a[i]*b[i] list) => (1,sum)

Example:

Solution: mapper function: Map(lineNo, textStr) for each word w in textSTr: emit(w,1)

Reduce(w, interm-vals)

int result = 0

for each v in interm-vales: result += v

emit(outKey, result)

MapReduce execution consists of 2 main stages.
Stage 1, map(): partition input data and run on many machines. Then group data by key
Stage 2, reduce()
Mapped values to go to the same reducer should have the same key