Web Search and System Design
Three basic problems:
- Result relevance
- Processing speed
- Scaling to many documents
Result Relevance
- Most basic search is boolean queries: simply include or exclude a document from results
- But we need to give order to results, with ranking
- Hit Counting methods:
- Term frequency
- Term scarcity in collection
- Length of documents
- Vector Space Query Model
- Documents are points in high dimensional space where the system is a set of axes for each word
- Documents (points) close together, are more similar
- Thus querying is the same as looking for docs near the query in this space
- Docs and Queries are vectors, with terms and their respective weights
- Weight of kth term in Document i = termfrequency_ki * log(N/n_k)
- The second part of this equation is the IDF (Inverse Document Frequency): provides high values for rare words, low values for common words
- Also be sure to normalize weights so that longer docs are not given more weight
- One way is using similarity
Example: What is the result of the similarity computation with query vector Q = (0.4, 0.8) and document D2 = (0.2, 0.2)?
$$sim(Q, D_2) = ((0.4 0.2) + (0.80.7))/sqrt([0.4)^2 + 0.8)^2]*[(0.2)^2 + (0.7)^2])$$
$$= 0.64/sqrt(0.42) = 0.98$$
Techniques to Evaluate Search Ranking
- Precision/Recall Curves
- Precision is % of selected items that are correct
- Recall is % of correct items that are selected
- These are inverses of each other, as shown by P/R curve. You make a tradeoff by focusing on one or the other
- Kendall's Tau
- Compute fraction of pairwise orderings that are consistent $$tau = (n_c - n_d)/0.5n(n-1)$$ where $$n_c$$ = num consistent pairs, $$n_d$$ = num inconsistent pairs, $$n$$ = total num pairs
Ex: Correct ordering: 1,2,3,4
Search Engine A: 1,2,4,3
Kendall's Tau = $$5-1/((1/2)4(4-1)) = 4/6$$
- Mean Reciprocal Rank
- Summation of 1/rank for reach search query
Link/Graph Analysis
Measures of Prestige and Importance
- Degree centrality: whoever has highest degree wins
- Measure closeness: take sum of distance to all other nodes and take the inverse
PageRank $$ PR(A) = (1-d)/N + d*Sum_i[ PR(I_i)/C(I_i)]
In other words, (1-d)/N + d* the sum of PR/Number of outgoing links for each of the other nodes $$
- Hubs and Authorities: Page is good authority if it is pointed-to by many good hubs
- Page is good hub if it points to many good authorities
auth(p) = Sigma i=1 to n hub(i)
hub(p) = Sigma i=1 to n auth(i)
- HITS: Hyperlink-Induced Topic Search Algorithm used
- Latent Semantic Indexing to identify topics instead of terms
- Ways to optimize your page for Search Engines:
- Use text, not images or Flash for important content
- Use JS, with Java and CSS disabled
- Avoid links that look like form queries
- Have pages focusing on particular topics
- Have relevant sites link to yours
- Link Spam: require links have rel="nofollow" attribute, tells search engines to ignore
GOOGLE Filesystem
- Want clusters of machines with local disks, some network
Basic Physical Set Up
- Servers go into a rack
- Servers have local processing and disks
- Servers in a rack share a switch
- Switch proves intra and inter rack network connectivity
- Intra-rack bandwith is cheap
- Inter-rack bandwith expensive
Failure Rates
- MTTF (Mean Time to Failure): the length of time a device is expected to last
- MTBF (Mean Time Between Failures)
Consistency Model
- If different clients simultaneously write/read to same section
- File region consistent if all clients will always see the same data, regardless of which replicas they read fro
- A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety
Failure Modes
- If chunkserver dies? Chunkservers report to master every few seconds (heartbeat message)
- If master dies? Master can ask chunkservers to reduplicate data
- If network switch dies? If master dies, shadow master provides read-only access until master restart
Map Reduce
- "Data-centric programming"
- Master, Grouper, Reducers, Mappers
- Note: output of map is the same types as the inputs to reduce
Example:
- Write mapper and reducer functions for computing the dot product of two large vectors
Solution: mapper function: (a[i], b[i]) => (1, a[i]*b[i]) list
reducer function: (1, a[i]*b[i] list) => (1,sum)
Example:
- Describe a mapper and reducer for word counting
Solution: mapper function: Map(lineNo, textStr) for each word w in textSTr: emit(w,1)
Reduce(w, interm-vals)
int result = 0
for each v in interm-vales: result += v
emit(outKey, result)
- MapReduce execution consists of 2 main stages.
- Stage 1, map(): partition input data and run on many machines. Then group data by key
- Stage 2, reduce()
- Mapped values to go to the same reducer should have the same key