Workshop: Week 7: Language Models =============================================================== Question 1: KL divergence ========================= In lecture 12, slide 12, we calculated the log probability of Shakespeare's play, "All's Well that Ends Well", for each of five different source texts. Redo this calculation, but using the KL divergence between the play, $p$, and each of the source texts, $T$: KL_D(M_p || M_T) Some words will occur in the play that don't occur in the source texts. We will therefore have smooth the text model with a background model, $M_B$: M_T' = (1 - \lambda) M_T + \lambda M_B (You don't need to do this for the play model (why?)). A reasonable way to form $M_B$ is to aggregate the counts of term occurrences in all the six texts (the five background texts and the play), and then form a MLE estimator from these counts. (In principle, \lambda should be smaller where the source text $T$ is larger, since we have a larger sample of the desired language, but we won't go into this detail here). The frequency data for each of the source texts can be found at: http://www.williamwebber.com/comp90042/wksh/wk07/dat/txt-freq.zip Calculate the KL_D between the play and each of the source texts, for \lambda = 0.4. Does the ranking agree with that given in the lecture? Try \lambda = 0.1 and \lambda=0.9. Does the ranking change? Why or why not? IMPLEMENTATION NOTE: You may find it useful to define the following functions: load_freq(fname): Load a frequency file into a {term: freq} dictionary freq_to_prop(freq_dict): Take a {term : freq} dictionary and produce a {term: prop} dictionary, where prop is the proportion of times the term appears (all props in one dictionary should sum to 1). sum_freq_dicts(freq_dict_list): Sum the term frequencies in a list of {term: freq} dictionaries, and return. smooth_models(fg_mdl, bg_mdl, lmbda): Smooth a foreground model (a {term: prop} dictionary) with a background model, using the formula M_S = (1 - \lambda) M_F + \lambda M_B and return kl_div(p, q): Calculate KL divergence between two distributions (expressed as {term : prop} dictionaries) Question 2: KL divergence and LM ================================ In lecture 13, slide 6, we said that calculating document--query match using KL divergence between the document model and the MLE query model: R(q, D) = - KL_D (M_q || M_D) is rank equivalent to regular LM: S(q, D) = \sum_i P(q_i | M_D) Prove the rank equivalence. Question 3: Language model querying =================================== Implmement an LM search engine. Use the LYRL30k dataset as the corpus. For the document model, use: M_D' = (1 - \lambda) M_D + \lambda M_C Set \lambda to 0.4. Calculate the score of a $D$ for query $q$ as: R(D, q; C) = \sum_i \log P(q_i | M_D') (Why do we do this rather than directly multiply the probabilities: S(D, q; C) = \prod_i P(q_i | M_D') ? What is the relationship between R() and S()?) Give the depth-10 ranking returned to the query "oil price iraq tension". IMPLEMENTATION NOTE: The file: http://www.williamwebber.com/comp90042/wksh/wk07/code/coll.py contains parse_lyrl_coll(fname), which parses the LYRL30k dataset into a dictionary of {docid:docbow}, where each docbow is a {term:freq} pair. Use this to load up the dataset. Iterate through the documents, and reuse freq_to_prop() to create a foreground model for each document. Then invert this to create a nested { term : {doc : P(t | D) } } dictionary. Write this out to a shelve file (or other index). Reuse sum_freq_dicts() and freq_to_prop() from question 1 to create your background model. Then write this out to a file (use pickle()). Note that you should _not_ smooth the document models first, then store in the index, as this will create dictionary for each document that is as large as the collection vocab list. At query processing time, process terms one at a time, and keep an accumulator dictionary of documents, adding the term's contribution of \log P(t | M_D') to that document, as with standard inverted index processing. Note, though, that you have to take some care. If a query term does not appear in a document, it will still get a score from the background model. We obviously don't want to score documents that contain _no_ query terms. But if a document contains _any_ query terms, it should still get the background score of: \log \lambda P(w | M_C) An easy (though slightly inefficient) way to do this is to first figure out the list of documents that have at least one query term (that is, create an accumulator dictionary where every candidate document has a score of 0.0 at the outset). Then, when processing a query term, iterate through the list of candidate documents, _not_ the list of documents that term appears in. If the candidate document does not occur in the query, then give it the background score. Question 4: Relevance model =========================== In lecture 13, we saw how one could build a "model of relevance" by using pseudo-relevance feedback. We used this model to improve search results, and also to perform CLIR. We can also directly interrogate the model to see what terms have a high weight in it; this helps interpret the model. Write code to calculate a relevance model using the formula given by Lavrenko and Croft (2001) (lecture 13, slide 15). Set their parameter $\lambda$ to 0.6. Instantiate the model (that is, calculate actual P(w | q; {\cal F}) ) for the words in the feedback vocabulary (w \in V_{\cal F}). Report the top 20 highest-weighted terms in the relevance model for "gas price iraq tensions", built from the top 10 results returned in Question 3.