Homework 5

Due March 20* (do what you can in the short time allotted)


The problems in this homework refer to the following fkd matrix:
 
 D1:  2   0   0   0   0   4   0   0   2   1   1   0   0   0   0   0
 D2:  0   0   0   0   1   1   0   0   7   0   2   0   0   0   0   6
 D3:  0   0   1   0   0   1   0   0   3   0   3   0   2   0   3   0
 D4:  0   3   1   0   0   0   0   5   0   0   0   3   0   0   0   0
 D5:  0   0   0   1   0   0   0   0   0   0   0   0   0   0   3   0
 D6:  0   0   0   0   1   0   0   0   3   0   0   2   0   0   0   0
 D7:  1   0   0   0   0   0   1   0   0   0   0   0   0   0   0   3
 D8:  0   0   1   0   0   0   0   3   0   0   0   2   0   1   0   0
 D9:  0   1   2   0   0   0   2   0   3   0   0   0   2   1   0   1
D10:  0   0   0   0   1   1   0   0   0   1   0   5   0   0   0   0
For purposes of calculation, you are advised to load these values into a spreadsheet program. You may prefer writing your own code in C++ or java to do the problems below.


1. Using the cosine similarity measure, compute the similarity matrix (a 10 by 10 matrix). Find the average similarity Sim. (This was done for last weeks homework -- you can use those results).

2. Find the average document D*, and compute the average similarity of Di to D*.

3. Compute the value of Disck for each keyword (Eqn 3.28) using both kinds of similarity (across all pairs as in #1 and using the average document approach as in #2).

4. Compute the idfk value for each keyword (3.40).

5. Generate the matrix of tf-idf weights wkd (3.41).