Homework 5
Due March 20* (do what you can in the short time allotted)
The problems in this homework refer to the following fkd
matrix:
D1: 2 0 0 0 0 4 0 0 2 1 1 0 0 0 0 0
D2: 0 0 0 0 1 1 0 0 7 0 2 0 0 0 0 6
D3: 0 0 1 0 0 1 0 0 3 0 3 0 2 0 3 0
D4: 0 3 1 0 0 0 0 5 0 0 0 3 0 0 0 0
D5: 0 0 0 1 0 0 0 0 0 0 0 0 0 0 3 0
D6: 0 0 0 0 1 0 0 0 3 0 0 2 0 0 0 0
D7: 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 3
D8: 0 0 1 0 0 0 0 3 0 0 0 2 0 1 0 0
D9: 0 1 2 0 0 0 2 0 3 0 0 0 2 1 0 1
D10: 0 0 0 0 1 1 0 0 0 1 0 5 0 0 0 0
For purposes of calculation, you are advised to load these values into
a spreadsheet program. You may prefer writing your own code in C++ or
java to do the problems below.
1. Using the cosine similarity measure, compute the similarity matrix
(a 10 by 10 matrix). Find the average similarity
Sim. (This was done for last weeks homework -- you can use those results).
2. Find the average document D*, and compute the average similarity of
Di to D*.
3. Compute the value of Disck for each keyword (Eqn 3.28)
using both kinds of similarity (across all pairs as in #1 and using the
average document approach as in #2).
4. Compute the idfk value for each keyword (3.40).
5. Generate the matrix of tf-idf weights wkd (3.41).