Workshop: Week 5 PRELIM 1. RCV1v2 judgments. ============================ The file: http://www.williamwebber.com/comp90042/wksh/wk05/dat/lyrl30k_tpcs.txt.gz (also available as: http://www.williamwebber.com/comp90042/wksh/wk05/dat/lyrl30k_tpcs.txt.zip http://www.williamwebber.com/comp90042/wksh/wk05/dat/lyrl30k_tpcs.txt ) contains the LYRL30k documents for each RCV1v2 topic, one line per topic, with the first token being the topic code, and the remainder the docids. (If you want to know what the topic codes mean, see: http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a02-orig-topics-hierarchy/rcv1.topics.hier.orig ). Download and unpack this data, since you'll need it for later questions. For the binary classification experiments in this workshop, we'll use Topic GCAT ("government/social"). Around 1/3 of documents fall into this category, making it an easy one to find sufficient training data for. (Roughly speaking, a binary classifier trains as well as the number of minority class examples it sees.) For these binary classification tasks, the two classes will be "GCAT" and "not GCAT". PRELIM 2. Scikit Learn. ======================= There is a handy machine learning package for Python called scikit-learn. We'll use this for more advanced classification algorithms like SVM. Please install the package. (Note that it has quite a few dependencies; if you can't get it installed, don't worry for now, and just skip the questions that require it; but we will need to use it later, so please try to get it instally in your own time.) When you have it installed, run through the brief tutorial at: http://scikit-learn.org/stable/tutorial/basic/tutorial.html to understand the basics of the package (and also to make sure that it's working for you). ===================================================================== Question 1: Wrapper code ======================== Repurpose code (or indices) from previous workshops that converts the LYRL 30k document set into unit-normalized TF*IDF document vectors (other representations are possible, but we'll use this since we have it), then stores them in an index. Implement code which samples $k$ docvecs from the docvec index (random.sample() will help you here). We need to split this into training and test sets. To avoid overlap between the sets, the best way of doing this is to draw one sample of ntrain + ntest documents, then splits them into training and test sets by: test_set = smpl_set[0:ntest] train_set = smpl_set[ntest:(ntest + ntrain)] To keep your experiments reproducible, seed the random number generator using random.seed(1). (Note that we take the test set from the front. This allows us to try different training set sizes with the same test set -- so long as we seed the random generator with the same value each time.) You'll need to keep a hold of the document ids that correspond to each document vector, but _don't_ put them into the document vectors (that will at best cause the learning algorithm to treat them as features, at worst break the algorithm altogether). Implement code that loads the RCV1v2 judgments (see Prelim 1 above), and does the following: 1. For a given document, tell you whether it belongs to a class 2. For a set of documents and a class, gives you all documents in the set that belong to that class 3. For a set of documents and a class, gives you all the documents that do not belong to that class (These functions are obviously related). Do _not_ however put topic labels into the document vector: the learning algorithms will treat them as a feature and _do extremely well_ since you directly tell them what class a document belongs to. Question 2: Rocchio classifier ============================== Implement a Rocchio classifier. This contains two part: model construction; and classification. The model construction takes a dictionary of training document vectors, one per class (where the class tag is given as the key). It calculate the mean document vector for a class, and returns a dictionary of class and mean document vector. The classification takes a document vector; calculates its cosine distance to each mean class vector; and returns the class that the document vectors is closest to. Question 3: Rocchio multi-class classification ============================================== Randomly split the LYRL30k document set into 15k training documents and 15k test documents. Build a training model for every RCV1v2 class (that has at least one training document in the training set). Classify the 15k test documents. (Allocate each document to exactly one class, even though in practice a document can belong to more than one class.) ** Amendment, 2014-04-02 15k training / 15k test can ** take quite a long time. Reduce to 5k training / ** 5k test if this is the case. What is the total percentage of wrong labels? That is, what percentage of documents are allocated to a class that they do not belong to? Question 4: Rocchio binary classification ========================================= Redo Question 3, but now have only two class: GCAT and not-GCAT. Calculate the number of TP, FP, FN, and TN for this classification. From these, calculate and report the following metrics: 1. Accuracy 2. F1 Question 5: Binary class ranking ================================= We said in lectures that some binary classifiers are also able to rank documents by degree of closeness to the positive class versus the negative class. It is tempting to implement this for Rocchio classification by building only a model of the positive class (here, GCAT), and rank documents by their distance from this model. Why is this wrong? Propose a correct way of doing this. Question 6: kNN binary classification ===================================== Implement a kNN binary classifier. No model needs to be built (except to have the training docvec list handy). Classification of a document $d$ is performed by comparing its docvec to each training document vector, and recording the classes of the top $k$. If more than $k / 2$ are in the positive class, assign the document to that class; otherwise, assign to the negative class. (It makes sense to choose an odd $k$). This is an O(t * c) algorithm, where $t$ is the number of the training documents, and $c$ is the number of test documents. To keep this manageable, reduce your training size to 2000 documents, and your test set to 500 documents. (If this still takes too long to run, decrease training to 1000 and test to 250.) Run your code with GCAT as the positive class and not GCAT as your negative class. What is your accuracy and F1 score? Re-run the Rocchio binary classifier from Question 4, reducing training and test to 2000/500 (or 1000/250). Which classifier is more accurate? Question 7: SVM classification ============================== Implement wrapper code to run the ScikitLearn SVM classifier. (The main issue here is getting the docvecs into the format that SckitLearn expects.) Run on the same 2000/500 (1000/250) training/test split as the Rocchio and kNN classifiers. Calculate accuracy and F1. Which classifier is best? Question 8: Learning curve ========================== As we add more training data, our classifier should improve. Do this with the SVM classifier. Extend the test set to 10,000 (SVM is quick to classify). Try training set sizes of 100, 200, 500, 1000, 2000, 5000, 10000. Calculate F1 score, and graph the results. At what level does the classifier stop improving? (You may need to increase the training set size further, though I expect 10000 will be enough.)