DUC 2003 set up some working groups to study particular matters, and there were some other studies after DUC 2003 that can be treated in the same way. The following are summary reports for three of these efforts. Post DUC-03 Working Group on Reviewing and Revising the Linguistic Quality Questions. A Nenkova The WG subsumed two activities: 1) The development of a set of output summary quality questions for use in intrinsic evaluation and appropriate to many different types of summary (eg single document, multi-document, extractive, constructive etc). 2) A study of a method for evaluation of summary content designed to accept the fact that there may be multiple alternative, and equally good, summaries for the same source text(s). The output of (1) is the Quality Question Set available at http://duc.nist.gov/duc2004/quality.questions.txt The output of (2) is a report `Evaluating content selection in human- or machine-generated summaries: the pyramid scoring method', by R J Passonneau and A Nenkova, Sept 2003, at http://www1.cs.columbia.edu/~library/TR-repository/reports/reports-2003/cucs-02 5-03.pdf and a HLT-NAACL paper 'Evaluating Content Selection in Summarization: the Pyramid Method', by A Nenkova and R Passonneau http://www1.cs.columbia.edu/~ani/papers/pyramid.pdf Post DUC-03 Study on extrinsic evaluation B Dorr R Shwartz Summary We performed an experiment to evaluate short summaries extrinsically. The application was to make TREC relevance judgments. We tested subjects on 6 short summarization methods as well as the full documents. The results showed that the judgments on the full documents were not very consistent with the NIST judgments or between two of our subjects. The results when making judgments on summaries were somewhat worse than for documents, and all about equal among the 6 methods. We believe the high level of noise in the basic task hid the differences. Our next test will be done using TDT event judgments, which we believe will be more consistent. A more detailed description is available. Working One-man Group on Automatic Evaluation Metrics C-Y Lin Status at February 27 04 (1) ROUGE evaluation package v1.2.1 has been released to the public last week. It can be downloaded from http://www.isi.edu/~cyl. (2) N-gram overlap and LCS based similarity measures are included in ROUGE v1.2.1. (3) A working note explains how ROUGE works is included in the package. It is also attached for your comments. The followings are ongoing: (1) A new measure, skip bigram, will be added to the package. Skip bigrams are word pairs in their sentence order, allowing for arbitrary gaps. This new measure, called ROUGE-S, computes skip bigram overlap between candidate summary and a set of reference summaries. ROUGE-S has been tested in automatic MT evaluation and I have initial evidence showing that ROGUE-S is better than BLEU, NIST, ROUGE-N (ngram based ROUGE), and ROUGE-L (LCS-based ROUGE). Information about this new metric and its application have been reported in a paper submitted to ACL 2004. If you would like to have a copy of the draft paper, please let me know. (2) I will analyze DUC 2001, 2002, and 2003 data using ROUGE and the initial results probably will be reported in the DUC workshop. I am planning to send a paper based on these analyses to COLING 2004 or the Text Summarization workshop in ACL 2004. (3) I would like to carry out error analysis at sentence level using MT data to see what is missing from the current automatic measures. I expect that these will be synonyms, paraphrases, syntactic level equivalences, etc. New measures will then be designed to consider these variations. These results will then be applied to automatic evaluation of summarization. ROUGE-Working-Note-v1.3.pdf