DUC 2003 set up some working groups to study particular matters,
and there were some other studies after DUC 2003 that can be treated
in the same way. The following are summary reports for three of these
efforts.


Post DUC-03 Working Group on Reviewing and Revising the Linguistic
Quality Questions.
A Nenkova

The WG subsumed two activities:

1) The development of a set of output summary quality questions
for use in intrinsic evaluation and appropriate to many different
types of summary (eg single document, multi-document, extractive,
constructive etc).

2) A study of a method for evaluation of summary content designed
to accept the fact that there may be multiple alternative, and
equally good, summaries for the same source text(s).

The output of (1) is the Quality Question Set available at
http://duc.nist.gov/duc2004/quality.questions.txt

The output of (2) is a report `Evaluating content selection in
human- or machine-generated summaries: the pyramid scoring method',
by R J Passonneau and A Nenkova, Sept 2003, at

http://www1.cs.columbia.edu/~library/TR-repository/reports/reports-2003/cucs-02
5-03.pdf

and a HLT-NAACL paper 'Evaluating Content Selection in Summarization: the
Pyramid Method', by A Nenkova and R Passonneau

http://www1.cs.columbia.edu/~ani/papers/pyramid.pdf


Post DUC-03 Study on extrinsic evaluation
B Dorr R Shwartz

Summary

    We performed an experiment to evaluate short summaries
    extrinsically.  The application was to make TREC relevance
    judgments.  We tested subjects on 6 short summarization methods as
    well as the full documents.  The results showed that the judgments
    on the full documents were not very consistent with the NIST
    judgments or between two of our subjects.  The results when making
    judgments on summaries were somewhat worse than for documents, and
    all about equal among the 6 methods.  We believe the high level of
    noise in the basic task hid the differences.  Our next test will
    be done using TDT event judgments, which we believe will be more
    consistent.

A more detailed description is available.


Working One-man Group on Automatic Evaluation Metrics

C-Y Lin


Status at February 27 04

(1) ROUGE evaluation package v1.2.1 has been released to the public last
week. It can be downloaded from http://www.isi.edu/~cyl.
(2) N-gram overlap and LCS based similarity measures are included in ROUGE
v1.2.1.
(3) A working note explains how ROUGE works is included in the package. It
is also attached for your comments.

The followings are ongoing:
(1) A new measure, skip bigram, will be added to the package. Skip bigrams
are word pairs in their sentence order, allowing for arbitrary gaps. This
new measure, called ROUGE-S, computes skip bigram overlap between candidate
summary and a set of reference summaries. ROUGE-S has been tested in
automatic MT evaluation and I have initial evidence showing that ROGUE-S is
better than BLEU, NIST, ROUGE-N (ngram based ROUGE), and ROUGE-L (LCS-based
ROUGE). Information about this new metric and its application have been
reported in a paper submitted to ACL 2004. If you would like to have a copy
of the draft paper, please let me know.

(2) I will analyze DUC 2001, 2002, and 2003 data using ROUGE and the initial
results probably will be reported in the DUC workshop. I am planning to send
a paper based on these analyses to COLING 2004 or the Text Summarization
workshop in ACL 2004.

(3) I would like to carry out error analysis at sentence level using MT data
to see what is missing from the current automatic measures. I expect that
these will be synonyms, paraphrases, syntactic level equivalences, etc. New
measures will then be designed to consider these variations. These results
will then be applied to automatic evaluation of summarization.


ROUGE-Working-Note-v1.3.pdf