D U C 2 0 0 7: Task, Documents, and Measures

The Document Understanding Conference (DUC) is a series of summarization evaluations that have been conducted by the National Institute of Standards and Technology (NIST) since 2001. Its goal is to further progress in automatic text summarization and enable researchers to participate in large-scale experiments in both the development and evaluation of summarization systems.

DUC 2007 will consist of two tasks. The tasks are independent, and participants in DUC 2007 may choose to do one or both tasks:

Main task
Update task (pilot)

The main task is the same as the DUC 2006 task and will model real-world complex question answering, in which a question cannot be answered by simply stating a name, date, quantity, etc. Given a topic and a set of 25 relevant documents, the task is to synthesize a fluent, well-organized 250-word summary of the documents that answers the question(s) in the topic statement. Successful performance on the task will benefit from a combination of IR and NLP capabilities, including passage retrieval, compression, and generation of fluent text.

The update task will be to produce short (~100 words) multi-document update summaries of newswire articles under the assumption that the user has already read a set of earlier articles. The purpose of each update summary will be to inform the reader of new information about a particular topic.

Documents for summarization

The documents for summarization will come from the AQUAINT corpus, comprising newswire articles from the Associated Press and New York Times (1998-2000) and Xinhua News Agency (1996-2000). The corpus has the following DTD:

AQUAINT corpus (DTD)

NIST assessors will develop topics of interest to them. The assessor will create a topic and choose a set of 25 documents relevant to the topic. These documents will form the document cluster for that topic. Topics and document clusters will be distributed by NIST. Only DUC 2007 participants who have completed all required forms will be allowed access.

Main Task

Reference summaries

Each topic and its document cluster will be given to 4 different NIST assessors, including the developer of the topic. The assessor will create a ~250-word summary of the document cluster that satisfies the information need expressed in the topic statement. These multiple references summaries will be used in the evaluation of summary content.

System task

System task: Given a DUC topic and a set of 25 relevant documents, create from the documents a brief, well-organized, fluent summary which answers the need for information expressed in the topic statement. All processing of documents and generation of summaries must be automatic.

The summary can be no longer than 250 words (whitespace-delimited tokens). Summaries over the size limit will be truncated. No bonus will be given for creating a shorter summary. No specific formatting other than linear is allowed.

There will be 45 topics in the test data. Each group can submit one set of results, i.e., one summary for each topic/cluster. Participating groups should be able to evaluate additional results themselves using ISI's ROUGE/BE package.

Evaluation

All summaries will first be truncated to 250 words. Where sentences need to be identified for automatic evaluation, NIST will then use a simple Perl script for sentence segmentation.

NIST will manually evaluate the linguistic well-formedness of each submitted summary using a set of quality questions.
NIST will manually evaluate the relative responsiveness of each submitted summary to the topic. Here are instructions to the assessors for judging responsiveness.

NIST will run ROUGE-1.5.5 to compute ROUGE-2 and ROUGE-SU4, with stemming and keeping stopwords. Jackknifing will be implemented so that human and system scores can be compared. ROUGE-1.5.5 will be run with the following parameters:

ROUGE-1.5.5.pl -n 2 -x -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -d

-n 2	compute ROUGE-1 and ROUGE-2
-x	do not calculate ROUGE-L
-m	apply Porter stemmer on both models and peers
-2 4	compute Skip Bigram (ROUGE-S) with a maximum skip distance of 4
-u	include unigram in Skip Bigram (ROUGE-S)
-c 95	use 95% confidence interval
-r 1000	bootstrap resample 1000 times to estimate the 95% confidence interval
-f A	scores are averaged over multiple models
-p 0.5	compute F-measure with alpha = 0.5
-t 0	use model unit as the counting unit
-d	print per-evaluation scores

NIST will calculate overlap in Basic Elements (BE) between automatic and manual summaries. Summaries will be parsed with Minipar, and BE-F will be extracted. These BEs will be matched using the Head-Modifier criterion.
Groups may participate in an optional manual evaluation of summary content using the pyramid method, which will be carried out cooperatively by DUC participants.

Update Task (Pilot)

Reference Summaries

Instructions given to NIST assessors for writing update summaries.

Each topic and its 3 document clusters, A-C, will be given to 4 different NIST assessors. The assessor will create 3 100-word topic-focused summaries that contribute to satisfying the information need expressed in the topic statement:

A summary of documents in cluster A
An update summary of documents in B, under the assumption that the reader has already read documents in A
An update summary of documents in C, under the assumption that the reader has already read documents in A and B

System Task

A summary of documents in cluster A
An update summary of documents in B, under the assumption that the reader has already read documents in A
An update summary of documents in C, under the assumption that the reader has already read documents in A and B

in chronological order

Evaluation

All summaries will first be truncated to 100 words. Where sentences need to be identified for automatic evaluation, NIST will then use a simple Perl script for sentence segmentation.

NIST will run ROUGE-1.5.5 to compute ROUGE-2 and ROUGE-SU4, with stemming and keeping stopwords. Jackknifing will be implemented so that human and system scores can be compared.
NIST will calculate overlap in Basic Elements (BE) between automatic and manual summaries. Summaries will be parsed with Minipar, and BE-F will be extracted. These BEs will be matched using the Head-Modifier criterion.
NIST will conduct a manual evaluation of summary content using a pyramid-like method based on information nuggets.

Tools for DUC 2007

Perl script for sentence segmentation
ISI's webpage on Basic Elements. Download also includes ROUGE version 1.5.5.
Columbia University's 2006 webpage on Pyramids

DUC Workshop Papers and Presentations

For data, past results, mailing list or other general information
contact: Lori Buckland ([email protected])
For other questions contact: Hoa Dang (hoa.dang AT nist.gov)
Last updated:
Date created: Wednesday, 18-October-06