Understanding Conferences |
D U C 2 0 0 4: Documents, Tasks, and MeasuresSome comparisons to DUC 2003
Information on ROUGEHere is a link to ISI's webpage on ROUGE, provided by Chin-Yew Lin.Documents for summarization
Documents/clusters: NIST staff chose 50 TDT topics/events/timespans and a subset of the documents TDT annotators found for each topic/event/timespan. Each subset contained on average 10 documents. The documents came from the AP newswire and New York Times newswire. Manual summaries: NIST assessors created a very short summary (<= 75 bytes, no specific format other than linear) of each document and a short summary (<= 665 bytes) of each cluster. These summaries were not be focused in any particular way beyond by the documents. The manual summarizers were NOT given the TDT topic. Here are the summarization instructions the NIST summarizers were given. Documents/clusters: The above 50 TDT topics were chosen so that 13 of them also have relevant documents in an Arabic source. These were supplemented with 12 new topics in the same style. We used these 25 topics and a subset of the documents TDT annotators found for each topic/event/timespan. Each subset contained on average 10 documents. For each cluster we also attempted to provide documents that came from the TDT English sources and were relevant to the topic. The numbers of such documents varied by topic. For 13 of the topics there were some relevant documents from English sources; for the others we used however many were found after a fixed amount of searching; in some cases none may have been found. To the extent possible, the English documents came from dates the same or close to those for the Arabic documents. The Arabic documents came from the Agence France Press (AFP) Arabic Newswire (1998, 2000-2001). They were translated to English by two fully automatic machine translation (MT) systems. Summarizers created a very short summary (~10 words, no specific format other than linear) of each document and a short summary (~ 100 words) of each cluster in English. In both cases the summarizer worked from professional translations to English of the original Arabic document(s). Here are the summarization instructions the LDC summarizers were given. (The instructions pre-date the determination of the size limits in bytes, that's why the limits are in terms of words.) These summaries were not focused in any particular way beyond by the documents. For one of the test docsets (30047) we did not use for testing, we provided sample test data files: the output of two machine translation systems, the manual translations for the document set and for each document, additional English documents relevant to the topic of the docset, and the manual summaries (4 humans' versions) of the manually translated documents. The MT output contains mutiple variants for each sentence, ranked by descending quality. In the actual test data NIST will provide a version containing only the best translation for each sentence. Note that ISI and IBM used independent sentence separation. Documents/clusters: NIST assessors chose 50 clusters of TREC documents such that all the documents in a given cluster provide at least part of the answer to a broad question the assessor formulated. The question was of the form "Who is X?", where X was the name of a person. Each subset contained on average 10 documents. The documents came from the following collections with their own taggings: Manual summaries: NIST assessors
created a focused short summary (<= 665 bytes) of
each cluster, designed to answer the question defined by the assessor.
Tasks and measuresIn what follows, the evaluation of quality and coverage implements the SEE manual evaluation protocol. The discussion group led by Ani Nenkova has produced the following linguistic questions and answers, which will be used within the SEE manual evaluation protocol. (Note please that SEE will only be used in task 5 for 2004). For all tasks, we truncated summaries over the target length (defined in terms of characters, with whitespace and punctuation included) and there will be no bonus for summarizing in less than the target length. We modeled a situation in which an application has a fixed amount of time/space for a summary. Summary material beyond the target length cannot be used and compression to less than the target length underutilizes the available time/space. For short summaries, overlap in content between a submitted summary and a manual model was defined for the assessors in terms of shared facts. Because very short summaries may be list of key words, the requirement that they express complete facts was relaxed and for very short summaries overlap in content between a submitted summary and a manual model was defined for the assessors in terms of shared references to entities such as people, things, places, events, etc. The TDT topics themselves were not input either to the manual summarizers or the automatic summarization systems NIST created some simple automatic (baseline) summaries to be included in the evaluation along with the other submissions. Definitions of the baselines are available. Each group could submit up to 3 prioritized runs/results for each task.
Use the 50 TDT English clusters. Given each document, create a very short summary (<= 75 bytes) of the document. Summaries over the size limit will be truncated; no bonus for creating a shorter summary . No specific format other than linear is required. Submitted summaries will be evaluated solely using ROUGE n-gram matching. Use the 50 TDT English clusters. Given each document cluster, create a short summary (<= 665 bytes) of the cluster. Summaries over the size limit will be truncated; no bonus for creating a shorter summary. Note that the TDT topic will NOT be input to the system. Submitted summaries will be evaluated solely using ROUGE n-gram matching. Two required runs and one optional one per group: Summaries over the size limit will be truncated; no bonus for creating a shorter summary. No specific format other than linear is required. Submitted summaries will be evaluated solely using ROUGE n-gram matching. Two required runs and one optional one per group: Summaries over the size limit will be truncated; no bonus for creating a shorter summary. Submitted summaries will be evaluated solely using ROUGE n-gram matching. Use the 50 TREC clusters. Given each document cluster and a question of the form "Who is X?", where X is the name of a person or group of people, create a short summary (<= 665 bytes) of the cluster that responds to the question. Summaries over the size limit will be truncated; no bonus for creating a shorter summary. The summary should be focused by the given natural language question (stated in a sentence).
NIST will evaluate the summaries intrinsically (SEE) for
quality and
coverage. Here are the instructions to the assessors for using SEE.
In addition, NIST will evaluate each summary for its
"responsiveness" to the question. Here are the instructions to
the assessors for judging relative responsiveness.
|
For
data, past results, mailing list or other general information
contact:
Lori
Buckland (lori.buckland@nist.gov)
For
other questions contact: Paul
Over (over@nist.gov)
Last
updated: Thursday, 24-Mar-2011 12:03:11 EDT
Date
created: Friday, 26-July-02