|
Document
Understanding
Conferences
Introduction
Publications
Data
Guidelines
|
|
D U C 2 0 0 3: Documents,
Tasks, and Measures
DUC 2003 will use documents from the TDT and TREC collections and will
incorporate focus of various sorts to reduce variability and better model
real tasks. It will examine automatic creation of short and very short
summaries. What follows is a brief description of the data and tasks -
a more detailed version of what was developed at the DUC 2002 Workshop.
Documents for summarization
-
30 TREC document clusters
Documents/clusters: NIST assessors will choose 30 clusters
of TREC documents related to subjects of interest to them. Each subset
will contain on average 10 documents.
The documents will come from the following collections with their own
taggings:
-
AP newswire, 1998-2000
-
New York Times newswire, 1998-2000
-
Xinhua News Agency (English version), 1996-2000
Here is a DTD.
Manual summaries: NIST assessors will create a very
short summary (~10 words, no specific format other than linear) of each
document. They will also create a focused short summary (~100 words) of
each cluster, designed to reflect a viewpoint defined by the assessor.
-
30 TDT document clusters
Documents/clusters: NIST staff will choose 30 TDT topics/events/timespans
and a subset of the documents TDT annotators found for each topic/event/timespan.
Each subset will contain on average 10 documents.
The documents will come from the same collections identified above.
Manual summaries: NIST assessors will be given the TDT
topic and will create a very short summary (~10 words, no specific
format other than linear) of each document and a short summary of each
cluster. These summaries will not be focused in any particular way
beyond by the documents and the topic.
-
30 TREC Novelty track document clusters
Documents/clusters: NIST staff will choose 30 TREC Novelty
Track question topics and a subset of the documents TREC assessors found
relevant to each topic. Each subset will contain on average 22 documents.
The documents will come from the following collections with their own
taggings:
-
Financial Times of London, 1991-1994
-
Federal Register, 1994
-
FBIS, 1996
-
Los Angeles Times 1989-1990
Here are the DTDs.
Manual summaries: NIST assessors will create a focused
short summary (~100 words) of each cluster, designed to answer the question
posed by the TREC topic.
Tasks and measures
In what follows, the evaluation of quality and coverage implements
the SEE manual
evaluation protocol. Where sentences needed to be identified, a simple Perl
script for sentence separation was used. .
-
Task 1 - Very short summaries
Use the 30 TDT clusters and the 30 TREC clusters. Given each document,
create a very short summary (~10 words, no specific format other
than linear) of it.
NIST will evaluate a subset of the summaries intrinsically (SEE) for
coverage (similar to DUC 2002). In addition, NIST will assign each of
the evaluated summaries to one of a set of given categories based on anticipated
"usefulness"
(See the Issues section below).
-
Task 2 - Short summaries focused by events
Use the 30 TDT clusters. Given each document cluster and the
associated TDT topic, create a short summary (~100 words) of
the cluster.
NIST will evaluate the summaries intrinsically (SEE) for quality and
length-adjusted coverage.
-
Task 3 - Short summaries focused by viewpoints
Use the 30 TREC clusters. Given each document cluster and a viewpoint
description, create a short summary (~100 words) of the cluster
from the point of view specified.
The viewpoint description will be a natural language string no larger
than a sentence. It will describe the important facet(s) of the cluster
the assessor has decided to include in the short summary. These facet(s)
will be represented in at least all but one of the documents in the cluster.
NIST will evaluate the summaries intrinsically (SEE) for quality and
length-adjusted coverage.
-
Task 4 - Short summaries in response to a question
Use the 30 TREC Novelty track clusters. Given each document cluster,
a question, and the set of sentences in each document deemed relevant to
the question, create a short summary (~100 words) of the cluster
that answers the question. The set of sentences in each document that were
deemed relevant and novel will also be made available. The sentences were
identified automatically by the simple sentence separation program used
in DUC 2002.
The instructions given to the humans that identified the relevant and
novel sentences included the following:
-
Order the printed documents according to the ranked list in the topic.
-
Using the description part of the topic only, go thru each printed document
and mark in yellow all sentences that directly provide information requested
by the description. Do not mark sentences that are introductory or explanatory
in nature. In particular, if there is a set of sentences that provide a
single piece of information, only select the sentence that provides the
most detail in that set. If two adjacent sentences are needed to provide
a single piece of information because of an unusual sentence construction
or error in the sentence segmentor, mark both.
-
Go to the computer and pull up the online version of your documents.
Go through each document, selecting the sentences that you have previously
marked (you can change your mind). Save this edited version as "relevant".
-
Now go thru the online version looking for duplicate information. Order
is important here; if a piece of information has already been picked, then
repeats of that same information should be deleted. Instances that give
further details of that information should be retained, but instances that
summarize details seen earlier should be deleted. Save this second edited
version as "new".
Here is what the humans creating the summaries will be asked to
do:
In this round NIST will mail you 12 sets of printed documents. NIST
will also email you 12 files - one for each set of documents. Each file
will contain a topic statement which poses a question and a list of sentences
which have been determined to be relevant to the question posed by the
topic. In addition, some sentences will be marked as "novel". This means
that someone reading the list from top to bottom decided that the "novel"
sentences introduced new information.
Your task is to create a summary of about 100-words for each file
of relevant sentences. The sentences marked as "novel" may be useful in
creating your summary.
The printed documents are ONLY there for reference - in case you
have trouble understanding any of the sentences. For example you might
need to refer to the printed document to figure out who a pronoun in the
sentence file refers to. Please do not incorporate facts that only occur
in the printed document into your summary. Your summary should be of the
sentences in the sentence file.
Sample novelty topic,
documents, and relevant/new sentence lists are available.
The full set of data from the TREC 2002 Novelty Track is available here:
NIST will evaluate the summaries intrinsically (SEE) for quality and
length-adjusted coverage.
In addition, NIST will assign each of the evaluated summaries for a
cluster to one of a set of given categories based on "responsiveness"
(See the Issues section below) to the question.
Issues
-
Operational definition of usefulness categories in task 1:
For each document within that set for which summaries are being
judged, the assessor will be presented with the document and all the
submitted very short summaries of that document. The instructions to
the assessor will include the following:
Imagine that to save time, rather than read through a set of complete
documents in search of one of interest to you, you could first read
a list of very short summaries of those documents and based on those
summaries choose which documents to read in their entirety.
It would of course be possible to create various very short summaries
of a given document and some such summaries might be more helpful than
others (e.g., tell you more about the content relevant to the subject, be
easier to read, etc.) Your task is to help us understand how relatively
helpful a number of very short summaries of the same document are.
Please read all the following very short summaries of the document you have
been given. Assume the document is one you should read. Grade each summary
according to how useful you think it would be in getting you to choose
the document: 0 (worst, of no use), 1, 2, 3, or 4 (best, as good as having
the full document).
-
Operational definition of responsiveness categories in task 4:
The assessor will be presented with a question (topic), all the
submitted short summaries being judged for that question, and
the relevant sentences from the set of documents being summarized.
The instructions to the assessor will inlcude the following:
You have been given a question (topic), the relevant sentences from
a document set, and a number of short summaries of those sentences -
designed to answer the question. Some of the summaries may be
more responsive (in form and content) to the question than others.
Your task is to help us understand how relatively well each summary
responds to the question.
Read the question and all the associated short summaries. Consult the
relevant sentences in the document set as needed. Then grade
each summary according to how responsive it is to the question:
0 (worst, unresponsive), 1, 2, 3, or 4 (best, fully responsive).
-
Revised quality questions:
We will reuse the 12 quality questions from DUC 2002.
-
Definitions of baseline for the very short summaries:
We
will use the HEADLINE element from each document as the baseline very
short summary of that document. Such elements exist for over 80% of
the documents in the collections used. Where such an element does, not
exist, we will suply one. NOTE: The documents will be distributed
without the HEADLINE elements as a convenience to participants. Baselines for all tasks have now been
defined.
|
|