Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems


Click here to start


Table of Contents

Introduction to DUC-2001: an Intrinsic
Evaluation of Generic News Text
Summarization Systems

Document Understanding Conferences (DUC)

Summarization road map

DUC-2001 schedule

Goals of the talk

The design…

Data: Formation of training/test document sets

Human Summary Creation

Training and test document sets

Example training and test document sets

Automatic baselines

Submitted summaries

Evaluation basics

Phases Summary evaluation and evaluation evaluation

Models

Model editing very limited

Peers

The implementation…

Origins of the evaluation framwork SEE+++

Overall peer quality Difficult to define operationally

SEE: overall peer quality

Overall peer quality: assessor feedback

Counts of peer units (sentences) in submissions Widely variable

Grammaticality across all summaries

Most baselines contained a sentence fragment

Grammaticality: singles vs multis Single- vs multi-document seems to have little effect

Grammaticality: among multis Why more lower scores for baseline 50s and human 400s?

Cohesion across all summaries Median baselines = systems < humans

Cohesion: singles vs multis

Cohesion: among multis Why more higher system summaries in 50s?

Organization across all summaries Median baselines > systems > humans

Organization: singles vs multis

Organization: among multis Why more higher system summaries in 50s? Why are human summaries worse for the 200s?

Cohesion vs Organization Any real difference for assessors? Why is organization ever higher than cohesion?

Per-unit content: evaluation details

SEE: per-unit content

Per-unit content: assessor feedback

Per-unit content: measures

Average coverage across all summaries

Average coverage : singles vs multis

Average coverage : among multis Small improvement as size increases

Average coverage by system for singles

Average coverage by system for multis

Average coverage by docset for 2 systems Averages hide lots of variation by docset-assessor

SEE: unmarked peer units

Unmarked peer units: evaluation details

Unmarked peer units: assessor feedback

Unmarked peer units Few extremely good or bad

Phase 2 initial results

Summing up …

Summing up …

Summing up …

Summing up …

Author:
Paul Over
Retrieval Group, Information Access Division
National Institute of Standards and Technology

Download presentation source