Table of Contents
Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems
Document Understanding Conferences (DUC)
Summarization road map
DUC-2001 schedule
Goals of the talk
The design…
Data: Formation of training/test document sets
Human Summary Creation
Training and test document sets
Example training and test document sets
Automatic baselines
Submitted summaries
Evaluation basics
PhasesSummary evaluation and evaluation evaluation
Models
Model editing very limited
Peers
The implementation…
Origins of the evaluation framworkSEE+++
Overall peer qualityDifficult to define operationally
SEE: overall peer quality
Overall peer quality: assessor feedback
Counts of peer units (sentences) in submissionsWidely variable
Grammaticality across all summaries
Most baselines contained a sentence fragment
Grammaticality: singles vs multisSingle- vs multi-document seems to have little effect
Grammaticality: among multisWhy more lower scores for baseline 50s and human 400s?
Cohesion across all summaries Median baselines = systems < humans
Cohesion: singles vs multis
Cohesion: among multisWhy more higher system summaries in 50s?
Organization across all summariesMedian baselines > systems > humans
Organization: singles vs multis
Organization: among multisWhy more higher system summaries in 50s?Why are human summaries worse for the 200s?
Cohesion vs Organization Any real difference for assessors?Why is organization ever higher than cohesion?
Per-unit content: evaluation details
SEE: per-unit content
Per-unit content: assessor feedback
Per-unit content: measures
Average coverage across all summaries
Average coverage : singles vs multis
Average coverage : among multisSmall improvement as size increases
Average coverage by system for singles
Average coverage by system for multis
Average coverage by docset for 2 systemsAverages hide lots of variation by docset-assessor
SEE: unmarked peer units
Unmarked peer units: evaluation details
Unmarked peer units: assessor feedback
Unmarked peer unitsFew extremely good or bad
Phase 2 initial results
Summing up …
Summing up …
Summing up …
Summing up …
|
Author: Paul Over Retrieval Group, Information Access Division National Institute of Standards and Technology
Download presentation source
|