Summarization Datasets
There are 18 summarization datasets in our directory. Each links to its source, paper, and download — browse the full list below or filter by language.
Summarization is the task of condensing a longer document into a shorter version that retains its key points. We catalog 18 datasets for it.
Updated June 2026
- CorpusTCCSummarizationPortuguese
- Multi-XscienceSummarizationEnglish
- GigawordSummarizationEnglish
- SAMSumSummarizationEnglish
- Essex Arabic Summaries Corpus (EASC)SummarizationArabic
- KALIMAT Multipurpose Arabic CorpusSummarization, Named Entity Recognition (NER), Part-of-Speech (POS)Arabic
- The New York Times Annotated CorpusSummarization, Information ExtractionEnglish
- MATINFClassification, Question Answering, SummarizationChinese
- daekeun-ml/naver-news-summarization-koSummarizationKO
- DAMO-NLP-SG/multimodal_textbookText Generation, SummarizationEN
- Cornell NewsroomText Corpora, SummarizationEnglish
- WikiHowText Corpora, SummarizationEnglish
- Open-Orca/OpenOrcaText Classification, Token Classification, Table Question Answering, Question Answering, Zero Shot Classification, Summarization, Feature Extraction, Text GenerationEN
- euirim/goodwikiText Generation, SummarizationEN
- dennlinger/eur-lex-sumTranslation, SummarizationBG, HR, CS
- Open-Orca/SlimOrcaText Classification, Token Classification, Table Question Answering, Question Answering, Zero Shot Classification, Summarization, Feature Extraction, Text GenerationEN
- ccdv/pubmed-summarizationSummarization, Text GenerationEN
- defunct-datasets/amazon_us_reviewsSummarization, Text Generation, Fill Mask, Text ClassificationEN