The Semantic Scholar Open Research Corpus (S2ORC)
Text CorporaKnowledge BaseEnglishBenchmark
The Semantic Scholar Open Research Corpus (S2ORC) is a text corpora-focused benchmark dataset in English that provides 467M edges, 136M nodes labeled examples distributed in JSON format.
📊 This dataset is used as an LLM benchmark. See model leaderboards →
About The Semantic Scholar Open Research Corpus (S2ORC)
Dataset contains 136M+ paper nodes with 12.7M+ full text papers and connected by 467M+ citation edges.
Details
- Task
- Text Corpora, Knowledge Base
- Language
- English
- Format
- JSON
- Rows / instances
- 467M edges, 136M nodes
- Creator
- Lo et al.
- Year
- 2020