allenai/olmOCR-mix-0225
General NLPEnglishodc-by
Allenai/olmOCR-mix-0225 is a General NLP-focused dataset in English distributed in Parquet format. It is distributed under the odc-by license and falls in the 100K<n<1M size category, and has been downloaded 686 times.
About allenai/olmOCR-mix-0225
olmOCR-mix-0225
olmOCR-mix-0225 is a dataset of ~250,000 PDF pages which have been OCRed into plain-text in a natural reading order using gpt-4o-2024-08-06 and a special
prompting strategy that preserves any born-digital content from each page....
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 100K<n<1M
- Creator
- allenai
- Year
- 2025
- License
- odc-by
- Downloads
- 686
- Likes
- 171