CODE Datasets
We catalog 10 CODE datasets for NLP and machine learning, including 1 benchmarks. Browse the list below or narrow down by task.
This page covers CODE-language data. Our directory includes 10 datasets in CODE.
Updated June 2026
- codeparrot/appsText GenerationCODE
- xlangai/DS-1000General NLPCODE
- codeparrot/codecomplexText GenerationCODE
- codeparrot/github-codeText GenerationCODE
- Locutusque/UltraTextbooksText GenerationEN, CODE
- bigcode/the-stackText GenerationCODE
- bigcode/the-stack-v2Text GenerationCODE
- bigcode/the-stack-dedupText GenerationCODE
- BAAI/TACOText GenerationCODE
- nyuuzyou/google-code-archiveText GenerationCODE, ENBenchmark