List of Programming languages Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. Although there are hard to find low resource language datasets, like Programming languages language, there a good list of them to you start your machine learning (ML) project right now. To solve this, we collected a list of Programming languages NLP datasets for machine learning, a large curated base for training data and testing data. Covering a wide gamma of NLP use cases, from text classification, part-of-speech (POS), to machine translation.


Custom fine-tune with Programming languages datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 12 Programming languages Datasets

Let’s get started!

CodeXGLUE: CT-All
The task is aimed to predict the answers for the blank with the context of the blank, which can be formulated as a multi-choice classification problem. Each instance in the dataset contains a masked code function, its docstring and the target word.
CodeXGLUE: CT-Max/Min
The difference between this dataset and CT-All is that this dataset only contains two words. The task is aimed to predict the answers for the blank with the context of the blank, which can be formulated as a multi-choice classification problem. Each instance in the dataset contains a masked code function, its docstring and the target word.
CodeXGLUE: PY 150/Java Corpus token
Datasets used for code completion on the token level for Python and Java.
CodeXGLUE: PY 150/Java Corpus line
Datasets used for code completion on the line level for Python and Java.
CodeXGLUE: CodeSearchNet, AdvTest
Given a natural language prompt, the task is to search source code that matches the natural language. To test the generalization ability of a model, function names and variables in test sets are replaced by special tokens.
CodeXGLUE: NL Code Search WebQuery
Code Search is aimed to find a code snippet which best matches the demand of the query. This task is formulated in text-code classification.
CodeXGLUE: CT-All
The task is aimed to predict the answers for the blank with the context of the blank, which can be formulated as a multi-choice classification problem. Each instance in the dataset contains a masked code function, its docstring and the target word.
CodeXGLUE: CT-Max/Min
The difference between this dataset and CT-All is that this dataset only contains two words. The task is aimed to predict the answers for the blank with the context of the blank, which can be formulated as a multi-choice classification problem. Each instance in the dataset contains a masked code function, its docstring and the target word.
CodeXGLUE: PY 150/Java Corpus token
Datasets used for code completion on the token level for Python and Java.
CodeXGLUE: PY 150/Java Corpus line
Datasets used for code completion on the line level for Python and Java.
CodeXGLUE: CodeSearchNet, AdvTest
Given a natural language prompt, the task is to search source code that matches the natural language. To test the generalization ability of a model, function names and variables in test sets are replaced by special tokens.
CodeXGLUE: NL Code Search WebQuery
Code Search is aimed to find a code snippet which best matches the demand of the query. This task is formulated in text-code classification.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.