List of Sentiment analysis Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. We collected a list of NLP datasets for Sentiment analysis task, to get started your machine learning projects. Bellow your find a large curated training base for Sentiment analysis.

What is Sentiment analysis task?

Sentiment analysis is the process of understanding how the writer or speaker’s sentiment affects the tone of a given piece of text.


Custom fine-tune with Sentiment analysis datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 92 Sentiment analysis Datasets

Let’s get started!

FinancialPhraseBank
Dataset contains the sentiments for financial news headlines from the perspective of a retail investor.
CASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. Task is defined as a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral.
HoASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms. The dataset covers ten different aspects of hotel quality. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative.
EmoT (IndoNLU)
Dataset used for emotion classification of tweets with 5 categories: anger, fear, happiness, love and sadness.
SmSA (IndoNLU)
Dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments: positive, negative, and neutral.
TermA (IndoNLU)
Dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment.
SigmaLaw-ABSA
Dataset contains legal data consisting of 39,155 legal cases including 22,776 taken from the United States Supreme Court. For the data collection process, about 2,000 sentences were gathered to annotate and court cases were selected without targeting any specific category. Party based sentiment polarity values are annotated: negative, positive, & neutral.
PerSenT
Dataset that captures the sentiment of an author towards the main entity in a news article. This dataset contains annotation for 5.3K documents and 38K paragraphs covering 3.2K unique entities.
XED
Dataset consists of emotion annotated movie subtitles from OPUS. Plutchik's 8 core emotions to annotate were used. The data is multilabel. The original annotations have been sourced for mainly English and Finnish, with the rest created using annotation projection to aligned subtitles in 41 additional languages, with 31 languages included in the final dataset (more than 950 lines of annotated subtitle lines).
Vietnamese Students’ Feedback Corpus (UIT-VSFC)
Dataset contains over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.
MalayalamMixSentiment
Dataset contains 6,739 comments and 7,743 distinct sentences. There are 5 classes: Positive, Negative, Mixed feelings, Neutral, and Non-Malayalam. Requires to email author for dataset download.
Yelp Polarity Reviews
Dataset contains 1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity. Dataset from FastAI's website.
PolEmo2.0-IN & OUT
Dataset contains online reviews from medicine and hotels domains. The task is to predict the sentiment of a review.
Clash of Clans
Dataset contains 50K user comments, both from the iTunes App Store and Google Play. The dataset spans from Oct 18, 2018 to Feb 1, 2019.
MPQA Opinion Corpus
Dataset contains news articles and other text documents manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
Arabic Jordanian General Tweets (AJGT)
Dataset consists of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
ASTD: Arabic Sentiment Tweets Dataset
Dataset contains over 10k Arabic sentiment tweets classified into 4 classes: subjective positive, subjective negative, subjective mixed, and objective.
Webis-CLS-10
The Cross-Lingual Sentiment (CLS) dataset comprises about 800,000 Amazon product reviews in the 4 languages: English, German, French, and Japanese.
SemEval-2016 Task 4
Dataset contains 5 subtasks involving the sentiment analysis of tweets.
ArguAna TripAdvisor Corpus
Dataset contains 2,100 hotel reviews balanced with respect to the reviews’ sentiment scores. reviews are segmented into subsentence-level statements that have been manually classified as a fact, a positive, or a negative opinion.
Excitement Datasets
Datasets contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company. The datasets are available in English and Italian.
Large Movie Review Dataset - Imdb
Dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing.
Wisesight Sentiment Corpus
Dataset contains around 26,700 messages in Thai language from various social media with human-annotated sentiment classification (positive, neutral, negative, and question).
Irony Sarcasm Analysis Corpus
Dataset contains tweets in 4 subgroups: irony, sarcasm, regular and figurative. Requires using Twitter API in order to obtain tweets.
Sentiment Corpus of App Reviews with Fine-grained Annotations in German (SCARE)
Dataset consists of fine-grained annotations for mobile application reviews from the Google Play Store. For each user review the mentioned application aspects, i.e., the design or the usability, as well as subjective phrases, which evaluate these aspects, are annotated. In addition, the polarity (positive, negative or neutral) of each subjective phrase is recorded as well as the relationship of an aspect to the main app in discussion. Requires emailing source for password to retrieve data.
Dutch Book Reviews
Dataset contains book reviews along with associated binary sentiment polarity labels.
Amazon Fine Food Reviews
Dataset consists of reviews of fine foods from amazon.
Amazon Reviews
US product reviews from Amazon.
Blogger Authorship Corpus
Blog post entries of 19,320 people from blogger.com.
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI)
Dataset contains more than 23,500 sentence utterance videos from more than 1000 online YouTube speakers. The dataset is gender balanced. All the sentences utterance are randomly chosen from various topics and monologue videos.
Examiner Pseudo-News Corpus
Clickbait, spam, crowd-sourced headlines from 2010 to 2015.
NYSK Dataset
English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
NYSK Dataset
English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
Yelp Open Dataset
Dataset containing millions of reviews on Yelp. In addition it contains business data including location data, attributes, and categories.
FinancialPhraseBank
Dataset contains the sentiments for financial news headlines from the perspective of a retail investor.
CASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. Task is defined as a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral.
HoASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms. The dataset covers ten different aspects of hotel quality. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative.
EmoT (IndoNLU)
Dataset used for emotion classification of tweets with 5 categories: anger, fear, happiness, love and sadness.
SmSA (IndoNLU)
Dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments: positive, negative, and neutral.
TermA (IndoNLU)
Dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment.
SigmaLaw-ABSA
Dataset contains legal data consisting of 39,155 legal cases including 22,776 taken from the United States Supreme Court. For the data collection process, about 2,000 sentences were gathered to annotate and court cases were selected without targeting any specific category. Party based sentiment polarity values are annotated: negative, positive, & neutral.
PerSenT
Dataset that captures the sentiment of an author towards the main entity in a news article. This dataset contains annotation for 5.3K documents and 38K paragraphs covering 3.2K unique entities.
XED
Dataset consists of emotion annotated movie subtitles from OPUS. Plutchik's 8 core emotions to annotate were used. The data is multilabel. The original annotations have been sourced for mainly English and Finnish, with the rest created using annotation projection to aligned subtitles in 41 additional languages, with 31 languages included in the final dataset (more than 950 lines of annotated subtitle lines).
Vietnamese Students’ Feedback Corpus (UIT-VSFC)
Dataset contains over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.
MalayalamMixSentiment
Dataset contains 6,739 comments and 7,743 distinct sentences. There are 5 classes: Positive, Negative, Mixed feelings, Neutral, and Non-Malayalam. Requires to email author for dataset download.
Yelp Polarity Reviews
Dataset contains 1,569,264 samples from the Yelp Dataset Challenge 2015. This subset has 280,000 training samples and 19,000 test samples in each polarity. Dataset from FastAI's website.
PolEmo2.0-IN & OUT
Dataset contains online reviews from medicine and hotels domains. The task is to predict the sentiment of a review.
Clash of Clans
Dataset contains 50K user comments, both from the iTunes App Store and Google Play. The dataset spans from Oct 18, 2018 to Feb 1, 2019.
MPQA Opinion Corpus
Dataset contains news articles and other text documents manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).
Arabic Jordanian General Tweets (AJGT)
Dataset consists of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
ASTD: Arabic Sentiment Tweets Dataset
Dataset contains over 10k Arabic sentiment tweets classified into 4 classes: subjective positive, subjective negative, subjective mixed, and objective.
Webis-CLS-10
The Cross-Lingual Sentiment (CLS) dataset comprises about 800,000 Amazon product reviews in the 4 languages: English, German, French, and Japanese.
SemEval-2016 Task 4
Dataset contains 5 subtasks involving the sentiment analysis of tweets.
ArguAna TripAdvisor Corpus
Dataset contains 2,100 hotel reviews balanced with respect to the reviews’ sentiment scores. reviews are segmented into subsentence-level statements that have been manually classified as a fact, a positive, or a negative opinion.
Excitement Datasets
Datasets contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company. The datasets are available in English and Italian.
Large Movie Review Dataset - Imdb
Dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing.
Wisesight Sentiment Corpus
Dataset contains around 26,700 messages in Thai language from various social media with human-annotated sentiment classification (positive, neutral, negative, and question).
Irony Sarcasm Analysis Corpus
Dataset contains tweets in 4 subgroups: irony, sarcasm, regular and figurative. Requires using Twitter API in order to obtain tweets.
Sentiment Corpus of App Reviews with Fine-grained Annotations in German (SCARE)
Dataset consists of fine-grained annotations for mobile application reviews from the Google Play Store. For each user review the mentioned application aspects, i.e., the design or the usability, as well as subjective phrases, which evaluate these aspects, are annotated. In addition, the polarity (positive, negative or neutral) of each subjective phrase is recorded as well as the relationship of an aspect to the main app in discussion. Requires emailing source for password to retrieve data.
Dutch Book Reviews
Dataset contains book reviews along with associated binary sentiment polarity labels.
Amazon Fine Food Reviews
Dataset consists of reviews of fine foods from amazon.
Amazon Reviews
US product reviews from Amazon.
Blogger Authorship Corpus
Blog post entries of 19,320 people from blogger.com.
CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI)
Dataset contains more than 23,500 sentence utterance videos from more than 1000 online YouTube speakers. The dataset is gender balanced. All the sentences utterance are randomly chosen from various topics and monologue videos.
Examiner Pseudo-News Corpus
Clickbait, spam, crowd-sourced headlines from 2010 to 2015.
NYSK Dataset
English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
NYSK Dataset
English news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
Yelp Open Dataset
Dataset containing millions of reviews on Yelp. In addition it contains business data including location data, attributes, and categories.

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.