Aspect Extraction and Content Classification on Cosmetic Reviews with BERT

6 min readDec 23, 2020


Madhawadias | Python Developer @iLabs

Text written in a natural language is difficult for a computer to interpret. Natural Language Processing(NLP) is a subdomain of Machine learning which allows a computer system to understand text written in human language. User reviews are a popular source of information to gather user opinions and reflections on a product or service.

In this article, we compare and contrast different sentiment analysis technologies and methods. The data used for this analysis is extracted from Sephora.

“Online reviews affect product marketing, and companies use online reviews to investigate consumer attitudes and perceptions of their products.” — Kim Sung Guena , Kang Juyoung

Natural Language Processing is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications (Liddy, n.d. 2001.)

Researches done in the past on sentiment analysis uses LSTM(Long Short Term Memory) and TF-IDF(Term Frequency-Inverse Document Frequency) which are not the most accurate methods available in the present day. Studies were done on sentiment analysis using BERT(Bidirectional Encoder Representations from Transformers) and were proved that BERT outperformed both LSTM and TF-IDF models. Hence for the improvement of sentiment analysis in cosmetic reviews, using BERT was found to be the solution.

Categorizing and profiling cosmetic/ restaurants and customers according to aspects and demographics is essential for strategic decision making of an owner or a stakeholder of a cosmetic/ hospitality service provider.

What you will learn:

  • Understand what BERT is
  • Preprocess text data for BERT (Tokenization, attention masks, and padding)
  • Use Transfer Learning to build Content Classifier using the Transformers library by Hugging Face
  • Evaluate the model on test data
  • Classify raw text

What is BERT?

BERT stands for Bi-directional Encoder Representations from Transformers. If that sounds Greek to you, you have come to the right place!
BERT was introduced in this paper.

BERT Model

Problem and the Solution

As humans when we read a text although we read it from left to right, we remember the previous words, what they meant and we also give attention to keywords that we feel are important. But when using methods such as LSTM and TFIDF in NLP, since they are based on recurrent neural networks, the machine tends to forget the previous words.

This is where BERT comes into play. Since this uses a mechanism of reading the text Bi-directionally it can look back at previous words and make more sense of the entire sentence. On top of that when it combines with the attention model of a transformer the performance of BERT is unparalleled. The attention mechanism allows for learning contextual relations between words. (Madhawa has a dog. His dog is a Labrador) The machine will know that ‘His’ in this sentence refers to Madhawa.

Things to know before starting to Code

Masked Language Modelling (Masked LM)

The objective of this task is to guess the masked tokens.
eg: He [mask] a new [mask] = He bought a new car

Next Sentence Prediction (NSP) — Binary classification

The objective of this task is to predict whether the second sentence follows the first
eg: He [mask] a new [mask]. [SEP] Wow! That’s awesome [SEP]
eg: He [mask] a new [mask]. [SEP] That building is the tallest [SEP]


Google Colab project can be found here.

Data Preprocessing

You must know by now that before any machine learning task, raw data should be preprocessed and converted to numbers. Sadly machines are only good with numbers. Mr. BERT requires even more preprocessing!

Here are the requirements:

  • Add special tokens to separate sentences and do classification
  • Pass sequences of constant length (introduce padding)
  • Create an array of 0s (pad token) and 1s (real token) called attention mask

The Transformers library provides (you’ve guessed it) a wide variety of Transformer models (including BERT). It works with TensorFlow and PyTorch! It also includes prebuilt tokenizers that do most of the work.

Choosing the BERT model:

You can use a cased and uncased version of BERT and tokenizer. I’ve experimented with both. The cased version works better. Intuitively, that makes sense, since “BAD” might convey more sentiment than “bad”.

Special Tokens:

[SEP] - marker for ending of a sentence
[CLS] - marker for the start of a sentence
[UNK] - Everything else can be marked unknown

The encoding can be done using the encode_plus() method:

Content Classification with BERT

For this project, we have chosen the Bert Sequence Classification library from Hugging Face. Output is aSequenceClassifierOutput (if return_dict=True is passed or when config.return_dict=True) or a tuple of torch.FloatTensor comprising various elements depending on the configuration (BertConfig) and inputs.

The last_hidden_state is a sequence of hidden states of the last layer of the model. Obtaining the pooled_output is done by applying the BertPooler on last_hidden_state:

pooled_output is a summary of the content, according to BERT.

Training the Model

To reproduce the training procedure from the BERT paper, we’ll use the AdamW optimizer provided by Hugging Face. It corrects weight decay, so it’s similar to the original paper. We’ll also use a linear scheduler with no warmup steps

How do we come up with all hyperparameters? The BERT authors have some recommendations for fine-tuning:

  • Batch size: 16, 32
  • Learning rate (Adam): 5e-5, 3e-5, 2e-5
  • Number of epochs: 2, 3, 4

We’re going to ignore the number of epochs recommendations but stick with the rest. Note that increasing the batch size reduces the training time significantly, but gives you lower accuracy.

The model is trained in 10 epochs and the best model with the highest validation accuracy is chosen for the BERT classification.


Let’s start the evaluation of our model.


Recall refers to the percentage of total relevant results correctly classified by your algorithm. Higher the percentage, the higher the accuracy.


Precision is a measure used to determine the accuracy of the model by determining the percentage of an actual score out of all which are predicted positive. Higher the percentage, the higher the accuracy.


F-1 score shows a less complex result of both Recall and Precision scores. This is the harmonic mean of precision and recall. A model with a better F1 score is considered the better model. Higher the percentage, the higher the accuracy.

Determining the Confusion Matrix

The performance and accuracy of a machine learning algorithm and a neural network model or in this case the BERT model can be measured by the confusion matrix. Each row of the confusion matrix represents the instances of an actual class and each column represents the instances of a predicted class (Hay, 1988)


Super Job! You learned how to:

  • Preprocess text data for BERT (Tokenization, attention masks, and padding)
  • Use Transfer Learning to build Content Classifier using the Transformers library by Hugging Face
  • Evaluate the model on test data
  • Classify raw text

Originally published at on December 23, 2020.




Web & Mobile Apps | Enterprise Software | UI/UX Design | Social Media Marketing | Facebook @ilabsteam| Instagram | Linkedin @iLabs