What is topic modelling?
Topic modelling is a machine learning technique in which the objective is to find out the underlying pattern and hidden topics in unstructured text data. Due to the exponential growth of data, it has become significantly more important to identify meaningful insights from the data and use them to understand business.
What are topics in the context of topic modelling?
Topics are underlying patterns or themes that represent a group of words that frequently occur together in the document. For example, in a news article, there can be an underlying topic such as entertainment, politics, foreign relations, etc., or a combination of these topics.
Steps in Topic Modelling
- Data Preparation
This step involves cleaning and preprocessing text data. The tasks associated with preprocessing include tokenization, removing punctuation, removing stop words, stemming, and lemmatization.
- Building a Document Term Matrix (DTM)
DTM is a format that is used to feed in models. In this format, rows represent documents, while columns represent words. Values in the associated row and column represent the frequency of a word if the term frequency method is used; otherwise, other values are used depending on the algorithm applied.
- Selecting Topic Modelling Algorithm
Different topic modelling algorithms can be used, such as NMF (Non-Negative Matrix Factorization) or LDA (Latent Dirichlet Allocation). LDA is more common compared to NMF.
- Model Training
The model is trained post-train and test split. The DTM matrix is used as an input to discover the topics within the corpus. Hyperparameter tuning is also done to identify the number of topics.
- Topic interpretation and evaluation
Understand the associated words from each topic to interpret the theme of that topic.
Applications of Topic Modelling
- Content recommendations: recommending content on the same topic to users based on their history
- Customer Insights: Identify the topics from reviews shared by customers.
- Text Categorization: Categorization of text data based on topics interpreted from topic modelling
No comments:
Post a Comment