Tuesday, 19 September 2023

Named Entity Recognition (NER) in Natural Language Processing

What is Named Entity Recognition (NER)?

Named entity recognition (NER) is a vital sub-task of natural language processing (NLP). The objective of NER is to identify and classify named entities in text data. NER is classified into predefined categories such as person, organization, location, dates, percentages, etc. NER helps with information extraction, text understanding, and document summarization. NER models empower organizations to extract valuable insights, automate information retrieval, improve search functionality, etc.

NER Categorization

Following are the primary categories of NER.

  • Persons: Names of people
  • Organizations: companies, government bodies, political groups
  • Locations: names of places, including cities, countries, monuments, etc.
  • Dates: Specific dates such as years, months, and dates
  • Numbers: numerical values such as percentages, currencies, measurements, etc.
  • Miscellaneous: miscellaneous named entities such as product names, event titles, skills, etc.
Importance of NER

  • Information Extraction: Extracting names of people, organizations, locations, etc.
  • Question Answering: Chatbots identify entities mentioned in user queries and retrieve relevant information.
  • Document Summarization: Helpful in identifying and highlighting key named entities
  • Sentiment Analysis: By understanding the organization and which products are discussed in customer reviews
Techniques in NER
  • Rule-based NER: These systems rely on predefined patterns, regular expressions, or dictionaries to identify named entities.
  • Statistical NER: These models use ML algorithms such as conditional random fields (CRF) and hidden Markov models (HMM). This model requires labelled training data for learning.
  • Deep learning-based RNNs and transformers have gained popularity for their ability to capture contextual information and achieve state-of-the-art results.
Challenges in NER
  • Ambiguity: Text data often contains ambiguous references to entities, which makes it difficult to define the correct category.
  • Named Entity Variability: Various forms of entities can exist, such as abbreviations, misspellings, synonyms, etc.
  • Domain Specificity: NER models perform differently based on domains with unique vocabularies and contexts.
Applications of NER
  • Healthcare: NER is used to extract medical entities like patient names, diseases, and treatment information from electronic health records.
  • Finance: NER is used to identify entities such as company names, stock symbols, and financial metrics from reports and news articles.
  • Legal: NER assists in recognizing legal entities, case names, and references to legal documents in legal texts.
 

Monday, 18 September 2023

Sentiment Analysis in Natural Language Processing

What is Sentiment Analysis ?

Sentiment analysis which is also known as opinion mining is used to extract sentiment or opinions from a text data, text data such as feedback, comment, tweet, etc. The objective of sentiment analysis is to classify the text data in terms of positive or negative.

Key steps involved in sentiment analysis

  • Text Preprocessing
  1. Tokenization- Splitting the text data into individual words or tokens
  2. Lowercasing - Transforming the text data to lower case so that there is consistency across complete data.
  3. Stop word removal - Removing unnecessary words such as 'a','an','the', etc.
  4. Stemming or Lemmatization - Transforming words to the root word such as playing to play, etc.
  • Feature Extraction
  1. Bag of words - Text data is represented as frequency of words
  2. Term Frequency-Inverse Document Frequency (TF-IDF) - Weight words basis importance of words in overall document
  3. Word Embeddings - Pre-trained models can be use to create word vector, Ex.(Word2vec, Glove, etc.) to capture semantic meanings
  • Model Selection
  1. Lexicon based - If sentiment lexicons are used 
  2. Machine learning models - Supervised or unsupervised based machine learning models such as Naive Bayes, support vector machine, LSTM or transformer based models such as BERT
Applications of Sentiment Analysis

  • Customer feedback analysis - Analyze customer reviews data
  • Social Media Monitoring - Track and analyze sentiment expressed on social media
  •  Market Research - Understand public sentiment towards specific products

Topic Modelling in Natural Language Processing

What is topic modelling?

Topic modelling is a machine learning technique in which the objective is to find out the underlying pattern and hidden topics in unstructured text data. Due to the exponential growth of data, it has become significantly more important to identify meaningful insights from the data and use them to understand business.

What are topics in the context of topic modelling?

Topics are underlying patterns or themes that represent a group of words that frequently occur together in the document. For example, in a news article, there can be an underlying topic such as entertainment, politics, foreign relations, etc., or a combination of these topics.

Steps in Topic Modelling

  •  Data Preparation 

This step involves cleaning and preprocessing text data. The tasks associated with preprocessing include tokenization, removing punctuation, removing stop words, stemming, and lemmatization.
  •  Building a Document Term Matrix (DTM)
DTM is a format that is used to feed in models. In this format, rows represent documents, while columns represent words. Values in the associated row and column represent the frequency of a word if the term frequency method is used; otherwise, other values are used depending on the algorithm applied.
  • Selecting Topic Modelling Algorithm
Different topic modelling algorithms can be used, such as NMF (Non-Negative Matrix Factorization) or LDA (Latent Dirichlet Allocation). LDA is more common compared to NMF.
  •  Model Training
The model is trained post-train and test split. The DTM matrix is used as an input to discover the topics within the corpus. Hyperparameter tuning is also done to identify the number of topics.
 
  • Topic interpretation and evaluation
Understand the associated words from each topic to interpret the theme of that topic.

 Applications of Topic Modelling

  • Content recommendations: recommending content on the same topic to users based on their history
  • Customer Insights: Identify the topics from reviews shared by customers.
  • Text Categorization: Categorization of text data based on topics interpreted from topic modelling