# Lecture 8: Further on NLP

In this lecture, we will learn how to do basic sentiment analysis and topic modelling. From both we will further understand of the critical role of the pre-processing step and the word and document vectorization we learned i Lecture 8.

## Intro to Sentiment Analysis

### What is sentiment analysis and how to do it?

Sentiment analysis in essence is introduced to solve a text classification problem according to the text's potential emotional value. In short, sentiment analysis aims to classify a given text into an overall positive and negative emotional categories. There are in general two approaches to do it depending on wether we have suitable labeled data to apply sentiment analysis on a given text:
- Lexicon base sentiment analyser.
- Machine learning model of sentiment classifier.

We will discuss examples of each approach in details below.

### The Sentiment Lexicon approach

A lexicon is a collection of words compiled using expert knowledge for a specific purpose. Sentiment lexicons contain commonly used words and the sentiment associated with them, such as: ‘happy’ (with a sentiment score of 1)  or ‘frustrated’ (with a sentiment score of -1). The assigned negative and positive values indicates sentiment polarity; the assigned magnitude indicates the strength. <br>

There are several standard English sentiment lexicons with varying vocabulary size and representation that we can use, including:
- [AFINN Lexicon](https://www.geeksforgeeks.org/python-sentiment-analysis-using-affin/) (3300 words, each with a sentiment score range of -3 to +3).
- [SentiWordNet](https://www.nltk.org/howto/sentiwordnet.html)
- [Bing Liu’s lexicon](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html) (6800 words in separate positive and negative lists;
- [VADER lexicon](https://vadersentiment.readthedocs.io/en/latest/)

Using any of these Sentiment Lexicons, a sentiment analyses score of a given text (document, sentences, phrases, or words) is computed based on the sentiment score of each word in the text which is found in the chosen lexicon.

#### VADER Examples of sentiment analysis with NLTK

Below, we present two examples to perform sentiment analysis with two of the Sentiment Lexicons available in NLTK module. For these examples, we will use the following text data and VADER lexicon from NLTK which may need to be downloaded first usign `nltk.download()` if you have never done it:
- twitter_samples: Sample of Twitter posts
- movie_reviews: Two thousand movie reviews categorized by Bo Pang and Lillian Lee
- vader_lexicon: A scored list of words and jargon created by C.J. Hutto and Eric Gilbert

The vader_lexicon is NLTK pre-trained sentiment analyser of VADER (Valence Aware Dictionary and sEntiment Reasoner). This lexicon is best suited for language used in social media (short sentences). It is considered to be less accurate for longer, structured sentences. <br>

To use NLTK's VADER Lexicon, we: 
- First, create an instance of `nltk.sentiment.SentimentIntensityAnalyzer`, then
- Use the .polarity_scores() method on the text (a string object) which we want to sentiment-analyse. 

In Python, these steps are performend as shown in the following example codes. The output of this short example is reproduced below.

`VADER sentiment analysis:`
` {'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}`

The interpretation NLTK SentimentIntensityAnalyzer.polarity_scores() can be interpreted as follows:
- There are 4 different scores reported by VADER:
- The three scores labeled with ‘neg’, ‘neu’, and ‘pos’ sum to 1. So, these may be interpreted as probabilities. For example, there is a 0.705 probability that the sentiment value of the text (i.e `sentence1`) is positive.
- The score labeled ‘compound’ is the aggregate sentiment score. This can be thought of as the overall normalised sum of (‘neg’, ‘neu’ and ‘pos’). This sum also ranges from -1 to 1.

In [1]:
# VADER example 1
from nltk.sentiment import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()
sentence1 = "Wow, NLTK is really powerful!"
print("VADER sentiment analysis:\n", vader.polarity_scores(sentence1))

VADER sentiment analysis:
 {'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}


##### Setiment analysis of NLTK twitter posts

In the next example, we will use VADER lexicon to do sentiment analysis of a sample of Twitter posts in NLTK. After downloading the sample and import it into Python, we load the data onto a DataFrame object `twitter_df`.
```python
#download NLTK twitter samples
nltk.download(['twitter_samples'])

#importing twitter_samples
from nltk.corpus import twitter_samples

# load tweets into DataFrame
twitter_df = pd.DataFrame()
twitter_df['tweet'] = twitter_samples.strings()
```
Originally, NLTK's Twitter post sample contains 30,000 posts. For our example, we select a random sample of 1000 which we obain by issuing the following codes:

```python
# there are 30000 tweets in the sample; let's just 1000 random sample
twitter_df = twitter_df.sample(n=1000, random_state=5)
```

```{note}
Setting random_state to a specific value ensures for reproducibility of our analysis. If we do not set random_state, then the next time we run the Python codes we may always get a slightly different result due to the sample is randomised.
```

We will store the sentiment analysis score of each Twitter post in a new column of the DataFrame `twitter_df`. We will name this new column as `sentiment`.
```python
twitter_df['sentiment'] = twitter_df['tweet'].apply(
    lambda tweet: sentimentclass(tweet,threshold))
twitter_df.head()
```
In the above code, we apply a [lambda](https://www.geeksforgeeks.org/python-lambda-anonymous-functions-filter-map-reduce/) function which will call our own custom function (`sentimentclass()`) which will score the sentiment value of each row in `twitter_df['tweet']` column. <br>

Our custom function wwhich we call as `sentimentclass(sentence, threshold=0):` is where the actual VADER sentiment analysis happening. For the threshold value, we use 0.25 to reduce the number of ambiguous cases incorrectly classified. (In the next example, we will use labeled data with true human sentiment classification and then estimate a Machine Learning Classifier to do sentiment analysis. If we have such labelled data now, we can try to find the optimal threshold value which produces the highest predictive accuracy)

```python
threshold = 0.25

# we use lambda function to apply the sentimentclass function to each row
# within 'tweet' column in twitter_df and then saving the result to a new
# column 'sentiment'
twitter_df['sentiment'] = twitter_df['tweet'].apply(
    lambda tweet: sentimentclass(tweet,threshold))
twitter_df.head()
```

In [7]:
#%% Sentiment analysis of Twitter post
import nltk
import pandas as pd 

# a function to classify the tweets into positive, negative or neutral
# using VADER's sentiment analysis
def sentimentclass(sentence, threshold=0):
    """
        Using NLTK's VADER sentiment analysis to classifiy the input sentence 
        into positve, negative, or neutral. 
        
        input: 
            sentence a raw string containing the text to analyse the sentiment
            threshold: a value between [0 and 1] to determine threshold value
                for classifying the sentiment analysis score.
                'pos': if score>threshold
                'neg': if score<-threshold
                'neu': if -threshold<score<threshold
                
        return: 'pos', 'neu' or 'neg'
        
    """
    from nltk.sentiment import SentimentIntensityAnalyzer 
    sia = SentimentIntensityAnalyzer()
    vaderscore = sia.polarity_scores(sentence)['compound']
    if threshold<0 or threshold>1:
        threshold = 0 #default
    if vaderscore > threshold:
        return 'positive'
    elif vaderscore <-threshold:
        return 'negative'
    else:
        return 'neutral'

#download NLTK twitter samples
nltk.download(['twitter_samples'])

#importing twitter_samples
from nltk.corpus import twitter_samples

# load tweets into DataFrame
twitter_df = pd.DataFrame()
twitter_df['tweet'] = twitter_samples.strings()

# there are 30000 tweets in the sample; let's just 1000 random sample
twitter_df = twitter_df.sample(n=1000, random_state=5)
        
#create a new column in twitter_df containing the sentiment value
#we will use 0.25 as threshold to reduce ambiguous cases incorrectly 
#classified (Ideally, we want labeled data with true human sentiment
#classification and then estimate a classification model to optimize
#the threshold value for the highest predictive accuracy)

threshold = 0.25

# we use lambda function to apply the sentimentclass function to each row
# within 'tweet' column in twitter_df and then saving the result to a new
# column 'sentiment'
twitter_df['sentiment'] = twitter_df['tweet'].apply(
    lambda tweet: sentimentclass(tweet,threshold))
twitter_df.head()


[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\apalangkaraya\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


Unnamed: 0,tweet,sentiment
8033,@ffsjason I'm not. datz you. :-),neutral
29952,RT @cristinaprkr: The level of blatant misinfo...,negative
2736,Cant stand seeing my titos and titas cry :((((...,negative
29677,RT @blairmcdougall: Salmond on Sky encouraging...,positive
3285,@MsCarlyDowd we're not sure :( might have to w...,positive


##### Sentiment analysis of movie reviews

In this example, we will use NLTK’s movie_reviews database, which is a collection of movie reviews where the reviews are already classified by human. In other words, the movie review data are labelled data that can be used to develop and judge the accuracy of a machine learning classifier. In the data, the fileids' first three letters indicate human-label of the review. For example the `fileids` 'neg/cv000_29416.txt' means the movie review text has been labelled as a negative review. To quickly identify `fileids` which are associated with positive and negative reviews separately, we can use `movie_reviews.fileids(categories=)`.
```python
positive_review_ids = movie_reviews.fileids(categories=["pos"])
negative_review_ids = movie_reviews.fileids(categories=["neg"])
```

To score the sentiment value, we will use the VADER lexicon sentiment analyser. Recall that VADER is likely better for sentiment analysis of short sentences such as tweet. However, the movie review text is longer than Twitter post. This means, it may be better to split each review text into separate sentences. We can then rate the sentiment score of each sentence individually followed by taking an average of the sentences' sentiment scores. 

The custom function `meansentiment(review_id, threshold=0)` will use NLTK's `sentence_tokenize` function to split a given movie review text specified by `review_id` into its sentences and, for each of the sentences, compute the VADER `compound` sentiment scores. The function will then compute a simple average (using the `mean` function imported from the `statistics package`) of these scores and sotre it in a varaible `meanscore`. 

```python
    sia = SentimentIntensityAnalyzer()
    text = nltk.corpus.movie_reviews.raw(review_id)
    scores = [sia.polarity_scores(sentence)["compound"] 
              for sentence in nltk.sent_tokenize(text) ]
    meanscore = mean(scores)
```

Then, as in the previous example, the `meanscore` is classfied into 'positive', 'negative', or 'neutral' using the input `threshold value`. <br>

In this example, since we have labelled sentiment data, we will try a number of threshold values [0,0.025, 0.05, 0.1,0.125, 0.15, 0.2] to find the optimal threshold value which maximize the proportion of correctly VADER classified sentiment category when compared to human classfication of the movie review's sentiment.

```python
#let's compare VADER to human labels of the movie reviews
for threshold in [0,0.025, 0.05, 0.1,0.125, 0.15, 0.2]:
    correct = 0
    neutral = 0
    for review_id in all_review_ids:
        if meansentiment(review_id, threshold)=='positive':
            if review_id in positive_review_ids:
                correct += 1
        elif meansentiment(review_id, threshold)=='negative':
            if review_id in negative_review_ids:
                correct += 1
        else:
            #if neutral, then it is too ambiguous to classify by VADER
            #so we will drop the case from the evaluation
            neutral +=1
    print(F"At threshold = {threshold}; {correct / (len(all_review_ids)-neutral):.2%} correct")
    print(F"At threshold = {threshold}; {neutral} reviews were too ambiguous.")
```

There are 2000 moview reviews in the original NLTK's data. To speed up the run time of this example, we will use only 100 random sample of the review.
```python
# random sample of 100 ids
import random
random.seed(5)
sample_size = 100
sample_review_ids =  random.sample([x for x in all_review_ids], sample_size)
```

The results seem to suggest that if we keep increasing the threshold value, the correctly classified proportion increases but at an accelerating cost of having ambiguous sentiment values. For example, at threshold = 0.2, we have 86.67% correct classfication, but 85% of the reviews cannot be classified. If any guess is better than no guess, then we may want to lower the threshold value. In this case, we may want to set the threshold at 0.025 with accuracy of 63% with only 10% of the sample unclassfied.

_Question_: How else can we improve from the 63% accuracy rate? There are several things to see if we can improve this accuracy while avoiding having too many unscored text.
- Preprocess the review text before submitting to VADER (SentimentIntensityAnalyzer)
    - Drop non-English words
    - Identify entity names (e.g. actor’s names) and drop them
- Extract/generate new features based on the review text and use them to train a better classifier
- Use a better Sentiment Analyzer which is more appropriate for the type of the text.

In [10]:
#%% Sentiment analysis of NLTK's movie eviews
import nltk

#download NLTK twitter samples
nltk.download(['movie_reviews'])

#importing twitter_samples
from nltk.corpus import movie_reviews

#movie_reviews contains separate fileids for separate review
print(movie_reviews.fileids())

#look at an example review
#print(movie_reviews.raw('neg/cv000_29416.txt'))

#notice fileids' first three letters indicate human-label of the review
#we can use the categories of fileids to systematically identify positive
#and negative review
positive_review_ids = movie_reviews.fileids(categories=["pos"])
negative_review_ids = movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_review_ids + negative_review_ids

# random sample of 100 ids
import random
random.seed(5)
sample_size = 100
sample_review_ids =  random.sample([x for x in all_review_ids], sample_size)

def meansentiment(review_id, threshold=0):
    """
        Return positive, negative or neural classification for the
        review of the provided review_id and given threshold level
    """
    from statistics import mean
    from nltk.sentiment import SentimentIntensityAnalyzer 
    sia = SentimentIntensityAnalyzer()
    text = nltk.corpus.movie_reviews.raw(review_id)
    scores = [sia.polarity_scores(sentence)["compound"] 
              for sentence in nltk.sent_tokenize(text) ]
    meanscore = mean(scores)
    if threshold<0 or threshold>1:
        threshold = 0 #default
    if meanscore > threshold:
        return 'positive'
    elif meanscore <-threshold:
        return 'negative'
    else:
        return 'neutral'
    
#let's compare VADER to human labels of the movie reviews
for threshold in [0,0.025, 0.05, 0.1,0.125, 0.15, 0.2]:
    correct = 0
    neutral = 0
    for review_id in sample_review_ids:
        if meansentiment(review_id, threshold)=='positive':
            if review_id in positive_review_ids:
                correct += 1
        elif meansentiment(review_id, threshold)=='negative':
            if review_id in negative_review_ids:
                correct += 1
        else:
            #if neutral, then it is too ambiguous to classify by VADER
            #so we will drop the case from the evaluation
            neutral +=1
    print(F"At threshold = {threshold}; {correct / (len(sample_review_ids)-neutral):.2%} correct")
    print(F"At threshold = {threshold}; {neutral} reviews were too ambiguous.")


[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\apalangkaraya\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt', 'neg/cv006_17022.txt', 'neg/cv007_4992.txt', 'neg/cv008_29326.txt', 'neg/cv009_29417.txt', 'neg/cv010_29063.txt', 'neg/cv011_13044.txt', 'neg/cv012_29411.txt', 'neg/cv013_10494.txt', 'neg/cv014_15600.txt', 'neg/cv015_29356.txt', 'neg/cv016_4348.txt', 'neg/cv017_23487.txt', 'neg/cv018_21672.txt', 'neg/cv019_16117.txt', 'neg/cv020_9234.txt', 'neg/cv021_17313.txt', 'neg/cv022_14227.txt', 'neg/cv023_13847.txt', 'neg/cv024_7033.txt', 'neg/cv025_29825.txt', 'neg/cv026_29229.txt', 'neg/cv027_26270.txt', 'neg/cv028_26964.txt', 'neg/cv029_19943.txt', 'neg/cv030_22893.txt', 'neg/cv031_19540.txt', 'neg/cv032_23718.txt', 'neg/cv033_25680.txt', 'neg/cv034_29446.txt', 'neg/cv035_3343.txt', 'neg/cv036_18385.txt', 'neg/cv037_19798.txt', 'neg/cv038_9781.txt', 'neg/cv039_5963.txt', 'neg/cv040_8829.txt', 'neg/cv041_22364.txt', 'neg/cv042_11927.txt', 'neg/cv043_16808.t

#### The Bing Liu Lexicon

Unlike the VADER lexicon which contains word samples with specific sentiment scores, the Bing-Liu lexicon consist of two separate word samples without any specific score. Instead, there is a a list of words which express positive opinion and a separate list of words which express negative opinion. Furthermore, The Bing Liu lexicon also contain misspelled words to make it more suitable for application on texts extracted from online discussion forums, social media, and other such sources including Amazon customer reviews data. <br>

This lexicon is also available from NLTK (called as “opinion_lexicon”):
```python
from nltk.corpus import opinion_lexicon
nltk.download('opinion_lexicon')
```

An approach to use the Bing Liu lexicon on a reviewText can be described as follows: 
- First, create a Bing Liu word dictionary with "word" as the key and a specified sentiment *score* as the value {"word":score} where 
    - "word" is each word in the Bing Liu Lexicon
    - *score* is +1 if the "word" is from the positive list and -1 if the "word" is from the negative list.
- Then, for each review entry, 
    -Word-tokenize the reviewText
    -For each word in reviewText, get its sentiment score from the Bing Liu word dictionary
    -Aggregate the total sentiment score for the reviewText and compute average score. 

##### Bing Liu application on Amazon product review

In this example, we will apply Bing Liu application on Amazon Product Review text. The source for the text data is https://nijianmo.github.io/amazon/index.html and we use the 2018 Amazon Review data. The specific Product Review text we will analysis is the product review for Magazine Subscriptions (Magazine_Subscriptions_5.json.gz)
[See also (the main source)](https://snap.stanford.edu/data/web-Amazon.html). <br>

The Amazon Product Review data for magazine subscription contains the following fields:
- *Overall*: This is the final rating provided by the reviewer. Ranges from 1 (lowest) to 5 (highest).
- *Verified*: This indicates whether the product purchase has been verified by Amazon.
- *ReviewerID*: A unique identifier allocated by Amazon to each reviewer.
- *asin*: A unique product code that Amazon uses to identify the product.
- *reviewText*: The actual text in the review provided by the user.
- *Summary*: This is the headline or summary of the review that the user provided

The Amazon Product Review data is in a [JSON](https://www.w3schools.com/js/js_json_intro.asp) data exchange format. On a glance, a JSON file looks like a Python dictionary. However, it must be noted that a JSON file is a normal string/text file. In contrast, a Python dictionary is a Python object residing in the memory. In other words, a JSON file is basically a normal text file in which its rows looks like a Python dictionary which may represent each row in a DataFrame.

![json file](json.png)

![pandas row](pandasrow.png)

We can load a JSON file onto a DataFrame using Pandas' read_json. Notice that in this example, our JSON file is compressed as [gzip file](https://en.wikipedia.org/wiki/Gzip). Pandas can handle this type of compressed file directly.

```python
amazon_df = pd.read_json('Magazine_Subscriptions_5.json.gz', lines=True)
amazon_df.sample(5)
```

As described earlier, we first import the Bing Liu lexicon and create a dictionary object `bingliuworddict` and populate it with the words from Bing Liu Lexicon's positive and negative list with +1 and -1 as the value.

```python
# Create a dictionary which we can use for scoring our review text
pos_score = 1
neg_score = -1
bingliuworddict = {}
# Adding the positive words to the dictionary
for word in opinion_lexicon.positive():
    bingliuworddict[word] = pos_score
# Adding the negative words to the dictionary
for word in opinion_lexicon.negative():
    bingliuworddict[word] = neg_score
```

Then, the sentiment analysis scoring is done in a custom function `bing_liu_score(text)` which takes an input `text`, word tokenises it, and then compute the average sentiment score.

```python
def bing_liu_score(text):
    """
        A function to compute the sentiment score of the input "text" based 
        on Bing Liu Lexicons
    """
    #word as the token
    from nltk.tokenize import word_tokenize
    sentiment_score = 0
    try:
        bag_of_words = word_tokenize(text.lower())
    except:
        print("skipping review; cant create bag of words")
    else:
        for word in bag_of_words:
            if word in bingliuworddict:
                sentiment_score += bingliuworddict[word]
        return sentiment_score / len(bag_of_words)
```

To compute the Bing Liu sentiment score for all of the review text in the Amazon data, we use Pandas' `.apply()` method to apply bing_liu_score function on each row on the column `reviewText`:

```python
# Apply the bing_liu_score function on each row of the review text column ('text')
amazon_df['Bing_Liu_Score'] = amazon_df['reviewText'].apply(bing_liu_score)

#see a random sample of 2 reviewText and their sentiment scores
sample10_df = amazon_df[['asin','reviewText','Bing_Liu_Score']].sample(10)

#comparing Bing_Liu_Score to the Overall rating (1-5 stars)
amazon_df.groupby('overall').agg({'Bing_Liu_Score':'mean'})
```

On the last line above, we compute the average Bing Liu Score of each classification under the `overall` column. The classifications in the `overall` column are associated with Amazon's `*`, `**`, `***`, `****`, and `*****` ratings assigned by each _human_ reviewer. From the results, we can infer, for example the following classfication:
- \* (one star): $\textrm{Bing Liu score} < 0.011$
- ** (two stars): $0.011 <= \textrm{Bing Liu score} < 0.018$
- *** (three stars): $0.018 <= \textrm{Bing Liu score} < 0.037$
- **** (four stars): $0.037 <= \textrm{Bing Liu score} < 0.109$
- ***** (five stars): $\textrm{Bing Liu score} >= 0.109$ 


In [12]:
#%% Using Bing Liu Lexicon on Amazon Magazine Review data
import nltk
import pandas as pd

amazon_df = pd.read_json('Magazine_Subscriptions_5.json.gz', lines=True)
amazon_df.sample(5)

#importing the Bing Liu lexicon 
from nltk.corpus import opinion_lexicon

#if necessary, download first
nltk.download('opinion_lexicon')

#summary measures of the opinion_lexicon
print('Total number of words in opinion lexicon', len(opinion_lexicon.words()))
print('Examples of positive words in opinion lexicon',
opinion_lexicon.positive()[:5])
print('Examples of negative words in opinion lexicon',
opinion_lexicon.negative()[:5])

# Create a dictionary which we can use for scoring our review text
pos_score = 1
neg_score = -1
bingliuworddict = {}
# Adding the positive words to the dictionary
for word in opinion_lexicon.positive():
    bingliuworddict[word] = pos_score
# Adding the negative words to the dictionary
for word in opinion_lexicon.negative():
    bingliuworddict[word] = neg_score

# A function to compute the sentiment score based on Bing Liu Lexicons
def bing_liu_score(text):
    """
        A function to compute the sentiment score of the input "text" based 
        on Bing Liu Lexicons
    """
    #word as the token
    from nltk.tokenize import word_tokenize

    sentiment_score = 0
    try:
        bag_of_words = word_tokenize(text.lower())
    except:
        print("skipping review; cant create bag of words")
    else:
        for word in bag_of_words:
            if word in bingliuworddict:
                sentiment_score += bingliuworddict[word]
        return sentiment_score / len(bag_of_words)

# Apply the bing_liu_score function on each row of the review text column ('text')
amazon_df['Bing_Liu_Score'] = amazon_df['reviewText'].apply(bing_liu_score)

#see a random sample of 2 reviewText and their sentiment scores
sample10_df = amazon_df[['asin','reviewText','Bing_Liu_Score']].sample(10)

#comparing Bing_Liu_Score to the Overall rating (1-5 stars)
amazon_df.groupby('overall').agg({'Bing_Liu_Score':'mean'})


[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     C:\Users\apalangkaraya\AppData\Roaming\nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!


Total number of words in opinion lexicon 6789
Examples of positive words in opinion lexicon ['a+', 'abound', 'abounds', 'abundance', 'abundant']
Examples of negative words in opinion lexicon ['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable']
skipping review; cant create bag of words


Unnamed: 0_level_0,Bing_Liu_Score
overall,Unnamed: 1_level_1
1,0.011047
2,0.018279
3,0.036957
4,0.108515
5,0.17508


### Doing Sentiment Analysis with Machine Learning

#### Limitations of the Lexicon approach

From our previous discussions and examples, we can summarise a few limitations of using the Lexicon approach to do sentiment analysis.
- THe size of the lexicon. If a word does not exist in the chosen lexicon, then the sentiment score does nont capture all the words in the text. 
- The chosen lexicon may not be the gold standard, nor the sentiment score/polarity provided by the lexicon’s author(s). A particular lexicon may not be suitable for the intended purpose. 
    - Bing Liu lexicon is more suitable for online usage of language
    - VADER lexicon would be better suited for Twitter’s tweets since it includes support for popular acronyms (e.g., LOL) and emojis.
- Lexicons overlook negation because they only match words and not phrases. Sentence containing the phrase “not bad” would be rated negative instead of neutral. 
- Alternative approaches? Supervised machine learning (but, we need labelled sentiment data).

#### Applying Support Vector Machine (SVM) for sentiment analysis

The Support Vector Machine (SVM) is a Machine Learning algorithm with a basic idea of, given the observed data represented by the "dots" and "crosses" as illustrated in the following diagram (see Mark E. Fenner's *Machine Learning with Python for Everyone*), how to find the best support vectors to separate the two classes implied by the observed data.

![Support Vector Machine](svm.png)

In the sentiment analysis, we can imagine that the "dots" represent the negative valued text and the "crosses" represent the positive valued text such that the sentiment analysis problem is translated into an SVM estimation problem of finding the the two support vectors represented by the dashed lines. <br>

While a full discussion of the algorithm of the SVM model and its estimation are beyond this course, we can still discuss how we can use the [sklearn](https://scikit-learn.org/stable/modules/svm.html) module to implement the model. In essence, SVM algorithm is preferred (relative to a linear regression or non-linear logistic or other basic ML models) when working with text data because it is more suited to work with sparse data and when the input features are purely numeric (as in the case of TFIDF word vectors) instead of categorical. There are available SVM estimation functions in `sklear` depending on the actual implementation:
- `sklearn.svm.SVC` (can be specified to produce LinearSVC model, but slower)
- `sklearn.svm.LinearSVC` (This is more specific than the other two, but much faster)
- `sklearn.linear_model.SGDClassifier` (can be specified to produce LinearSVC model, but slower)

In the following example, which consists of six steps, we will implement the SVM model to redo the Amazon magazine subscription review sentiment analysis:
- Step 0: Load the Amazaon Magazine Subscription Review data to DataFrame and create target variable
- Step 1: Pre-process the review text
- Step 2: Split the data into train and test sample
- Step 3: Vectorizing the reviewText using TF-IDF vector representations
- Step 4: Train the SVM Classifier
- Step 5: Evaluating the predictive performance of the SVM Classifier

In Step 0, we load the Magazine Subscription product review data into DataFrame in a similar way shown in the previous example by using Pandas' `.read_json()` method. The most important part in this first step is the creation of the _target variable_ which we store in `sentiment` column in the DataFrame (`amazon_df`). To simplify the classification problem, in this example we only consider a binary classification for the `sentiment` column as the target variable:
- $'sentiment' = 1 \textrm{ if 'overall'} > 3$
- $'sentiment' = 0 \textrm{ if 'overall'} <=3$ 
Thus, we defined a 'positive' review as review with 4 and 5 stars overall rating. 

```python
# Assigning a new [1,0] target class label based on the product rating
# sentiment = 0 (negative); sentiment = 1 (positive)
amazon_df['sentiment'] = 0
amazon_df.loc[amazon_df['overall'] > 3, 'sentiment'] = 1
amazon_df.loc[amazon_df['overall'] < 3, 'sentiment'] = 0
```

To make the classification problem even simpler, we also drop any case in which `overall`=3 since it is probably a neutral review. Furthermore, we also drop non-verified review since we want to be certain that our target variable to have as little noise as possible which may come from inconsistent 'overall' rating and the actual text of the review.

```python
# Drop rows if overall rating is 3
amazon_df = amazon_df[amazon_df.overall!=3]

# Use only verified reviews
amazon_df = amazon_df[amazon_df.verified==True]
```

This step leaves up with a sample size of $n = 1,571$ review text with a `sentiment` classification of positive reviews (1,454) and negative reviews (117). Obviously, this sample distribution is skewed toward positive reviews. This might have some undesirable implications on our SVM model, so we will need to do some stratified random sampling when setting up the training and test sample as discussed later.

In [13]:
#%% Amazon Review: supervised learning approach for sentiment analysis
import pandas as pd

# STEP0: Load the data to DataFrame
amazon_df = pd.read_json('Magazine_Subscriptions_5.json.gz', lines=True)

# Assigning a new [1,0] target class label based on the product rating
# sentiment = 0 (negative); sentiment = 1 (positive)
amazon_df['sentiment'] = 0
amazon_df.loc[amazon_df['overall'] > 3, 'sentiment'] = 1
amazon_df.loc[amazon_df['overall'] < 3, 'sentiment'] = 0

# Drop rows if overall rating is 3
amazon_df = amazon_df[amazon_df.overall!=3]

# Use only verified reviews
amazon_df = amazon_df[amazon_df.verified==True]

# Removing unnecessary columns to keep a simple DataFrame
amazon_df.drop(columns=['reviewTime', 'unixReviewTime', 'overall', 
                        'reviewerID', 'summary', 'vote', 'image',
                        'style', 'reviewerName' ], inplace=True)
amazon_df.sample(3)

#tabulate the sentiment value (data is skewed, most review is positive)
amazon_df['sentiment'].value_counts()



sentiment
1    1454
0     117
Name: count, dtype: int64

In Step 1, we proceed to pre-process the review text in the usual ways:
- Clean the text from non-alphanumeric characters

```python
# First clean the text from an any special characters, HTML tags, and URLs:
amazon_df['text_orig'] = amazon_df['reviewText'].copy()
amazon_df['reviewText'] = amazon_df['reviewText'].apply(clean)
```

- Normalise to lower case and Lemmatise with parts-of-speech tagging

```python
# Preprocessed
amazon_df['reviewText'] = amazon_df['reviewText'].apply(prep)
```

Each of the above pre-processing steps is done by a custom function namely: `clean()` and `prep` (which calls a custom function `get_wordnet_pos` and a `NLTK`'s `WordNetLemmatizer`).

Lastly, we drop if drop the cleaned reviewText is empty.
```python
amazon_df = amazon_df[amazon_df['reviewText'].str.len()!=0]
```

In [17]:
# STEP 1: Now prepare the review text data

def clean(text):
    """
        Text cleaning function taken from Blueprints for Text Analytics
        The function uses Regular Expression fro the clearning
    """
    import html
    import re
    # convert html escapes like &amp; to characters.
    try:
        text = html.unescape(text)
    except:
        print('error in handling html escape')
    else:
        try:
            # tags like <tab>
            text = re.sub(r'<[^<>]*>', ' ', text)
        except:
            print('error in regular expression')
            return text.strip()
        else:        
            # markdown URLs like [Some text](https://....)
            text = re.sub(r'\[([^\[\]]*)\]\([^\(\)]*\)', r'\1', text)
            # text or code in brackets like [0]
            text = re.sub(r'\[[^\[\]]*\]', ' ', text)
            # standalone sequences of specials, matches &# but not #cool
            text = re.sub(r'(?:^|\s)[&#<>{}\[\]+|\\:-]{1,}(?:\s|$)', ' ', text)
            # standalone sequences of hyphens like --- or ==
            text = re.sub(r'(?:^|\s)[\-=\+]{2,}(?:\s|$)', ' ', text)
            # sequences of white spaces
            text = re.sub(r'\s+', ' ', text)
            return text.strip()

# First clean the text from an any special characters, HTML tags, and URLs:
amazon_df['text_orig'] = amazon_df['reviewText'].copy()
amazon_df['reviewText'] = amazon_df['reviewText'].apply(clean)

# Second preprocess the text (lower case, no punctuations, etc)
def get_wordnet_pos(word):
    """ Get Part-of-speech (POS) tag of input word, and return the first POS 
    tag character (which is the character that lemmatize() accepts as input)
    """
    
    from nltk import pos_tag
    from nltk.corpus import wordnet
    
    tag_firstchar = pos_tag([word])[0][1][0].upper()
    tag_dict = {'J': wordnet.ADJ,
                'N': wordnet.NOUN,
                'V': wordnet.VERB,
                'R': wordnet.ADV}

    return tag_dict.get(tag_firstchar, wordnet.NOUN)  # Note that the default value to return is "N" (NOUN)

#preprocess function
def prep(docs, filtpunc=True):
    """
        Input: English sentences
        Output: preprocessed list of sentences
        Preprocessing: 
            1. filtered punctuations (if punct==True)
            2. lemmatized (with POS tag) and converted to lowercase.
    """

    from nltk import word_tokenize, sent_tokenize
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()

    docs_p = []
    
    if docs == None:
        return docs_p
    else:
        try:
            docs = sent_tokenize(docs, language='english')
        except:
            print('No need sentence tokenizing')
        for doc in docs:

            #word tokenize
            doc = word_tokenize(doc, language="english")

            #convert lowercase then remove punctuations
            if filtpunc:
                doc =[word.lower() for word in doc  if word.isalpha()]
            else:
                doc =[word.lower() for word in doc]
                    
            #lemmatize
            doc = [lemmatizer.lemmatize(
                word, pos=get_wordnet_pos(word)) for word in doc]
            
            #join the words into the original doc format
            docs_p.append(' '.join(doc))
    
    return ''.join(str(x) for x in docs_p)

# Preprocessed
amazon_df['reviewText'] = amazon_df['reviewText'].apply(prep)

#doc = 	"I'm old, and so is my computer.  Any advice that can help me maximize my computer perfomance is very welcome.  MaximumPC has some good tips on computer parts, vendors, and usefull tests"
#print(prep(doc))

#drop if cleaned reviewText is empty
amazon_df = amazon_df[amazon_df['reviewText'].str.len()!=0]



error in handling html escape


In Step 2, we set up the training and test sample. Since the sample is skewed (most of sentiment is positive), we use sentiment classfication to stratify the random sampling to ensure that the resulting sample split has similar proportion to before the split. 

```{note}
In more advanced ML techniques, we may want to do random re-sampling to create a more balanced sample to improve the predictive performance of the SVM model.
```

To split the sample, we call `sklearn.model_selection`'s train_test_split:

```python
x_train, x_test, y_train, y_test = train_test_split(amazon_df['reviewText'],
                                                    amazon_df['sentiment'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=amazon_df['sentiment'])
```
By specifying `stratify=amazon_df['sentiment']`, we ensure that both training and testing data containe around 93% positive sentiment.

In [18]:
#%% STEP2: Split the data into train and test sample
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(amazon_df['reviewText'],
                                                    amazon_df['sentiment'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=amazon_df['sentiment'])
print (f'Size of Training Data: {x_train.shape[0]} (reviews)')
print (f'Size of Test Data: {x_test.shape[0]} (reviews)')
print ('Distribution of classes in Training Data :')
print (f'Positive Sentiment {(sum(y_train == 1)/ len(y_train) * 100.0):.2f}%')
print (f'Negative Sentiment {(sum(y_train == 0)/ len(y_train) * 100.0):.2f}%')
print ('Distribution of classes in Testing Data :')
print (f'Positive Sentiment {(sum(y_test == 1)/ len(y_test) * 100.0):.2f}%')
print (f'Negative Sentiment {(sum(y_test == 0)/ len(y_test) * 100.0):.2f}%')



Size of Training Data: 1249 (reviews)
Size of Test Data: 313 (reviews)
Distribution of classes in Training Data :
Positive Sentiment 92.47%
Negative Sentiment 7.53%
Distribution of classes in Testing Data :
Positive Sentiment 92.65%
Negative Sentiment 7.35%


In Step 3, we construct the feature variables. More specifically, our feature variables consist of the words in the Document-Term Matrix produced by the `sklearn.feature_extraction.text`'s `TfidfVectorizer`. The basic idea is that some specific words may be more strongly associated with positive review rating (`sentiment = 1` in the target variable). If that is the case, then the SVM algorithim will be able to pick up the pattern from the training data such that when we provide the set of TFIDF vectors in the test data we can get a reasonably good predictive performance.

The construction  of the TfidfVectorizer is done by the following statement:
```python
tfidf = TfidfVectorizer(min_df = 10, ngram_range=(1,1))
```

In the above statement, we set `ngram_range=(1,1)` because, for this example, we only want the "term" in the TFIDF vectors to consist only of unigram (i.e. single word). If we want to also consider bigram (i.e. two word phrases to capture compound terms), then we can set `ngram_range=(1,2)`. In addition, we also set the parameter `min_df=10` to drop any terms that appear too infrequently (`min_df=10` means ignore terms that appear in less than 10 documents). The default value for `min_df` is 1, which means "ignore terms that appear in less than 1 document".

In [23]:
#%% STEP3: Vectorizing the reviewText using TF-IDF vector representations
#(This step is needed because Machine Learning does not understand text)

#import the TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#identify the vocabulary
#Note1: here setting ngram_range=(1,1) means we only consider unigram (i.e. single word)
#if we want to consider unigram and bigram (two word phrases) then ngram_range=(1,2)
#Note2: the parameter min_df is to remove terms that appear too infrequently. 
#min_df=10 means ignore terms that appear in less than 10 documents.
#The default min_df is 1, which means "ignore terms that appear in less than 1 document".

tfidf = TfidfVectorizer(min_df = 10, ngram_range=(1,1))

#create the document term matrix for the train and test data
x_train_tf = tfidf.fit_transform(x_train)
x_test_tf = tfidf.transform(x_test)

#if you are curious to browse the document term matrix
x_train_tf_df = pd.DataFrame(x_train_tf.toarray(), 
                            columns=tfidf.get_feature_names_out())

x_train_tf_df.tail()


Unnamed: 0,about,actually,ad,add,advice,after,again,all,already,also,...,wonderful,work,world,worth,would,write,year,you,young,your
1244,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1245,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1246,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.237294,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.109506,0.0,0.0
1247,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058187,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.167409,0.0,0.0,0.0,0.0


In Step 4, the actual training of the SVM Classfier is performed. In this example, we will use the `LinearSVC` (an SVM algorithm). You may try other algorithm 
such as randomforest to see if there is improvement in predictive performance. Once the model is trained (by calling the `.fit() method)`, we make prediction by calling the `.predict()` method, supplying the test data of the features: `x_test_tf`. The result of `.predict()` is a Numpy array, which we convert to DataFrame to quickly peek its first five values.

In [26]:
#%% STEP4: Train the Machine Learning model and produce prediction

from sklearn.svm import LinearSVC
model1 = LinearSVC(random_state=42, tol=1e-5)
model1.fit(x_train_tf, y_train)

y_pred = model1.predict(x_test_tf)

# peek at y_pred values
pd.DataFrame({'y_pred':y_pred}).head()




Unnamed: 0,y_pred
0,1
1,1
2,1
3,1
4,1


In Step 5, which is the last step, we [evaluate the predictive performance](https://scikit-learn.org/stable/modules/model_evaluation.html) of our SVM Classifier by comparing the value of `y_pred` to `y_test` (the true observed value of the `sentiment` target variable as we defined in Step 0). We consider two predictive performance metrics: [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) and [ROC-AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html). <br>

The results seem to suggest a reasonably good predictive performance with Accuracy Score = 0.96 and ROC-AUC Score = 0.76. Furthremore, when we compared the predictive performance of our SVM Classifier and the Bing Liu Lexicon, we find that the Bing Liu Lexicon Accuracy of 0.79 seems to be significantly lower than the 0.96 Accuracy Score of the SVM Classifier.

In [27]:
#%% STEP5: Evaluating the predictive performance
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

print (f'Accuracy Score: {accuracy_score(y_test, y_pred):.2f}')
print (f'ROC-AUC Score: {roc_auc_score(y_test, y_pred):.2f}')

#look at some sample
sample_reviews = amazon_df.sample(5)
sample_reviews_tf = tfidf.transform(sample_reviews['reviewText'])
sentiment_predictions = model1.predict(sample_reviews_tf)
sentiment_predictions = pd.DataFrame(data = sentiment_predictions,
index=sample_reviews.index,
columns=['sentiment_prediction'])
sample_reviews = pd.concat([sample_reviews, sentiment_predictions], axis=1)
print ('Some sample reviews with their sentiment - ')
print(sample_reviews[['text_orig','sentiment_prediction']]) 

#%% Comparing with the Bing Liu Lexicon sentiment classification
#Next, compare the accuracy of BingLiu Lexicon
def baseline_scorer(text):
    score = bing_liu_score(text)
    if score > 0:
        return 1
    else:
        return 0
y_pred_baseline = x_test.apply(baseline_scorer)
acc_score = accuracy_score(y_pred_baseline, y_test)

print()
print('Predictive Performance Comparison\n')
print (f'Bing Liu Lexicon Accuracy: {acc_score:.2f}')
print (f'SVM Machine Learning Accuracy: {accuracy_score(y_test, y_pred):.2f}')

Accuracy Score: 0.96
ROC-AUC Score: 0.76
Some sample reviews with their sentiment - 
                                              text_orig  sentiment_prediction
2204  good realistic receipes. Good variety too! Tri...                     1
2154                                 Can't wait to read                     1
1447                   Quickly shipped. Loved this item                     1
554   Consumer Reports has been around for years and...                     1
937   High quality magazine.  Lots of reviews about ...                     1

Predictive Performance Comparison

Bing Liu Lexicon Accuracy: 0.79
SVM Machine Learning Accuracy: 0.96


## Introduction to Topic Modelling

A topic can be thought of as set of words that “go together”. For examples, when we think of "sports" as a topic, then the set of words that may come to mind consist of "athlete", "stadium", "game", "soccer", "Olympics", etc. This is because these words usually go together as the words associated with "sports" as a topic. To consider another example, the words such as "Chanelle", "boutique", "dress", "New York", and "Chadstone" may go together under the topic of "fashion".

### What is topic modelling and how to do it?

In essence, topic modelling is a statistical learning (i.e. machine learning) modelling to automatically discover the set of possible topics associated with a given corpus (i.e. a collection of documents). A corpus such as the whole [Wikipedia](https://www.wikipedia.org/) are likely to contain many different topics which are mostly `latent` from the point of view the reader of the corpus. In topic modelling, a topic is defined as the probability distribution over a fixed set of words contained in the corpus. To take our topic definition earlier, we can think of a topic as containing “the set of words that come to mind when referring to this topic” and the probability mass associated to each of these words. For example, if we combined the previous two groups of words which go together under the sports and fashion topics, then we can expect that for the sports topic, the words such as "athlete" and "Olympics" would have higher probability mass than the words such as "dress" and "Chanelle". Similarly, the topic of a document as a whole can be thought of as the probability distribution over a fixed set of latent topics associated to the words set within the document. This probability distritbution over latent topics reveals the topic of the document. <br>

Doing topic modelling is basicall the same as estimating a machine learning model which represent each latent topic's probability distribution over words in the document and the the probability over the latent topics's in the document as a whole. One of the most often used topic models is known as the _Latent Dirichlet Allocation_ (a.k.a the LDA) model. `LDA` is a probabilistic generative model and in Python there are several modules which offer LDA trainers including `Gensim`, `scikit-learn`, and many others. Other topic models beside LDA include, for examples, Latent Semantic Analysis (`LSA`), Probabilistic Latent Semantic Analysis (`PLSA`), and Correlated Topic Models (`CTM`). <br>

A full discussion of the mathematics and statistics behind the algorithm for the LDA model is beyond our course, however in essence in can be summarised into the following steps:
- Start with a fixed number of latent topics
- For each topic k, sample a topic-word distribution ϕk ~ Dir(β)
- For each document “d”, sample a document-topic distribution θd ~ Dir(α)
- For each word w in document “d”, sample a topic zdw ~ Multinomial(θd) and sample a word wdw ~ Multinomial(ϕzdw)

The output of these steps would be: 
- A list of topics and for each topic, a probability distribution over words.
- For each document, a probability distribution over the topics.

### Doing LDA topic modelling on Australian Research Council project summary

Every year, the Australian Research Council (ARC) awards research grants to successful grant applications from academic researchers based in Australian universities (who may have overseas collaboration partners). In their project proposal, these researchers provide a _project summary_ description which explains in one or two paragraphs what their proposed projects are all about. Below are actual text of two such [project summaries](https://dataportal.arc.gov.au/NCGP/Web/Grant/Grants):
- _Grant Application 1_: Optimum control of the in-use performance of talc-based compositions. It is important to improve the quality of their Talcom body powder, baby powder and other cosmetic products involving talc. The areas that can and need to be improved are shining characteristics, assessing the slip properties as well as developing the cosmetic chemistry of talc and other additives.  The proposed project will generate: a) simple but reliable test methods for measuring slip and shine, b) methods for control of the physical and chemical characteristics of talc blends, c) mathematical model(s) for property and process control, which is useful to improvement of the final talc properties and in-use service.
- _Grant Application 2_: Application of Silver Coatings to medical Devices for Antimicrobial Properties using Electroless Deposition. Silver compounds, eg. in topical creams, can be used to treat chronic infections.  The results are mediocre, and there may be significant side effects. Metallic silver when coated on bandages or medical devices is gaining wider acceptance, but the dissolution rate must be improved to minimise infection.  In this project an electroless silver coating process will be developed, with bath chemistry and coating conditions optimised for an ideal dissolution rate.  This project will lead to the development of improved medical devices that will have significant social and economic benefits for Australia.

In the example below, we will develop a basic LDA model to extract 50 possible latent topics from a sample of 1,000 granted ARC project proposal. The text data are contained in a CSV file where each row represents a single project proposal (which we will call as a single document):

```python
# ARC Project Grant Summary Data
arcdesc = pd.read_csv('ARCLP1.csv')
arcdesc.info()

# Pre-processed the grant summary text
arcdesc['grantsummary'] = arcdesc['grant-summary'].apply(preptext, args=(False,True))
```

The first step we do after loading the ARC project proposal text data is to pre-process the project summary text that we want to model. This text is under the column `grant-summary`. To do the text pre-processing, we `.apply()` a custom pre-processing function `preptext()` on each row (i.e., each document). The resulting pre-processed text is saved as a separate column `grantsummary`.

![grantsummary](grantsummary.png)

In [35]:
import pandas as pd
import re
 
#%% Pre-procesing function for Topic modelling ARC Grant Project Summary
commonwords = ['national', 'include', 'expect', 'understand', 'benefit', 'study',
             'novel', 'approach', 'result', 'test', 'an', 'aim', 'by', 'this',
             'with', 'australian', 'australia', 'that', 'have', 'their', 'on',
             'such', 'can', 'these', 'how', 'from', 'use', 'also', 'well', 
             'project', 'may', 'whether', 'year', 'per', 'cent', 'proposal',
             'u', 'provide', 'would', ]

my_stopwords = nltk.corpus.stopwords.words('english') + commonwords
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'

# preparing the text for topic modelling
def preptext(texttoclean, bigrams=False, lemmatize=False):

    from nltk import word_tokenize, sent_tokenize
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()

    def get_wordnet_pos(word):
        """ Get Part-of-speech (POS) tag of input word, and return the first POS 
        tag character (which is the character that lemmatize() accepts as input)
        """
    
        from nltk import pos_tag
        from nltk.corpus import wordnet
    
        tag_firstchar = pos_tag([word])[0][1][0].upper()
        tag_dict = {'J': wordnet.ADJ,
                    'N': wordnet.NOUN,
                    'V': wordnet.VERB,
                    'R': wordnet.ADV}

        return tag_dict.get(tag_firstchar, wordnet.NOUN)  # Note that the default value to return is "N" (NOUN)

    texttoclean = texttoclean.lower() # lower case
    texttoclean = re.sub('['+my_punctuation + ']+', ' ', texttoclean) # strip punctuation
    texttoclean = re.sub('\s+', ' ', texttoclean) #remove double spacing
    texttoclean = re.sub('([0-9]+)', '', texttoclean) # remove numbers
    texttoclean_token_list = [word for word in texttoclean.split(' ')
                            if word not in my_stopwords] # remove stopwords
    # texttoclean_token_list = [word_rooter(word) if '#' not in word else word
    #                     for word in texttoclean] # apply word rooter
    # print(texttoclean_token_list)

    if lemmatize:
        # texttoclean = [lemmatizer.lemmatize(word, pos=get_wordnet_pos(word)) for word in texttoclean]
        texttoclean = [lemmatizer.lemmatize(word) for word in texttoclean]
                 
    if bigrams:
        texttoclean_token_list = texttoclean_token_list+[texttoclean_token_list[i]+'_'+
                                                         texttoclean_token_list[i+1]
                                            for i in range(len(texttoclean_token_list)-1)]

    texttoclean = ' '.join(texttoclean_token_list)
    return texttoclean


#%% Topic Modelling of ARC Grant Project Summary
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# ARC Project Grant Summary Data
arcdesc = pd.read_csv('ARCLP1.csv')
arcdesc.info()

# Pre-processed the grant summary text
arcdesc['grantsummary'] = arcdesc['grant-summary'].apply(preptext, args=(False,True))
arcdesc[['grantsummary', 'grant-summary']].head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   code                              1000 non-null   object 
 1   scheme-name                       1000 non-null   object 
 2   funding-commencement-year         1000 non-null   int64  
 3   scheme-information                1000 non-null   object 
 4   current-admin-organisation        1000 non-null   object 
 5   announcement-admin-organisation   1000 non-null   object 
 6   grant-summary                     1000 non-null   object 
 7   lead-investigator                 1000 non-null   object 
 8   current-funding-amount            1000 non-null   int64  
 9   announced-funding-amount          1000 non-null   int64  
 10  grant-status                      1000 non-null   object 
 11  primary-field-of-research         1000 non-null   object 
 12  anticip

Unnamed: 0,grantsummary,grant-summary
0,optimum control performance talc based composi...,Optimum control of the in-use performance of t...
1,application silver coatings medical devices an...,Application of Silver Coatings to medical Devi...
2,qua queensland digital ultra atlas aims design...,QUA:Queensland digital Ultra-Atlas. This proje...
3,electronic properties diamondlike carbon appli...,Electronic properties of diamondlike carbon fo...
4,intermittent reinforcement scheduling improvin...,Intermittent reinforcement scheduling: Improvi...


##### Getting the topics and the probability distribution over the terms

Once we have a pre-processed text, we invoke `sklearn`'s `CountVectorizer` where we specify to ignore terms which appear in more than 50% of the documents (max_df = 0.50), term which appears too infrequently in less than 1% of the documents, and to consider only single words and no digits (`'\w+|\$[\d\.]+|\S+'`). As an example, these parameters are set for illustration purpose only. In practice, you would need to try to set the parameters by analysing their effects on the quality of the topics produced. 

```python
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=0.50, min_df=0.01, token_pattern='\w+|\$[\d\.]+|\S+')

# apply transformation to get document-term-matrix based on Count Vector
tf = vectorizer.fit_transform(arcdesc['grantsummary']).toarray()
```
Later on, for topic labelling purpose, we will need the actual terms produced by `CountVectorizer` which could be associated with each topic. These terms can be obtained by the following statement:

```python
# tf_feature_names tells us what word each column in the matric represents
# tf_feature_names = vectorizer.get_feature_names()
tf_feature_names = vectorizer.get_feature_names_out()
```

Then, to estimate the LDA model,  need to specify the number of latent topics to identify. For replication, we also set the random generator seed.
```python
# LDA model
number_of_topics = 50
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)

# a simple function to display the topics
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)

# Number of top words to display
no_top_words = 10

# Display the top words for each topic
topics50df = display_topics(model, tf_feature_names, no_top_words)
```

Lastly, the results of LatentDirichletAllocation() is a Sparse Matrix object. To facilitate for viewing the list of topics produced, we write a custom function (`display_topics()`) which require three inputs:
- model: the estimated/fitted LDA model
- feature_names: the names of the terms
- no_top_words: the number of the top terms we want to consider to represent the topics

In this example, we will consider the top 10 terms (i.e. ten terms with the highest probability mass for each topics). The image below shows a truncated snapshot of the DataFrame `topics50df`. So, for example, the first topic is associated the terms ['surface', 'resource', 'south', 'steel', 'remote' ....] as the first five of the top 10 terms. We can think of these top 10 firms are the terms which define Topic 0. As we specified, the DataFrame `topics50df` contain the top 10 terms (and their respective probability masses) for each of the 50 topics.

![topics](topics.png)

In [36]:
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=0.50, min_df=0.01, token_pattern='\w+|\$[\d\.]+|\S+')

# apply transformation to get document-term-matrix based on Count Vector
tf = vectorizer.fit_transform(arcdesc['grantsummary']).toarray()

# tf_feature_names tells us what word each column in the matric represents
# tf_feature_names = vectorizer.get_feature_names()
tf_feature_names = vectorizer.get_feature_names_out()

# LDA model
number_of_topics = 50
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)

# a simple function to display the topics
def display_topics(model, feature_names, no_top_words):
    topic_dict = {}
    for topic_idx, topic in enumerate(model.components_):
        topic_dict["Topic %d words" % (topic_idx)]= ['{}'.format(feature_names[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
        topic_dict["Topic %d weights" % (topic_idx)]= ['{:.1f}'.format(topic[i])
                        for i in topic.argsort()[:-no_top_words - 1:-1]]
    return pd.DataFrame(topic_dict)

# Number of top words to display
no_top_words = 10

# Display the top words for each topic
topics50df = display_topics(model, tf_feature_names, no_top_words)
topics50df.head()


Unnamed: 0,Topic 0 words,Topic 0 weights,Topic 1 words,Topic 1 weights,Topic 2 words,Topic 2 weights,Topic 3 words,Topic 3 weights,Topic 4 words,Topic 4 weights,...,Topic 45 words,Topic 45 weights,Topic 46 words,Topic 46 weights,Topic 47 words,Topic 47 weights,Topic 48 words,Topic 48 weights,Topic 49 words,Topic 49 weights
0,surface,36.3,knowledge,19.6,health,27.1,systems,14.7,public,21.1,...,development,22.7,research,9.7,cultural,47.0,processing,36.8,development,21.3
1,research,14.7,design,15.5,research,22.3,plant,14.5,history,21.0,...,materials,19.1,water,9.0,research,41.5,development,19.2,data,15.7
2,south,14.0,protein,11.9,public,20.2,based,10.7,western,14.1,...,new,15.5,control,9.0,management,33.1,devices,14.5,research,15.6
3,steel,12.1,gold,11.8,management,17.2,disease,10.2,research,10.4,...,quality,12.2,behaviour,8.1,industry,29.3,based,13.7,develop,15.4
4,remote,10.1,technology,11.3,community,14.4,marine,9.7,outcomes,10.3,...,genes,11.7,food,7.5,heritage,19.0,develop,12.7,systems,15.1


##### Labelling the topics and probability distribution over the topics

Ideally, we would want to have an "English" label for each of the topics instead of calling them as Topic 0, Topic 1, ..., etc. In this example, we will attempt to manually label each topic based on the top 10 terms. For more advanced topic modelling, there are some alternatives which have been proposed to do automatic labelling of LDA topics. For examples, see the discussion [here](https://www.researchgate.net/publication/220874747_Automatic_Labelling_of_Topic_Models) and other related articles.

Once we have a label for each of the topic, we can compute the probabilitu distribution over topic for each document in order to chracterise the topic of each document. Recall that in this example, a document is a funded ARC grant project proposal. Thus, effectively, we can now say what each project proposal's topic is and group the project proposals based on their common topic and do further analyses. For simplicity, we will only consider the top 3 topics associated with each document as shown in the DataFrame `doctoptopic`.

```python
doc_topic_dist = model.transform(tf)

# Each document's probabilistic distribution over the topics
doctopicdistdf = pd.DataFrame(doc_topic_dist, columns=topics50)

#top topic for each doc
df = doctopicdistdf
doctoptopic = df.apply(lambda x: pd.Series(x.sort_values(ascending=False)
       .iloc[:3].index, 
      index=['top50_1','top50_2','top50_3']), axis=1).reset_index()

```
For example, if we only consider the top 1 topic label, then Project Proposal #1 is about "regional-data-support; Project Proposal #2 is about "road-drivers-researc"; and Project Proposal #3 is about "cultural-history".

In [37]:
# topic labels (manual labelling based on keywords of each topic in topics50df)
topics50 = ['1.metal-surface', '2.knowledge-design-protein-gold','3.public-health-management',
          '4.plant-based-systems-disease', '5.western-history', '6.road-drivers-research',
          '7.drug-controlled-release', '8.industry-data-control', '9.effective-industry-system',
          '10.water-flow-model', '11.industry-labor-model', '12.industrial-gas-housing',
          '13.rural-effects-model', '14.oil-industry-performance', '15.high-species-control',
          '16.blood-cell-system', '17.water-management', '18.mobile-based-applications',
          '19.young-children-policy', '20.indigenous-disease-drug', '21.school-learning',
          '22.new-blood-services', '23.fish-environmental', '24.native-plant-design',
          '25.age-care-services', '26.regional-data-support', '27.water-nutrient-forest',
          '28.new-systems-seed','29.new-fish-control', '30.mobile-data-pressure',
          '31.arts-data-research', '32.water-transfer-system', '33.networks-growth-policy',
          '34.social-rural-community', '35.industry-based-water', '36.social-change',
          '37.information-support-services', '38.new-species-development', '39.cell-support-system',
          '40.research-test-data', '41.human-safety-policy', '42.cultural-history',
          '43.molecular-management', '44.water-system', 
          '45.urban-ecological-risk', '46.genes-disease-quality', '47.water-control-behavior',
          '48.cultural-management', '49.devices-processing', '50.genetic-data-development']

doc_topic_dist = model.transform(tf)

# Each document's probabilistic distribution over the topics
doctopicdistdf = pd.DataFrame(doc_topic_dist, columns=topics50)

#top topic for each doc
df = doctopicdistdf
doctoptopic = df.apply(lambda x: pd.Series(x.sort_values(ascending=False)
       .iloc[:3].index, 
      index=['top50_1','top50_2','top50_3']), axis=1).reset_index()

doctoptopic.head()

Unnamed: 0,index,top50_1,top50_2,top50_3
0,0,26.regional-data-support,39.cell-support-system,49.devices-processing
1,1,6.road-drivers-research,49.devices-processing,1.metal-surface
2,2,42.cultural-history,2.knowledge-design-protein-gold,5.western-history
3,3,11.industry-labor-model,1.metal-surface,39.cell-support-system
4,4,6.road-drivers-research,34.social-rural-community,21.school-learning
