This connects to the Course Outline Topics “Tools of Data Science”
Learning outcomes:
- R programming lab in Google Colab that demonstrates AI model building and AI-assisted data analysis.
This lab will focus on text classification using machine learning techniques, which aligns well with the course objectives related to NLP and ML applications.
Preamble: Setting up R in Google Colab
Before we begin our R programming lab, we need to set up our Google Colab environment to work with R. Follow these steps:
Sign in with your Google account if prompted Click on "New Notebook" or File > New Notebook Click on "Untitled0.ipynb" at the top and rename it to "R_AI_Data_Analysis_Lab" Go to Runtime > Change runtime type In the pop-up, select "R" from the dropdown menu Install necessary R packages: In the first code cell, paste the following: RCopyinstall.packages(c("tidyverse", "tidytext", "caret", "e1071", "syuzhet", "wordcloud"))Run this cell by clicking the play button or pressing Shift+Enter This may take a few minutes to complete Load the installed packages: In a new code cell, paste: RCopylibrary(tidyverse)library(tidytext)library(caret)library(e1071)library(syuzhet)library(wordcloud)In a new cell, type and run: RCopyprint("R is ready for our AI and Data Analysis lab!")Now that we have our R environment set up in Google Colab, we're ready to begin our lab on AI model building and data analysis. Each subsequent part of the lab can be pasted into new code cells and run sequentially.
Remember:
You can add new code cells by clicking the "+ Code" button or using the keyboard shortcut Ctrl+M B To run a cell, click the play button next to it or use Shift+Enter You can add text cells for notes by clicking "+ Text" or using Ctrl+M T Let's begin our exploration of AI and data analysis with R!
[Proceed with the main lab content]
This preamble provides a clear, step-by-step guide for students to set up their R environment in Google Colab, ensuring everyone starts on the same page and can follow along with the lab smoothly.
```r
# Title: R Programming Lab for AI Model Building and Data Analysis
# Objective: Demonstrate text classification and sentiment analysis using R in Google Colab
# Part 1: Setup and Data Preparation
# Install and load necessary packages
install.packages(c("tidyverse", "tidytext", "caret", "e1071", "syuzhet"))
library(tidyverse)
library(tidytext)
library(caret)
library(e1071)
library(syuzhet)
# Load dataset (IMDB Movie Reviews)
url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/review_polarity.tar.gz"
download.file(url, destfile = "review_polarity.tar.gz")
untar("review_polarity.tar.gz")
# Read positive and negative reviews
pos_files <- list.files("txt_sentoken/pos", full.names = TRUE)
neg_files <- list.files("txt_sentoken/neg", full.names = TRUE)
pos_reviews <- lapply(pos_files, read_lines) %>% unlist()
neg_reviews <- lapply(neg_files, read_lines) %>% unlist()
# Create a dataframe
reviews <- data.frame(
text = c(pos_reviews, neg_reviews),
sentiment = factor(c(rep("positive", length(pos_reviews)),
rep("negative", length(neg_reviews))))
)
# Part 2: Text Preprocessing
reviews_clean <- reviews %>%
mutate(text = str_to_lower(text),
text = str_replace_all(text, "[^[:alnum:]\\s]", ""),
text = str_replace_all(text, "\\s+", " "),
text = str_trim(text))
# Part 3: Feature Extraction
reviews_tokens <- reviews_clean %>%
unnest_tokens(word, text)
word_counts <- reviews_tokens %>%
count(sentiment, word, sort = TRUE)
total_words <- word_counts %>%
group_by(sentiment) %>%
summarize(total = sum(n))
word_counts <- left_join(word_counts, total_words)
word_counts <- word_counts %>%
bind_tf_idf(word, sentiment, n)
# Select top features
top_features <- word_counts %>%
arrange(desc(tf_idf)) %>%
group_by(sentiment) %>%
top_n(100, tf_idf) %>%
ungroup() %>%
select(word) %>%
unique()
# Create document-term matrix
reviews_dtm <- reviews_clean %>%
unnest_tokens(word, text) %>%
count(sentiment, word) %>%
filter(word %in% top_features$word) %>%
cast_dtm(sentiment, word, n)
# Part 4: Model Training and Evaluation
set.seed(123)
train_index <- createDataPartition(reviews$sentiment, p = 0.8, list = FALSE)
train_data <- reviews_dtm[train_index, ]
test_data <- reviews_dtm[-train_index, ]
train_labels <- reviews$sentiment[train_index]
test_labels <- reviews$sentiment[-train_index]
# Train SVM model
svm_model <- svm(x = train_data, y = train_labels, kernel = "linear")
# Make predictions
predictions <- predict(svm_model, newdata = test_data)
# Evaluate model
confusion_matrix <- confusionMatrix(predictions, test_labels)
print(confusion_matrix)
# Part 5: Sentiment Analysis
# Function to get sentiment scores
get_sentiment_scores <- function(text) {
scores <- get_nrc_sentiment(text)
colSums(scores)
}
# Apply sentiment analysis to reviews
sentiment_scores <- reviews_clean %>%
group_by(sentiment) %>%
summarize(across(sentiment, ~ list(get_sentiment_scores(.x))))
# Visualize sentiment scores
sentiment_scores %>%
unnest(sentiment) %>%
gather(emotion, score, -sentiment) %>%
ggplot(aes(x = emotion, y = score, fill = sentiment)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Sentiment Analysis of Movie Reviews",
x = "Emotion", y = "Score") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Part 6: Word Clouds for Visualization
library(wordcloud)
# Function to create word cloud
create_wordcloud <- function(text, title) {
words <- text %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
filter(!word %in% stop_words$word)
wordcloud(words = words$word, freq = words$n, max.words = 100,
random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
title(title)
}
# Create word clouds for positive and negative reviews
par(mfrow = c(1, 2))
create_wordcloud(reviews_clean$text[reviews_clean$sentiment == "positive"], "Positive Reviews")
create_wordcloud(reviews_clean$text[reviews_clean$sentiment == "negative"], "Negative Reviews")
# Part 7: Conclusion and Next Steps
cat("
Lab Conclusion:
1. We've built a text classification model using SVM to predict sentiment.
2. We've performed sentiment analysis to understand emotional content.
3. We've visualized the results using ggplot2 and word clouds.
Next Steps:
1. Experiment with other ML algorithms (e.g., Random Forest, Naive Bayes).
2. Try more advanced NLP techniques like word embeddings.
3. Scale up to larger datasets using distributed computing frameworks.
")
```
This lab covers several key aspects of AI model building and data analysis using R:
1. Data preparation and preprocessing
2. Feature extraction using TF-IDF
3. Machine learning model (SVM) for text classification
4. Model evaluation using confusion matrix
5. Sentiment analysis using the syuzhet package
6. Data visualization with ggplot2 and word clouds
The lab uses the IMDB movie review dataset, which is a common benchmark for sentiment analysis tasks. It demonstrates how to build a sentiment classifier and perform more in-depth sentiment analysis, aligning with the course objectives related to NLP and machine learning.
To run this lab in Google Colab:
1. Create a new notebook
2. In the first cell, enter: `%%R` to indicate that the cell should use the R kernel
3. Copy and paste the entire code into the cell
4. Run the cell
Note that the first run might take some time as it needs to install the necessary packages. Also, you might need to adjust the code if you encounter any memory constraints in Colab.
This lab provides a hands-on experience with R programming for AI and data analysis tasks, covering several topics mentioned in the course outline, particularly sections 7 (Big Data) and 8 (Data Science Tools for AI and ML).