Lessons Learned in Training AI Models with xAPI Data

I posted an article, ChatGPT and xAPI that started my journey on understanding how I could use AI to analyse xAPI data and provide meaningful reports, predictions and outcomes.

First off, this has not been as straight forward as I thought it would be. I wrote the article xapi buckets that was to describe the process on a light-hearted analogy using buckets. In principle, this looked really, really easy, but oh boy, was I wrong!

I’ll admit, my primary tool for this has been ChatGPT, closely followed by Gemini. There’s no irony lost on the fact that I am using said tools to understand how they work!

My first prompt was as simple as ‘Tell me how I can train an LLM using xAPI data. I know I can query OpenAi with xAPI data as part of the prompt, but I want to train and query. I’d like to do it on my local machine, I have a MacBook Pro so I should be ok!’

I was NOT ready for the response.

View the Response from ChatGPT

Training a Language Model (LLM) using xAPI data on your local machine involves several steps, from data preprocessing to model training and fine-tuning. Here’s a step-by-step guide to help you through the process:

### 1. **Set Up Your Environment**

Ensure you have the necessary tools and libraries installed. You will need Python, PyTorch or TensorFlow, and the Hugging Face Transformers library.

“`bash
pip install torch transformers pandas
“`

### 2. **Prepare Your xAPI Data**

Ensure your xAPI data is in a suitable format for training. You’ll need to preprocess it to create input-output pairs that the model can learn from. Typically, xAPI data is stored in JSON format, but for training, it might be easier to convert it to CSV.

#### Preprocessing xAPI Data

“`python
import pandas as pd
import json

# Load xAPI data from JSON
with open(‘xapi_data.json’, ‘r’) as file:
xapi_data = json.load(file)

# Convert to DataFrame
df = pd.DataFrame(xapi_data[‘statements’])

# Save to CSV
df.to_csv(‘xapi_data.csv’, index=False)
“`

### 3. **Fine-tune the Language Model**

You can use the Hugging Face Transformers library to fine-tune a pre-trained model on your xAPI data.

#### Fine-tuning Script

“`python
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset, Dataset

# Load your xAPI data
data = pd.read_csv(‘xapi_data.csv’)
dataset = Dataset.from_pandas(data)

# Preprocess the text (tokenization)
tokenizer = AutoTokenizer.from_pretrained(‘distilbert-base-uncased’)

def tokenize_function(examples):
return tokenizer(examples[‘text’], padding=’max_length’, truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(‘distilbert-base-uncased’)

# Define training arguments
training_args = TrainingArguments(
output_dir=’./results’,
evaluation_strategy=’epoch’,
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)

# Create Trainer instance
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets,
eval_dataset=tokenized_datasets,
)

# Train the model
trainer.train()
“`

### 4. **Evaluate and Save the Model**

After training, evaluate the model and save it for future use.

“`python
# Evaluate the model
results = trainer.evaluate()

# Save the model
model.save_pretrained(‘./my_trained_model’)
tokenizer.save_pretrained(‘./my_trained_model’)
“`

### 5. **Query the Trained Model**

You can now use your fine-tuned model to make predictions or classify new xAPI statements.

#### Querying Script

“`python
from transformers import pipeline

# Load the trained model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(‘./my_trained_model’)
tokenizer = AutoTokenizer.from_pretrained(‘./my_trained_model’)

# Create a pipeline for classification
classifier = pipeline(‘text-classification’, model=model, tokenizer=tokenizer)

# Sample xAPI statement
sample_statement = “User completed a quiz on machine learning.”

# Get prediction
prediction = classifier(sample_statement)
print(prediction)
“`

### Notes:

1. **Hardware Considerations**: Training large models can be resource-intensive. If your MacBook Pro struggles with this, consider using cloud services with GPU support like Google Colab, AWS, or Azure.

2. **Data Quality**: Ensure your xAPI data is clean and well-prepared. Inconsistent or noisy data can significantly impact model performance.

3. **Model Selection**: The example uses `distilbert-base-uncased`, a smaller model suitable for many tasks. Depending on your needs, you might want to use a different model from the Hugging Face model hub.

By following these steps, you should be able to train and query an LLM using your xAPI data on your local machine.

What’s a Token and a Tokenizer? What’s HuggingFace? What’s LM Studio or OLLAMA? What Pre-trained and what is not.

I don’t want to know the mathematical formular for a Model and as far as I knew, a Transformer was a toy my kids had 30 years ago!

It wasn’t long before I realised this was going to take a bit of effort, but I was hooked. I have a Learning Record Store (LRS) with over 4 Million records and I want answers!

Every hour I had free I was ‘playing’ (some may call it research). I slowly started to understand what the words meant, the learning curve had was becoming very rapid and steep.

I decided to use a smaller dataset (another new word) based on the xAPI I have for the Video Demo stored in Veracity LRS. This is around 550 records, a nice simple set of data to work on, and meaningful data. The other reason for using a smaller dataset was computing power, another valuable lesson learnt. I lost count of the times I crashed my laptop trying to run the training.

The principle of training a model is very simple:

Collect the data
Clean and process the data
Train the Model
Save the Model
Query the Model

Let’s look a little closer at each one:

Getting the data from an LRS is pretty simple and has been around for a number of years. I used the TinCan/API Rustici wrapper in PHP to get the data. What I also did was limit the data to around 2,000 records and also save the timestamp of the last xAPI statement. This allowed to set the Since parameter in the next LRS call.

Clean and process the data, turns out you just can’t pass the xAPI JSON, you need to make the data more meaningful. This is still a work in progress, but the data needs to look something like:

Actor: Julian Davis. (Julian.Davis@example.com). performed completed. on xAPI Demo Video - Big Buck Bunny Test video. (ID: https://xapi.com.au/xAPI_UG_Demo_Video.mp4). at 2024-06-16 23:15:38

As I said, this is a work in progress, and I think I still need to get it right. This is a critical part of process, getting the xAPI right here will yield better results.

The other thing to consider is the format you pass to the model. Depending on the Model, you may need JSONL or just CSV.

This is the same content but in JSONL

{"messages": [{"role": "system", "content": "You are an assistant that helps users with xAPI queries."}, {"role": "user", "content": "Summarize the activities performed by all actors in the dataset."}, {"role": "assistant", "content": "Actor: Julian Davis. (Julian.Davis@example.com). performed completed. on xAPI Demo Video - Big Buck Bunny Test video. (ID: https://xapi.com.au/xAPI_UG_Demo_Video.mp4). at 2024-06-16 23:15:38"}]}

Training the model, this was very much trial and error. ChatGPT was invaluable here. It taught me how to run Python on my Mac, how to understand and interpret error messages by pasting them in as the next prompt and how to optimise the training process of the model. What else ChatGPT was doing was explaining what the code did. Such a powerful tool to save on productivity and learning.

Training is the lengthy process and can take a considerable amount of time and may lead to poor performance on your machine, especially if you’re trying to do something else at the same time on your local machine.

I looked at multiple models and multiple ways to train a model. There is an overwhelming amount of information available, sometimes it was like juggling jelly to find what you need. There are opensource and paid models. Obviously paid will give you more, and this is where I initially started – till it started adding up!

This is what I ended up testing and playing with both locally and online:

- LM Studio with a basic GPT2 Model – Local machine

- Ollama with a GPT2 Model – Local machine

- Le Mistral and there Small Model – Online (https://mistral.ai/)

- HuggingFace and pick one! There are 1,000’s (opted for GPT2 Models)– Online (https://huggingface.co/)

- Google CoLab – With BERT Model (https://colab.google/)

I’ll post another article on the end-to-end process of the Google Colab experience with the data I used and how I got it to work.

Saving the model is simple and as easy as saving it to your drive or cloud service.

Query the model is next – simple in principle but I am yet to get really meaningful data from what I’ve tried so far. I’m trying to run prompts to return meaningful quantitative and qualitive outputs. Some of the prompts I used based around the data I had:

Qualitative:

Summarize the activities performed by all actors in the dataset.
Provide a summary of all actions performed on the object named ‘xAPI Video Demo’.
Provide a detailed report of all activities with their results and timestamps.
Generate a report of all actors’ activities, including time, progress, and played segments.
List all activities performed by the actor named ‘John Doe’.
What activities did ‘John Doe’ perform on ‘xAPI Video Demo’?
Summarize the progress of ‘John Doe’ on the xAPI Video Demo.
What are the completion details for ‘Jane Doe’ on the xAPI Video Demo?
Analyze the progress of all actors for the xAPI Video Demo.
Analyze the engagement metrics for actors in the dataset.
Summarize the context and results of the activities performed by actors on ‘xAPI Video Demo’.
What were the session details for actors who interacted with ‘xAPI Video Analyze’?
Summarize the session details for all actors in the dataset.
Provide a detailed analysis of session length and interactions for the xAPI Video Demo.

Quantitative:

List all actors who completed the xAPI Video Demo.
Who has completed the xAPI Video Demo and what were their progress and results?
Which actor showed the most progress on the xAPI Video Demo?
Which segments of the xAPI Video Demo had the highest engagement?
Calculate the most popular times at which the video was paused.
Identify the most common pause times for the xAPI Video Demo.
Predict the likelihood of actors likely to complete the xAPI Video Demo based on their current progress.

Lessons:

This has been such an expirence to date. Learning will next stop. Here are some of the lessons I’ve learnt todate:

Prompts are key. With Prompts comes great power – with power comes mistakes – learn from them
Know the Model you want to use – not all models are the same
Understand the outcome you are trying to achieve be that reporting, adaptive learning processes or student prediction BEFORE you start
Use a free cloud solution like CoLab. Some models run into the Gigabytes and fill up your local drive quickly. You also need a fair amount of GPU resources, the cloud solutions can offer this.
As always, Stay Curious

Summary

This is just the beginning of my journey into xAPI and AI, and I will continue to share and collaborate with other curious digital learning professionals. There are tonnes of much smarter people out there than me, so let’s share what we learn and make learning analytics a force to be reckoned with.

Want to collaborate? Please connect with me on LinkedIn or contact me. I’d be super keen to hear from anyone that has been playing in this area as well.

STAY CURIOUS

Tagged Experience API, Generative AI, Learning Analytics, Machine Learning, technology, Tin-Can API, xAPI

Lessons Learned in Training AI Models with xAPI Data

Like this:

Leave a Reply Cancel reply

Lessons Learned in Training AI Models with xAPI Data

Share this:

Like this:

Leave a Reply Cancel reply