Skip to content

ohmatheus/Kaggle_LLMClassificationFinetuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Kaggle_LLMClassificationFinetuning

Overview

This repo is the humble result of my work on a Kaggle competition: https://www.kaggle.com/competitions/llm-classification-finetuning/overview

The idea is to predict which responses users will prefer in a head-to-head battle between chatbots powered by LLMs. The dataset is composed of a promt, and 2 responses comming from 2 differents LLM for the Chatbot Arena.

The only data accessible in the test set are [prompt] [reponse_a] and [reponse_b].

It a multiclass classification evaluated on the log loss of the probability made for each class.

I did this mainly to improve my knowledge on NLP and finetuning LLM.

Evolution

I started using a notebook given by Kaggle, working with tensoflow and WSL. I had many issues working with tensorflow and wsl (like tensor incompatibility between Tensorflow and Transformers, for instance), so i quicly recreate that notebook using PyTorch, worked like a charm.

First i tried a solution using Roberta and a siamese network, tokenizing the promt paired with each response separetely. Achieved a modest result, but good enought to start with.

Then i played a little bit with some basic feature engineering (lenght, similarity, key overlap and lexical diversity). This improved a little my results. For that i created a model using roberta, get both embedding from it, concatenating them with a vector containning all my features, and added a classification hea don top of it.

Then i switch to an other variant of roberta, mdeberta, supposed to handle multiligual embeddings. Helped to improve results as well. ("microsoft/mdeberta-v3-base" on HuggingFace)

Finnaly, a good enhancement i had was by adding a warm-up/decay scheduler (originally present in the tensorflow starter notebook) but i also added different starting LR for the finetuning part and the classification layer. This improved drasticly my results. I did not took the time to search for optimal hyperparameters, because i spent enought time on this project and i wanted to start something else, but they are possible improvements to be made on this part.

Results

The competition scores using the log loss between prediction and test set label, of probabilities shared between [reponse_a prefered] [reponse_b prefered] or [tie]. I scored 1.19 loss, while best of leaderboard are close to 0.83. Which is 'ok' but not a particularly good result.

But there is plenty of room to improve, and i have now a good backbone to start another interesting competition, based on almost the same parameters.

Possible improvement

  • Create a pipeline with less data to be able to test different ideas/FeatueEngineering/Models so i can iterate faster and compare more strategies.
  • Better feature engineering: already have better idea on how to handle similarity.
  • Try bigger and better models, i saw very good results of ppl using Gemma2, and i recently learned about the existence of a multiligual Gemma2 (https://huggingface.co/BAAI/bge-multilingual-gemma2) that i would like to test.
  • Grid Search to optimize hyperparameters
  • Getting the most out of the GPU T4 x2 accelerator from kaggle by using multithreading trainning.
  • Upgrading Sequence lenght, currenlty at 256, not ideal.
  • Chaging the model to create only 1 embedding containning prompt resp_a and resp_b. Currently it is using too much memory by storing prompt x2, and im stuck with a poor sequence lenght (256)

Next Step

Using this work as a baseline for another similar competition (timed) WSDM Cup - Multilingual Chatbot Arena, almost the same, but is only a binnary classification (no Tie) and ask for more supports on multiligual prompts.

Links

Trainning on Kaggle: https://www.kaggle.com/code/ohmatheus/llm-classification-supervisedlearning
Predict on Kaggle: https://www.kaggle.com/code/ohmatheus/llm-classification-predict

Releases

No releases published

Packages

No packages published