Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds basic ragas eval #193

Merged
merged 6 commits into from
Jan 9, 2025
Merged

adds basic ragas eval #193

merged 6 commits into from
Jan 9, 2025

Conversation

RobotSail
Copy link
Member

@RobotSail RobotSail commented Dec 6, 2024

This PR introduces Rubric-based evaluation through Ragas using the default with-reference rubric that they provide.

The current evaluation supports the two following modes:

  1. Being given a dataset contains records which hold user_input (question), reference (golden answer), and response (model answer)
  2. Being given a dataset with the user_input and reference, and additionally a ModelConfiguration of a model which will be used to generate the response for each question. This can be any model running on an OpenAI-compatible endpoint

Signed-off-by: Oleg S 97077423+RobotSail@users.noreply.github.com

@mergify mergify bot added dependencies Pull requests that update a dependency file ci-failure labels Dec 6, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Dec 6, 2024
@mergify mergify bot removed the ci-failure label Dec 6, 2024
requirements.txt Outdated
@@ -10,3 +10,5 @@ pandas
pandas-stubs
lm-eval>=0.4.4
httpx

ragas

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline at EOF

def __init__(self):
pass

def run(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for this a user is expected to bring a list of Sample objects, which hold the input, prediction, and ground truth? Are we going to provide a way to build this list of Samples from given files or lists of each category, or is this moreso just for use with self-built scripts that import the Sample object and build?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it such that dataset now is either a pathlib.Path object or a list of samples, and we read what we need to accordingly

@abhi1092
Copy link
Member

abhi1092 commented Dec 9, 2024

@RobotSail let me know once you are done with testing the code. Other than that LGTM.


max_tokens: int = 768

# Random seed for reproducibility. This is not supported everywhere and therefore is unreliable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had discussed this earlier, I think you're going to want to remove this comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimaredia I'll update this comment because I believe you are confusing it's meaning with our earlier conversation. This comment relates to the seed not being supported by every model serving framework.

@RobotSail
Copy link
Member Author

@abhi1092 I've updated the code to have unit tests, please let me know if there was anything I missed.

src/instructlab/eval/ragas.py Outdated Show resolved Hide resolved
src/instructlab/eval/ragas.py Show resolved Hide resolved
api_key="test-api-key",
)
evaluator = RagasEvaluator()
result_df = evaluator._generate_answers_from_model(questions, student_model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobotSail shouldn't we hit a mock client here too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using gpt-3.5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhi1092 We do test it with a mock-client, GPT-3.5 was just the first model that came to mind as a fill-in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to use fake values to eliminate the possibility of it actually calling out to the real openai API.

Given a DataFrame containing `user_input` columns, generates responses from the given model
and returns a new DataFrame containing its answers in the `response` column.
"""
client = get_openai_client(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobotSail maybe this method and also the class can instead take the client as input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhi1092 That's a good idea. Although the way I've done it here is how it happens in MT-Bench as well. Would it make sense to make a follow-up issue for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhi1092 Actually I will just make this change in this PR since it's small.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR to include this.

########################################################################
# Test case: directly passing a dataset
########################################################################
result = evaluator.run(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too maybe we can call mock gpt-4o with pre-defined evaluation result. This way we only test the evaluation given the student model response, reference and judge model output

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
We want ragas to read from both a list as well as a list of samples

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
When a dataset is provided and is missing the `response` field, we will need to generate these responses. This commit ensures that when this case happens, we will error out when a student model is not configured. Otherwise, we will always generate these responses if the student model exists, regardless if `response` is in the dataframe or not.

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
…ng that gets passed in to __init__

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
if isinstance(dataset, list):
input_df = DataFrame(dataset)
elif isinstance(dataset, Path):
input_df = read_json(dataset, orient="records", lines=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an implicit requirement here that the dataset referred to by the path is well-formed (shaped like list[Sample]). Could consider doing a quick check to make sure the required columns are present in the df and failing here if they aren't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I don't see a reason not to.

Copy link
Contributor

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems solid. I have a couple of function naming suggestions, and I think that we could type-safe the incoming data from a .jsonl file to fail as early as possible.

@mergify mergify bot added the one-approval label Jan 8, 2025

# we will be using gpt-4o for the foreseeable future, we hardcode this
# for consistency of answers
critic_lm = ChatOpenAI(model=DEFAULT_JUDGE_MODEL)
Copy link
Contributor

@alimaredia alimaredia Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the API key supposed to be passed into the judge model? If we're assuming the environment variable is set for that we need When I run the script you the short script you sent me yesterday I hit this as an issue.

There were previous iterations of this PR that had a base_url and a key in the ModelConfig that I think are needed since the student and the judge need at least a base_url if not both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realizing now, maybe your intention is that the users OpenAI key for the judge model is already set before calling RagasEvaluator.run(). Could we have a check for that or make it more clear somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since we want to lock it down to OpenAI for now just to have a consistent evaluation base, we can just expose the API key. We can make a follow-up where this is something configurable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is outdated, we've since updated this evaluation to also accept the judge model name as well.

updated_df.at[i, "response"] = response.choices[0].message.content
return updated_df

def _get_metrics(self) -> List[Metric]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use of this function? Since it's only being used in one place and we can't configure what's being passed into rubrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that since we select the metrics carefully, having them isolated allows us to treat them with more care. Also - when you add to this list it becomes very messy, so it's just slightly more readable to have it in a separate method.

@danmcp danmcp removed their request for review January 8, 2025 20:24
@mergify mergify bot added the ci-failure label Jan 8, 2025
Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
@mergify mergify bot merged commit c437ef2 into instructlab:main Jan 9, 2025
17 checks passed
@mergify mergify bot removed the one-approval label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants