adds basic ragas eval #193

RobotSail · 2024-12-06T16:33:51Z

This PR introduces Rubric-based evaluation through Ragas using the default with-reference rubric that they provide.

The current evaluation supports the two following modes:

Being given a dataset contains records which hold user_input (question), reference (golden answer), and response (model answer)
Being given a dataset with the user_input and reference, and additionally a ModelConfiguration of a model which will be used to generate the response for each question. This can be any model running on an OpenAI-compatible endpoint

Signed-off-by: Oleg S 97077423+RobotSail@users.noreply.github.com

Maxusmusti · 2024-12-06T20:57:56Z

requirements.txt

@@ -10,3 +10,5 @@ pandas
 pandas-stubs
 lm-eval>=0.4.4
 httpx
+
+ragas


missing newline at EOF

Maxusmusti · 2024-12-06T21:03:08Z

src/instructlab/eval/ragas.py

+    def __init__(self):
+        pass
+
+    def run(


So for this a user is expected to bring a list of Sample objects, which hold the input, prediction, and ground truth? Are we going to provide a way to build this list of Samples from given files or lists of each category, or is this moreso just for use with self-built scripts that import the Sample object and build?

Updated it such that dataset now is either a pathlib.Path object or a list of samples, and we read what we need to accordingly

abhi1092 · 2024-12-09T21:46:09Z

@RobotSail let me know once you are done with testing the code. Other than that LGTM.

alimaredia · 2024-12-12T19:10:36Z

src/instructlab/eval/ragas.py

+
+    max_tokens: int = 768
+
+    # Random seed for reproducibility. This is not supported everywhere and therefore is unreliable.


We had discussed this earlier, I think you're going to want to remove this comment.

@alimaredia I'll update this comment because I believe you are confusing it's meaning with our earlier conversation. This comment relates to the seed not being supported by every model serving framework.

RobotSail · 2024-12-13T18:07:11Z

@abhi1092 I've updated the code to have unit tests, please let me know if there was anything I missed.

src/instructlab/eval/ragas.py

abhi1092 · 2024-12-13T20:30:32Z

tests/test_ragas.py

+            api_key="test-api-key",
+        )
+        evaluator = RagasEvaluator()
+        result_df = evaluator._generate_answers_from_model(questions, student_model)


@RobotSail shouldn't we hit a mock client here too.

Instead of using gpt-3.5?

@abhi1092 We do test it with a mock-client, GPT-3.5 was just the first model that came to mind as a fill-in.

Updated this to use fake values to eliminate the possibility of it actually calling out to the real openai API.

abhi1092 · 2024-12-13T20:32:12Z

src/instructlab/eval/ragas.py

+        Given a DataFrame containing `user_input` columns, generates responses from the given model
+        and returns a new DataFrame containing its answers in the `response` column.
+        """
+        client = get_openai_client(


@RobotSail maybe this method and also the class can instead take the client as input?

@abhi1092 That's a good idea. Although the way I've done it here is how it happens in MT-Bench as well. Would it make sense to make a follow-up issue for this?

See: /~https://github.com/RobotSail/eval/blob/ad12276878dfa691599e37aa5e18d6c1bd7f4afd/src/instructlab/eval/mt_bench_answers.py#L120 for reference

@abhi1092 Actually I will just make this change in this PR since it's small.

Updated the PR to include this.

abhi1092 · 2024-12-13T20:35:26Z

tests/test_ragas.py

+        ########################################################################
+        # Test case: directly passing a dataset
+        ########################################################################
+        result = evaluator.run(


Here too maybe we can call mock gpt-4o with pre-defined evaluation result. This way we only test the evaluation given the student model response, reference and judge model output

@abhi1092 It is getting mocked, see: /~https://github.com/instructlab/eval/pull/193/files#diff-07ba3868aa4f9cd9324d6bfebe02ad4b4a3928285c0f01d29ef3d12d3f3b09b4R94

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

We want ragas to read from both a list as well as a list of samples Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

When a dataset is provided and is missing the `response` field, we will need to generate these responses. This commit ensures that when this case happens, we will error out when a student model is not configured. Otherwise, we will always generate these responses if the student model exists, regardless if `response` is in the dataframe or not. Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

…ng that gets passed in to __init__ Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

JamesKunstle · 2025-01-08T18:09:52Z

src/instructlab/eval/ragas.py

+        if isinstance(dataset, list):
+            input_df = DataFrame(dataset)
+        elif isinstance(dataset, Path):
+            input_df = read_json(dataset, orient="records", lines=True)


I think there's an implicit requirement here that the dataset referred to by the path is well-formed (shaped like list[Sample]). Could consider doing a quick check to make sure the required columns are present in the df and failing here if they aren't.

Sure, I don't see a reason not to.

JamesKunstle

This seems solid. I have a couple of function naming suggestions, and I think that we could type-safe the incoming data from a .jsonl file to fail as early as possible.

alimaredia · 2025-01-08T18:24:47Z

src/instructlab/eval/ragas.py

+
+        # we will be using gpt-4o for the foreseeable future, we hardcode this
+        # for consistency of answers
+        critic_lm = ChatOpenAI(model=DEFAULT_JUDGE_MODEL)


How is the API key supposed to be passed into the judge model? If we're assuming the environment variable is set for that we need When I run the script you the short script you sent me yesterday I hit this as an issue.

There were previous iterations of this PR that had a base_url and a key in the ModelConfig that I think are needed since the student and the judge need at least a base_url if not both.

I'm realizing now, maybe your intention is that the users OpenAI key for the judge model is already set before calling RagasEvaluator.run(). Could we have a check for that or make it more clear somehow?

I think since we want to lock it down to OpenAI for now just to have a consistent evaluation base, we can just expose the API key. We can make a follow-up where this is something configurable.

This comment is outdated, we've since updated this evaluation to also accept the judge model name as well.

src/instructlab/eval/ragas.py

alimaredia · 2025-01-08T19:47:31Z

src/instructlab/eval/ragas.py

+            updated_df.at[i, "response"] = response.choices[0].message.content
+        return updated_df
+
+    def _get_metrics(self) -> List[Metric]:


What's the use of this function? Since it's only being used in one place and we can't configure what's being passed into rubrics.

The point is that since we select the metrics carefully, having them isolated allows us to treat them with more care. Also - when you add to this list it becomes very messy, so it's just slightly more readable to have it in a separate method.

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

mergify bot added dependencies Pull requests that update a dependency file ci-failure labels Dec 6, 2024

RobotSail force-pushed the add-ragas branch from 04fe42a to dae91b6 Compare December 6, 2024 20:05

mergify bot added ci-failure and removed ci-failure labels Dec 6, 2024

RobotSail force-pushed the add-ragas branch from dae91b6 to 46ec02f Compare December 6, 2024 20:48

mergify bot removed the ci-failure label Dec 6, 2024

RobotSail requested review from danmcp, JamesKunstle and Maxusmusti December 6, 2024 20:55

Maxusmusti reviewed Dec 6, 2024

View reviewed changes

requirements.txt Outdated

@@ -10,3 +10,5 @@ pandas

pandas-stubs

lm-eval>=0.4.4

httpx

ragas

Copy link

Maxusmusti Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline at EOF

Maxusmusti reviewed Dec 6, 2024

View reviewed changes

RobotSail force-pushed the add-ragas branch from 46ec02f to df441c1 Compare December 6, 2024 22:07

mergify bot added ci-failure and removed ci-failure labels Dec 6, 2024

RobotSail force-pushed the add-ragas branch from f170a64 to 7538561 Compare December 7, 2024 03:23

mergify bot removed the ci-failure label Dec 7, 2024

RobotSail force-pushed the add-ragas branch 2 times, most recently from f12fbd7 to f0db4a4 Compare December 7, 2024 03:50

RobotSail requested a review from abhi1092 December 9, 2024 20:04

alimaredia reviewed Dec 12, 2024

View reviewed changes

RobotSail force-pushed the add-ragas branch from f0db4a4 to 3a9e3f2 Compare December 13, 2024 13:34

mergify bot added ci-failure testing Relates to testing and removed ci-failure labels Dec 13, 2024

RobotSail force-pushed the add-ragas branch from 8c5d66a to 0cd0d9d Compare December 13, 2024 18:06

abhi1092 reviewed Dec 13, 2024

View reviewed changes

RobotSail force-pushed the add-ragas branch from d581e7d to a157203 Compare January 7, 2025 19:42

mergify bot added ci-failure and removed ci-failure labels Jan 7, 2025

RobotSail added 5 commits January 7, 2025 19:57

adds basic ragas eval

3443ffa

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

feat: add ability for ragas to read from a list

8568b13

We want ragas to read from both a list as well as a list of samples Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

chore: add unit tests for ragas evaluator

04117dd

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

feat: update the Ragas evaluator to have the OpenAI client as somethi…

c6b5a70

…ng that gets passed in to __init__ Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

RobotSail force-pushed the add-ragas branch from c87cf0f to 54b4ba9 Compare January 8, 2025 00:57

mergify bot removed the ci-failure label Jan 8, 2025

JamesKunstle reviewed Jan 8, 2025

View reviewed changes

JamesKunstle approved these changes Jan 8, 2025

View reviewed changes

mergify bot added the one-approval label Jan 8, 2025