feat: Implement llm system evaluation #440

Yuan325 · 2024-07-18T20:59:39Z

Implementation of llm system evaluation.

Usage example:

If would like to output detailed metric table (row-specific scores), add the following env var export EXPORT_CSV=True

Run the following: python run_evaluation.py

Yuan325 · 2024-08-13T01:41:14Z

/gcbrun

kurtisvg · 2024-08-19T21:47:51Z

llm_demo/evaluation.cloudbuild.yaml

+  _RETRIEVAL_EXPERIMENT_NAME: "retrieval-phase-eval-${_PR_NUMBER}"
+  _RESPONSE_EXPERIMENT_NAME:  "response-phase-eval-${_PR_NUMBER}"
+


Do these still get run on merge to main?

Nope, this cloudbuild.yaml only runs on PRs. if we want to run it periodically on main, we can set up a cloud build a replace the name for this.

llm_demo/evaluation/eval_golden.py

Adding golden datasets that will be used in llm system evaluation. The golden dataset is separated into multiple types of queries: - queries that uses a specific tool - airline related queries (no tool calling - answer is within prompt) - assistant related question (no tool calling - answer is within prompt) - out of context questions (no tool calling) - multitool selections (agent selecting multiple tool before returning a final answer to the user)

Add the function to get prediction for each of the queries from golden_dataset. Prediction is used as comparison to retrieve metrics. Usage example: ``` from evaluation import run_llm_for_eval, goldens # set up orchestration, session, set uuid eval_list = await run_llm_for_eval(goldens, orchestration, session, session_id) ```

Add the function to run evaluation for retrieval phase (evaluating the llm ability to pick and utilize tools that are given). Usage example: ``` eval_results = evaluate_retrieval_phase(eval_datas) ```

Add the function to run evaluation for response phase (evaluating the output of llm based on instructions and context given). Usage example: ``` eval_results = evaluate_retrieval_phase(eval_datas) ```

Add run_evaluation.py file and automatic cloudbuild trigger during PR. Cloudbuild trigger will automatically run the evaluation and the results will be added to GCP under vertexai/experiments. Detailed metric table (with row specific metrics) could be obtain by running locally with `export EXPORT_CSV=True`.

Update evaluation goldens to use dynamic dates when evaluating query that uses term like "today" or "tomorrow".

Update name to differentiate experiment on GCP portal.

Implementation of llm system evaluation. Usage example: If would like to output detailed metric table (row-specific scores), add the following env var `export EXPORT_CSV=True` Run the following: `python run_evaluation.py`

🤖 I have created a release *beep* *boop* --- ## [0.2.0](v0.1.0...v0.2.0) (2024-08-27) ### Features * Add langgraph orchestration ([#447](#447)) ([8cefed0](8cefed0)) * add ticket validation and insertion to cloudsql postgres ([#437](#437)) ([a4480fa](a4480fa)) * Add tracing to langgraph orchestration and postgres provider ([#473](#473)) ([a5759e9](a5759e9)) * Implement llm system evaluation ([#440](#440)) ([a2df60b](a2df60b)) * Remove user ID and user email from `list_tickets()` result ([#464](#464)) ([5958938](5958938)) ### Bug Fixes * update pytest to pytest_asyncio for async fixtures ([#474](#474)) ([c2ad4bb](c2ad4bb)) * update return from tools for langchain and function calling ([#476](#476)) ([9dfb60b](9dfb60b)) --- This PR was generated with [Release Please](/~https://github.com/googleapis/release-please). See [documentation](/~https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Kurtis Van Gent <31518063+kurtisvg@users.noreply.github.com>

Yuan325 requested a review from a team as a code owner July 18, 2024 20:59

Yuan325 force-pushed the evaluation branch from bbc97a5 to 6e11aad Compare July 26, 2024 21:48

Yuan325 force-pushed the evaluation branch from 8775235 to 9b59d85 Compare August 8, 2024 21:33

Yuan325 force-pushed the evaluation branch from 91e79ce to f5c20c4 Compare August 19, 2024 20:11

kurtisvg approved these changes Aug 19, 2024

View reviewed changes

Yuan325 and others added 13 commits August 19, 2024 15:05

chore: add retrieval phase evaluation (#419)

75846da

Add the function to run evaluation for retrieval phase (evaluating the llm ability to pick and utilize tools that are given). Usage example: ``` eval_results = evaluate_retrieval_phase(eval_datas) ```

chore: add response phase evaluation (#420)

c1ae63b

Add the function to run evaluation for response phase (evaluating the output of llm based on instructions and context given). Usage example: ``` eval_results = evaluate_retrieval_phase(eval_datas) ```

chore: update golden eval (#442)

d6884f4

chore: update dataset to use dynamic dates (#448)

34e897c

Update evaluation goldens to use dynamic dates when evaluating query that uses term like "today" or "tomorrow".

chore: update experiment name (#449)

e2965f3

Update name to differentiate experiment on GCP portal.

chore: add auth to evaluation (#459)

9d29b74

update service account

fe7680c

update secrets

1208593

update google cloud aiplatform version

8924c00

add comment to model classes

0bd966c

Yuan325 force-pushed the evaluation branch from 02861a7 to 0bd966c Compare August 19, 2024 22:06

Yuan325 merged commit a2df60b into main Aug 19, 2024
14 of 15 checks passed

Yuan325 deleted the evaluation branch August 19, 2024 22:48

github-actions bot mentioned this pull request Aug 22, 2024

chore(main): release 0.2.0 #439

Merged

github-actions bot mentioned this pull request Dec 3, 2024

chore(main): release 0.1.0 #517

Closed

github-actions bot mentioned this pull request Jan 14, 2025

chore(main): release 0.1.0 #522

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement llm system evaluation #440

feat: Implement llm system evaluation #440

Yuan325 commented Jul 18, 2024 •

edited

Loading

Yuan325 commented Aug 13, 2024

kurtisvg Aug 19, 2024

Yuan325 Aug 19, 2024

		_RETRIEVAL_EXPERIMENT_NAME: "retrieval-phase-eval-${_PR_NUMBER}"
		_RESPONSE_EXPERIMENT_NAME: "response-phase-eval-${_PR_NUMBER}"

feat: Implement llm system evaluation #440

feat: Implement llm system evaluation #440

Conversation

Yuan325 commented Jul 18, 2024 • edited Loading

Yuan325 commented Aug 13, 2024

kurtisvg Aug 19, 2024

Choose a reason for hiding this comment

Yuan325 Aug 19, 2024

Choose a reason for hiding this comment

Yuan325 commented Jul 18, 2024 •

edited

Loading