Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement llm system evaluation #440

Merged
merged 13 commits into from
Aug 19, 2024
Merged

feat: Implement llm system evaluation #440

merged 13 commits into from
Aug 19, 2024

Conversation

Yuan325
Copy link
Collaborator

@Yuan325 Yuan325 commented Jul 18, 2024

Implementation of llm system evaluation.

Usage example:

If would like to output detailed metric table (row-specific scores), add the following env var export EXPORT_CSV=True

Run the following: python run_evaluation.py

@Yuan325 Yuan325 requested a review from a team as a code owner July 18, 2024 20:59
@Yuan325
Copy link
Collaborator Author

Yuan325 commented Aug 13, 2024

/gcbrun

Comment on lines +41 to +43
_RETRIEVAL_EXPERIMENT_NAME: "retrieval-phase-eval-${_PR_NUMBER}"
_RESPONSE_EXPERIMENT_NAME: "response-phase-eval-${_PR_NUMBER}"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these still get run on merge to main?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this cloudbuild.yaml only runs on PRs. if we want to run it periodically on main, we can set up a cloud build a replace the name for this.

llm_demo/evaluation/eval_golden.py Show resolved Hide resolved
Yuan325 and others added 13 commits August 19, 2024 15:05
Adding golden datasets that will be used in llm system evaluation.

The golden dataset is separated into multiple types of queries:
- queries that uses a specific tool
- airline related queries (no tool calling - answer is within prompt)
- assistant related question (no tool calling - answer is within prompt)
- out of context questions (no tool calling)
- multitool selections (agent selecting multiple tool before returning a
final answer to the user)
Add the function to get prediction for each of the queries from
golden_dataset. Prediction is used as comparison to retrieve metrics.

Usage example:

```
from evaluation import run_llm_for_eval, goldens

# set up orchestration, session, set uuid
eval_list = await run_llm_for_eval(goldens, orchestration, session, session_id)
```
Add the function to run evaluation for retrieval phase (evaluating the
llm ability to pick and utilize tools that are given).

Usage example:

```
eval_results = evaluate_retrieval_phase(eval_datas)
```
Add the function to run evaluation for response phase (evaluating the
output of llm based on instructions and context given).

Usage example:

```
eval_results = evaluate_retrieval_phase(eval_datas)
```
Add run_evaluation.py file and automatic cloudbuild trigger during PR.

Cloudbuild trigger will automatically run the evaluation and the results
will be added to GCP under vertexai/experiments.

Detailed metric table (with row specific metrics) could be obtain by
running locally with `export EXPORT_CSV=True`.
Update evaluation goldens to use dynamic dates when evaluating query
that uses term like "today" or "tomorrow".
Update name to differentiate experiment on GCP portal.
@Yuan325 Yuan325 merged commit a2df60b into main Aug 19, 2024
14 of 15 checks passed
@Yuan325 Yuan325 deleted the evaluation branch August 19, 2024 22:48
ferdeleong pushed a commit that referenced this pull request Aug 21, 2024
Implementation of llm system evaluation. 

Usage example: 

If would like to output detailed metric table (row-specific scores), add
the following env var `export EXPORT_CSV=True`

Run the following: `python run_evaluation.py`
kurtisvg added a commit that referenced this pull request Sep 3, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.2.0](v0.1.0...v0.2.0)
(2024-08-27)


### Features

* Add langgraph orchestration
([#447](#447))
([8cefed0](8cefed0))
* add ticket validation and insertion to cloudsql postgres
([#437](#437))
([a4480fa](a4480fa))
* Add tracing to langgraph orchestration and postgres provider
([#473](#473))
([a5759e9](a5759e9))
* Implement llm system evaluation
([#440](#440))
([a2df60b](a2df60b))
* Remove user ID and user email from `list_tickets()` result
([#464](#464))
([5958938](5958938))


### Bug Fixes

* update pytest to pytest_asyncio for async fixtures
([#474](#474))
([c2ad4bb](c2ad4bb))
* update return from tools for langchain and function calling
([#476](#476))
([9dfb60b](9dfb60b))

---
This PR was generated with [Release
Please](/~https://github.com/googleapis/release-please). See
[documentation](/~https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Kurtis Van Gent <31518063+kurtisvg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants