-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement llm system evaluation #440
Conversation
/gcbrun |
_RETRIEVAL_EXPERIMENT_NAME: "retrieval-phase-eval-${_PR_NUMBER}" | ||
_RESPONSE_EXPERIMENT_NAME: "response-phase-eval-${_PR_NUMBER}" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these still get run on merge to main?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this cloudbuild.yaml only runs on PRs. if we want to run it periodically on main, we can set up a cloud build a replace the name for this.
Adding golden datasets that will be used in llm system evaluation. The golden dataset is separated into multiple types of queries: - queries that uses a specific tool - airline related queries (no tool calling - answer is within prompt) - assistant related question (no tool calling - answer is within prompt) - out of context questions (no tool calling) - multitool selections (agent selecting multiple tool before returning a final answer to the user)
Add the function to get prediction for each of the queries from golden_dataset. Prediction is used as comparison to retrieve metrics. Usage example: ``` from evaluation import run_llm_for_eval, goldens # set up orchestration, session, set uuid eval_list = await run_llm_for_eval(goldens, orchestration, session, session_id) ```
Add the function to run evaluation for retrieval phase (evaluating the llm ability to pick and utilize tools that are given). Usage example: ``` eval_results = evaluate_retrieval_phase(eval_datas) ```
Add the function to run evaluation for response phase (evaluating the output of llm based on instructions and context given). Usage example: ``` eval_results = evaluate_retrieval_phase(eval_datas) ```
Add run_evaluation.py file and automatic cloudbuild trigger during PR. Cloudbuild trigger will automatically run the evaluation and the results will be added to GCP under vertexai/experiments. Detailed metric table (with row specific metrics) could be obtain by running locally with `export EXPORT_CSV=True`.
Update evaluation goldens to use dynamic dates when evaluating query that uses term like "today" or "tomorrow".
Update name to differentiate experiment on GCP portal.
Implementation of llm system evaluation. Usage example: If would like to output detailed metric table (row-specific scores), add the following env var `export EXPORT_CSV=True` Run the following: `python run_evaluation.py`
🤖 I have created a release *beep* *boop* --- ## [0.2.0](v0.1.0...v0.2.0) (2024-08-27) ### Features * Add langgraph orchestration ([#447](#447)) ([8cefed0](8cefed0)) * add ticket validation and insertion to cloudsql postgres ([#437](#437)) ([a4480fa](a4480fa)) * Add tracing to langgraph orchestration and postgres provider ([#473](#473)) ([a5759e9](a5759e9)) * Implement llm system evaluation ([#440](#440)) ([a2df60b](a2df60b)) * Remove user ID and user email from `list_tickets()` result ([#464](#464)) ([5958938](5958938)) ### Bug Fixes * update pytest to pytest_asyncio for async fixtures ([#474](#474)) ([c2ad4bb](c2ad4bb)) * update return from tools for langchain and function calling ([#476](#476)) ([9dfb60b](9dfb60b)) --- This PR was generated with [Release Please](/~https://github.com/googleapis/release-please). See [documentation](/~https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Kurtis Van Gent <31518063+kurtisvg@users.noreply.github.com>
Implementation of llm system evaluation.
Usage example:
If would like to output detailed metric table (row-specific scores), add the following env var
export EXPORT_CSV=True
Run the following:
python run_evaluation.py