-
-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor split function with tests #811
Conversation
The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new `collect_files` function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the `split` function, which calls `collect_files`.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @srdas. Please find comments / suggestions below.
Please make sure you are using pytest. Currently there is nothing in your tests incompatible with pytest so just implement comments suggesting use of different pytest features and make sure to run testing suite with pytest -vv -r ap --cov jupyter_ai
or just your test with pytest -vv -k test_collect_files
when testing while working on your test. Pytest docs for reference: https://docs.pytest.org/en/8.2.x/getting-started.html#.
packages/jupyter-ai/jupyter_ai/tests/doc_loaders/test_eligible_files.py
Outdated
Show resolved
Hide resolved
packages/jupyter-ai/jupyter_ai/tests/doc_loaders/test_eligible_files.py
Outdated
Show resolved
Hide resolved
packages/jupyter-ai/jupyter_ai/tests/doc_loaders/test_eligible_files.py
Outdated
Show resolved
Hide resolved
packages/jupyter-ai/jupyter_ai/tests/doc_loaders/test_eligible_files.py
Outdated
Show resolved
Hide resolved
The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new `collect_files` function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the `split` function, which calls `collect_files`.
for more information, see https://pre-commit.ci
Replaced testing using unittests with testing using pytest fixtures.
…pyter-ai into refactor_split_and_test
for more information, see https://pre-commit.ci
replacd unittests with pytests
…pyter-ai into refactor_split_and_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing changes. Please find a couple of additional comments below
… split Further improvements to the code suggested from the review of PR
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes @srdas. Everything looks good except of one small comment, please give it a look: #811 (comment)
Changed function level constant from all caps to lower case to line up with the convention in https://peps.python.org/pep-0008/#constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @srdas.
* Refactor split function with test The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new `collect_files` function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the `split` function, which calls `collect_files`. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Test changed to use pytest * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor split function with test The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new `collect_files` function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the `split` function, which calls `collect_files`. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Test changed to use pytest * refactored tests for directory.py using pytest fixtures Replaced testing using unittests with testing using pytest fixtures. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove old test files replacd unittests with pytests * Update test_directory.py * Update docstrings and further improve code for retrieve filepaths and split Further improvements to the code suggested from the review of PR * update docstring in test file * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update directory.py Changed function level constant from all caps to lower case to line up with the convention in https://peps.python.org/pep-0008/#constants. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
[1] The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new
collect_files
function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into thesplit
function, which callscollect_files
. This separates the file collection from the sharding in the RAG pipeline.[2] Added a unit test for the new
collect_files
function indirectory.py
. The refactoring in [1] makes unit testing separately for the file selection component simpler. Added a new test foldertests/doc_loaders
with test filetest_eligible_files.py
.