Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor split function with tests #811

Merged
merged 18 commits into from
Jun 5, 2024

Conversation

srdas
Copy link
Collaborator

@srdas srdas commented May 31, 2024

[1] The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new collect_files function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the split function, which calls collect_files. This separates the file collection from the sharding in the RAG pipeline.

[2] Added a unit test for the new collect_files function in directory.py. The refactoring in [1] makes unit testing separately for the file selection component simpler. Added a new test folder tests/doc_loaders with test file test_eligible_files.py.

srdas and others added 2 commits May 31, 2024 12:11
The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new `collect_files` function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the `split` function, which calls `collect_files`.
@srdas srdas added the enhancement New feature or request label May 31, 2024
@srdas srdas requested a review from 3coins May 31, 2024 19:18
@srdas srdas requested review from andrii-i and removed request for 3coins May 31, 2024 20:27
Copy link
Collaborator

@andrii-i andrii-i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @srdas. Please find comments / suggestions below.

Please make sure you are using pytest. Currently there is nothing in your tests incompatible with pytest so just implement comments suggesting use of different pytest features and make sure to run testing suite with pytest -vv -r ap --cov jupyter_ai or just your test with pytest -vv -k test_collect_files when testing while working on your test. Pytest docs for reference: https://docs.pytest.org/en/8.2.x/getting-started.html#.

srdas and others added 8 commits June 4, 2024 12:17
The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new `collect_files` function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the `split` function, which calls `collect_files`.
Replaced testing using unittests with testing using pytest fixtures.
replacd unittests with pytests
@srdas srdas marked this pull request as ready for review June 4, 2024 19:49
@srdas srdas requested a review from andrii-i June 4, 2024 19:50
Copy link
Collaborator

@andrii-i andrii-i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing changes. Please find a couple of additional comments below

@srdas srdas requested a review from andrii-i June 5, 2024 01:42
Copy link
Collaborator

@andrii-i andrii-i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes @srdas. Everything looks good except of one small comment, please give it a look: #811 (comment)

Changed function level constant from all caps to lower case to line up with the convention in https://peps.python.org/pep-0008/#constants.
@srdas srdas requested a review from andrii-i June 5, 2024 22:12
Copy link
Collaborator

@andrii-i andrii-i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @srdas.

@srdas srdas merged commit f138b0d into jupyterlab:main Jun 5, 2024
8 checks passed
@srdas
Copy link
Collaborator Author

srdas commented Jun 5, 2024

Thank you @srdas.

Thanks a ton, @andrii-i

Marchlak pushed a commit to Marchlak/jupyter-ai that referenced this pull request Oct 28, 2024
* Refactor split function with test

The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new `collect_files` function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the `split` function, which calls `collect_files`.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Test changed to use pytest

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactor split function with test

The split function was (1) selecting files in included directories in the top half of the function, and (2) selecting files with valid extensions and sharding them in the second half. This PR divides the split function in a new `collect_files` function that selects files with valid extensions from non-excluded directories, and then passes the valid filepaths into the `split` function, which calls `collect_files`.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Test changed to use pytest

* refactored tests for directory.py using pytest fixtures

Replaced testing using unittests with testing using pytest fixtures.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove old test files

replacd unittests with pytests

* Update test_directory.py

* Update docstrings and further improve code for retrieve filepaths and split

Further improvements to the code suggested from the review of PR

* update docstring in test file

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update directory.py

Changed function level constant from all caps to lower case to line up with the convention in https://peps.python.org/pep-0008/#constants.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants