Skip to content

Commit

Permalink
Updated links.
Browse files Browse the repository at this point in the history
  • Loading branch information
umarbutler committed Feb 13, 2025
1 parent 80d5964 commit f1b629e
Show file tree
Hide file tree
Showing 5 changed files with 37 additions and 37 deletions.
56 changes: 28 additions & 28 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ All notable changes to `semchunk` will be documented here. This project adheres

## [3.0.0] - 2024-12-31
### Added
- Added an `offsets` argument to `chunk()` and `Chunker.__call__()` that specifies whether to return the start and end offsets of each chunk ([#9](/~https://github.com/umarbutler/semchunk/issues/9)). The argument defaults to `False`.
- Added an `overlap` argument to `chunk()` and `Chunker.__call__()` that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap ([#1](/~https://github.com/umarbutler/semchunk/issues/1)). The argument defaults to `None`, in which case no overlapping occurs.
- Added an `offsets` argument to `chunk()` and `Chunker.__call__()` that specifies whether to return the start and end offsets of each chunk ([#9](/~https://github.com/isaacus-dev/semchunk/issues/9)). The argument defaults to `False`.
- Added an `overlap` argument to `chunk()` and `Chunker.__call__()` that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap ([#1](/~https://github.com/isaacus-dev/semchunk/issues/1)). The argument defaults to `None`, in which case no overlapping occurs.
- Added an undocumented, private `_make_chunk_function()` method to the `Chunker` class that constructs chunking functions with call-level arguments passed.
- Added more unit tests for new features as well as for multiple token counters and for ensuring there are no chunks comprised entirely of whitespace characters.

Expand All @@ -37,11 +37,11 @@ All notable changes to `semchunk` will be documented here. This project adheres

## [2.2.1] - 2024-12-17
### Changed
- Started benchmarking [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) in parallel to ensure a fair comparison, courtesy of [@benbrandt](/~https://github.com/benbrandt) ([#17](/~https://github.com/umarbutler/semchunk/pull/12)).
- Started benchmarking [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) in parallel to ensure a fair comparison, courtesy of [@benbrandt](/~https://github.com/benbrandt) ([#17](/~https://github.com/isaacus-dev/semchunk/pull/12)).

## [2.2.0] - 2024-07-12
### Changed
- Switched from having `chunkerify()` output a function to having it return an instance of the new `Chunker()` class which should not alter functionality in any way but will allow for the preservation of type hints, fixing [#7](/~https://github.com/umarbutler/semchunk/pull/7).
- Switched from having `chunkerify()` output a function to having it return an instance of the new `Chunker()` class which should not alter functionality in any way but will allow for the preservation of type hints, fixing [#7](/~https://github.com/isaacus-dev/semchunk/pull/7).

## [2.1.0] - 2024-06-20
### Fixed
Expand All @@ -64,19 +64,19 @@ All notable changes to `semchunk` will be documented here. This project adheres

## [0.3.2] - 2024-06-01
### Fixed
- Fixed a bug where a `DivisionByZeroError` would be raised where a token counter returned zero tokens when called from `merge_splits()`, courtesy of [@jcobol](/~https://github.com/jcobol) ([#5](/~https://github.com/umarbutler/semchunk/pull/5)) ([7fd64eb](/~https://github.com/umarbutler/semchunk/pull/5/commits/7fd64eb8cf51f45702c59f43795be9a00c7d0d17)), fixing [#4](/~https://github.com/umarbutler/semchunk/issues/4).
- Fixed a bug where a `DivisionByZeroError` would be raised where a token counter returned zero tokens when called from `merge_splits()`, courtesy of [@jcobol](/~https://github.com/jcobol) ([#5](/~https://github.com/isaacus-dev/semchunk/pull/5)) ([7fd64eb](/~https://github.com/isaacus-dev/semchunk/pull/5/commits/7fd64eb8cf51f45702c59f43795be9a00c7d0d17)), fixing [#4](/~https://github.com/isaacus-dev/semchunk/issues/4).

## [0.3.1] - 2024-05-18
### Fixed
- Fixed typo in error messages in `chunkerify()` where it was referred to as `make_chunker()`.

## [0.3.0] - 2024-05-18
### Added
- Introduced the `chunkerify()` function, which constructs a chunker from a tokenizer or token counter that can be reused and can also chunk multiple texts in a single call. The resulting chunker speeds up chunking by 40.4% thanks, in large part, to a token counter that avoid having to count the number of tokens in a text when the number of characters in the text exceed a certain threshold, courtesy of [@R0bk](/~https://github.com/R0bk) ([#3](/~https://github.com/umarbutler/semchunk/pull/3)) ([337a186](/~https://github.com/umarbutler/semchunk/pull/3/commits/337a18615f991076b076262288b0408cb162b48c)).
- Introduced the `chunkerify()` function, which constructs a chunker from a tokenizer or token counter that can be reused and can also chunk multiple texts in a single call. The resulting chunker speeds up chunking by 40.4% thanks, in large part, to a token counter that avoid having to count the number of tokens in a text when the number of characters in the text exceed a certain threshold, courtesy of [@R0bk](/~https://github.com/R0bk) ([#3](/~https://github.com/isaacus-dev/semchunk/pull/3)) ([337a186](/~https://github.com/isaacus-dev/semchunk/pull/3/commits/337a18615f991076b076262288b0408cb162b48c)).

## [0.2.4] - 2024-05-13
### Changed
- Improved chunking performance with larger chunk sizes by switching from linear to binary search for the identification of optimal chunk boundaries, courtesy of [@R0bk](/~https://github.com/R0bk) ([#3](/~https://github.com/umarbutler/semchunk/pull/3)) ([337a186](/~https://github.com/umarbutler/semchunk/pull/3/commits/337a18615f991076b076262288b0408cb162b48c)).
- Improved chunking performance with larger chunk sizes by switching from linear to binary search for the identification of optimal chunk boundaries, courtesy of [@R0bk](/~https://github.com/R0bk) ([#3](/~https://github.com/isaacus-dev/semchunk/pull/3)) ([337a186](/~https://github.com/isaacus-dev/semchunk/pull/3/commits/337a18615f991076b076262288b0408cb162b48c)).

## [0.2.3] - 2024-03-11
### Fixed
Expand Down Expand Up @@ -117,24 +117,24 @@ All notable changes to `semchunk` will be documented here. This project adheres
### Added
- Added the `chunk()` function, which splits text into semantically meaningful chunks of a specified size as determined by a provided token counter.

[3.0.2]: /~https://github.com/umarbutler/semchunk/compare/v3.0.1...v3.0.2
[3.0.1]: /~https://github.com/umarbutler/semchunk/compare/v3.0.0...v3.0.1
[3.0.0]: /~https://github.com/umarbutler/semchunk/compare/v2.2.2...v3.0.0
[2.2.2]: /~https://github.com/umarbutler/semchunk/compare/v2.2.1...v2.2.2
[2.2.1]: /~https://github.com/umarbutler/semchunk/compare/v2.2.0...v2.2.1
[2.2.0]: /~https://github.com/umarbutler/semchunk/compare/v2.1.0...v2.2.0
[2.1.0]: /~https://github.com/umarbutler/semchunk/compare/v2.0.0...v2.1.0
[2.0.0]: /~https://github.com/umarbutler/semchunk/compare/v1.0.1...v2.0.0
[1.0.1]: /~https://github.com/umarbutler/semchunk/compare/v1.0.0...v1.0.1
[1.0.0]: /~https://github.com/umarbutler/semchunk/compare/v0.3.2...v1.0.0
[0.3.2]: /~https://github.com/umarbutler/semchunk/compare/v0.3.1...v0.3.2
[0.3.1]: /~https://github.com/umarbutler/semchunk/compare/v0.3.0...v0.3.1
[0.3.0]: /~https://github.com/umarbutler/semchunk/compare/v0.2.4...v0.3.0
[0.2.4]: /~https://github.com/umarbutler/semchunk/compare/v0.2.3...v0.2.4
[0.2.3]: /~https://github.com/umarbutler/semchunk/compare/v0.2.2...v0.2.3
[0.2.2]: /~https://github.com/umarbutler/semchunk/compare/v0.2.1...v0.2.2
[0.2.1]: /~https://github.com/umarbutler/semchunk/compare/v0.2.0...v0.2.1
[0.2.0]: /~https://github.com/umarbutler/semchunk/compare/v0.1.2...v0.2.0
[0.1.2]: /~https://github.com/umarbutler/semchunk/compare/v0.1.1...v0.1.2
[0.1.1]: /~https://github.com/umarbutler/semchunk/compare/v0.1.0...v0.1.1
[0.1.0]: /~https://github.com/umarbutler/semchunk/releases/tag/v0.1.0
[3.0.2]: /~https://github.com/isaacus-dev/semchunk/compare/v3.0.1...v3.0.2
[3.0.1]: /~https://github.com/isaacus-dev/semchunk/compare/v3.0.0...v3.0.1
[3.0.0]: /~https://github.com/isaacus-dev/semchunk/compare/v2.2.2...v3.0.0
[2.2.2]: /~https://github.com/isaacus-dev/semchunk/compare/v2.2.1...v2.2.2
[2.2.1]: /~https://github.com/isaacus-dev/semchunk/compare/v2.2.0...v2.2.1
[2.2.0]: /~https://github.com/isaacus-dev/semchunk/compare/v2.1.0...v2.2.0
[2.1.0]: /~https://github.com/isaacus-dev/semchunk/compare/v2.0.0...v2.1.0
[2.0.0]: /~https://github.com/isaacus-dev/semchunk/compare/v1.0.1...v2.0.0
[1.0.1]: /~https://github.com/isaacus-dev/semchunk/compare/v1.0.0...v1.0.1
[1.0.0]: /~https://github.com/isaacus-dev/semchunk/compare/v0.3.2...v1.0.0
[0.3.2]: /~https://github.com/isaacus-dev/semchunk/compare/v0.3.1...v0.3.2
[0.3.1]: /~https://github.com/isaacus-dev/semchunk/compare/v0.3.0...v0.3.1
[0.3.0]: /~https://github.com/isaacus-dev/semchunk/compare/v0.2.4...v0.3.0
[0.2.4]: /~https://github.com/isaacus-dev/semchunk/compare/v0.2.3...v0.2.4
[0.2.3]: /~https://github.com/isaacus-dev/semchunk/compare/v0.2.2...v0.2.3
[0.2.2]: /~https://github.com/isaacus-dev/semchunk/compare/v0.2.1...v0.2.2
[0.2.1]: /~https://github.com/isaacus-dev/semchunk/compare/v0.2.0...v0.2.1
[0.2.0]: /~https://github.com/isaacus-dev/semchunk/compare/v0.1.2...v0.2.0
[0.1.2]: /~https://github.com/isaacus-dev/semchunk/compare/v0.1.1...v0.1.2
[0.1.1]: /~https://github.com/isaacus-dev/semchunk/compare/v0.1.0...v0.1.1
[0.1.0]: /~https://github.com/isaacus-dev/semchunk/releases/tag/v0.1.0
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,10 @@ text = 'The quick brown fox jumps over the lazy dog.'
# OpenAI `tiktoken` encoding or Hugging Face model, or a custom tokenizer that has an `encode()`
# method (like a `tiktoken`, `transformers` or `tokenizers` tokenizer) or a custom token counting
# function that takes a text and returns the number of tokens in it.
chunker = semchunk.chunkerify('umarbutler/emubert', chunk_size) or \
chunker = semchunk.chunkerify('isaacus-dev/emubert', chunk_size) or \
semchunk.chunkerify('gpt-4', chunk_size) or \
semchunk.chunkerify('cl100k_base', chunk_size) or \
semchunk.chunkerify(AutoTokenizer.from_pretrained('umarbutler/emubert'), chunk_size) or \
semchunk.chunkerify(AutoTokenizer.from_pretrained('isaacus-dev/emubert'), chunk_size) or \
semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), chunk_size) or \
semchunk.chunkerify(lambda text: len(text.split()), chunk_size)

Expand Down
8 changes: 4 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,10 @@ dependencies = [
]

[project.urls]
Homepage = "/~https://github.com/umarbutler/semchunk"
Documentation = "/~https://github.com/umarbutler/semchunk/blob/main/README.md"
Issues = "/~https://github.com/umarbutler/semchunk/issues"
Source = "/~https://github.com/umarbutler/semchunk"
Homepage = "/~https://github.com/isaacus-dev/semchunk"
Documentation = "/~https://github.com/isaacus-dev/semchunk/blob/main/README.md"
Issues = "/~https://github.com/isaacus-dev/semchunk/issues"
Source = "/~https://github.com/isaacus-dev/semchunk"

[tool.hatch.build.targets.sdist]
only-include = ['src/semchunk/__init__.py', 'src/semchunk/py.typed', 'src/semchunk/semchunk.py', 'pyproject.toml', 'README.md', 'LICENCE', 'CHANGELOG.md', 'tests/bench.py', 'tests/test_semchunk.py', '.github/workflows/ci.yml', 'tests/helpers.py']
Expand Down
2 changes: 1 addition & 1 deletion tests/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def initialize_test_token_counters() -> dict[str, Callable[[str], int]]:
"""Initialize `tiktoken`, `transformers`, character and word token counters for testing purposes."""

gpt4_tiktoken_tokenizer = tiktoken.encoding_for_model('gpt-4').encode
emubert_transformers_tokenizer = make_transformers_tokenizer(transformers.AutoTokenizer.from_pretrained('umarbutler/emubert'))
emubert_transformers_tokenizer = make_transformers_tokenizer(transformers.AutoTokenizer.from_pretrained('isaacus-dev/emubert'))

def word_tokenizer(text: str) -> list[str]:
"""Tokenize a text into words."""
Expand Down
4 changes: 2 additions & 2 deletions tests/test_semchunk.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ def test_semchunk() -> None:
assert error_raised

# Test using `tiktoken` tokenizers, encodings and a `transformers` tokenizer by name with `chunkerify()`.
for name in ['cl100k_base', 'gpt-4', 'umarbutler/emubert']:
for name in ['cl100k_base', 'gpt-4', 'isaacus-dev/emubert']:
chunker = semchunk.chunkerify(name, 1)
chunker(DETERMINISTIC_TEST_INPUT)
if TEST_OFFSETS: chunker(DETERMINISTIC_TEST_INPUT, offsets = True)
Expand All @@ -175,7 +175,7 @@ def test_semchunk() -> None:
assert error_raised

# Test using a `transformers` tokenizer directly.
tokenizer = AutoTokenizer.from_pretrained('umarbutler/emubert')
tokenizer = AutoTokenizer.from_pretrained('isaacus-dev/emubert')
chunker = semchunk.chunkerify(tokenizer, 1)

# Test using a `tiktoken` tokenizer directly.
Expand Down

0 comments on commit f1b629e

Please sign in to comment.