Replies: 1 comment 1 reply
-
Tokenizer instances should return lists, and ideally a list of strings. Right now, here's an example of what's going on: import stringcompare
tokenizer = stringcompare.NGramTokenizer(3)
tokenizer("hello world")
<zip at ...>
list(tokenizer("Hello World"))
[('H', 'e', 'l'),
('e', 'l', 'l'),
('l', 'l', 'o'),
('l', 'o', ' '),
('o', ' ', 'W'),
(' ', 'W', 'o'),
('W', 'o', 'r'),
('o', 'r', 'l'),
('r', 'l', 'd')] It would be better for NGramTokeniser to return a list of strings instead of a list of character tuples (I'll fix that), but the two are basically equivalent. Here's how WhitespaceTokenizer works: tokenizer = stringcompare.WhitespaceTokenizer()
tokenizer("Hello world")
['Hello', 'world'] |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I was wondering about the return type of the Tokenizer Method.
The method "zip" seems to return a list of tuples, but we are unsure. We would like some clarification on what type is returned by the Tokenizer method.
Beta Was this translation helpful? Give feedback.
All reactions