Return Type of Tokenizer #15

JerryXin2 · 2022-04-05T01:02:30Z

JerryXin2
Apr 5, 2022

I was wondering about the return type of the Tokenizer Method.

The method "zip" seems to return a list of tuples, but we are unsure. We would like some clarification on what type is returned by the Tokenizer method.

OlivierBinette · 2022-04-05T01:20:17Z

OlivierBinette
Apr 5, 2022
Maintainer

@JerryXin2

Tokenizer instances should return lists, and ideally a list of strings.

Right now, here's an example of what's going on:

import stringcompare

tokenizer = stringcompare.NGramTokenizer(3)

tokenizer("hello world")
 <zip at ...>

list(tokenizer("Hello World"))
 [('H', 'e', 'l'),
 ('e', 'l', 'l'),
 ('l', 'l', 'o'),
 ('l', 'o', ' '),
 ('o', ' ', 'W'),
 (' ', 'W', 'o'),
 ('W', 'o', 'r'),
 ('o', 'r', 'l'),
 ('r', 'l', 'd')]

It would be better for NGramTokeniser to return a list of strings instead of a list of character tuples (I'll fix that), but the two are basically equivalent.

Here's how WhitespaceTokenizer works:

tokenizer = stringcompare.WhitespaceTokenizer()

tokenizer("Hello world")
 ['Hello', 'world']

1 reply

OlivierBinette Apr 5, 2022
Maintainer

@JerryXin2

Update: I've juste updated the NGramTokenizer class to return a list of strings. Here's the new code:

class NGramTokenizer(Tokenizer):
    def __init__(self, n):

        self.n = n

    def tokenize(self, sentence):
        return [sentence[i:i+self.n] for i in range(len(sentence)-self.n+1)]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return Type of Tokenizer #15

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Return Type of Tokenizer #15

JerryXin2 Apr 5, 2022

Replies: 1 comment · 1 reply

OlivierBinette Apr 5, 2022 Maintainer

OlivierBinette Apr 5, 2022 Maintainer

JerryXin2
Apr 5, 2022

Replies: 1 comment 1 reply

OlivierBinette
Apr 5, 2022
Maintainer

OlivierBinette Apr 5, 2022
Maintainer