pip install opensubtitles-dataloader
Download, preprocess and use sentences from the OpenSubtitles v2018 dataset without ever needing to load all of it into memory.
See possible languages here.
opensubtitles-download en
Load tokenized version.
opensubtitles-download en --token
opensubtites_dataset = OpenSubtitlesDataset('en')
Load only the first 1 million lines.
opensubtites_dataset = OpenSubtitlesDataset('en', first_n_lines=1_000_000)
Group sentences into groups of 5.
opensubtites_dataset = OpenSubtitlesDataset('en', 5)
Group sentences into groups ranging from 2 to 5.
opensubtites_dataset = OpenSubtitlesDataset('en', (2,5))
Split sentences using "\n".
opensubtites_dataset = OpenSubtitlesDataset('en', delimiter="\n")
Do preprocessing.
opensubtites_dataset = OpenSubtitlesDataset('en', preprocess_function=my_preprocessing_function)
train, valid, test = opensubtites_dataset.split()
Set the fractions of the original dataset.
train, valid, test = opensubtites_dataset.split([0.7, 0.15, 0.15])
Use a seed.
train, valid, test = opensubtites_dataset.split(seed=42)
index.
train, valid, text = OpenSubtitlesDataset('en').splits()
train[20_000]
pytorch.
from torch.utils.data import DataLoader
train, valid, text = OpenSubtitlesDataset('en').splits()
train_loader = DataLoader(train, batch_size=16)