opensubtitles-dataloader

pip install opensubtitles-dataloader

Download, preprocess and use sentences from the OpenSubtitles v2018 dataset without ever needing to load all of it into memory.

Download

See possible languages here.

opensubtitles-download en

Load tokenized version.

opensubtitles-download en --token

Use in Python

Load

opensubtites_dataset = OpenSubtitlesDataset('en')

Load only the first 1 million lines.

opensubtites_dataset = OpenSubtitlesDataset('en', first_n_lines=1_000_000)

Group sentences into groups of 5.

opensubtites_dataset = OpenSubtitlesDataset('en', 5)

Group sentences into groups ranging from 2 to 5.

opensubtites_dataset = OpenSubtitlesDataset('en', (2,5))

Split sentences using "\n".

opensubtites_dataset = OpenSubtitlesDataset('en', delimiter="\n")

Do preprocessing.

opensubtites_dataset = OpenSubtitlesDataset('en', preprocess_function=my_preprocessing_function)

Split for Training

train, valid, test = opensubtites_dataset.split()

Set the fractions of the original dataset.

train, valid, test = opensubtites_dataset.split([0.7, 0.15, 0.15])

Use a seed.

train, valid, test = opensubtites_dataset.split(seed=42)

Access

index.

train, valid, text = OpenSubtitlesDataset('en').splits()
train[20_000]

pytorch.

from torch.utils.data import DataLoader
train, valid, text = OpenSubtitlesDataset('en').splits()
train_loader = DataLoader(train, batch_size=16)

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
examples		examples
opensubtitles_dataloader		opensubtitles_dataloader
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

opensubtitles-dataloader

Download

Use in Python

Load

Split for Training

Access

About

Releases

Packages

Languages

MiniXC/opensubtitles-dataloader

Folders and files

Latest commit

History

Repository files navigation

opensubtitles-dataloader

Download

Use in Python

Load

Split for Training

Access

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages