Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Fix bug with lazy data loading, un-implement __len__ on AllennlpLazyDataset #4328

Merged
merged 6 commits into from
Jun 5, 2020

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Jun 5, 2020

This fixes a bug that occurs when using a lazy dataset reader where you'd get a PyTorch warning on every single batch during training. It does so by making AllennlpLazyDataset consistent with torch.data.IterableDataset in that it no longer implements __len__() (previously len() would just return 1).

image

To reproduce this bug just run allennlp train on a experiment with a lazy dataset reader like this:

{
  "dataset_reader": {
    "type": "squad",
    "lazy": true,
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true
      },
      "token_characters": {
        "type": "characters",
        "character_tokenizer": {
          "byte_encoding": "utf-8",
          "start_tokens": [259],
          "end_tokens": [260]
        },
        "min_padding_length": 5
      }
    }
  },
  "train_data_path": "https://allennlp.s3.amazonaws.com/datasets/squad/squad-train-v1.1.json",
  "validation_data_path": "https://allennlp.s3.amazonaws.com/datasets/squad/squad-dev-v1.1.json",
  "model": {
    "type": "bidaf",
    "text_field_embedder": {
      "token_embedders": {
        "tokens": {
          "type": "embedding",
          "pretrained_file": "https://allennlp.s3.amazonaws.com/datasets/glove/glove.6B.100d.txt.gz",
          "embedding_dim": 100,
          "trainable": false
        },
        "token_characters": {
          "type": "character_encoding",
          "embedding": {
            "num_embeddings": 262,
            "embedding_dim": 16
          },
          "encoder": {
            "type": "cnn",
            "embedding_dim": 16,
            "num_filters": 100,
            "ngram_filter_sizes": [5]
          },
          "dropout": 0.2
        }
      }
    },
    "num_highway_layers": 2,
    "phrase_layer": {
      "type": "lstm",
      "bidirectional": true,
      "input_size": 200,
      "hidden_size": 100,
      "num_layers": 1
    },
    "matrix_attention": {
      "type": "linear",
      "combination": "x,y,x*y",
      "tensor_1_dim": 200,
      "tensor_2_dim": 200
    },
    "modeling_layer": {
      "type": "lstm",
      "bidirectional": true,
      "input_size": 800,
      "hidden_size": 100,
      "num_layers": 2,
      "dropout": 0.2,
    },
    "span_end_encoder": {
      "type": "lstm",
      "bidirectional": true,
      "input_size": 1400,
      "hidden_size": 100,
      "num_layers": 1,
    },
    "dropout": 0.2,
  },
  "data_loader": {
    "batch_size": 40,
    "num_workers": 1,
  },
  "trainer": {
    "num_epochs": 20,
    "grad_norm": 5.0,
    // "patience": 10,
    "validation_metric": "+em",
    "cuda_device": 0,
    "learning_rate_scheduler": {
      "type": "reduce_on_plateau",
      "factor": 0.5,
      "mode": "max",
      "patience": 2,
    },
    "optimizer": {
      "type": "adam",
      "betas": [0.9, 0.9],
    },
  },
}

@epwalsh epwalsh requested a review from matt-gardner June 5, 2020 20:02
@@ -490,7 +490,7 @@ def run(self) -> Dict[str, Any]:
return self.trainer.train()

def finish(self, metrics: Dict[str, Any]):
if self.evaluation_data_loader and self.evaluate_on_test:
if self.evaluation_data_loader is not None and self.evaluate_on_test:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need this so len() is not called.

@@ -361,7 +361,7 @@ def __init__(
self.optimizer = optimizer

if patience is None: # no early stopping
if validation_data_loader:
if validation_data_loader is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need this so len() is not called.

@epwalsh
Copy link
Member Author

epwalsh commented Jun 5, 2020

By the way I tracked the __len__ implementation back to this PR. I think the rational for adding it on lazy datasets is now out-dated: #3700 (comment).

@epwalsh
Copy link
Member Author

epwalsh commented Jun 5, 2020

I think this should close #4035

@epwalsh epwalsh linked an issue Jun 5, 2020 that may be closed by this pull request
@epwalsh epwalsh requested a review from matt-gardner June 5, 2020 21:04
@epwalsh epwalsh merged commit 902d36a into allenai:master Jun 5, 2020
@epwalsh epwalsh deleted the lazy-data-loading branch June 5, 2020 21:47
@@ -138,7 +138,9 @@ def get_values(self):
self.batch_num_total_epoch_end[-1] / (len(self.batch_num_total_epoch_end) - 1)
)
else:
actual_num_steps_per_epoch = max(self.num_steps_per_epoch, self.last_batch_num_total)
actual_num_steps_per_epoch = max(
self.num_steps_per_epoch or 1, self.last_batch_num_total
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it default to 1 here? Isn't that a problem when it happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives the same behavior as it was before. I'm not sure if that's an issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cases where the length of the IterableDataset is not overriden properly
3 participants