bug in gpt2 notebook (in tensorflow) #13332

randomgambit · 2021-08-30T13:03:38Z

Hello there!

I tried to use the language-modeling-from-scratch notebook https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling_from_scratch.ipynb#scrollTo=JEA1ju653l-p

More specifically, I need to run it by using tensorflow. The simple strategy of using the TF versions of the huggingface functions everything seems to work correctly until I reach the trainer step and then I get a mysterious cardinality issue.

This looks like a bug... Can you please have a look at the code below?

model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"

from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

def tokenize_function(examples):
    return tokenizer(examples["text"])

tokenized_datasets = datasets.map(tokenize_function, batched=True, remove_columns = ['text'])

block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result


lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000
)

print(tokenizer.decode(lm_datasets['train'][2]["input_ids"]))

from transformers import AutoConfig, TFAutoModelForCausalLM

config = AutoConfig.from_pretrained(model_checkpoint)
model = TFAutoModelForCausalLM.from_config(config)

from transformers import TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    "test-clm",
    evaluation_strategy = "epoch",
    learning_rate=2e-5)

trainer = TFTrainer(
    model=model,
    args = training_args,
    train_dataset=lm_datasets)

trainer.train()
Traceback (most recent call last):

  File "<ipython-input-82-01e49a077e43>", line 11, in <module>
    trainer.train()

  File "C:\Users\john\anaconda3\envs\keras\lib\site-packages\transformers\trainer_tf.py", line 472, in train
    train_ds = self.get_train_tfdataset()

  File "C:\Users\john\anaconda3\envs\keras\lib\site-packages\transformers\trainer_tf.py", line 150, in get_train_tfdataset
    self.num_train_examples = self.train_dataset.cardinality().numpy()

AttributeError: 'DatasetDict' object has no attribute 'cardinality'

What do you think?
Thanks!

The text was updated successfully, but these errors were encountered:

randomgambit · 2021-08-30T13:49:57Z

summoning the masters @LysandreJik @sgugger @Rocketknight1 💯

Rocketknight1 · 2021-08-31T14:02:01Z

Hey! There are a couple of issues here. The first is that we're trying to move away from TFTrainer towards Keras - there'll be a new version of that notebook coming soon, like I promised!

In the meantime, your approach should work, though. The error you're getting is because lm_datasets is actually a DatasetDict containing both the train and validation set, so everything downstream gets confused. You probably want to swap out lm_datasets for lm_datasets['train'] in that call to TFTrainer. However, like I said, we're trying to deprecate TFTrainer, so I'm trying to avoid doing any more bugfixing for it. I'm working on getting the new examples in ASAP!

randomgambit · 2021-08-31T14:09:24Z

Thanks @Rocketknight1 ! Actually I was getting the same error even when I was using a dataset that only contains one set of data. But you are absolutely right: there is no need to fix something that is going to be deprecated soon. Happy to help if you need anything! Thanks!

Rocketknight1 · 2021-08-31T14:36:37Z

The good news is I'm moving to working on those TF notebooks right now, so hopefully I'll have a proper example to show you soon. However, the official launch of the new notebooks might depend on the PR at huggingface/datasets#2731 being accepted and making it to release, since I'm planning to use that new method in a lot of them.

Still, I'll make sure to ping you as soon as I have a LM example ready - just be aware that you might have to install a pre-release version of datasets to get it to work!

randomgambit · 2021-08-31T14:38:17Z

got it. happy to try out the beta version of them at my risk and peril ;-)

github-actions · 2021-09-29T15:01:57Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

guotong1988 · 2023-02-11T02:51:41Z

Same question at year 2023 for
/~https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb

guotong1988 · 2023-02-11T10:34:54Z

Solution:
/~https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling/run_clm.py

LysandreJik assigned Rocketknight1 Aug 31, 2021

github-actions bot closed this as completed Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug in gpt2 notebook (in tensorflow) #13332

bug in gpt2 notebook (in tensorflow) #13332

randomgambit commented Aug 30, 2021

randomgambit commented Aug 30, 2021

Rocketknight1 commented Aug 31, 2021

randomgambit commented Aug 31, 2021 •

edited

Loading

Rocketknight1 commented Aug 31, 2021

randomgambit commented Aug 31, 2021

github-actions bot commented Sep 29, 2021

guotong1988 commented Feb 11, 2023 •

edited

Loading

guotong1988 commented Feb 11, 2023

bug in gpt2 notebook (in tensorflow) #13332

bug in gpt2 notebook (in tensorflow) #13332

Comments

randomgambit commented Aug 30, 2021

randomgambit commented Aug 30, 2021

Rocketknight1 commented Aug 31, 2021

randomgambit commented Aug 31, 2021 • edited Loading

Rocketknight1 commented Aug 31, 2021

randomgambit commented Aug 31, 2021

github-actions bot commented Sep 29, 2021

guotong1988 commented Feb 11, 2023 • edited Loading

guotong1988 commented Feb 11, 2023

randomgambit commented Aug 31, 2021 •

edited

Loading

guotong1988 commented Feb 11, 2023 •

edited

Loading