Taken Down in the Dynabench for Large Track #90

CSJianYang · 2021-06-27T04:54:17Z

Hi,
I passed the local and integrated tests following the model submission workflow on GitHub and submitted my model. But I notice that "Your model t1 has been successfully deployed. You can find and publish the model at https://dynabench.org/models/119. (python handler.py, dynalab-cli test --local, dynalab-cli test -n all successfully passed the test on our local server)
"。 But the status shows that "Taken Down". Would you mind sending the detailed log information to me for debugging our code? Thanks very much !

gwenzek · 2021-06-28T14:02:36Z

Hi, I see that you uploaded a new model in the meantime that seems to work correctly is that correct ?

Here are the log for your older model.
It seems related to a mismatch in the version of Hydra you put in your requirements.txt and the version expected by the fairseq version you're using.

2021-06-26 17:13:12,097 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG - <fairseq.data.encoders.sentencepiece_bpe.SentencepieceBPE object at 0x7f7eac7201d0>
2021-06-26 17:13:12,316 [WARN ] W-9000-archive.ts1624724506-t1_1.0-stderr MODEL_LOG - /usr/local/lib/python3.6/dist-packages/hydra/experimental/initialize.py:37: UserWarning: hydra.experimental.initialize() is no longer experimental. Use hydra.initialize()
2021-06-26 17:13:12,317 [WARN ] W-9000-archive.ts1624724506-t1_1.0-stderr MODEL_LOG -   message="hydra.experimental.initialize() is no longer experimental."
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG - Error when composing. Overrides: ['common.no_progress_bar=False', 'common.log_interval=100', "common.log_format='json'", "common.tensorboard_logdir='/checkpoint/vishrav/tensorboard_logs/2021-03-28/mm100_flores_small.mem_fp16.transformer_wmt_en_de_big.shareemb.maxtok4096.uf2.adam.beta0.9_0.98.initlr1e-07.warmup4000.lr0.0006.clip0.0.drop0.1.act_drop0.0.atn_drop0.1.wd0.0.ls0.1.seed2.ngpu64'", 'common.wandb_project=null', 'common.seed=2', 'common.cpu=False', 'common.tpu=False', 'common.bf16=False', 'common.memory_efficient_bf16=False', 'common.fp16=True', 'common.memory_efficient_fp16=True', 'common.fp16_no_flatten_grads=False', 'common.fp16_init_scale=128', 'common.fp16_scale_window=null', 'common.fp16_scale_tolerance=0.0', 'common.min_loss_scale=0.0001', 'common.threshold_loss_scale=null', 'common.user_dir=null', 'common.empty_cache_freq=0', 'common.all_gather_list_size=16384', 'common.model_parallel_size=1', 'common.quantization_config_path=null', 'common.profile=False', 'common.reset_logging=True', 'common_eval.path=null', 'common_eval.post_process=null', 'common_eval.quiet=False', "common_eval.model_overrides='{}'", 'common_eval.results_path=null', 'distributed_training.distributed_world_size=64', 'distributed_training.distributed_rank=0', "distributed_training.distributed_backend='nccl'", "distributed_training.distributed_init_method='tcp://learnfair5060:17982'", 'distributed_training.distributed_port=17982', 'distributed_training.device_id=0', 'distributed_training.distributed_no_spawn=False', "distributed_training.ddp_backend='c10d'", 'distributed_training.bucket_cap_mb=25', 'distributed_training.fix_batches_to_gpus=False', 'distributed_training.find_unused_parameters=False', 'distributed_training.fast_stat_sync=False', 'distributed_training.broadcast_buffers=False', "distributed_training.distributed_wrapper='DDP'", 'distributed_training.slowmo_momentum=null', "distributed_training.slowmo_algorithm='LocalSGD'", 'distributed_training.localsgd_frequency=3', 'distributed_training.nprocs_per_node=1', 'distributed_training.pipeline_model_parallel=False', 'distributed_training.pipeline_balance=null', 'distributed_training.pipeline_devices=null', 'distributed_training.pipeline_chunks=0', 'distributed_training.pipeline_encoder_balance=null', 'distributed_training.pipeline_encoder_devices=null', 'distributed_training.pipeline_decoder_balance=null', 'distributed_training.pipeline_decoder_devices=null', "distributed_training.pipeline_checkpoint='never'", "distributed_training.zero_sharding='none'", 'distributed_training.tpu=True', 'dataset.num_workers=1', 'dataset.skip_invalid_size_inputs_valid_test=False', 'dataset.max_tokens=4096', 'dataset.batch_size=null', 'dataset.required_batch_size_multiple=8', 'dataset.required_seq_len_multiple=1', "dataset.dataset_impl='mmap'", 'dataset.data_buffer_size=10', "dataset.train_subset='train'", "dataset.valid_subset='valid'", 'dataset.validate_interval=1', 'dataset.validate_interval_updates=0', 'dataset.validate_after_updates=0', 'dataset.fixed_validation_seed=null', 'dataset.disable_validation=True', 'dataset.max_tokens_valid=4096', "dataset.batch_size_valid='${dataset.batch_size}'", 'dataset.curriculum=0', "dataset.gen_subset='test'", 'dataset.num_shards=1', 'dataset.shard_id=0', 'optimization.max_epoch=0', 'optimization.max_update=10000000', 'optimization.stop_time_hours=0.0', 'optimization.clip_norm=0.0', 'optimization.sentence_avg=False', 'optimization.update_freq=[2]', 'optimization.lr=[0.0006]', 'optimization.stop_min_lr=-1.0', 'optimization.use_bmuf=False', "checkpoint.save_dir='/large_experiments/flores/checkpoints/mm100_flores/vishrav/mm100_flores_small.mem_fp16.transformer_wmt_en_de_big.shareemb.maxtok4096.uf2.adam.beta0.9_0.98.initlr1e-07.warmup4000.lr0.0006.clip0.0.drop0.1.act_drop0.0.atn_drop0.1.wd0.0.ls0.1.seed2.ngpu64'", "checkpoint.restore_file='checkpoint_last.pt'", 'checkpoint.finetune_from_model=null', 'checkpoint.reset_dataloader=False', 'checkpoint.reset_lr_scheduler=False', 'checkpoint.reset_meters=False', 'checkpoint.reset_optimizer=False', "checkpoint.optimizer_overrides='{}'", 'checkpoint.save_interval=1', 'checkpoint.save_interval_updates=25000', 'checkpoint.keep_interval_updates=-1', 'checkpoint.keep_last_epochs=-1', 'checkpoint.keep_best_checkpoints=-1', 'checkpoint.no_save=False', 'checkpoint.no_epoch_checkpoints=False', 'checkpoint.no_last_checkpoints=False', 'checkpoint.no_save_optimizer_state=False', "checkpoint.best_checkpoint_metric='loss'", 'checkpoint.maximize_best_checkpoint_metric=False', 'checkpoint.patience=-1', "checkpoint.checkpoint_suffix=''", 'checkpoint.checkpoint_shard_count=1', 'checkpoint.load_checkpoint_on_all_dp_ranks=False', "checkpoint.model_parallel_size='${common.model_parallel_size}'", 'checkpoint.distributed_rank=0', 'bmuf.block_lr=1.0', 'bmuf.block_momentum=0.875', 'bmuf.global_sync_iter=50', 'bmuf.warmup_iterations=500', 'bmuf.use_nbm=False', 'bmuf.average_sync=False', 'bmuf.distributed_world_size=64', 'generation.beam=5', 'generation.nbest=1', 'generation.max_len_a=0.0', 'generation.max_len_b=1024', 'generation.min_len=1', 'generation.match_source_len=False', 'generation.unnormalized=False', 'generation.no_early_stop=False', 'generation.no_beamable_mm=False', 'generation.lenpen=1.0', 'generation.unkpen=0.0', 'generation.replace_unk=null', 'generation.sacrebleu=False', 'generation.score_reference=False', 'generation.prefix_size=0', 'generation.no_repeat_ngram_size=0', 'generation.sampling=False', 'generation.sampling_topk=-1', 'generation.sampling_topp=-1.0', 'generation.constraints=null', 'generation.temperature=1.0', 'generation.diverse_beam_groups=-1', 'generation.diverse_beam_strength=0.5', 'generation.diversity_rate=-1.0', 'generation.print_alignment=False', 'generation.print_step=False', 'generation.lm_path=null', 'generation.lm_weight=0.0', 'generation.iter_decode_eos_penalty=0.0', 'generation.iter_decode_max_iter=10', 'generation.iter_decode_force_max_iter=False', 'generation.iter_decode_with_beam=1', 'generation.iter_decode_with_external_reranker=False', 'generation.retain_iter_history=False', 'generation.retain_dropout=False', 'generation.retain_dropout_modules=null', 'generation.decoding_format=null', 'generation.no_seed_provided=False', 'eval_lm.output_word_probs=False', 'eval_lm.output_word_stats=False', 'eval_lm.context_window=0', 'eval_lm.softmax_batch=9223372036854775807', 'interactive.buffer_size=0', "interactive.input='-'", 'optimizer=adam', 'optimizer._name=adam', "optimizer.adam_betas='(0.9, 0.98)'", 'optimizer.adam_eps=1e-08', 'optimizer.weight_decay=0.0', 'optimizer.use_old_adam=False', 'optimizer.tpu=True', 'optimizer.lr=[0.0006]', 'lr_scheduler=inverse_sqrt', 'lr_scheduler._name=inverse_sqrt', 'lr_scheduler.warmup_updates=4000', 'lr_scheduler.warmup_init_lr=1e-07', 'lr_scheduler.lr=[0.0006]']
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG - Backend worker process died.
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 182, in <module>
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     worker.run_server()
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 154, in run_server
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 116, in handle_connection
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 89, in load_model
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 110, in load
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 137, in <lambda>
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     initialize_fn = lambda ctx: entry_point(None, ctx)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/home/model-server/tmp/models/4593aa57cb454867b37cbab3bb764c11/handler.py", line 286, in handle
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     _service.initialize(context)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/home/model-server/tmp/models/4593aa57cb454867b37cbab3bb764c11/handler.py", line 115, in initialize
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     self.models, _model_args = checkpoint_utils.load_model_ensemble(model_pt_path.split(), task=self.task)
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/home/model-server/code/fairseq/checkpoint_utils.py", line 269, in load_model_ensemble
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     state,
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/home/model-server/code/fairseq/checkpoint_utils.py", line 306, in load_model_ensemble_and_task
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     cfg = convert_namespace_to_omegaconf(state["args"])
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/home/model-server/code/fairseq/dataclass/utils.py", line 351, in convert_namespace_to_omegaconf
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     composed_cfg = compose("config", overrides=overrides, strict=False)
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG - TypeError: compose() got an unexpected keyword argument 'strict'

CSJianYang · 2021-06-29T07:49:32Z

Hi, after I fixed the environment setup, I successfully submit the two models "https://dynabench.org/models/121" (Base M2M: Both have scores on devtest and dev sets.) and "https://dynabench.org/models/123" (Large M2M: Only has BLEU points on the devtest set). But I find that "https://dynabench.org/models/123" shows "taken down" but has BLEU points on devtest set and failed on the dev set. Could you please further help me see the detailed log to solve the problem? Is there a time limit, which caused the failure on the dev set? Or maybe Out Of Memory caused the failure?

gwenzek · 2021-06-29T11:13:04Z

hmm it seems you are pushing a bit the limit of the system, we may need to revise some constraints.
With batch_size = 128 you are running out of memory, but with batch_size = 64 you are hitting a timeout.

I'll increase the timeout in the meantime, but we will need to redeploy the evaluation servers and I don't have access to it. So it won't happen before tomorrow.

In the meantime can you try an intermediary batch size ? 96 ?

CSJianYang · 2021-06-30T05:04:16Z

@gwenzek Hi, I try to set the batchsize=96 but also find "Taken Down (https://dynabench.org/models/134)". Could you please help check the detailed log why the task failed? Could you please tell us the detailed time limit such as 20 min or 40 min, which ensures that the inference time of our model will not exceed the limited time? Thanks very much!

gwenzek · 2021-06-30T13:07:20Z

I did not see a model with batch size of 96, only 64 and 128. And it seems that your last model with batchsize of 128 succeeded.
The timeout limit is currently the default limit for the tool we are using of 10 minutes, but I've increased it to 20 minutes per request (not live yet). A request contains ~3000 sentences, so you need to process ~3 sentences per second.

gwenzek · 2021-07-02T07:19:07Z

The timeout increase is live, you can try re-upload your model. Thanks for your patience.

CSJianYang · 2021-07-05T12:47:52Z

Thanks very much @gwenzek ! I have successfully upload the model. Moreover, I want to enable the same translation direction in one minibatch (Minibatch=64) (Such as en->fr in one batch Not en->fr cs->de in one batch). Would you mind giving some suggestions help us realize this goal?

gwenzek · 2021-07-05T14:56:06Z

so the samples arrive grouped by language IIRC, first all the en->de then all the en->fr, ...
You could modify the for loop of handle in handler.py to also make a mini batch when the language change.
Notably this line: /~https://github.com/gwenzek/flores/blob/e0b63c00e6ceea2fad61f2e15c1ae0a24c86b4e4/dynalab/handler.py#L270-L270
the condition should read something like if last_lang == minibatch_lang and len(samples) < batch_size and i + 1 < n

CSJianYang · 2021-07-07T03:21:31Z

Thanks very much! I also tried submitting the model to the large track (https://dynabench.org/models/157). But it shows "Taken Down". Could please help me check the reason? What is the detailed time limit of the large track (nearly 10000 translation directions)?

gwenzek · 2021-07-22T16:51:35Z

sorry, the large track is currently not working. I should have a fix for next week.

gwenzek · 2021-07-30T19:55:57Z

Hi, I think I think I've fixed the issues with large track. The large track was hitting several limits on Dynabench design, and I had to push a few walls :-) I think your two models for the large track have started their evaluation, I will keep you updated.

gwenzek · 2021-08-02T11:02:07Z

Your model failed, it seems that your model is missing language "zt".
I'm not sure what this code correspond to, since it's not own of the codes used by Flores.

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 182, in <module>
     worker.run_server()
File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 154, in run_server
     self.handle_connection(cl_socket)
File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 116, in handle_connection
     service, result, code = self.load_model(msg)
File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 89, in load_model
     service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)
File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 110, in load
     initialize_fn(service.context)
File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 137, in <lambda>
     initialize_fn = lambda ctx: entry_point(None, ctx)
File "/home/model-server/tmp/models/f4338934cc2b4d049867efb380f9a68a/handler.py", line 307, in handle
     _service.initialize(context)
File "/home/model-server/tmp/models/f4338934cc2b4d049867efb380f9a68a/handler.py", line 127, in initialize
     self.task = tasks.setup_task(self.cfg.task)
File "/home/model-server/code/fairseq/tasks/__init__.py", line 44, in setup_task
     return task.setup_task(cfg, **kwargs)
File "/home/model-server/code/fairseq/tasks/translation_multi_simple_epoch.py", line 129, in setup_task
     return cls(args, langs, dicts, training)
2021-07-30 23:52:01,199 [INFO ] epollEventLoopGroup-5-7 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
File "/home/model-server/code/fairseq/tasks/translation_multi_simple_epoch.py", line 106, in __init__
     args, self.lang_pairs, langs, dicts, self.sampling_method
File "/home/model-server/code/fairseq/data/multilingual/multilingual_data_manager.py", line 87, in setup_data_manager
     args, lang_pairs, langs, dicts, sampling_method
File "/home/model-server/code/fairseq/data/multilingual/multilingual_data_manager.py", line 77, in __init__
     self.lang_id[lang] = self.get_langtok_index(langtok, self.dicts[list(self.dicts.keys())[0]])
File "/home/model-server/code/fairseq/data/multilingual/multilingual_data_manager.py", line 425, in get_langtok_index
     ), "cannot find language token 
{}
 in the dictionary".format(lang_tok)
AssertionError: cannot find language token __zt__ in the dictionary

CSJianYang · 2021-08-05T15:54:39Z

@gwenzek , thanks for your efforts! We have trying uploading base and large models for FLORES-FULL track (https://www.dynabench.org/models/312 and https://www.dynabench.org/models/315). We find that these two jobs have shown "taken down" condition. Could you please help provide the detailed log? The deadline will come, we haven't uploaded a model to the full track successfully. Thanks very much!

gwenzek mentioned this issue Jun 29, 2021

504 Server Error: Gateway Time-out #94

Open

gwenzek self-assigned this Jul 1, 2021

gwenzek added the flores Flores competition label Aug 5, 2021

CSJianYang changed the title ~~Taken Down in the Dynabench~~ Taken Down in the Dynabench for Large Track Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taken Down in the Dynabench for Large Track #90

Taken Down in the Dynabench for Large Track #90

CSJianYang commented Jun 27, 2021 •

edited

Loading

gwenzek commented Jun 28, 2021 •

edited

Loading

CSJianYang commented Jun 29, 2021 •

edited

Loading

gwenzek commented Jun 29, 2021

CSJianYang commented Jun 30, 2021 •

edited

Loading

gwenzek commented Jun 30, 2021

gwenzek commented Jul 2, 2021

CSJianYang commented Jul 5, 2021

gwenzek commented Jul 5, 2021

CSJianYang commented Jul 7, 2021

gwenzek commented Jul 22, 2021

gwenzek commented Jul 30, 2021

gwenzek commented Aug 2, 2021 •

edited

Loading

CSJianYang commented Aug 5, 2021 •

edited

Loading

Taken Down in the Dynabench for Large Track #90

Taken Down in the Dynabench for Large Track #90

Comments

CSJianYang commented Jun 27, 2021 • edited Loading

gwenzek commented Jun 28, 2021 • edited Loading

CSJianYang commented Jun 29, 2021 • edited Loading

gwenzek commented Jun 29, 2021

CSJianYang commented Jun 30, 2021 • edited Loading

gwenzek commented Jun 30, 2021

gwenzek commented Jul 2, 2021

CSJianYang commented Jul 5, 2021

gwenzek commented Jul 5, 2021

CSJianYang commented Jul 7, 2021

gwenzek commented Jul 22, 2021

gwenzek commented Jul 30, 2021

gwenzek commented Aug 2, 2021 • edited Loading

CSJianYang commented Aug 5, 2021 • edited Loading

CSJianYang commented Jun 27, 2021 •

edited

Loading

gwenzek commented Jun 28, 2021 •

edited

Loading

CSJianYang commented Jun 29, 2021 •

edited

Loading

CSJianYang commented Jun 30, 2021 •

edited

Loading

gwenzek commented Aug 2, 2021 •

edited

Loading

CSJianYang commented Aug 5, 2021 •

edited

Loading