Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Taken Down in the Dynabench for Large Track #90

Open
CSJianYang opened this issue Jun 27, 2021 · 13 comments
Open

Taken Down in the Dynabench for Large Track #90

CSJianYang opened this issue Jun 27, 2021 · 13 comments
Assignees
Labels
flores Flores competition

Comments

@CSJianYang
Copy link

CSJianYang commented Jun 27, 2021

Hi,
I passed the local and integrated tests following the model submission workflow on GitHub and submitted my model. But I notice that "Your model t1 has been successfully deployed. You can find and publish the model at https://dynabench.org/models/119. (python handler.py, dynalab-cli test --local, dynalab-cli test -n all successfully passed the test on our local server)
"。 But the status shows that "Taken Down". Would you mind sending the detailed log information to me for debugging our code? Thanks very much !

@gwenzek
Copy link
Contributor

gwenzek commented Jun 28, 2021

Hi, I see that you uploaded a new model in the meantime that seems to work correctly is that correct ?

Here are the log for your older model.
It seems related to a mismatch in the version of Hydra you put in your requirements.txt and the version expected by the fairseq version you're using.

2021-06-26 17:13:12,097 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG - <fairseq.data.encoders.sentencepiece_bpe.SentencepieceBPE object at 0x7f7eac7201d0>
2021-06-26 17:13:12,316 [WARN ] W-9000-archive.ts1624724506-t1_1.0-stderr MODEL_LOG - /usr/local/lib/python3.6/dist-packages/hydra/experimental/initialize.py:37: UserWarning: hydra.experimental.initialize() is no longer experimental. Use hydra.initialize()
2021-06-26 17:13:12,317 [WARN ] W-9000-archive.ts1624724506-t1_1.0-stderr MODEL_LOG -   message="hydra.experimental.initialize() is no longer experimental."
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG - Error when composing. Overrides: ['common.no_progress_bar=False', 'common.log_interval=100', "common.log_format='json'", "common.tensorboard_logdir='/checkpoint/vishrav/tensorboard_logs/2021-03-28/mm100_flores_small.mem_fp16.transformer_wmt_en_de_big.shareemb.maxtok4096.uf2.adam.beta0.9_0.98.initlr1e-07.warmup4000.lr0.0006.clip0.0.drop0.1.act_drop0.0.atn_drop0.1.wd0.0.ls0.1.seed2.ngpu64'", 'common.wandb_project=null', 'common.seed=2', 'common.cpu=False', 'common.tpu=False', 'common.bf16=False', 'common.memory_efficient_bf16=False', 'common.fp16=True', 'common.memory_efficient_fp16=True', 'common.fp16_no_flatten_grads=False', 'common.fp16_init_scale=128', 'common.fp16_scale_window=null', 'common.fp16_scale_tolerance=0.0', 'common.min_loss_scale=0.0001', 'common.threshold_loss_scale=null', 'common.user_dir=null', 'common.empty_cache_freq=0', 'common.all_gather_list_size=16384', 'common.model_parallel_size=1', 'common.quantization_config_path=null', 'common.profile=False', 'common.reset_logging=True', 'common_eval.path=null', 'common_eval.post_process=null', 'common_eval.quiet=False', "common_eval.model_overrides='{}'", 'common_eval.results_path=null', 'distributed_training.distributed_world_size=64', 'distributed_training.distributed_rank=0', "distributed_training.distributed_backend='nccl'", "distributed_training.distributed_init_method='tcp://learnfair5060:17982'", 'distributed_training.distributed_port=17982', 'distributed_training.device_id=0', 'distributed_training.distributed_no_spawn=False', "distributed_training.ddp_backend='c10d'", 'distributed_training.bucket_cap_mb=25', 'distributed_training.fix_batches_to_gpus=False', 'distributed_training.find_unused_parameters=False', 'distributed_training.fast_stat_sync=False', 'distributed_training.broadcast_buffers=False', "distributed_training.distributed_wrapper='DDP'", 'distributed_training.slowmo_momentum=null', "distributed_training.slowmo_algorithm='LocalSGD'", 'distributed_training.localsgd_frequency=3', 'distributed_training.nprocs_per_node=1', 'distributed_training.pipeline_model_parallel=False', 'distributed_training.pipeline_balance=null', 'distributed_training.pipeline_devices=null', 'distributed_training.pipeline_chunks=0', 'distributed_training.pipeline_encoder_balance=null', 'distributed_training.pipeline_encoder_devices=null', 'distributed_training.pipeline_decoder_balance=null', 'distributed_training.pipeline_decoder_devices=null', "distributed_training.pipeline_checkpoint='never'", "distributed_training.zero_sharding='none'", 'distributed_training.tpu=True', 'dataset.num_workers=1', 'dataset.skip_invalid_size_inputs_valid_test=False', 'dataset.max_tokens=4096', 'dataset.batch_size=null', 'dataset.required_batch_size_multiple=8', 'dataset.required_seq_len_multiple=1', "dataset.dataset_impl='mmap'", 'dataset.data_buffer_size=10', "dataset.train_subset='train'", "dataset.valid_subset='valid'", 'dataset.validate_interval=1', 'dataset.validate_interval_updates=0', 'dataset.validate_after_updates=0', 'dataset.fixed_validation_seed=null', 'dataset.disable_validation=True', 'dataset.max_tokens_valid=4096', "dataset.batch_size_valid='${dataset.batch_size}'", 'dataset.curriculum=0', "dataset.gen_subset='test'", 'dataset.num_shards=1', 'dataset.shard_id=0', 'optimization.max_epoch=0', 'optimization.max_update=10000000', 'optimization.stop_time_hours=0.0', 'optimization.clip_norm=0.0', 'optimization.sentence_avg=False', 'optimization.update_freq=[2]', 'optimization.lr=[0.0006]', 'optimization.stop_min_lr=-1.0', 'optimization.use_bmuf=False', "checkpoint.save_dir='/large_experiments/flores/checkpoints/mm100_flores/vishrav/mm100_flores_small.mem_fp16.transformer_wmt_en_de_big.shareemb.maxtok4096.uf2.adam.beta0.9_0.98.initlr1e-07.warmup4000.lr0.0006.clip0.0.drop0.1.act_drop0.0.atn_drop0.1.wd0.0.ls0.1.seed2.ngpu64'", "checkpoint.restore_file='checkpoint_last.pt'", 'checkpoint.finetune_from_model=null', 'checkpoint.reset_dataloader=False', 'checkpoint.reset_lr_scheduler=False', 'checkpoint.reset_meters=False', 'checkpoint.reset_optimizer=False', "checkpoint.optimizer_overrides='{}'", 'checkpoint.save_interval=1', 'checkpoint.save_interval_updates=25000', 'checkpoint.keep_interval_updates=-1', 'checkpoint.keep_last_epochs=-1', 'checkpoint.keep_best_checkpoints=-1', 'checkpoint.no_save=False', 'checkpoint.no_epoch_checkpoints=False', 'checkpoint.no_last_checkpoints=False', 'checkpoint.no_save_optimizer_state=False', "checkpoint.best_checkpoint_metric='loss'", 'checkpoint.maximize_best_checkpoint_metric=False', 'checkpoint.patience=-1', "checkpoint.checkpoint_suffix=''", 'checkpoint.checkpoint_shard_count=1', 'checkpoint.load_checkpoint_on_all_dp_ranks=False', "checkpoint.model_parallel_size='${common.model_parallel_size}'", 'checkpoint.distributed_rank=0', 'bmuf.block_lr=1.0', 'bmuf.block_momentum=0.875', 'bmuf.global_sync_iter=50', 'bmuf.warmup_iterations=500', 'bmuf.use_nbm=False', 'bmuf.average_sync=False', 'bmuf.distributed_world_size=64', 'generation.beam=5', 'generation.nbest=1', 'generation.max_len_a=0.0', 'generation.max_len_b=1024', 'generation.min_len=1', 'generation.match_source_len=False', 'generation.unnormalized=False', 'generation.no_early_stop=False', 'generation.no_beamable_mm=False', 'generation.lenpen=1.0', 'generation.unkpen=0.0', 'generation.replace_unk=null', 'generation.sacrebleu=False', 'generation.score_reference=False', 'generation.prefix_size=0', 'generation.no_repeat_ngram_size=0', 'generation.sampling=False', 'generation.sampling_topk=-1', 'generation.sampling_topp=-1.0', 'generation.constraints=null', 'generation.temperature=1.0', 'generation.diverse_beam_groups=-1', 'generation.diverse_beam_strength=0.5', 'generation.diversity_rate=-1.0', 'generation.print_alignment=False', 'generation.print_step=False', 'generation.lm_path=null', 'generation.lm_weight=0.0', 'generation.iter_decode_eos_penalty=0.0', 'generation.iter_decode_max_iter=10', 'generation.iter_decode_force_max_iter=False', 'generation.iter_decode_with_beam=1', 'generation.iter_decode_with_external_reranker=False', 'generation.retain_iter_history=False', 'generation.retain_dropout=False', 'generation.retain_dropout_modules=null', 'generation.decoding_format=null', 'generation.no_seed_provided=False', 'eval_lm.output_word_probs=False', 'eval_lm.output_word_stats=False', 'eval_lm.context_window=0', 'eval_lm.softmax_batch=9223372036854775807', 'interactive.buffer_size=0', "interactive.input='-'", 'optimizer=adam', 'optimizer._name=adam', "optimizer.adam_betas='(0.9, 0.98)'", 'optimizer.adam_eps=1e-08', 'optimizer.weight_decay=0.0', 'optimizer.use_old_adam=False', 'optimizer.tpu=True', 'optimizer.lr=[0.0006]', 'lr_scheduler=inverse_sqrt', 'lr_scheduler._name=inverse_sqrt', 'lr_scheduler.warmup_updates=4000', 'lr_scheduler.warmup_init_lr=1e-07', 'lr_scheduler.lr=[0.0006]']
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG - Backend worker process died.
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG - Traceback (most recent call last):
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 182, in <module>
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     worker.run_server()
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 154, in run_server
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     self.handle_connection(cl_socket)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 116, in handle_connection
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     service, result, code = self.load_model(msg)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 89, in load_model
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 110, in load
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     initialize_fn(service.context)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 137, in <lambda>
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     initialize_fn = lambda ctx: entry_point(None, ctx)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/home/model-server/tmp/models/4593aa57cb454867b37cbab3bb764c11/handler.py", line 286, in handle
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     _service.initialize(context)
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/home/model-server/tmp/models/4593aa57cb454867b37cbab3bb764c11/handler.py", line 115, in initialize
2021-06-26 17:13:12,395 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     self.models, _model_args = checkpoint_utils.load_model_ensemble(model_pt_path.split(), task=self.task)
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/home/model-server/code/fairseq/checkpoint_utils.py", line 269, in load_model_ensemble
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     state,
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/home/model-server/code/fairseq/checkpoint_utils.py", line 306, in load_model_ensemble_and_task
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     cfg = convert_namespace_to_omegaconf(state["args"])
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -   File "/home/model-server/code/fairseq/dataclass/utils.py", line 351, in convert_namespace_to_omegaconf
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG -     composed_cfg = compose("config", overrides=overrides, strict=False)
2021-06-26 17:13:12,396 [INFO ] W-9000-archive.ts1624724506-t1_1.0-stdout MODEL_LOG - TypeError: compose() got an unexpected keyword argument 'strict'

@CSJianYang
Copy link
Author

CSJianYang commented Jun 29, 2021

Hi, after I fixed the environment setup, I successfully submit the two models "https://dynabench.org/models/121" (Base M2M: Both have scores on devtest and dev sets.) and "https://dynabench.org/models/123" (Large M2M: Only has BLEU points on the devtest set). But I find that "https://dynabench.org/models/123" shows "taken down" but has BLEU points on devtest set and failed on the dev set. Could you please further help me see the detailed log to solve the problem? Is there a time limit, which caused the failure on the dev set? Or maybe Out Of Memory caused the failure?

@gwenzek
Copy link
Contributor

gwenzek commented Jun 29, 2021

hmm it seems you are pushing a bit the limit of the system, we may need to revise some constraints.
With batch_size = 128 you are running out of memory, but with batch_size = 64 you are hitting a timeout.

I'll increase the timeout in the meantime, but we will need to redeploy the evaluation servers and I don't have access to it. So it won't happen before tomorrow.

In the meantime can you try an intermediary batch size ? 96 ?

@CSJianYang
Copy link
Author

CSJianYang commented Jun 30, 2021

@gwenzek Hi, I try to set the batchsize=96 but also find "Taken Down (https://dynabench.org/models/134)". Could you please help check the detailed log why the task failed? Could you please tell us the detailed time limit such as 20 min or 40 min, which ensures that the inference time of our model will not exceed the limited time? Thanks very much!

@gwenzek
Copy link
Contributor

gwenzek commented Jun 30, 2021

I did not see a model with batch size of 96, only 64 and 128. And it seems that your last model with batchsize of 128 succeeded.
The timeout limit is currently the default limit for the tool we are using of 10 minutes, but I've increased it to 20 minutes per request (not live yet). A request contains ~3000 sentences, so you need to process ~3 sentences per second.

@gwenzek gwenzek self-assigned this Jul 1, 2021
@gwenzek
Copy link
Contributor

gwenzek commented Jul 2, 2021

The timeout increase is live, you can try re-upload your model. Thanks for your patience.

@CSJianYang
Copy link
Author

Thanks very much @gwenzek ! I have successfully upload the model. Moreover, I want to enable the same translation direction in one minibatch (Minibatch=64) (Such as en->fr in one batch Not en->fr cs->de in one batch). Would you mind giving some suggestions help us realize this goal?

@gwenzek
Copy link
Contributor

gwenzek commented Jul 5, 2021

so the samples arrive grouped by language IIRC, first all the en->de then all the en->fr, ...
You could modify the for loop of handle in handler.py to also make a mini batch when the language change.
Notably this line: /~https://github.com/gwenzek/flores/blob/e0b63c00e6ceea2fad61f2e15c1ae0a24c86b4e4/dynalab/handler.py#L270-L270
the condition should read something like if last_lang == minibatch_lang and len(samples) < batch_size and i + 1 < n

@CSJianYang
Copy link
Author

Thanks very much! I also tried submitting the model to the large track (https://dynabench.org/models/157). But it shows "Taken Down". Could please help me check the reason? What is the detailed time limit of the large track (nearly 10000 translation directions)?

@gwenzek
Copy link
Contributor

gwenzek commented Jul 22, 2021

sorry, the large track is currently not working. I should have a fix for next week.

@gwenzek
Copy link
Contributor

gwenzek commented Jul 30, 2021

Hi, I think I think I've fixed the issues with large track. The large track was hitting several limits on Dynabench design, and I had to push a few walls :-) I think your two models for the large track have started their evaluation, I will keep you updated.

@gwenzek
Copy link
Contributor

gwenzek commented Aug 2, 2021

Your model failed, it seems that your model is missing language "zt".
I'm not sure what this code correspond to, since it's not own of the codes used by Flores.

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 182, in <module>
     worker.run_server()
File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 154, in run_server
     self.handle_connection(cl_socket)
File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 116, in handle_connection
     service, result, code = self.load_model(msg)
File "/usr/local/lib/python3.6/dist-packages/ts/model_service_worker.py", line 89, in load_model
     service = model_loader.load(model_name, model_dir, handler, gpu, batch_size, envelope)
File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 110, in load
     initialize_fn(service.context)
File "/usr/local/lib/python3.6/dist-packages/ts/model_loader.py", line 137, in <lambda>
     initialize_fn = lambda ctx: entry_point(None, ctx)
File "/home/model-server/tmp/models/f4338934cc2b4d049867efb380f9a68a/handler.py", line 307, in handle
     _service.initialize(context)
File "/home/model-server/tmp/models/f4338934cc2b4d049867efb380f9a68a/handler.py", line 127, in initialize
     self.task = tasks.setup_task(self.cfg.task)
File "/home/model-server/code/fairseq/tasks/__init__.py", line 44, in setup_task
     return task.setup_task(cfg, **kwargs)
File "/home/model-server/code/fairseq/tasks/translation_multi_simple_epoch.py", line 129, in setup_task
     return cls(args, langs, dicts, training)
2021-07-30 23:52:01,199 [INFO ] epollEventLoopGroup-5-7 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
File "/home/model-server/code/fairseq/tasks/translation_multi_simple_epoch.py", line 106, in __init__
     args, self.lang_pairs, langs, dicts, self.sampling_method
File "/home/model-server/code/fairseq/data/multilingual/multilingual_data_manager.py", line 87, in setup_data_manager
     args, lang_pairs, langs, dicts, sampling_method
File "/home/model-server/code/fairseq/data/multilingual/multilingual_data_manager.py", line 77, in __init__
     self.lang_id[lang] = self.get_langtok_index(langtok, self.dicts[list(self.dicts.keys())[0]])
File "/home/model-server/code/fairseq/data/multilingual/multilingual_data_manager.py", line 425, in get_langtok_index
     ), "cannot find language token 
{}
 in the dictionary".format(lang_tok)
AssertionError: cannot find language token __zt__ in the dictionary

@gwenzek gwenzek added the flores Flores competition label Aug 5, 2021
@CSJianYang
Copy link
Author

CSJianYang commented Aug 5, 2021

@gwenzek , thanks for your efforts! We have trying uploading base and large models for FLORES-FULL track (https://www.dynabench.org/models/312 and https://www.dynabench.org/models/315). We find that these two jobs have shown "taken down" condition. Could you please help provide the detailed log? The deadline will come, we haven't uploaded a model to the full track successfully. Thanks very much!

@CSJianYang CSJianYang changed the title Taken Down in the Dynabench Taken Down in the Dynabench for Large Track Aug 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flores Flores competition
Projects
None yet
Development

No branches or pull requests

2 participants