This is the official implementation of Pix2Seq in Tensorflow 2 with efficient TPUs/GPUs support. The original Pix2Seq code aims to be a general framework that turns RGB pixels into semantically meaningful sequences. We now extend it to be a generic codebase, with task-centric organization that supports different tasks as well as their combination, using generative modeling (both autoregressive and diffusion models, see below).
We added (official) implementations of FitTransformer (FIT) (as an encoder, a diffusion decoder, or an autoregressive decoder) see architectures/transformers.py.
We added (official) implementations of diffusion models (such as Bit Diffusion, RIN, see references below) built on top of the original Pix2Seq codebase and they can be found in tasks/, models/, and architectures/.
Please note that we have not yet added proper documentations on training these models.
Backbone | Total params (M) | Image size | Google cloud storage location |
---|---|---|---|
ResNet-50 | 36.6 | 640x640 | gs://pix2seq/obj365_pretrain/resnet_640x640_b256_s400k |
ResNet-50 (C4) | 84.7 | 640x640 | gs://pix2seq/obj365_pretrain/resnetc_640x640_b256_s400k |
ViT-B | 115.2 | 640x640 | gs://pix2seq/obj365_pretrain/vit_b_640x640_b256_s400k |
ViT-L | 341.2 | 640x640 | gs://pix2seq/obj365_pretrain/vit_l_640x640_b256_s400k |
Backbone | Total params (M) | Image size | COCO AP | Google cloud storage location |
---|---|---|---|---|
ResNet-50 | 36.6 | 640x640 | 39.1 | gs://pix2seq/coco_det_finetune/resnet_640x640 |
ResNet-50 | 36.6 | 1024x1024 | 41.7 | gs://pix2seq/coco_det_finetune/resnet_1024x1024 |
ResNet-50 | 36.6 | 1333x1333 | 42.6 | gs://pix2seq/coco_det_finetune/resnet_1333x1333 |
ResNet-50 (C4) | 84.7 | 640x640 | 44.7 | gs://pix2seq/coco_det_finetune/resnetc_640x640 |
ResNet-50 (C4) | 84.7 | 1024x1024 | 46.9 | gs://pix2seq/coco_det_finetune/resnetc_1024x1024 |
ResNet-50 (C4) | 84.7 | 1333x1333 | 47.3 | gs://pix2seq/coco_det_finetune/resnetc_1333x1333 |
ViT-B | 115.2 | 640x640 | 44.2 | gs://pix2seq/coco_det_finetune/vit_b_640x640 |
ViT-B | 115.2 | 1024x1024 | 46.5 | gs://pix2seq/coco_det_finetune/vit_b_1024x1024 |
ViT-B | 115.2 | 1333x1333 | 47.1 | gs://pix2seq/coco_det_finetune/vit_b_1333x1333 |
ViT-L | 341.2 | 640x640 | 47.6 | gs://pix2seq/coco_det_finetune/vit_l_640x640 |
ViT-L | 341.2 | 1024x1024 | 49.2 | gs://pix2seq/coco_det_finetune/vit_l_1024x1024 |
ViT-L | 341.2 | 1333x1333 | 50.0 | gs://pix2seq/coco_det_finetune/vit_l_1333x1333 |
Jointly fine-tuned on coco object detection, instance segmentation, captioning and keypoint detection.
Backbone | Total params (M) | Image size | COCO AP | Google cloud storage location |
---|---|---|---|---|
ViT-B | 115.2 | 640x640 | 44.2 | gs://pix2seq/multi_task/ckpt/vit_b_640x640 |
ViT-B | 115.2 | 1024x1024 | 46.5 | gs://pix2seq/multi_task/ckpt/vit_b_1024x1024 |
See colabs for inference and fine-tuning demos. Give it a try!
The following setup is required before running the code.
git clone /~https://github.com/google-research/pix2seq.git
pip install -r requirements.txt
Download COCO annotations from gs://pix2seq/multi_task/data/coco/json to /tmp/coco_annotations
(dir can be updated in the configs).
annotations_dir=/tmp/coco_annotations
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/captions_train2017_eval_compatible.json $annotations_dir
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/captions_val2017_eval_compatible.json $annotations_dir
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/instances_train2017.json $annotations_dir
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/instances_val2017.json $annotations_dir
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/person_keypoints_train2017.json $annotations_dir
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/person_keypoints_val2017.json $annotations_dir
(Optional) If accessing the pretrained checkpoints in Cloud is slowing down or blocking the start of training/eval, you can download them manually with following command gsutil cp -r gs://cloud_folder local_folder
, and update pretrained_ckpt
in the config file accordingly.
(Optional) If training fails at the start (due to NcclAllReduce error), try a different cross_device_ops
for tf.distribute.MirroredStrategy
in utils.py:build_strategy function.
Below is the instruction for starting a training job, where we've set up a configuration mainly for fine-tuning the objects365 pretrained models.
Step 1: check config_det_finetune.py and update if necessary, such as encoder_variant
, image_size
.
Step 2: run python3 run.py --mode=train --model_dir=/tmp/model_dir --config=configs/config_det_finetune.py --config.train.batch_size=32 --config.train.epochs=20 --config.optimization.learning_rate=3e-5
.
(Optional) Setup tensorboard for training curves with tensorboard --logdir=/tmp/model_dir
. Note: eval on this drill fine-tuning run (with vit-b 640x640 and 20 epochs) should give ~43.5 AP. Exact configurations used to reproduce the COCO fine-tuning results can be found in gs://pix2seq/coco_det_finetune/...
(Optional) Set --run_eagerly=True
for interactive debugging (which will be slower).
Below is the instruction for starting an evaluation job, which monitors the specified directory and perform (continuous) evaluation of the latest and un-evaluated checkpoints. It can be started in parallel to or after the training.
Step 1: check config_det_finetune.py and update if necessary, such as encoder_variant
, image_size
. Set checkpoint_dir
if the checkpoints to evaluate are not in model_dir
(e.g., for evaluating our provided fine-tuning checkpoints).
Step 2: run python3 run.py --mode=eval --model_dir=/tmp/model_dir --config=configs/config_det_finetune.py --config.dataset.coco_annotations_dir=/path/to/annotations --config.eval.batch_size=40
.
(Optional) Setup tensorboard for eval curves and detection visualizations with tensorboard --logdir=/tmp/model_dir
.
In configs/config_multi_task.py
uncomment the line with checkpoint_dir=get_multi_task_checkpoint_dir(...)
.
To evaluate for image size 1024x1024
update image_size
in the config.
config=configs/config_multi_task.py:object_detection@coco/2017_object_detection,vit-b
model_dir=/tmp/pix2seq_eval_det
# Path to save the detected boxes for evaluating other tasks.
boxes_json_path=$model_dir/boxes.json
python3 run.py --config=$config --model_dir=$model_dir --mode=eval --config.task.eval_outputs_json_path=$boxes_json_path
(Optional) In order to use the detected boxes generated in the previous step for eval of instance segmentation and keypoint detection, they need to be converted to tfrecords using the command below. Alternatively you can use the pre-processed tfrecords that we have provided.
box_tfrecords=/tmp/boxes
python3 data/scripts/merge_coco_json_tfrecord.py --tfrecord_path=gs://pix2seq/multi_task/data/coco/tfrecord/val* --annotation_path=$boxes_json_path --output_dir=$box_tfrecords
config=configs/config_multi_task.py:instance_segmentation@coco/2017_instance_segmentation,vit-b
val_file_pattern=gs://pix2seq/multi_task/data/coco/det_boxes/vit_b_640x640/*.tfrecord
# val_file_pattern=$box_tfrecords/*.tfrecord
# Number of masks to aggregate. Reduce this for faster but lower quality eval.
num_samples=8
model_dir=/tmp/pix2seq_eval_ins
python3 run.py --config=$config --model_dir=$model_dir --mode=eval --config.dataset.val_file_pattern=$val_file_pattern --config.task.ensemble_num_samples=$num_samples
config="configs/config_multi_task.py:keypoint_detection@coco/2017_keypoint_detection,vit-b"
val_file_pattern=gs://pix2seq/multi_task/data/coco/det_boxes/vit_b_640x640/*.tfrecord
# val_file_pattern=$box_tfrecords/*.tfrecord
model_dir=/tmp/pix2seq_eval_key
python3 run.py --config=$config --model_dir=$model_dir --mode=eval --config.dataset.val_file_pattern=$val_file_pattern
config=configs/config_multi_task.py:captioning@coco/2017_captioning,vit-b
model_dir=/tmp/pix2seq_eval_cap
python3 run.py --config=$config --model_dir=$model_dir --mode=eval
For captioning, the generated captions are written to $model_dir/coco_result_{step}_{uuid.uuid4()}.json
. Metrics can be computed using the official coco scripts.
Note: You can run eval on a subset of images by setting --config.eval.steps
.
@article{chen2021pix2seq,
title={Pix2seq: A language modeling framework for object detection},
author={Chen, Ting and Saxena, Saurabh and Li, Lala and Fleet, David J and Hinton, Geoffrey},
journal={arXiv preprint arXiv:2109.10852},
year={2021}
}
@article{chen2022unified,
title={A Unified Sequence Interface for Vision Tasks},
author={Chen, Ting and Saxena, Saurabh and Li, Lala and Lin, Tsung-Yi and Fleet, David J. and Hinton, Geoffrey},
journal={arXiv preprint arXiv:2206.07669},
year={2022}
}
@article{chen2022unified,
title={A generalist framework for panoptic segmentation of images and videos},
author={Chen, Ting and Li, Lala and Saxena, Saurabh and Hinton, Geoffrey and Fleet, David J.},
journal={arXiv preprint arXiv:2210.06366},
year={2022}
}
@article{chen2022analog,
title={Analog bits: Generating discrete data using diffusion models with self-conditioning},
author={Chen, Ting and Zhang, Ruixiang and Hinton, Geoffrey},
journal={arXiv preprint arXiv:2208.04202},
year={2022}
}
@article{jabri2022scalable,
title={Scalable Adaptive Computation for Iterative Generation},
author={Jabri, Allan and Fleet, David J. and Chen, Ting},
journal={arXiv preprint arXiv:2212.11972},
year={2022}
}
Diffusion noise scheduling paper:
@article{chen2023on,
title={On the Importance of Noise Scheduling for Diffusion Models},
author={Chen, Ting},
journal={arXiv preprint arXiv:2301.10972},
year={2023}
}
@article{chen2023fit,
title={FIT: Far-reaching Interleaved Transformers},
author={Chen, Ting and Li, Lala},
journal={arXiv preprint arXiv:2305.12689},
year={2023}
}
This is not an officially supported Google product.