3

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

Preprint (arXiv 2311.07562)

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

The Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B 2023)

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

The Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)

Multimodal Procedural Planning via Dual Text-Image Prompting

Preprint (arXiv 2305.01795)

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text

The Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (NeurIPS D&B 2023)

OpenFlamingo: An Open-Source Framework for Training Vision-Language Models with In-Context Learning

Preprint (arXiv 2308.01390)

Collaborative Generative AI: Integrating GPT-k for Efficient Editing in Text-to-Image Generation

The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023, Short)

Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning

The Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)

CLIP also Understands Text: Prompting CLIP for Phrase Understanding

Preprint (arXiv 2210.05836)

Text Infilling

Preprint (arXiv 1901.00158)