Visual Knowledge and Reasoning with Large Language Models: Classification via Description and Visual Inference via Program Execution for Reasoning (ViperGPT)

[CFAR Outstanding PhD Student Seminar Series]
Visual Knowledge and Reasoning with Large Language Models: Classification via Description and Visual Inference via Program Execution for Reasoning (ViperGPT) by Sachit Menon
14 Jun 2023 | 9.30am (Singapore Time)

While large language models (LLMs) have shown impressive capabilities in a wide range of natural language tasks, how could they help us solve computer vision tasks without being trained on visual data?

In this talk, Sachit Menon from Columbia University will discuss two research works that leverage LLMs toward visual tasks: determine *what* information to use for a visual task (visual knowledge) and *how* to use that information (visual reasoning). The first work presents an alternative framework for classification with Vision-Language Models (VLMs), also known as classification by description. The VLMs will check for descriptive features rather than broad categories, i.e., to find a tiger, look for its stripes; its claws and more. He will also explain how these descriptive features could be obtained by a language model and achieve better performance with inherent interpretability. 

The second work will introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any visual query.  Sachit will conclude his talk by sharing how both approaches require no additional training and are able to obtain state-of-the-art results on various datasets, providing practical intermediate results, and enabling much more interpretability than previous models.

 
SPEAKER
talks---sachit-menon (1)
Sachit Menon
Ph.D. Student
Columbia University
Sachit Menon is a PhD student in Computer Science at Columbia University. Through his research, he hopes to develop new ways to learn or utilise models at scale and is particularly interested in representation learning, generative modelling and self-supervised methods, as well as their intersection. Recently, Sachit has been particularly interested in the potential for language to aid vision tasks. His doctoral work is supported by the Columbia Presidential Fellowship and the NSF Graduate Research Fellowship.