Hey there, folks! Today, we’ve got some exciting news to share with you about the latest developments in language models and how they’re being adapted to handle visual information seeking tasks. We’ve seen some impressive progress in the world of large language models (LLMs) when it comes to tasks like image captioning and visual question answering (VQA), but there’s still some room for improvement, especially in situations where external knowledge is needed to answer those tricky questions.
Introducing “AVIS: Autonomous Visual Information Seeking with Large Language Models”! This cutting-edge method combines LLMs with a range of tools to tackle visual information seeking tasks like never before. We’ve got three types of tools in our arsenal: computer vision tools to extract visual information from images, a web search tool to retrieve open world knowledge and facts, and an image search tool to gather relevant information from metadata associated with visually similar images.
So, how does AVIS work? Well, we’ve got an LLM-powered planner calling the shots, making decisions on which tools to use and what queries to send at each step of the process. And we’ve also got an LLM-powered reasoner that analyzes the outputs of the tools and extracts key information. It’s like having your very own AI assistant helping you navigate through the complexity of visual information seeking. Plus, we’ve got a working memory component to store information as we go along.
Now, you might be wondering how AVIS stacks up against previous work. Well, there have been other attempts to incorporate tools into LLMs for multimodal inputs, but they often struggle when it comes to real-world scenarios. And sure, there are autonomous agents out there that use LLMs, but they don’t have the same restrictions on tool usage as AVIS, which can lead to some inefficiencies. That’s where AVIS shines, my friends. We’ve taken cues from human decision-making through a user study, guiding our LLM’s decision-making process and making it more effective.
Speaking of the user study, we wanted to understand how humans make decisions when using external tools. We equipped users with the same set of tools as AVIS and observed their actions and outputs. This helped us construct a transition graph that defines distinct states and restricts the available actions at each state. We also used these human decisions to guide our planner and reasoner, making our system even better.
Now, let’s talk about the general framework of AVIS. We’ve got three primary components at play here. First, our planner determines the next action to take, including which tool to use and what query to send. Then, we’ve got a working memory that stores information from API executions. And lastly, we’ve got a reasoner that analyzes tool outputs and decides if more data retrieval is needed or if we’re ready to provide a final response.
The planner helps us navigate through a potentially large action space by referring to the transition graph and excluding actions that have already been taken. It also collects in-context examples from previous human decisions and uses them to formulate prompts for the LLM. The LLM then returns a structured answer, determining the next tool to activate and the query to send. This iterative process of using the planner and reasoner gradually leads us to the answer we seek.
Now, let’s talk results! We evaluated AVIS on Infoseek and OK-VQA datasets and compared it to previous baselines. And let me tell you, AVIS came out on top! It achieved impressive accuracy on the unseen entity split of the Infoseek dataset, even without fine-tuning. And on the OK-VQA dataset, AVIS with few-shot in-context examples outperformed most previous works.
In conclusion, AVIS is a game-changer when it comes to visual information seeking tasks. By combining LLMs with a range of tools and guided decision-making, we’ve achieved state-of-the-art results. It’s like having your very own AI assistant, working tirelessly to find the answers you need. So, get ready to take your visual information seeking to the next level with AVIS!
And that’s it for today, folks. Thanks for tuning in, and be sure to check out AVIS for all your knowledge-intensive visual questions. Stay curious out there!