AI models, man, they can handle text and images pretty well, and sometimes they even solve complex problems without screwing up too much, you know?
Take OpenAI, for example. They claimed their GPT-4 model nailed a 700 out of 800 on the SAT math exam. But not all their claims have held up. There was this paper that said GPT-4 could get a computer science degree at MIT, but they had to withdraw it later. Bummer.
So now, a group of ten brainiac researchers from UCLA, UW, and Microsoft Research came up with a benchmark called MathVista to see how these huge language models and multimodal models handle problem solving with a visual twist. It’s all about visually-oriented challenges, man.
These researchers, man, they’re like, “Yo, we haven’t really seen how these foundation models handle math in a visual context.” So they created MathVista to test these models and see how they compare when it comes to problem solving.
It’s important, you know, to see if these AI models can solve visual problems correctly. Like, if we’re gonna trust them to drive a car without running over someone, they better be able to handle visual challenges.
So these researchers made this MathVista thing, right? It’s got like 6,141 examples from a bunch of multimodal datasets and three new datasets they made up – IQTest, FunctionQA, and PaperQA. This benchmark covers all sorts of reasoning like algebraic, arithmetic, geometric, logical, numeric, scientific, and statistical. And it’s all about figure question answering, geometry problem solving, math word problems, textbook questions, and visual questions, man.
Screenshot of MathVista challenge question – Click to enlarge
They tested a bunch of these foundation models, man. Like ChatGPT, GPT-4, Claude-2, all them fancy ones. And they threw in some proprietary and open-source models too, just for kicks. They even had humans from Amazon Mechanical Turk take the test, you know, to make sure it’s all legit. They also had some random answers thrown in there, just for fun.
The good news is that all these models, man, they did better than random chance. No surprise there, considering most of the questions were multiple choice. But guess what? OpenAI’s GPT-4V, that bad boy actually beat the humans in some areas. Like algebraic reasoning and complex visual challenges with tables and function plots. Pretty impressive, I gotta say.
But here’s the deal, even with all that, GPT-4V only got 49.9 percent of the questions right. It’s not bad, but it ain’t as good as those humans from Amazon Mechanical Turk. They got a score of 60.3 percent. So, there’s definitely room for improvement, man. These models gotta step up their game.
So, yeah, AI models are getting there, but they still got a ways to go to catch up to us humans. We’ll see what happens, man. Plenty of room for improvement, that’s for sure. Keep pushing those boundaries, AI peeps. Keep pushing.