Big language models can't plan, even if they write fancy essays

This article is part of our coverage of the latest AI research.

Large language models such as GPT-3 have grown to the point where it is difficult to measure the limits of their capabilities. When you have a very large neural network that can produce articles, write software code, and engage in conversations about feelings and life, you should expect it to be able to reason about tasks and plans like humans do, right?

Wrong. A study by researchers at Arizona State University, Tempe, showed that when it comes to planning and methodical thinking, LLMs perform very poorly, and suffer from many of the same failures observed in today’s deep learning systems.

Regards, humanoids

Subscribe to our newsletter now for weekly recaps of our favorite AI stories in your inbox.

Interestingly, this study found that, although very large LLMs such as GPT-3 and PaLM pass many tests intended to evaluate reasoning abilities and artificial intelligence systems, they do so because these benchmarks are too simplistic or too flawed and can be “tricked” through statistical tricks, something very good in deep learning systems.

With a new LLM every day, the authors suggest new benchmarks to test the planning and reasoning capabilities of AI systems. The researchers hope their findings can help direct AI research toward the development of artificial intelligence systems that can tackle what’s known as the “systems thinking 2 task”.

The illusion of planning and reasoning

“Back last year, we evaluated GPT-3’s ability to extract plans from text descriptions—a task that was attempted with previous special-purpose methods—and found that the available GPT-3s did quite well compared to special-purpose methods,” Subbarao Kambhampati, professor at Arizona State University and co-author of the study, told TechTalks. “It certainly makes us wonder what ’emerging capabilities’—if any—GPT3 has for performing the simplest planning problems (eg, making plans in the toy domain). We soon discovered that GPT3 was very poor on anecdotal tests.”

However, one interesting fact is that GPT-3 and other major language models perform very well on benchmarks designed for common sense reasoning, logical reasoning, and ethical reasoning, skills previously considered off limits to deep learning systems. An earlier study by the Kambhampati group at Arizona State University demonstrated the effectiveness of large language models in generating plans from text descriptions. Other recent studies include one showing LLMs can perform zero-shot reasoning if provided with special trigger phrases.

However, “reasoning” is often used extensively in these benchmarks and studies, Kambhampati believes. What LLM does, in fact, creates a semblance of planning and reasoning through pattern recognition.

“Most benchmarks rely on a shallow type of reasoning (one or two steps), as well as tasks that sometimes have no real underlying truth (e.g., getting an LLM to reason about an ethical dilemma),” he says. “It’s possible that a pure pattern completion engine with no reasoning ability to still perform well on some such benchmarks. After all, while System 2’s reasoning capabilities may sometimes compile to System 1, it’s also the case that System 1’s ‘reasoning abilities’ may simply be reflexive responses to patterns the system has seen in its training data, without actually doing anything resembling reasoning. .”

System 1 and System 2 thinking

Systems 1 and Systems 2 thinking were popularized by psychologist Daniel Kahneman in his book Thinking Fast and Slow. The first is the fast, reflexive, automatic type of thinking and acting that we often do, such as walking, brushing teeth, tying shoes, or driving in familiar areas. In fact most of the speech was done by System 1.

System 2, on the other hand, is the slower thinking mode we use for tasks that require methodical planning and analysis. We use System 2 to solve calculus equations, play chess, design software, plan trips, solve puzzles, etc.

But the line between System 1 and System 2 is not clear. Take driving, for example. When you learn to drive, you should concentrate fully on how you coordinate your muscles to control gears, steering wheel and pedals while also keeping an eye on the road and side and rear mirrors. It’s clear System 2 is at work. It consumes a lot of energy, requires your full attention, and is slow. But when you gradually repeat the procedure, you learn to do it without thinking. The driving task shifts to your System 1, allowing you to do it without straining your mind. One of the task criteria that has been integrated into System 1 is the ability to perform them subconsciously while focusing on other tasks (for example, you can tie your shoes and talk at the same time, brush your teeth and read, drive, and talk, etc.) .

Even many of the very complex tasks that remained in the domain of System 2 were eventually partially integrated into System 1. For example, professional chess players rely heavily on pattern recognition to speed up their decision-making process. You can see similar examples in mathematics and programming, where after doing something repeatedly, some tasks that previously required careful thought come to you automatically.

A similar phenomenon may occur in deep learning systems that have been exposed to very large data sets. They may have learned to perform the simple pattern recognition phase of complex reasoning tasks.

“Plan-making requires a chain of reasoning steps to come up with a plan, and solid ground truths about truth can be established,” says Kambhampati.

New benchmark for test planning in LLM

However, given the excitement surrounding the hidden/emergent nature of LLM, we thought it would be more constructive to develop benchmarks that provide various planning/reasoning tasks that can serve as benchmarks as people improve LLM through finetuning and other approaches to adjust/improve their performance to / on the reasoning task. This is what we ended up doing,” said Kambhampati.

The team developed their benchmark based on the domain used in the International Planning Competition (IPC). The framework consists of several tasks that evaluate various aspects of reasoning. For example, some tasks will evaluate the capacity of an LLM to create a valid plan to achieve a specific goal, while others will test whether the resulting plan is optimal. Other tests include reasoning about plan results, recognizing whether different text descriptions refer to the same goal, reusing parts from one plan to another, shuffling plans, and more.

To conduct the test, the team used Blocks world, a problem framework that revolves around placing a set of different blocks in a specific order. Every problem has an initial condition, an end goal, and a set of permitted actions.

“The benchmark itself is extensible and is meant to have tests from multiple IPC domains,” said Kambhampati. “We used the Block world example to illustrate the different tasks. Each of these tasks (e.g., Plan generation, goal randomization, etc.) can also be submitted in other IPC domains.”

The benchmark that Kambhampati and his colleagues developed uses multiple-shot learning, where a prompt given to a machine learning model includes an example being solved plus the main problem to be solved.

Unlike other benchmarks, this new benchmark problem description is very long and detailed. Solving it requires concentration and methodical planning and cannot be tricked through pattern recognition. Even a human who wants to solve it has to think through each problem carefully, take notes, maybe visualize, and plan a step-by-step solution.

“Reasoning is a system-2 task in general. Society’s collective delusions have looked at the kinds of benchmarks of reasoning that might be addressed through compiling to system 1 (e.g., ‘the answer to this ethical dilemma, with pattern resolution, is this’) as opposed to actually performing the reasoning required for the task at hand. there is,” said Kambhampati.

Big language models are bad at planning

The researchers tested their framework on Davinci, the largest version of GPT-3. Their experiments showed that GPT-3 had mediocre performance on several types of planning tasks but performed very poorly in areas such as plan reuse, plan generalization, optimal planning, and re-planning.

“The initial studies we looked at basically showed that LLM was terrible at anything that was considered a planning task – including plan generation, optimal plan generation, reuse or plan re-planning,” Kambhampati said. “They did better on planning-related tasks that didn’t require a chain of reasoning—such as randomizing goals.”

In the future, the researchers will add test cases based on other IPC domains and provide a baseline for performance with human subjects on the same benchmark.

“We are also curious as to whether other variants of the LLM are better at this benchmark,” said Kambhampati.

Kambhampati stressed that the aim of the project was to set benchmarks and provide an idea of ​​where the baseline is currently. The researchers hope that their work opens a new window for developing planning and reasoning capabilities for today’s AI systems. For example, one direction they propose is to evaluate the effectiveness of LLM finetuning for reasoning and planning in a particular domain. The team already had preliminary results on the instruction-following GPT-3 variant that appeared to do slightly better on the easy tasks, although it also remained around the 5 percent level for the actual planning task, Kambhampati said.

Kambhampati also believes that studying and acquiring models of the world will be an important step for any AI system that can reason and plan. Other scientists, including deep learning pioneer Yann LeCun, have made similar suggestions.

“If we agree that reasoning is part of intelligence, and want to claim LLM does, we certainly need benchmarks for planning there,” Kambhampati said. “Instead of taking a negative magisterial stance, we provide benchmarks, so people who believe that reasoning can emerge from an LLM even without specialized mechanisms such as world models and reasoning about dynamics, can use benchmarks to support their point of view. see.”

This article was originally published by Ben Dickson at TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new technology, and what we need to be aware of. You can read the original article here.


#Big #language #models #plan #write #fancy #essays

Comments

Popular posts from this blog

Keary opens up about battle concussion after 'nervous' return, revealing teammates preparing to rest