Enterprise

The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now!


Language models such as OpenAI’s GPT-3, which leverage AI techniques and large amounts of data to learn skills like writing text, have received an increasing amount of attention from the enterprise in recent years. From a qualitative standpoint, the results are good — GPT-3 and models inspired by it can write emails, summarize text, and even generate code for deep learning in Python. But some experts are skeptical that the size of these models — and their training datasets — correspond to performance.

Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, says that it’s an open question as to whether larger models are the right approach in natural language. While some of the best benchmark performance scores today come from large datasets and models, whether the pattern of dumping enormous amounts of data into models will pay off is uncertain.

“The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets,” she told VentureBeat in a previous interview. “These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate.”

Parameter count

Conventional wisdom once held that the more parameters a model had, the more complex tasks it could accomplish. In machine learning, parameters are internal configuration variables that a model uses when making predictions. Their values essentially define the skill of a model on a problem.

But a growing body of research casts doubt on this notion. This week, a team of Google researchers published a study claiming that a model far smaller than GPT-3 — fine-tuned language net (FLAN) — bests GPT-3 “by a large margin” on a number of challenging benchmarks. FLAN, which has 137 billion parameters compared with GPT-3’s 175 billion, outperformed GPT-3 on 19 out of the 25 tasks the researchers tested it on and even surpassed GPT-3’s performance on 10 tasks.

FLAN differs from GPT-3 in that it’s fine-tuned on 60 natural language processing tasks expressed via instructions like “Is the sentiment of this movie review positive or negative?” and “Translate ‘how are you’ into Chinese.” According to the researchers, this “instruction tuning” improves the model’s ability to respond to natural language prompts by “teaching” it to perform tasks described via the instructions.

After training FLAN on a collection of webpages, programming languages, dialogs, and Wikipedia articles, the researchers found that the model could learn to follow instructions for tasks it hadn’t been explicitly trained to do. Despite the fact that the training data wasn’t as “clean” as GPT-3’s training set, FLAN still managed to surpass GPT-3 on tasks like question answering and summarizing long stories.

“The performance of FLAN compares favorably against both zero-shot and few-shot GPT-3, signaling the potential ability for models at scale to follow instructions,” the researchers wrote. “We hope that our paper will spur further research on zero-shot learning and using labeled data to improve language models.”

Dataset difficulties

As the Google study alluded to, the problem with large language models may lie in the data used to train them — and common training techniques. For example, scientists at the Institute for Artificial Intelligence at the Medical University of Vienna, Austria found that GPT-3 underperforms in domains like biomedicine compared with smaller, less architecturally complex but carefully fine-tuned models. Even when pretrained on biomedical data, large language models struggle to answer questions, classify text, and identify relationships on par with highly-tuned models “orders of magnitude” smaller, according to the researchers.

“Large language models [can’t] achieve performance scores remotely competitive with those of a language model fine-tuned on the whole training data,” the Medical University of Vienna researchers wrote. “The experimental results suggest that, in the biomedical natural language processing domain, there is still much room for development of multitask language models that can effectively transfer knowledge to new tasks where a small amount of training data is available.”

It could come down to data quality. A separate paper by Leo Gao, data scientist at the community-driven project EleutherAI, implies that how data in a training dataset is curated can significantly impact the performance of large language models. While it’s widely believed that using a classifier to filter data from “low-quality sources” like Common Crawl improves training data quality, over-filtering can lead to a decrease in GPT-like language model performance. By optimizing too strongly for the classifier’s score, the data that’s kept begins to become biased in a way that satisfies the classifier, producing a less rich, diverse dataset.

“While intuitively it may seem like the more data is discarded the higher quality the remaining data will be, we find that this is not always the case with shallow classifier-based filtering. Instead, we find that filtering improves downstream task performance up to a point, but then decreases performance again as the filtering becomes too aggressive,” Gao wrote. “[We] speculate that this is due to Goodhart’s law, as the misalignment between proxy and true objective becomes more significant with increased optimization pressure.”

Looking ahead

Smaller, more carefully-tuned models could solve some of the other problems associated with large language models, like environmental impact. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide, equivalent to nearly 5 times the lifetime emissions of the average U.S. car.

GPT-3 used 1,287 megawatts during training and produced metric 552 metric tons of carbon dioxide emissions, a Google study found. By contrast, FLAN used 451 megawatts and produced 26 metrics tons of carbon dioxide.

As the coauthors of a recent MIT paper wrote, training requirements will become prohibitively costly from a hardware, environmental, and monetary standpoint if the trend of large language models continues. Hitting performance targets in an economical way will require more efficient hardware, more efficient algorithms, or other improvements such that the gain is a net positive.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member