UMIACS Team Aims to Boost High-Performance Computing Software Development Using AI

Oct 28, 2024

In this era of artificial intelligence (AI), software developers increasingly rely on large language models (LLMs) like ChatGPT and GitHub Copilot to streamline their coding processes. Companies like Meta have reported that nearly all their developers use an internal LLM to enhance productivity, highlighting the essential role AI now plays in software development.

Despite making huge strides, AI’s capabilities in high-performance computing (HPC)—which involves executing complex parallel programs and processing massive datasets across hundreds to thousands of GPUs—still face limitations.

Two University of Maryland researchers—Abhinav Bhatele (right in photo), an associate professor of computer science, and Tom Goldstein (left in photo), a professor of computer science—are part of a $7 million multi-institutional project supported by the Department of Energy (DOE) to address these limitations by developing an AI-assisted HPC software ecosystem.

The researchers are collaborating with scientists from the Lawrence Livermore National Laboratory (LLNL), Oak Ridge National Laboratory (ORNL), and Northeastern University on the project.

“HPC is all about using multiple computers at the same time to perform large-scale parallel computation,” explains Goldstein, who is director of the UMD Center for Machine Learning.

He emphasizes the complexity of the field, saying that while it’s relatively simple to write code for one computer, writing code that needs to run on 1,000 computers simultaneously is a feat.

“All processors in the system must work in unison—timing their computation and communication in a sort of symphony—to ensure everything works,” Goldstein says.

This complexity gives rise to several challenges, Bhatele notes. Although existing LLMs help developers write code for simpler programs, they aren’t nearly as effective when it comes to more complicated tasks. This is especially true for parallel programming—a method of running multiple calculations simultaneously to solve problems faster.

“If you ask an LLM to write code for a sequential program, which are programs that run one after the other, they’ll do just fine,” he says. “But, if you ask them to write code for parallel programs, that's where LLMs can fail, creating code that’s confusing or that just doesn’t work.”

Bhatele and Goldstein—who both have appointments in the University of Maryland Institute for Advanced Computer Studies (UMIACS)—will be receiving approximately $1 million of the DOE funding over three years to address these limitations. They aim to enhance LLMs so that they can perform HPC tasks effectively, enabling developers to boost their productivity by at least 10 times.

“Basically, we just want to help make developers’ lives a little easier,” says Bhatele.

If these models can help developers write faster, better code, they could save a lot of time at almost no cost, he adds. With the DOE planning to publish the project’s deliverables as free and open-source tools, the team’s work has the potential to benefit HPC developers globally.

To achieve these goals, the researchers must target existing LLMs’ blind spots. As language models struggle to process extensive code, Goldstein’s research aims to help them handle massive amounts of information.

To do so, he plans to either provide the models with potentially millions of words of code, so that they gain a comprehensive understanding of the codebase; or to help them retrieve the most relevant parts of the code, allowing them to access the codebase without having to process it all at once.

Meanwhile, Bhatele’s work aims to target objectives beyond that of generating correct code. Although many researchers are focused on ensuring an LLMs’ code is accurate, improving secondary metrics is a more niche area where fewer people are working. These objectives include enhancing the code’s performance or execution speed, quality and even energy efficiency.

Given the complex nature of the project, the researchers expect several challenges along the way.

For example, accessing high-quality code datasets is difficult due to complex legal issues surrounding licensing. Additionally, finding datasets with parallel code, which the researchers are interested in, is challenging because most datasets consist mainly of sequential programming.

To address this limitation, the researchers plan on manually gathering data and experimenting with techniques like generating synthetic code using LLMs to augment real data.

“We’re going to build our own datasets, our own language model tools, and our own retrieval models,” Goldstein says. “So, one of our biggest challenges is that we must cook from scratch. But that's a challenge that we're really excited about.”

Bhatele and Goldstein acknowledged the important contributions of their fellow researchers in navigating these hurdles: Harshitha Menon, a computer scientist at LLNL and long-time collaborator, will lead the project; William Godoy, a senior computer scientist at ORNL will focus on executing multi-step tasks; and Arjun Guha and David Bau, an associate and assistant professor of computer science at Northeastern, will work on safety and security.

The researchers also emphasized the essential role of the DOE and the National Laboratories, which will provide access to some of the fastest supercomputers in the U.S., such as Frontier at Oak Ridge National Laboratory and Tuolumne at Lawrence Livermore National Laboratory. These facilities will be crucial in helping them scale and test their models.

They also expressed appreciation for the resources and facilities provided by UMIACS. While the National Laboratories’ supercomputers are integral for large-scale computing, UMIACS will provide the researchers with essential local computing and storage facilities.

“Without local resources to develop training infrastructure and store data, it would be impossible to do a project like this,” Goldstein says. “UMIACS will be instrumental in providing computing resources that will act as our local intermediary—where we can stage and prep everything—before moving data to the supercomputers.”

—Story by Aleena Haroon, UMIACS communications group