The Path to Sustainable AI -- Core Principles and Best Practices

Large-scale AI models are considerable consumers of computing resources and energy, leading to a significant carbon footprint on our planet. Researchers estimate that training a single natural language processing model can generate as much CO2e (carbon dioxide equivalent) as the annual emissions of 120 homes. AI workloads in data centers accounted for 15% of Google’s total electricity consumption -- 18.3 terawatt hours in 2021, which is comparable to the annual energy usage of the entire City of Atlanta. And this was well before the boom of generative AI technologies we have been witnessing over the last couple of years. Driven by the growing demands of large-scale data analytics and AI workloads, data centers are projected to consume 3–13% of global electricity by 2030 -- a significant increase from just 1% in 2010. The computational demands of cutting-edge AI models are increasing 1,000-fold every three years, and AI could account for 14% of the world’s total carbon emissions by 2040. Hence, there has been a consensus among the major IT companies, academic institutions, and federal government agencies underscoring the urgency of decreasing carbon emissions and preventing the potentially daunting environmental impacts of emerging AI workloads.

In this blog post, I lay out the fundamental principles and best practices in achieving sustainable AI without notable compromise from the accuracy or performance of these transformative workloads.

Selecting the AI Framework

Figure 1: Comparison of energy consumption,
run-time performance (Rt), and accuracy (Acc)
between PyTorch and TensorFlow. 

Your selection of the AI framework will significantly impact the energy consumption of the AI models you develop with that framework. A thorough empirical study was performed to measure and compare the energy consumption and run-time performance of six different AI models written in the two most popular AI frameworks, namely PyTorch and TensorFlow. The study found that while TensorFlow achieves significantly better energy and run-time performance than PyTorch, and with large effect sizes in 100% of the cases for the training phase, PyTorch instead exhibits significantly better energy and run-time performance than TensorFlow in the inference phase for 66% of the cases, consistently, with large effect sizes. The findings of the study show that TensorFlow outperforms PyTorch regarding energy consumption by 1.7x and run-time performance by 2.1x when training models belonging to the recommender systems and computer vision categories of the benchmark. By contrast, PyTorch is 2.1x more energy-efficient and 2.4x faster than TensorFlow when training models of the NLP category. In the inference of the models, TensorFlow is 1.4x faster and 1.7x more energy-efficient than PyTorch only for the recommender systems and ResNet-50 models, while PyTorch outperforms TensorFlow for the remaining models. Regarding accuracy, both frameworks achieve a similar score under the same configurations. Figure 1 presents the trade-offs among the three aspects i.e., energy consumption (y-axis), run-time performance (size of circle, bigger better), and accuracy (color of the circle, darker better) of PyTorch vs TensorFlow. This shows that AI model developers should choose the most appropriate framework for the model at hand, while keeping accuracy, run-time performance, and energy consumption optimal.

Simplifying the AI Model Structure

Figure 2: Sometimes simple is better.
[image credits: @DowPhumiruk
]

Traditionally, AI researchers have favored larger and more complex models to achieve higher accuracy. However, the computational cost of scaling up models may not always justify the marginal gains in accuracy. Try using smaller, simpler models when possible, as these models often perform just as well for many tasks with significantly lower computational costs. Also, eliminating redundant neurons in the hidden layers of Deep Neural Networks (DNNs) and creating structurally simplified models can significantly reduce computational demands while maintaining similar levels of accuracy. If the output of a neuron is linearly dependent on the outputs of other neurons of the same layer, this neuron will be deemed redundant and removed from the network. In the process of this structural simplification, known as pruning, removing a redundant neuron also eliminates all the edges (weights) connected to it, significantly reducing computation with minimal impact on output accuracy. One of the studies in this area reports that significant DNN training cost reductions with up to 33x in energy and 12x in memory can be achieved through structure simplification, while the model accuracy loss is less than 2%.

Using Low-Precision Data Formats for AI

Figure 3: Quantization of wider data formats (e.g., FP32)
into narrower ones (e.g., INT8).

Recent research shows that low-precision data formats—such as those using 8 bits or fewer per value—can significantly enhance the performance and energy efficiency of AI training and inference with minimal impact on accuracy. Unlocking the full potential of these formats requires specialized techniques, such as quantization of wider data formats (e.g., FP32) into narrower ones (e.g., INT8), determining appropriate scaling factors, and emulating advanced data formats for research and development purposes. These formats also reduce the memory footprint of AI models while boosting effective memory capacity and network bandwidth, as fewer bits need to be stored and transmitted. HuggingFace enables models to be loaded in 8-bit or 4-bit precision using the bitsandbytes library, reducing memory usage by up to 4x with 8-bit models compared to 32-bit models, and up to 8x with 4-bit models. Other open-source libraries such a TensorRT by NVDIA and Brevitas by AMD enable neural network quantization and emulation with support for both post-training quantization and quantization-aware training.

Energy-aware AI Code Refactoring

Figure 4: Impact of intelligent/selective code smell
refactoring on application resource consumption.

Most AI models are not developed using the most efficient software coding practices. In our recent study, we showed that certain software coding practices can trigger a substantial surge in energy consumption, primarily stemming from the suboptimal utilization of computing resources. These coding practices are referred as “code smells” (or sometimes as “energy smells”) which are defined as characteristics in software source code indicating a deeper, underlying issue. In this work, we performed a thorough investigation of 16 distinct code smells and other coding malpractices across 31 real-world open-source applications written in Java and Python. The study provided evidence that several common refactoring techniques, employed for dealing with certain types of code smells such as god class, god method, long parameter, and specific instances of type checking, can inadvertently increase CPU and memory usage. This, in turn, escalates the overall energy consumption of the application. It demonstrated that a selective approach to code smell refactoring can result in considerable resource savings. For certain applications, this yielded a reduction of up to 39% in CPU utilization and a decrease of up to 48% in memory utilization (as shown in Figure 4). The study also showed that selective refactoring can yield a reduction of up to 13.1% of energy consumption and 5.1% of carbon emissions per workload on average. These findings underscore that by identifying and selectively refactoring AI code smells that contribute to high energy consumption, we can develop more sustainable and eco-friendly AI systems without sacrificing the accuracy or performance of these models.

GPU Power Capping

Figure 5: Time and energy usage comparison of three
language modeling network infrastructures with different
maximum power limits. Values given are percentages
relative to the performance of the default 250W setting
(100% indicated by black line). 
A recent study proposes power capping as a useful tool for reducing GPU energy consumption during
AI model training and inference. Most modern computing platforms allow users to adjust hardware settings for processors and GPUs. This can be done via command line tools that are generally not visible to users of a shared computing system. Over the duration of an AI training or inference task the power consumed by hardware components can vary significantly based on the operation being per- formed, environmental conditions, and hardware limits. Power-capping allows users to limit the maximum power available to hardware devices through these tools. Power-capping requires no changes to user code and is done at a hardware level. Figure 5 depicts training performance with power-capping at 100W, 150W and 200W. Results are plotted as a percent relative to the default limit of 250W. The experiments show that implementing power caps can significantly reduce energy usage without affecting the predictions of trained models or consequently their accuracy on tasks. The study also shows that power capping is beneficial to energy usage regardless of architecture. 

There are several additional strategies to significantly reduce the energy consumption of AI workloads, including reducing data size, shortening training time, improving data loading efficiency, and optimizing inter-node communication. I hope to cover those practices in another blog post. 

Links to My Other Related Posts

I have made a couple of other related posts in the past. Here are quick links to those if you would like to have some further reading in this area. 

Comments

Popular posts from this blog

Toward Sustainable Software for HPC, Cloud, and AI Workloads

Spring'24 Seminar Course on Green Computing and Sustainability

OneDataShare -- Fast, Scalable, and Flexible Data Sharing Made Easy