Navigating the Threat Landscape for Cloud-Based GPUs

May 23, 2024

By Numaan Huq, Philippe Lin, Roel Reyes, Charles Perine

The increased adoption of technologies like artificial intelligence (AI), machine learning (ML), large language models (LLMs), and high-performance computing (HPC) underscores the growing need to prioritize the security of graphics processing units (GPUs). GPUs are designed for parallel processing and can run thousands of simple compute tasks simultaneously, which speeds up executing AI, ML, LLMs, and HPC applications. Given the essential role that GPUs play in various business operations — AI, ML, LLM, and HPC are dependent on cloud-based GPU systems, for example — it's crucial that defenders implement protective measures for GPUs.

Our research paper, "A Survey of Cloud-Based GPU Threats and Their Impact on AI, HPC and Cloud Computing," delves into the current threat environment of GPUs and outlines security recommendations to address these security challenges.

How GPUs are used

Businesses are transitioning from on-site setups to cloud-based GPUs for several reasons:

The scalability and flexibility of cloud-based GPUs are well-suited to accommodate peaks and troughs in usage periods of computing power.
Cloud-based GPUs allow users access to the latest GPU chips in the market. The pay-as-you-go model of cloud services means users don't have to invest heavily upfront in costly hardware and upkeep.
Users are provided worldwide availability of shared GPU assets, sidestepping the complexities of hardware management.

Top GPU security concerns

GPUs contend with a slew of diverse and complex threats, with some components —– including processing units and special function units —– vulnerable to exploitation by malicious actors. This research includes a risk matrix to evaluate the likelihood and impacts of 10 kinds of GPU attacks:

Threat Type	Risk Level	Likelihood	General Impact	Cloud Impact	AI Impact	HPC Impact
GPU Side-Channel Attacks	High	Medium: Attacks are possible, but not trivial to execute.	High: Potential for significant data leakage and security breaches.	High: Potentially exposes data across users in shared environments.	High: Risk of leaking sensitive inference data or insights into model internals.	High: Could lead to the disclosure of sensitive computation or simulation results.
GPU Rootkits	Medium	Low: Sophisticated attacks, less frequent in well-monitored environments.	High: Can have far-reaching effects through system compromise.	High: Can evade detection and persistently compromise cloud services.	High: Threatens the integrity of AI models and the confidentiality of proprietary information.	Medium: Possible disruption to HPC tasks; the impact varies by the specific use case.
API Abuse and Kernel Manipulation	High	Medium: Vulnerabilities can exist, and attackers might leverage them.	High: Potential for severe system compromise and data manipulation.	Medium: Potential for exploiting vulnerabilities, mitigated by cloud platform securities.	High: Direct manipulation could compromise AI models and data.	High: Directly affects the integrity and execution of computational tasks.
Denial-of-Service Attacks	High	High: These attacks are common and can be easily launched.	High: Disrupt service availability, potentially causing significant losses.	High: Directly impact service availability, affecting multiple users.	High: Can render AI services inoperative, critically affecting availability.	High: Severely restrict access to computational resources, disrupting operations.
GPU Malware for Cryptomining	Medium	High: Malware is prevalent and targets any accessible resources.	Medium: Mainly impacts system performance and costs.	High: Consumes computational resources, leading to increased costs and degraded performance.	Low: Mainly a resource drain; indirect impact unless AI tasks are severely resource constrained.	Low: Similar to AI, mainly a resource drain with limited direct impact.
Exploiting Vulnerabilities in GPU Drivers	High	Medium: Vulnerabilities exist, but patching and mitigations are common.	High: Compromise can have severe consequences for system and data integrity.	Medium: Cloud platforms might mitigate some risks, but vulnerabilities can lead to system compromise.	High: Potentially compromises the integrity and confidentiality of AI processes.	High: Unauthorized access or disruption of tasks is a significant threat.
GPU Assisted Code Obfuscation	Medium	Low: Requires specialized techniques, not as common as basic malware.	Medium: Can hinder security analysis and delay incident response.	Medium: Complicates malware detection within the cloud infrastructure.	Medium: Can obscure malicious activities affecting AI model integrity.	Medium: Could hide the presence of unauthorized computations or data manipulations.
Overdrive Fault Attacks	Medium	Low: Requires physical access or specialized manipulation techniques.	Medium: Can impact accuracy and reliability, more targeted in nature.	Low: Rare in controlled cloud environments but could occur through hardware manipulation.	Medium: Specific attacks might subtly alter the outcomes of AI models.	High: Precision tasks might be compromised, affecting critical results.
Memory Snooping/Cross-Virtual Machine (VM) Attacks in vGPU Environments	High	Medium: Attacks are possible on virtualized GPUs, especially if not properly configured.	High: Potential for major data breaches and loss of confidentiality.	High: Breaks isolation between users, undermining cloud security.	High: Unauthorized access to AI datasets and models poses a serious confidentiality risk.	High: Data leakage is a major concern, especially in shared computational environments.
Compromised AI Models/Trojaning	High	Medium: Attacks rely on model distribution channels and user trust.	High: Can lead to incorrect or malicious outputs with significant consequences.	Medium: Cloud infrastructure might not be directly impacted but facilitates model distribution.	High: Directly affects model integrity, leading to incorrect or malicious decisions.	Medium: Indirect impact initially, but a growing concern during model deployment impacting HPC.

Mitigating threats to cloud-based GPUs

A robust set of protection measures can help fortify cloud-based GPU environments against cyberattacks. Broadly speaking, security approaches that defenders can put into practice as part of a layered defense strategy include the following:

Driver and firmware security. Consistently update GPU drivers and firmware with the most recent security patches to stay vigilant against potential vulnerability exploits.
GPU usage monitoring and anomaly detection. Use monitoring tools capable of pinpointing atypical behavior in GPU usage, which could be signs of malicious activity like cryptojacking, denial-of-service (DoS) attacks, or misuse of resources. AI/ML techniques can also help defenders detect more advanced cyberattacks.
Application-level security measures. Reduce potential threats in GPU-accelerated applications by implementing best practices for application security, including secure coding techniques, rigorously validating input data, and fortifying AI/ML models to withstand attackers’ data poisoning and evasion attempts.
Hardware Security Modules (HSMs) for sensitive operations. Use dedicated HSMs, which are designed to better withstand tampering and prevent data breaches, instead of general-purpose GPUs for critical cryptographic operations or when processing confidential data.
Access control policies. Implement stringent access policies, which include role-based access control (RBAC) and auditing mechanisms. This ensures that only approved people and applications have usage privileges to GPU resources.
Education and awareness.Promote awareness of the security risks in using GPUs via cloud services and provide training initiatives about how to identify suspicious activity that could point to GPU-related attacks.

The growing prevalence of AI and HPC deployments calls for organizations to take a forward-thinking approach to defending their GPU infrastructures, one that combines cybersecurity tools with time-tested security practices. Mitigating threats to GPUs takes coordinated efforts between developers, who need to adopt secure coding practices, and cloud service providers that should enforce intrusion and anomaly detection measures tailored specifically to GPUs.

HIDE

Like it? Add this infographic to your site:
1. Click on the box below. 2. Press Ctrl+A to select all. 3. Press Ctrl+C to copy. 4. Paste the code into your page (Ctrl+V).

Image will appear the same size as you see above.

Posted in Threat Landscape, Research