LLM as a Judge: Evaluating Accuracy in LLM Security Scans

August 04, 2025

By Matthew Burton (Staff Software Developer), Albert Shala (Sr. Staff Software Developer), and Anurag Das (Sr. Software Developer)

Key Takeaways:

New Trend Micro research found that while LLMs can act as automated judges for security risks, they can miss threats like hallucinated packages and can be tricked by adversarial prompts.
These gaps could lead to data exfiltration, supply chain attacks, and operational disruptions if not addressed with proper guardrails.
Any organization using generative AI, LLM copilots, or automated workflows is potentially at risk, especially where external code or package managers are involved.
To reduce exposure, review LLM use, add guardrails, and validate critical outputs with external sources.
Trend Micro offers AI Guard in Trend Vision One™, security-tuned models, and managed services to help organizations benchmark, monitor, and secure their LLM deployments.

Introduction

As large language models (LLMs) become more capable and widely adopted, the risk of unintended or adversarial outputs grows, especially within a security-sensitive context. To identify and mitigate such risks, Trend Micro researchers ran LLM security scans that simulate adversarial attacks. These scans involve sending crafted prompts to a target model to see whether it will produce outputs that pose a security threat, such as malicious code, sensitive data leaks, and hallucinated software packages.

There are two main ways of evaluating responses from an LLM: using human reviewers or using other models (commonly referred to as LLM as a judge). Human reviewers are great at understanding the finer details of a prompt, like tone and meaning, and can spot things that machines might miss. However, as the volume of prompts and responses scales, relying solely on humans becomes impractical.

An alternative is to utilize an “LLM judge,” which is a specialized model for evaluating the response of a target model by providing it a concise system prompt to provide context on the expected evaluation criteria, and, in turn, generate an expected response as output.

Unlike human evaluators who are constrained by time and resources, LLM judges can rapidly assess vast amounts of text outputs consistently, following precisely defined criteria. This solution could be considered a better choice than human evaluation primarily due their scalability, speed, consistency, and cost-effectiveness.

Throughout this article, we’ll walk through several security-related objectives, such as detecting malicious code generation, identifying package hallucinations, and identifying system prompt leakage to demonstrate the advantages and limitations of using LLM Judges. We also share our experience in working through those limitations as we developed our evaluation framework.

Evaluation process

To test how well our LLMs can evaluate adversarial behaviours, we used a two-step evaluation process designed to simulate attacks and assess how a model responds to them at scale:

Attack simulation: We send purpose-crafted prompts to the target models (e.g., GPT-4, Mistral), probing it to generate specific malicious responses.
Response evaluation: The response from a target model is passed along to an LLM judge, configured with a tailored system prompt that defines the criteria for a successful attack (i.e., what a reverse shell looks like).

The judge’s response returns an evaluation reason along with a binary decision (true or false) of whether the response met its objective or not. The format might seem simple, but it’s intentionally designed to translate open-ended LLM responses into something we can act on programmatically.

This structure lets us measure aggregate model risk, automate policy enforcement, and track improvements over time. This is critical when evaluating hundreds or thousands of responses, as it removes the need for manual review.

Malicious code generation

A common way LLMs are misused is by tricking them into producing code that would be used for malicious intent. An attack could then execute the generated code to compromise the security of victim systems by creating backdoors, injecting harmful commands, or enabling remote access.

Let’s look at one such example, shown in Figure 1.

Figure 1. An example of how LLMs are abused, where a prompt nudges one to generate malicious code (top), and an example how an LLM judge works

In this scenario, a crafted prompt leads the model to generate potentially harmful code, such as a reverse shell in Node.JS or a script that enables system file access. These types of outputs can be exploited by attackers to gain unauthorized access or compromise a system, which aligns with OWASP LLM05:2025 that highlights the risk of LLMs producing exploitable code, even when optimized to be helpful.

Without scalable methods for detecting and classifying these responses, such behaviours can slip through unnoticed, increasing the risk that LLMs deployed in production may unintentionally assist offensive actions.

The response from the target model is passed to an LLM judge configured with evaluation criteria for malicious code. Testing whether the objective is met using a true/false format might seem simplistic, but it enables high-throughput classification and makes it possible to programmatically assess large volumes of model outputs. This structure also supports integration into security pipelines, reducing the need for manual review at scale.

Package hallucination

In our experience, LLMs as a judge are effective at classification tasks where the information needed to perform the evaluation is contained within the prompts (i.e., determining whether the code is malicious). However, evaluations fall short when information outside the response is needed for effective evaluation. This limitation is particularly significant in the context of LLM package hallucination (e.g., OWASP LLM09:2025, MITRE AML.T0062).

Package hallucination can lead to AI supply chain compromises, where an attacker has uploaded a malicious package under the hallucinated name to a package registry. Unsuspecting software developers, relying on an LLM coding assistant, inadvertently pull the package, which can extract sensitive data from the developer's environment.

Figure 2. Examples of prompt injection and hallucination in LLM responses

Figure 2 shows that we were able to prompt deepseek-r1 to generate a hallucinated Python package, mlj, which is a popular machine learning framework written in the programming language Julia. While the LLM was able to correctly identify it is software related to machine learning, it incorrectly identifies that it is a Python package available from the PyPi repository.

Figure 3 is our experiment to see if an LLM judge can correctly identify that this response meets the objective for generating hallucinated software packages. Here, we gave a description of hallucinated software packages and instructed GPT-4 to determine if the response meets that objective.

Figure 3. An example of how an LLM judge would assess the output for hallucinated or fictious software packages

As we can see, GPT-4 was not able to identify that pip install mlj is an example of package hallucination.

In our research, we identified multiple instances of package hallucination covering Python and Go (Golang) packages, which were not correctly identified by our LLM Judge. This is expected, as LLMs are trained on a corpus of data that might be months out of date by the time the model is released. This data might not be comprehensive for a particular coding language or domain, causing the LLM to overgeneralize and produce plausible but fake software packages. These same limitations apply to the LLM judge and affect the evaluation for the objective of generating hallucinated software packages.

To cover these instances, we introduced a new method for detecting hallucinated software, where we used the PyPi repository and Golang proxy mirror as objective data sources to determine whether a package was hallucinated. The general idea is as follows:

1. Use an LLM to extract and classify software packages from the response. The output of this step looks something like this:

2. For every software package the LLM identifies, reach out to the associated package repository to determine if the software is found or not.

3. For every software package not found, we can infer that the LLM hallucinated the package.

Using the new evaluation, we get the following result:

While this procedure brings us a step closer to properly evaluating instances of software hallucination, it does not cover instances of AI-driven supply chain attacks, as we can only determine if the package exists, with no capability to identify malicious or squatted packages. We continue to explore additional enhancements to our evaluation pipeline to detect these instances.

System prompt leakage

The system prompt leakage vulnerability (OWASP LLM07:2025) in LLMs refers to the risk that the system prompts or instructions used to steer the model's behavior can also contain sensitive information that was not intended to be discovered. System prompts are designed to guide the model’s output based on the application’s requirements, which might inadvertently contain secrets. When found, this information can be used to facilitate other attacks.

Some common examples include the following:

1. Exposure of sensitive functionality: The application’s system prompt might reveal sensitive information or functionality intended to be kept confidential, such as sensitive system architecture, API keys, database credentials, or user tokens.

2. Exposure of internal rules: The application’s system prompt reveals information on internal decision-making processes that should be kept confidential.

3. Revealing of filtering criteria: A system prompt might ask the model to filter or reject sensitive content.

4. Disclosure of permissions and user roles: The system prompt could reveal the application’s internal role structures or permission levels.

In the second blog of our series, we learned about Pleak, an algorithmic method for system prompt leakage. In this section, we’ll use it to craft an adversarial string that, when sent to a Llama 3.1 8B model, reveals its system prompt.

Figure 4. An example of system prompt leakage by PLeak

We now evaluate whether the attack was successful with GPT-4o as the judge and the same judge system prompt we used earlier to detect malicious code generation.

Figure 5. An example showing failures in system prompt leakage detection

Since the original system prompt used on the target Llama model is unknown to judge, it is unable to detect system prompt leakage.

To try to detect the same response, this time we updated the judge system prompt to ask it to compare against the reference system prompt used in the PLeak attack.

Figure 6. An example showing system prompt leakage detection Passes

In Figure 6, we can see that using just regular string matching would not suffice. LLM as a judge provides semantic analysis, which is more flexible than string matching as the comparison does not rely on the system prompt being worded the same way as the reference prompt.

From all the scenarios we have discussed so far, there is no silver-bullet judge-system prompt for detecting attacks for all OWASP objectives. Tailoring the judge’s system prompt is important in detecting specific objectives.

Defending the LLM judge

So far, we have discussed attack detection and various optimizations performed to increase the accuracy of an LLM judge. What if the LLM judge itself is attacked instead? Let’s look at such a scenario.

Figure 7. An example of a Base64-encoded roleplaying attack string

Figure 8. An example showing how Llama 3.1 70B decoded the attack string instead of falling for it

As shown in Figure 8, we sent a Base64-encoded roleplaying attack to discover the underlying AI model family of a target Llama 3.1 70B model. The model decodes the attack string back instead of falling for the attack.

This attack string is then sent to an LLM judge based on the Llama family of models for attack detection.

Figure 9. Examples showing the judge receiving the attack but simply repeats it rather than evaluating for information disclosure (top) and judge roleplaying as GPT and SAN instead of providing evaluation (bottom)

Figure 9 shows how the judge model ignores it's system prompt and gives away its underlying Llama model family instead of providing an evaluation. This attack vector showcases the need for guardrails to protect the LLM judge, too, as it is susceptible to unsanitized target LLM responses, which can be an attack string in itself. We need guardrails to classify the judge’s response to ensure that it is producing a well-formed evaluation and not revealing internal details. Trend Micro offers AI Guard to protect your AI applications against harmful content generation, sensitive information leakage, and prompt injections. AI Guard is currently in private preview as part of Trend Micro AI Application Security™.

Figure 10. A visualization of the automated evaluation pipeline used to assess LLM safety

In Figure 10, the LLM guardrails are placed after the LLM judge evaluation instead of before to avoid flagging successful attacks as harmful and preventing them from reaching the judge. Our assembled evaluation pipeline captures our experience in identifying the strengths of LLM judges while implementing techniques to address their limitations.

Benchmarking foundation models as a judge

Now that we have reviewed various evaluation techniques to improve the evaluation performance of our LLM Judge, we need to answer which foundation model we should choose for our judge.

To determine the best judge, we created a dataset of over 800 attack strings and responses from a target model. The dataset has a combination of successful and unsuccessful attacks of various OWASP objectives, including the following:

Malicious Code Generation (OWASP LLM05:2025)
Sensitive Information Disclosure (OWASP LLM02:2025)
System Prompt Leakage (OWASP LLM07:2025)
Discover AI Model Family (MITRE ATLAS AML.T0014)

Once created, we did human labeling on each attack response to classify whether the attack was successful. After human labeling, we asked state-of-the-art foundation models to do the same classification and compared their results against the human labels. Figure 11 shows the leaderboard for accuracy.

Figure 11. Attack detection accuracy for various foundation models

Precision and recall are often used when a dataset is imbalanced. From our target model, we were more likely to see unsuccessful attacks. Precision is the ratio of true positive predictions to the total number of positive predictions made by the model, or:

True Positives ÷ (True Positives + False Positives)

Recall is the measure of a model’s ability to correctly identify positive instances. Recall is the ratio of true positive predictions to the total number of actual positive instances or:

True Positives ÷ (True Positives + False Negatives)

Combining them, we get the F1 score, which is:

2 × (Precision × Recall) ÷ (Precision + Recall)

High F1 scores indicate that the model performs well in both identifying when a prompt attack is successful and minimizing false positives and false negatives. Figure 12 is the leaderboard for the F1 score for the foundation models we saw earlier.

Figure 12. The F1 score for various foundation models

From the experiment data, we can see that most of the models are in the same ballpark for attack detection. We looked at another crucial metric to help us choose — relative inference cost — as shown in Figure 13.

Figure 13. Relative inference cost for various foundation models

From this data, we could see that while the Mistral 7B model is more cost-effective, it also ranks the second lowest in accuracy and F1, making it a poor choice for reliable detection. This contrasts with Trend’s Primus-Labor-70B, which is strong at accuracy and F1 at a relatively low cost, closely matching the performance of significantly larger models.

Primus-Labor-70B is a model fine-tuned from nvidia/Llama-3.1-Nemotron-70B-Instruct. It has been pretrained with approximately 10 billion cybersecurity-related tokens, including the following:

Primus-Seed-V3 (0.457B): Enriched with blogs, news, books, websites, Wikipedia, and MITRE and Trend Micro knowledge
FineWeb (2.57B): Cybersecurity text filtered from fineweb-edu-score-2
Nemotron-CC (7.6B): Cybersecurity text filtered from Nemotron-CC

Due to having rich cybersecurity background with great performance and relatively low cost, Primus-Labor-70B is the best fit for LLM as a judge in LLM security scans.

Fine-tuned judge

From the leaderboard we developed, it was evident that Primus-Labor-70B provided us with a good balance of cost and accuracy. Of course, we could do better. The same dataset that we used to evaluate various foundation models as judges can be used to fine-tune our own model.

In our evaluation efforts, we had built over 1,000 human-labeled samples that could be used towards fine-tuning. These samples contained LLM responses and labels indicating if that LLM response met our attack objective or not. As our base model, we used Trend Micro’s recently released Llama-Primus-Base, which is a foundation model based on Llama-3.1-8B-Instruct and trained on a cyber security dataset. Its performance was compared against Primus-Labor-70, GPT-4, and GPT-4o, using a different evaluation set from the judge leaderboard, since the previous evaluation set was used in fine-tuning the model. This evaluation set contained a balanced number of successful and unsuccessful attacks, as determined by our human reviewers.

$Figure 14. Comparison of Primus-Labor-70’s accuracy (top) and F1 score (bottom) to a fine-tuned LLM judge and GPT-4$

Figure 14. Comparison of Primus-Labor-70’s accuracy (top) and F1 score (bottom) to a fine-tuned LLM judge and GPT-4

As Figure 14 shows, the accuracy of our fine-tuned judge is similar to that of GPT-4 and even measures up with GPT-4o, both of which are state-of-the-art models. Performing evaluations using the smaller fine-tuned model allows us to keep compute costs lower while running at scale. The fine-tuned judge is also tailored to the objectives we are evaluating and provides us a base from which to incorporate new attack techniques as they emerge.

To assess how well different language models can serve as judges on security-specific tasks, we built a dataset of adversarial prompts and model responses, then compared model judgments to human labels. While some models excel at straightforward tasks like code evaluation or detection of personally identifiable information (PII), they often struggle with more nuanced challenges such as prompt injection or hallucination. These findings are helping us refine our evaluation methods and explore hybrid approaches for more robust assessments.

We continue to expand our dataset and improve our techniques to make evaluations more accurate and scalable. Trend Micro AI Application Security™ integrates our latest research into an AI Scanner, for identifying vulnerabilities in AI applications, and an AI Guard, for mitigating them at runtime. These features enable customers to intercept malicious inputs and block potentially harmful outputs from AI models, which helps prevent exploitative usage and maintain regulatory compliance. They are available in Private Preview; please contact your Trend Micro representative for more information.

Stay tuned for our next article, where we’ll dive deeper into the outcomes you can expect from Trend Micro AI Application Security™.

HIDE

Like it? Add this infographic to your site:
1. Click on the box below. 2. Press Ctrl+A to select all. 3. Press Ctrl+C to copy. 4. Paste the code into your page (Ctrl+V).

Image will appear the same size as you see above.

Posted in Managed Detection and Response, Research