LLMs Evaluation: Benchmarks, Challenges, and Future Trends

The evaluation of Large Language Models (LLMs) focuses on benchmarks, scalability, ethical challenges, and multimodal testing. Dynamic frameworks and emerging trends drive robust, adaptive AI performance, ensuring safer, efficient deployment in sensitive fields like healthcare, finance, and law.

LLMs Evaluation: Benchmarks, Challenges, and Future Trends
LLMs Evaluation: Benchmarks, Challenges, and Future Trends

In the dynamic field of artificial intelligence, Large Language Models (LLMs) have emerged as a cornerstone of technological innovation, demonstrating remarkable capabilities across domains such as natural language understanding, reasoning, and creative text generation. The evaluation of these models has become increasingly critical, not just for understanding their performance but also for ensuring their safe deployment in real-world applications.

The evolution of LLM evaluation methodologies parallels the rapid advancements in their design. Early approaches relied on task-specific benchmarks like GLUE and SuperGLUE to assess isolated capabilities such as syntactic parsing and question answering. However, with the advent of models like GPT-3, ChatGPT, and GPT-4, evaluation frameworks have expanded to accommodate complex, multi-faceted performance dimensions.

Key motivations for evaluating LLMs include:

  1. Performance Benchmarking: Ensuring models meet task-specific requirements across diverse applications.
  2. Safety and Trustworthiness: Addressing concerns such as data privacy, toxicity, and misinformation to mitigate societal risks.
  3. Alignment and Ethics: Aligning model outputs with human values and minimizing biases.

Current research underscores the importance of developing robust, adaptive evaluation frameworks that address these goals. The growing integration of LLMs into sensitive domains like healthcare, law, and finance further emphasizes the need for rigorous and comprehensive evaluations.

Objectives of LLM Evaluation

Evaluating Large Language Models (LLMs) serves multiple critical objectives, each essential for ensuring their effective deployment and integration into real-world systems. As these models are increasingly embedded in sensitive applications such as healthcare, law, and finance, rigorous evaluation becomes paramount. The key objectives of LLM evaluation include:

1. Performance Benchmarking

  • Definition: Benchmarking aims to quantify an LLM's performance across diverse tasks, such as language understanding, reasoning, and generation.
  • Examples: Metrics like accuracy, perplexity, and F1 scores help evaluate language comprehension and fluency.
  • Impact: This enables developers to compare models, select optimal architectures, and identify areas for improvement.

2. Understanding Limitations

  • Challenges Addressed: Identifying model weaknesses such as hallucination, factual inaccuracies, or poor generalization to out-of-distribution (OOD) data.
  • Strategies:
    • Using datasets like AdvGLUE for adversarial robustness testing.
    • Employing benchmarks such as GLUE-X for OOD performance.
  • Outcome: Clear insights into scenarios where models may fail, enabling risk mitigation.

3. Ensuring Safety and Trustworthiness

  • Focus Areas:
    • Avoidance of harmful outputs, such as bias, toxicity, and misinformation.
    • Protection against adversarial attacks and robustness to perturbations in input data.
  • Benchmarks:
    • Safety-specific datasets (e.g., RealToxicityPrompts, AdvGLUE).
    • Comprehensive evaluations of alignment with ethical norms and human values.

4. Alignment with Human Values

  • Purpose: Evaluating how closely the outputs of LLMs align with societal and ethical expectations.
  • Challenges:
    • Detecting bias and stereotypes using datasets like StereoSet and CrowS-Pairs.
    • Measuring ethical alignment with frameworks such as ETHICS.
  • Goal: To develop systems that are inclusive, unbiased, and adaptable to diverse cultural contexts.

5. Risk Mitigation in Deployment

  • Why Important:
    • Mitigating risks such as data leakage, system manipulation, or operational failures.
  • Evaluation Tools:
    • Red-teaming methods to probe for vulnerabilities.
    • Robustness metrics to measure stability under adversarial conditions.
  • Real-World Example: Testing ChatGPT's ability to handle medical and legal queries without propagating misinformation.
Source: Evaluating Large Language Models: A Comprehensive Survey

By fulfilling these objectives, LLM evaluations ensure not only technical excellence but also societal trust and safety, laying the groundwork for responsible AI deployment across industries.

Dimensions of LLM Evaluation

Understanding the dimensions of LLM evaluation is critical to designing robust benchmarks and protocols that align with their evolving capabilities and use cases. The key dimensions encompass the what, where, and how of evaluation, which provide a structured framework for assessing LLM performance across diverse domains and tasks.

1. What to Evaluate

  • General Capabilities: These include tasks such as natural language understanding, reasoning, and natural language generation. Evaluations often measure fluency, coherence, and contextual relevance of responses. MSTEMP, for example, introduces out-of-distribution (OOD) semantic templates to rigorously test these aspects.
  • Domain-Specific Capabilities: Performance tailored to specialized domains, such as legal, medical, or financial sectors, is critical. Benchmarks like PubMedQA and LSAT are specifically designed for these settings.
  • Safety and Ethical Considerations: Assessments of bias, toxicity, and alignment with human values are essential for ensuring trustworthy deployment. Tools like RealToxicityPrompts and StereoSet are commonly used in this area.

2. Where to Evaluate

  • Standardized Benchmarks: Datasets like GLUE, SuperGLUE, and Big-Bench have been central to assessing general capabilities. However, their static nature may limit their ability to test adaptive performance.
  • Dynamic Frameworks: Advanced tools like DYVAL and MSTEMP emphasize generating dynamic, OOD datasets to evaluate robustness and adaptability. This approach addresses limitations of static benchmarks by introducing variability in input styles and semantics.
Source: GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-Distribution Generalization Perspective
Source: GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-Distribution Generalization Perspective

3. How to Evaluate

  • Static Metrics: Conventional metrics such as accuracy, F1-score, and perplexity measure baseline performance. However, they may not capture nuances like contextual relevance or user alignment.
  • Dynamic Approaches: Evaluation tools like PandaLM integrate dynamic benchmarks to test models under varying conditions, including adversarial scenarios.
  • Human and Model Scoring: Human evaluators often provide qualitative insights into coherence and alignment, whereas automated evaluators (e.g., GPT-4 as a scoring model) offer scalability and cost-efficiency.

By combining insights from standardized benchmarks and emerging dynamic frameworks, LLM evaluation can achieve a balance between scalability, depth, and adaptability.

Current Evaluation Strategies


Evaluating Large Language Models (LLMs) involves a range of strategies that address various performance dimensions, from basic accuracy to robustness under adversarial and out-of-distribution (OOD) conditions. This section outlines the current methodologies for evaluating LLMs, emphasizing the evolution from static benchmarks to adaptive, dynamic evaluation frameworks.

1. Static Benchmarks

  • Definition: Static benchmarks consist of pre-defined datasets that evaluate specific capabilities of LLMs, such as natural language understanding, reasoning, and generation. Examples include GLUE, SuperGLUE, and MMLU.
  • Challenges:
    • Data Contamination: Static datasets often overlap with training data, leading to inflated evaluation metrics.
    • Limited Robustness Testing: These benchmarks struggle to test real-world adaptability, including adversarial inputs or OOD generalization.
  • Examples:
    • GLUE: A multi-task benchmark for natural language understanding.
    • SuperGLUE: An improved version of GLUE with more challenging tasks.

2. Dynamic Evaluation

  • Overview: Dynamic evaluations adapt test conditions to better assess model robustness and contextual understanding.
  • Frameworks:
    • DYVAL: Focuses on reasoning tasks by introducing dynamic test scenarios.
    • PandaLM: An adaptive framework optimizing evaluation for instruction-tuned models. It incorporates subjective metrics such as conciseness and clarity, moving beyond traditional accuracy scores.
  • Advantages:
    • Improved Robustness: Dynamic evaluation exposes models to diverse and unpredictable test cases.
    • Reduced Bias: Incorporates human-like subjectivity for a more comprehensive assessment.
Source: PANDALM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

3. Adversarial and Out-of-Distribution (OOD) Testing

  • Adversarial Testing:
    • Focuses on evaluating models under intentional input distortions (e.g., typographical errors, sentence-level distractions).
    • Tools like AdvGLUE assess adversarial robustness by introducing text perturbations.
  • OOD Testing:
    • Evaluates how well models perform on data outside their training distribution.
    • Benchmarks like Flipkart and DDXPlus are used to test real-world applicability.
Source: On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective

4. Human-in-the-Loop Evaluation

  • Definition: Combines automated metrics with human judgment to evaluate subjective qualities like coherence, alignment, and ethical considerations.
  • Applications:
    • Used in frameworks like PandaLM, where human annotations validate automated assessments.
    • Reduces reliance on static accuracy metrics by considering qualitative feedback.
  • Hybrid Approaches:
    • Combining static and dynamic evaluations to balance scalability and depth.
    • Leveraging adaptive frameworks like PandaLM for automated, scalable evaluations.
  • Real-World Testing:
    • Incorporating domain-specific datasets (e.g., PubMedQA, LSAT) to simulate practical applications.

These strategies illustrate the shift towards more nuanced and adaptive evaluation methodologies, ensuring LLMs meet the complex demands of real-world deployment.


The field of LLM evaluation is rapidly evolving, with new trends and benchmarks emerging to address the growing complexity and diverse applications of these models. These advancements aim to tackle existing gaps in evaluation methodologies, focusing on robustness, domain-specific testing, and ethical considerations.

1. Multi-Modal Evaluation

  • Definition: Multi-modal evaluations assess LLMs on their ability to process and integrate inputs from diverse data types, such as text, images, and audio.
  • Examples:
    • MSTEMP uses semantic templates to evaluate LLMs on tasks requiring cross-modal reasoning.
    • Applications include medical imaging reports, legal document parsing, and multimedia content generation.
  • Significance: Ensures that models can handle real-world scenarios involving multi-modal inputs effectively.

2. Adaptive and Dynamic Benchmarks

  • Overview: Dynamic evaluation frameworks like DYVAL and PandaLM enable models to adapt to evolving task requirements and diverse user inputs.
  • Applications:
    • Real-time updates to benchmarks ensure evaluation remains relevant as models improve.
    • Use in out-of-distribution testing to assess adaptability and robustness.
  • Advantages:
    • Mitigates the risk of data contamination in training datasets.
    • Enhances real-world applicability by dynamically generating test cases.

3. Ethical and Safety-Centric Testing

  • Key Focus Areas:
    • Reducing bias and toxicity in generated outputs.
    • Addressing alignment with ethical standards through frameworks like ETHICS.
  • Benchmarks:
    • RealToxicityPrompts tests models' responses to provocative or ethically sensitive prompts.
    • StereoSet and CrowS-Pairs evaluate biases in gender, race, and profession.
  • Outcome: Promotes fairness and inclusivity in model deployment.

4. Domain-Specific Benchmarks

  • Examples:
    • PubMedQA for medical reasoning.
    • LSAT and ReClor for legal and logical reasoning.
  • Purpose: Ensures models meet the stringent requirements of specialized industries, such as healthcare, law, and finance.

5. Unified Benchmarking Platforms

  • Examples:
    • HELM (Holistic Evaluation of Language Models) aggregates multiple dimensions of evaluation, including performance, robustness, and ethical alignment.
    • Leaderboards such as OpenAI’s Evals offer standardized comparisons across models.
  • Significance: Streamlines the evaluation process by providing a comprehensive view of model capabilities and limitations.

6. Focus on Explainability and Interpretability

  • Definition: Emphasis on making models' decision-making processes transparent and understandable to users.
  • Current Efforts:
    • Mechanistic interpretability studies aim to dissect how models arrive at specific outputs.
    • Benchmarks assess models’ ability to explain reasoning and decision-making processes.
  • Applications: Critical for deployment in sensitive domains like healthcare and legal systems.

These emerging trends signify a paradigm shift in LLM evaluation, ensuring that benchmarks remain relevant, adaptive, and aligned with the ethical, societal, and technical demands of the future.

Key Challenges in LLM Evaluation


Despite significant advancements in the evaluation of Large Language Models (LLMs), several challenges persist, limiting the efficacy and reliability of current methodologies. This section highlights the most pressing issues in LLM evaluation, encompassing data contamination, robustness, scalability, and ethical concerns.

1. Data Contamination

  • Definition: Data contamination occurs when evaluation benchmarks overlap with the training datasets of LLMs, leading to inflated performance metrics.
  • Impact:
    • Distorted understanding of model capabilities.
    • Overestimation of generalization and adaptability.
  • Proposed Solutions:
    • Use of recently curated benchmarks such as Flipkart and DDXPlus datasets to ensure OOD robustness.
    • Development of benchmarks like GLUE-X, which incorporate adversarial and OOD examples for more realistic evaluations.

2. Robustness

  • Challenges:
    • Adversarial inputs, such as minor perturbations or rephrased prompts, often lead to degraded performance.
    • Poor generalization to OOD samples, where models encounter inputs outside their training distribution.
  • Examples:
    • Benchmarks like AdvGLUE and ANLI test LLM resilience against adversarial perturbations but highlight significant gaps in robustness.
  • Future Directions:
    • Dynamic evaluation frameworks likeDYVAL and PandaLM offer real-time adaptability for robustness testing.
Source: PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

3. Scalability

  • Issue:
    • Evaluation frameworks struggle to scale with increasingly larger models, such as GPT-4 and beyond, which involve billions of parameters.
    • Computational and financial costs of evaluating models at scale are prohibitive.
  • Proposed Solutions:
    • Zero-shot evaluation techniques reduce computational demands while maintaining meaningful insights.
    • Optimized benchmarking platforms, such as HELM, that integrate diverse evaluation metrics across multiple domains.

4. Ethical and Safety Concerns

  • Bias and Toxicity:
    • Persistent societal biases in LLM outputs, as evaluated using benchmarks like StereoSet and RealToxicityPrompts, remain unresolved.
    • Models often fail to meet safety standards in sensitive applications like healthcare and education.
  • Hallucinations:
    • LLMs generate incorrect or misleading information, particularly on fact-heavy tasks.
  • Recommendations:
    • Incorporate ethical alignment benchmarks, such as ETHICS, to evaluate bias and alignment with human values.
    • Employ interpretability methods to identify and mitigate hallucinations in real-time applications.

Addressing these challenges requires a combination of novel benchmarks, adaptive evaluation strategies, and interdisciplinary research to ensure that LLM evaluation keeps pace with advancements in AI technologies.

Looking Forward - Future Challenges and Opportunities in LLM Evaluation


The landscape of Large Language Model (LLM) evaluation is both dynamic and multifaceted, reflecting the rapid evolution of the models themselves. As we look to the future, addressing current gaps and embracing new opportunities will be pivotal in advancing both the utility and trustworthiness of LLMs.

Category Focus Areas Examples of Benchmarks
General Capabilities Natural language understanding, reasoning, generation GLUE, SuperGLUE, MMLU, BIG-bench
Domain-Specific Capabilities Industry-specific tasks (Medical, Legal, Financial) PubMedQA, MultiMedQA, LSAT, ReClor, FinQA, FiQA
Safety and Trustworthiness Bias, toxicity, adversarial robustness StereoSet, CrowS-Pairs, RealToxicityPrompts, AdvGLUE
Extreme Risks Dangerous capabilities, alignment risks Red-teaming scenarios, frontier alignment benchmarks
Undesirable Use Cases Misinformation detection, inappropriate content FEVER, TruthfulQA, SafetyBench

Table 1: Summary of benchmarks categorized by general capabilities, domain specificity, safety and trustworthiness, extreme risks, and undesirable use cases.

1. Bridging Context-Specific Gaps

  • Challenge: Most evaluation frameworks focus on general benchmarks, often neglecting the nuanced demands of domain-specific applications such as healthcare, finance, or education.
  • Opportunities:
    • Development of tailored benchmarks for specialized fields like law (e.g., LSAT datasets) and medicine (e.g., PubMedQA).
    • Incorporation of cultural and linguistic diversity to address biases in multilingual and multicultural contexts.

2. Enhancing Ethical and Safety Evaluations

  • Challenge: Persistent issues like bias, toxicity, and misinformation demand continuous innovation in evaluation methodologies..
  • Opportunities:
    • Broader adoption of tools like RealToxicityPrompts and StereoSet to identify harmful content.
    • Advanced interpretability techniques to better understand model behavior and decision-making processes.
    • Expansion of frameworks like ETHICS to cover more nuanced moral and societal dimensions.

3. Addressing Scalability and Environmental Sustainability

  • Challenge: The increasing computational and environmental costs of training and evaluating LLMs are significant barriers.
  • Opportunities:
    • Adoption of efficient evaluation protocols like zero-shot and few-shot learning.
    • Research into greener AI practices to minimize the environmental impact of large-scale evaluations.

4. Developing Standards and Best Practices

  • Challenge: The lack of standardized evaluation methodologies often results in inconsistent results, making cross-study comparisons difficult.
  • Opportunities:
    • Creation of universal benchmarks that encompass diverse capabilities, safety, and ethical dimensions.
    • Introduction of collaborative frameworks for benchmarking and sharing insights, such as HELM and APIBench.

5. Embracing Multimodal and Adaptive Evaluations

  • Challenge: Traditional evaluation strategies often fail to accommodate multimodal capabilities or adaptive performance under dynamic conditions.
  • Opportunities:
    • Leveraging tools like MSTEMP for multimodal benchmarks.
    • Implementation of adaptive testing frameworks like DYVAL to ensure robustness across evolving tasks.

6. Long-Term Implications and Emerging Risks

  • Challenge: The potential for LLM misuse and unintended consequences necessitates proactive risk management.
  • Opportunities:
    • Exploration of evaluation protocols for extreme risks, such as alignment failures or misuse in adversarial scenarios.
    • Continuous research into the societal impacts of widespread LLM deployment, ensuring alignment with human values and governance frameworks.

References:

GitHub - tjunlp-lab/Awesome-LLMs-Evaluation-Papers: The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.
The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey. - tjunlp-lab/Awesome-LLMs-Evaluation-Papers
Evaluating Large Language Models: A Comprehensive Survey
Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. In addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to LLMs’ performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover LLM evaluations on capabilities, alignment, safety, and applicability. We hope that this comprehensive overview will stimulate further research interests in the evaluation of LLMs, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of LLMs. We envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. A curated list of related papers has been publicly available at https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ″LLMs-as-judges″. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at https://github.com/CSHaitao/Awesome-LLMs-as-Judges.
A Survey on Evaluation of Large Language Models
Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where’ and `how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.
PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
The increasing reliance on Large Language Models (LLMs) across academia and industry necessitates a comprehensive understanding of their robustness to prompts. In response to this vital need, we introduce PromptRobust, a robustness benchmark designed to measure LLMs’ resilience to adversarial prompts. This study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic. The adversarial prompts, crafted to mimic plausible user errors like typos or synonyms, aim to evaluate how slight deviations can affect LLM outcomes while maintaining semantic integrity. These prompts are then employed in diverse tasks including sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving. Our study generates 4,788 adversarial prompts, meticulously evaluated over 8 tasks and 13 datasets. Our findings demonstrate that contemporary LLMs are not robust to adversarial prompts. Furthermore, we present a comprehensive analysis to understand the mystery behind prompt robustness and its transferability. We then offer insightful robustness analysis and pragmatic recommendations for prompt composition, beneficial to both researchers and everyday users.
GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective
Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 21 popularly used PLMs, including GPT-3 and GPT-3.5. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
Instruction tuning large language models (LLMs) remains a challenging task, owing to the complexity of hyperparameter selection and the difficulty involved in evaluating the tuned models. To determine the optimal hyperparameters, an automatic, robust, and reliable evaluation benchmark is essential. However, establishing such a benchmark is not a trivial task due to the challenges associated with evaluation accuracy and privacy protection. In response to these challenges, we introduce a judge large language model, named PandaLM, which is trained to distinguish the superior model given several LLMs. PandaLM’s focus extends beyond just the objective correctness of responses, which is the main focus of traditional evaluation datasets. It addresses vital subjective factors such as relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality. To ensure the reliability of PandaLM, we collect a diverse human-annotated test dataset, where all contexts are generated by humans and labels are aligned with human preferences. Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5’s evaluation ability and 88.28% of GPT-4’s in terms of F1-score on our test dataset. PandaLM enables the evaluation of LLM to be fairer but with less cost, evidenced by significant improvements achieved by models tuned through PandaLM compared to their counterparts trained with default Alpaca’s hyperparameters. In addition, PandaLM does not depend on API-based evaluations, thus avoiding potential data leakage. All resources of PandaLM are released at https://github.com/WeOpenML/PandaLM.
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective
ChatGPT is a recent chatbot service released by OpenAI and is receiving increasing attention over the past few months. While evaluations of various aspects of ChatGPT have been done, its robustness, i.e., the performance to unexpected inputs, is still unclear to the public. Robustness is of particular concern in responsible AI, especially for safety-critical applications. In this paper, we conduct a thorough evaluation of the robustness of ChatGPT from the adversarial and out-of-distribution (OOD) perspective. To do so, we employ the AdvGLUE and ANLI benchmarks to assess adversarial robustness and the Flipkart review and DDXPlus medical diagnosis datasets for OOD evaluation. We select several popular foundation models as baselines. Results show that ChatGPT shows consistent advantages on most adversarial and OOD classification and translation tasks. However, the absolute performance is far from perfection, which suggests that adversarial and OOD robustness remains a significant threat to foundation models. Moreover, ChatGPT shows astounding performance in understanding dialogue-related texts and we find that it tends to provide informal suggestions for medical tasks instead of definitive answers. Finally, we present in-depth discussions of possible research directions.

Read more