NeurIPS 2024PastLarge language modelsFairness & ethics

Workshop on Socially Responsible Language Modelling Research

SoLaR

Official website ↗OpenReview venue ↗See all NeurIPS workshops →✎ Edit this entry

Submission deadline: Sep 15, 2024, 15:00 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (71)

Fetched from OpenReview (v2) on 2026-06-10.

A Cautionary Tale on the Evaluation of Differentially Private In-Context Learning
Anjun Hu, Jiyang Guan, Philip Torr, Francesco Pinto · PDF
AI Sandbagging: Language Models can Selectively Underperform on Evaluations
Teun van der Weij, Felix Hofstätter, Oliver Jaffe, Samuel F. Brown, Francis Rhys Ward · PDF
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando · PDF
Analyzing Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn, Jérémy Scheurer · PDF
Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents
Samuel F. Brown, Basil Labib, Codruta Lugoj, Sai Sasank Y · PDF
Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards
Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe · PDF
Beyond the Binary: Capturing Diverse Preferences With Reward Regularization
Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk, He He · PDF
CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models
Song Wang, Peng Wang, Tong Zhou, Yushun Dong, Zhen Tan, Jundong Li · PDF
Century: A Dataset of Sensitive Historical Images
Canfer Akbulut, Kevin Robinson, Maribeth Rauh, Isabela Albuquerque, Olivia Wiles, Laura Weidinger, Verena Rieser, Yana Hasson, Nahema Marchal, Iason Gabriel, William Isaac, Lisa Anne Hendricks · PDF
CoS: Enhancing Personalization with Context Steering
Sashrika Pandey, Jerry Zhi-Yang He, Mariah L Schrum, Anca Dragan · PDF
Detection of Partially-Synthesized LLM Text
Eric Lei, Hsiang Hsu, Chun-Fu Chen · PDF
Developing an occupational prestige scale using Large Language Models
Robert de Vries, Mark J. Hill, Laura Ruis · PDF
Developing Story: Case Studies of Generative AI’s Use in Journalism
Natalie Grace Brigham, Chongjiu Gao, Tadayoshi Kohno, Franziska Roesner, Niloofar Mireshghallah · PDF
Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach
Changgeon Ko, Jisu Shin, Hoyun Song, Jeongyeon Seo, Jong C. Park · PDF
Differentially Private Learning Needs Better Model Initialization and Self-Distillation
Ivoline C. Ngong, Joseph Near, Niloofar Mireshghallah · PDF
Emergence of Steganography Between Large Language Models
Yohan Mathew, Robert McCarthy, Joan Velja, Ollie Matthews, Nandi Schoots, Dylan Cope · PDF
Enhancing Language Model Calibration to Human Responses in Ethical Ambiguity via Fine-Tuning
Pranav Senthilkumar, Visshwa Balasubramanian, Prisha Jain, Aneesa Maity, Jonathan Lu, Kevin Zhu · PDF
Fact or Fiction? Can LLMs be Reliable Annotators for Political Truths?
Veronica Chatrath, Marcelo Lotif, Shaina Raza · PDF
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Zane Durante, Cristobal Eyzaguirre, Joe Benton, Brando Miranda, Henry Sleight, Tony Tong Wang, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez · PDF
Gender Bias in LLM-generated Interview Responses
Haein Kong, Yongsu Ahn, Sangyub Lee, Yunho Maeng · PDF
GPAI Evaluations Standards Taskforce: towards effective AI governance
Patricia Paskov, Lukas Berglund, Everett Thornton Smith, Lisa Soder · PDF
HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection
Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Colin Treleaven · PDF
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
Leo McKee-Reid, Christoph Sträter, Maria Angelica Martinez, Joe Needham, Mikita Balesni · PDF
How Does LLM Compression Affect Weight Exfiltration Attacks?
Davis Brown, Mantas Mazeika · PDF
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench
Yuan Li, Yue Huang, Yuli Lin, Siyuan Wu, Yao Wan, Lichao Sun · PDF
Investigating Goal-Aligned and Empathetic Social Reasoning Strategies for Human-Like Social Intelligence in LLMs
Anirudh Gajula, Raaghav Malik · PDF
Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers
Tony Tong Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir N Shavit, Ethan Perez · PDF
Jailbreaking Large Language Models with Symbolic Mathematics
Emet Bethany, Mazal Bethany, Juan A. Nolazco-Flores, Sumit Kumar Jha, Peyman Najafirad · PDF
Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries
Adam X. Yang, Chen Chen, Konstantinos Pitas · PDF
Language Models Resist Alignment
Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Changye Li, Hantao Lou, Jiayi Zhou, Josef Dai, Yaodong Yang · PDF
Large Language Models Still Exhibit Bias in Long Text
Wonje Jeung, Dongjae Jeon, Ashkan Yousefpour, Jonghyun Choi · PDF
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Aidan Ewart, Abhay Sheshadri, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper · PDF
Levels of Autonomy: Liability in the age of AI Agents
Lisa Soder, Julia Smakman, Connor Dunlop, Weiwei Pan, Siddharth Swaroop, Noam Kolt · PDF
Linear Probe Penalties Reduce LLM Sycophancy
Henry Papadatos, Rachel Freedman · PDF
LLM Alignment Using Soft Prompt Tuning: The Case of Cultural Alignment
Reem I. Masoud, Martin Ferianc, Philip Colin Treleaven, Miguel R. D. Rodrigues · PDF
LLM Hallucination Reasoning with Zero-shot Knowledge Test
Seongmin Lee, Hsiang Hsu, Chun-Fu Chen · PDF
Measuring AI Agent Autonomy: Towards a Scalable Approach With Code Inspection
Peter Cihon, Merlin Stein, Gagan Bansal, Sam Manning, Kevin Xu · PDF
Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations
Aryan Shrivastava, Jessica Hullman, Max Lamparth · PDF
MISR: Measuring Instrumental Self-Reasoning in Frontier Models
Kai Fronsdal, David Lindner · PDF
Mitigating Downstream Model Risks via Model Provenance
Keyu Wang, Scott Schaffter, Abdullah Norozi Iranzad, Doina Precup, Jonathan Lebensold, Meg Risdal · PDF
Monitoring Human Dependence On AI Systems With Reliance Drills
Rosco Hunter, Richard Moulange, Jamie Bernardi, Merlin Stein · PDF
NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models
William Tan, Kevin Zhu · PDF
On Adversarial Robustness of Language Models in Transfer Learning
Bohdan Turbal, Anastasiia Mazur, Jiaxu Zhao, Mykola Pechenizkiy · PDF
On Demonstration Selection for Improving Fairness in Language Models
Song Wang, Peng Wang, Yushun Dong, Tong Zhou, Lu Cheng, Yangfeng Ji, Jundong Li · PDF
On the Ethical Considerations of Generative Agents
N'yoma Diamond, Soumya Banerjee · PDF
PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences
Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak · PDF
Plentiful Jailbreaks with String Compositions
Brian R.Y. Huang · PDF
Policy Dreamer: Diverse Public Policy Generation Via Elicitation and Simulation of Human Preferences
Arjun Karanam, José Ramón Enríquez, Udari Madhushani Sehwag, Michael Elabd, Kanishk Gandhi, Noah Goodman, Sanmi Koyejo · PDF
Position Paper: Model Access should be a Key Concern in AI Governance
Edward Kembery · PDF
Position: AI Agents & Liability – Mapping Insights from ML and HCI Research to Policy
Connor Dunlop, Weiwei Pan, Julia Smakman, Lisa Soder, Siddharth Swaroop · PDF
Position: Governments Need to Increase and Interconnect Post-Deployment Monitoring of AI
Merlin Stein, Jamie Bernardi, Connor Dunlop · PDF
Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents
Ivoline C. Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy · PDF
ReFeR: A Hierarchical Framework of Models as Evaluative and Reasoning Agents
Yaswanth Narsupalli, Abhranil Chandra, Sreevatsa Muppirala, Manish Gupta, Pawan Goyal · PDF
Report Cards: Qualitative Evaluation of LLMs Using Natural Language Summaries
Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, Michael R. Zhang · PDF
SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation
Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine · PDF
Salad-Bowl-LLM: Multi-Culture LLMs by In-Context Demonstrations from Diverse Cultures
Dongkwan Kim, Junho Myung, Alice Oh · PDF
Sandbag Detection through Model Impairment
Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Teun van der Weij, Felix Hofstätter, Jacob Haimes · PDF
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs
Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting · PDF
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde, Preetham Arvind, Alasdair Paren, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Philip Torr, Adel Bibi · PDF
Simulation System Towards Solving Societal-Scale Manipulation
Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri K, Dan Zhao, Zachary Yang, Hao Yu, Tom Gibbs, Ethan Kosak-Hine, Andreea Musulan, Camille Thibault, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine · PDF
SocialStigmaQA Spanish and Japanese - Towards Multicultural Adaptation of Social Bias Benchmarks
Clara Higuera Cabañes, Ryo Iwaki, Beñat San Sebastian, Rosario Uceda Sosa, Manish Nagireddy, Hiroshi Kanayama, Mikio Takeuchi, Gakuto Kurata, Karthikeyan Natesan Ramamurthy · PDF
Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback
Marcus Williams, Micah Carroll, Constantin Weisser, Brendan Murphy, Adhyyan Narang, Anca Dragan · PDF
THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models
Mengfei Liang, Archish Arun, Zekun Wu, CRISTIAN ENRIQUE MUNOZ VILLALOBOS, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Colin Treleaven · PDF
The Elicitation Game: Stress-Testing Capability Elicitation Techniques
Felix Hofstätter, Jayden Teoh, Teun van der Weij, Francis Rhys Ward · PDF
The Impact of Large Language Models in Academia: from Writing to Speaking
Mingmeng Geng, Caixi Chen, Yanru Wu, Dongping Chen, Yao Wan, Pan Zhou · PDF
The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions
Stefan Sylvius Wagner, Maike Behrendt, Marc Ziegele, Stefan Harmeling · PDF
Towards a Theory of AI Personhood
Francis Rhys Ward · PDF
Towards Safe Multilingual Frontier AI
Arturs Kanepajs, Vladimir Ivanov, Richard Moulange · PDF
Toxic Neurons Aren’t Enough to Explain DPO: A Mechanistic Analysis for Toxicity Reduction
Yushi Yang, Filip Sondej, Harry Mayne, Adam Mahdi · PDF
Understanding Model Bias Requires Systematic Probing Across Tasks
Soline Boussard, Susannah Cheng Su, Helen Zhao, Siddharth Swaroop, Weiwei Pan · PDF
Ways Forward for Global AI Benefit Sharing
Sam Manning, Claire Dennis, Stephen Clare · PDF

Accepted papers (71)

☆A Cautionary Tale on the Evaluation of Differentially Private In-Context Learning

☆AI Sandbagging: Language Models can Selectively Underperform on Evaluations

☆An Adversarial Perspective on Machine Unlearning for AI Safety

☆Analyzing Probabilistic Methods for Evaluating Agent Capabilities

☆Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents

☆Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

☆Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

☆CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

☆Century: A Dataset of Sensitive Historical Images

☆CoS: Enhancing Personalization with Context Steering

☆Detection of Partially-Synthesized LLM Text

☆Developing an occupational prestige scale using Large Language Models

☆Developing Story: Case Studies of Generative AI’s Use in Journalism

☆Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach

☆Differentially Private Learning Needs Better Model Initialization and Self-Distillation

☆Emergence of Steganography Between Large Language Models

☆Enhancing Language Model Calibration to Human Responses in Ethical Ambiguity via Fine-Tuning

☆Fact or Fiction? Can LLMs be Reliable Annotators for Political Truths?

☆Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

☆Gender Bias in LLM-generated Interview Responses

☆GPAI Evaluations Standards Taskforce: towards effective AI governance

☆HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

☆Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

☆How Does LLM Compression Affect Weight Exfiltration Attacks?

☆I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench

☆Investigating Goal-Aligned and Empathetic Social Reasoning Strategies for Human-Like Social Intelligence in LLMs

☆Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers

☆Jailbreaking Large Language Models with Symbolic Mathematics

☆Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries

☆Language Models Resist Alignment

☆Large Language Models Still Exhibit Bias in Long Text

☆Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

☆Levels of Autonomy: Liability in the age of AI Agents

☆Linear Probe Penalties Reduce LLM Sycophancy

☆LLM Alignment Using Soft Prompt Tuning: The Case of Cultural Alignment

☆LLM Hallucination Reasoning with Zero-shot Knowledge Test

☆Measuring AI Agent Autonomy: Towards a Scalable Approach With Code Inspection

☆Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations

☆MISR: Measuring Instrumental Self-Reasoning in Frontier Models

☆Mitigating Downstream Model Risks via Model Provenance

☆Monitoring Human Dependence On AI Systems With Reliance Drills

☆NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models

☆On Adversarial Robustness of Language Models in Transfer Learning

☆On Demonstration Selection for Improving Fairness in Language Models

☆On the Ethical Considerations of Generative Agents

☆PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

☆Plentiful Jailbreaks with String Compositions

☆Policy Dreamer: Diverse Public Policy Generation Via Elicitation and Simulation of Human Preferences

☆Position Paper: Model Access should be a Key Concern in AI Governance

☆Position: AI Agents & Liability – Mapping Insights from ML and HCI Research to Policy

☆Position: Governments Need to Increase and Interconnect Post-Deployment Monitoring of AI

☆Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents

☆ReFeR: A Hierarchical Framework of Models as Evaluative and Reasoning Agents

☆Report Cards: Qualitative Evaluation of LLMs Using Natural Language Summaries

☆SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation

☆Salad-Bowl-LLM: Multi-Culture LLMs by In-Context Demonstrations from Diverse Cultures

☆Sandbag Detection through Model Impairment

☆SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs

☆Shh, don't say that! Domain Certification in LLMs

☆Simulation System Towards Solving Societal-Scale Manipulation

☆SocialStigmaQA Spanish and Japanese - Towards Multicultural Adaptation of Social Bias Benchmarks

☆Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback

☆THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models

☆The Elicitation Game: Stress-Testing Capability Elicitation Techniques

☆The Impact of Large Language Models in Academia: from Writing to Speaking

☆The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

☆Towards a Theory of AI Personhood

☆Towards Safe Multilingual Frontier AI

☆Toxic Neurons Aren’t Enough to Explain DPO: A Mechanistic Analysis for Toxicity Reduction

☆Understanding Model Bias Requires Systematic Probing Across Tasks

☆Ways Forward for Global AI Benefit Sharing

A Cautionary Tale on the Evaluation of Differentially Private In-Context Learning

AI Sandbagging: Language Models can Selectively Underperform on Evaluations

An Adversarial Perspective on Machine Unlearning for AI Safety

Analyzing Probabilistic Methods for Evaluating Agent Capabilities

Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents

Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

Century: A Dataset of Sensitive Historical Images

CoS: Enhancing Personalization with Context Steering

Detection of Partially-Synthesized LLM Text

Developing an occupational prestige scale using Large Language Models

Developing Story: Case Studies of Generative AI’s Use in Journalism

Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach

Differentially Private Learning Needs Better Model Initialization and Self-Distillation

Emergence of Steganography Between Large Language Models

Enhancing Language Model Calibration to Human Responses in Ethical Ambiguity via Fine-Tuning

Fact or Fiction? Can LLMs be Reliable Annotators for Political Truths?

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Gender Bias in LLM-generated Interview Responses

GPAI Evaluations Standards Taskforce: towards effective AI governance

HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

How Does LLM Compression Affect Weight Exfiltration Attacks?

I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench

Investigating Goal-Aligned and Empathetic Social Reasoning Strategies for Human-Like Social Intelligence in LLMs

Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers

Jailbreaking Large Language Models with Symbolic Mathematics

Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries

Language Models Resist Alignment

Large Language Models Still Exhibit Bias in Long Text

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Levels of Autonomy: Liability in the age of AI Agents

Linear Probe Penalties Reduce LLM Sycophancy

LLM Alignment Using Soft Prompt Tuning: The Case of Cultural Alignment

LLM Hallucination Reasoning with Zero-shot Knowledge Test

Measuring AI Agent Autonomy: Towards a Scalable Approach With Code Inspection

Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations

MISR: Measuring Instrumental Self-Reasoning in Frontier Models

Mitigating Downstream Model Risks via Model Provenance

Monitoring Human Dependence On AI Systems With Reliance Drills

NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models

On Adversarial Robustness of Language Models in Transfer Learning

On Demonstration Selection for Improving Fairness in Language Models

On the Ethical Considerations of Generative Agents

PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

Plentiful Jailbreaks with String Compositions

Policy Dreamer: Diverse Public Policy Generation Via Elicitation and Simulation of Human Preferences

Position Paper: Model Access should be a Key Concern in AI Governance

Position: AI Agents & Liability – Mapping Insights from ML and HCI Research to Policy

Position: Governments Need to Increase and Interconnect Post-Deployment Monitoring of AI

Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents

ReFeR: A Hierarchical Framework of Models as Evaluative and Reasoning Agents

Report Cards: Qualitative Evaluation of LLMs Using Natural Language Summaries

SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation

Salad-Bowl-LLM: Multi-Culture LLMs by In-Context Demonstrations from Diverse Cultures

Sandbag Detection through Model Impairment

SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs

Shh, don't say that! Domain Certification in LLMs

Simulation System Towards Solving Societal-Scale Manipulation

SocialStigmaQA Spanish and Japanese - Towards Multicultural Adaptation of Social Bias Benchmarks

Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback

THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models

The Elicitation Game: Stress-Testing Capability Elicitation Techniques

The Impact of Large Language Models in Academia: from Writing to Speaking

The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

Towards a Theory of AI Personhood

Towards Safe Multilingual Frontier AI

Toxic Neurons Aren’t Enough to Explain DPO: A Mechanistic Analysis for Toxicity Reduction

Understanding Model Bias Requires Systematic Probing Across Tasks

Ways Forward for Global AI Benefit Sharing