NeurIPS 2024 Past Large language modelsFairness & ethics

Workshop on Socially Responsible Language Modelling Research

SoLaR

Submission deadline
Sep 15, 2024, 15:00 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (71)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Cautionary Tale on the Evaluation of Differentially Private In-Context Learning

    Anjun Hu, Jiyang Guan, Philip Torr, Francesco Pinto · PDF
  2. AI Sandbagging: Language Models can Selectively Underperform on Evaluations

    Teun van der Weij, Felix Hofstätter, Oliver Jaffe, Samuel F. Brown, Francis Rhys Ward · PDF
  3. An Adversarial Perspective on Machine Unlearning for AI Safety

    Jakub Łucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tramèr, Javier Rando · PDF
  4. Analyzing Probabilistic Methods for Evaluating Agent Capabilities

    Axel Højmark, Govind Pimpale, Arjun Panickssery, Marius Hobbhahn, Jérémy Scheurer · PDF
  5. Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents

    Samuel F. Brown, Basil Labib, Codruta Lugoj, Sai Sasank Y · PDF
  6. Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards

    Shresth Verma, Niclas Boehmer, Lingkai Kong, Milind Tambe · PDF
  7. Beyond the Binary: Capturing Diverse Preferences With Reward Regularization

    Vishakh Padmakumar, Chuanyang Jin, Hannah Rose Kirk, He He · PDF
  8. CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

    Song Wang, Peng Wang, Tong Zhou, Yushun Dong, Zhen Tan, Jundong Li · PDF
  9. Century: A Dataset of Sensitive Historical Images

    Canfer Akbulut, Kevin Robinson, Maribeth Rauh, Isabela Albuquerque, Olivia Wiles, Laura Weidinger, Verena Rieser, Yana Hasson, Nahema Marchal, Iason Gabriel, William Isaac, Lisa Anne Hendricks · PDF
  10. CoS: Enhancing Personalization with Context Steering

    Sashrika Pandey, Jerry Zhi-Yang He, Mariah L Schrum, Anca Dragan · PDF
  11. Detection of Partially-Synthesized LLM Text

    Eric Lei, Hsiang Hsu, Chun-Fu Chen · PDF
  12. Developing an occupational prestige scale using Large Language Models

    Robert de Vries, Mark J. Hill, Laura Ruis · PDF
  13. Developing Story: Case Studies of Generative AI’s Use in Journalism

    Natalie Grace Brigham, Chongjiu Gao, Tadayoshi Kohno, Franziska Roesner, Niloofar Mireshghallah · PDF
  14. Different Bias Under Different Criteria: Assessing Bias in LLMs with a Fact-Based Approach

    Changgeon Ko, Jisu Shin, Hoyun Song, Jeongyeon Seo, Jong C. Park · PDF
  15. Differentially Private Learning Needs Better Model Initialization and Self-Distillation

    Ivoline C. Ngong, Joseph Near, Niloofar Mireshghallah · PDF
  16. Emergence of Steganography Between Large Language Models

    Yohan Mathew, Robert McCarthy, Joan Velja, Ollie Matthews, Nandi Schoots, Dylan Cope · PDF
  17. Enhancing Language Model Calibration to Human Responses in Ethical Ambiguity via Fine-Tuning

    Pranav Senthilkumar, Visshwa Balasubramanian, Prisha Jain, Aneesa Maity, Jonathan Lu, Kevin Zhu · PDF
  18. Fact or Fiction? Can LLMs be Reliable Annotators for Political Truths?

    Veronica Chatrath, Marcelo Lotif, Shaina Raza · PDF
  19. Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

    Rylan Schaeffer, Dan Valentine, Luke Bailey, James Chua, Zane Durante, Cristobal Eyzaguirre, Joe Benton, Brando Miranda, Henry Sleight, Tony Tong Wang, John Hughes, Rajashree Agrawal, Mrinank Sharma, Scott Emmons, Sanmi Koyejo, Ethan Perez · PDF
  20. Gender Bias in LLM-generated Interview Responses

    Haein Kong, Yongsu Ahn, Sangyub Lee, Yunho Maeng · PDF
  21. GPAI Evaluations Standards Taskforce: towards effective AI governance

    Patricia Paskov, Lukas Berglund, Everett Thornton Smith, Lisa Soder · PDF
  22. HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

    Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Colin Treleaven · PDF
  23. Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

    Leo McKee-Reid, Christoph Sträter, Maria Angelica Martinez, Joe Needham, Mikita Balesni · PDF
  24. How Does LLM Compression Affect Weight Exfiltration Attacks?

    Davis Brown, Mantas Mazeika · PDF
  25. I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench

    Yuan Li, Yue Huang, Yuli Lin, Siyuan Wu, Yao Wan, Lichao Sun · PDF
  26. Investigating Goal-Aligned and Empathetic Social Reasoning Strategies for Human-Like Social Intelligence in LLMs

    Anirudh Gajula, Raaghav Malik · PDF
  27. Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers

    Tony Tong Wang, John Hughes, Henry Sleight, Rylan Schaeffer, Rajashree Agrawal, Fazl Barez, Mrinank Sharma, Jesse Mu, Nir N Shavit, Ethan Perez · PDF
  28. Jailbreaking Large Language Models with Symbolic Mathematics

    Emet Bethany, Mazal Bethany, Juan A. Nolazco-Flores, Sumit Kumar Jha, Peyman Najafirad · PDF
  29. Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries

    Adam X. Yang, Chen Chen, Konstantinos Pitas · PDF
  30. Language Models Resist Alignment

    Jiaming Ji, Kaile Wang, Tianyi Qiu, Boyuan Chen, Changye Li, Hantao Lou, Jiayi Zhou, Josef Dai, Yaodong Yang · PDF
  31. Large Language Models Still Exhibit Bias in Long Text

    Wonje Jeung, Dongjae Jeon, Ashkan Yousefpour, Jonghyun Choi · PDF
  32. Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

    Aidan Ewart, Abhay Sheshadri, Phillip Huang Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper · PDF
  33. Levels of Autonomy: Liability in the age of AI Agents

    Lisa Soder, Julia Smakman, Connor Dunlop, Weiwei Pan, Siddharth Swaroop, Noam Kolt · PDF
  34. Linear Probe Penalties Reduce LLM Sycophancy

    Henry Papadatos, Rachel Freedman · PDF
  35. LLM Alignment Using Soft Prompt Tuning: The Case of Cultural Alignment

    Reem I. Masoud, Martin Ferianc, Philip Colin Treleaven, Miguel R. D. Rodrigues · PDF
  36. LLM Hallucination Reasoning with Zero-shot Knowledge Test

    Seongmin Lee, Hsiang Hsu, Chun-Fu Chen · PDF
  37. Measuring AI Agent Autonomy: Towards a Scalable Approach With Code Inspection

    Peter Cihon, Merlin Stein, Gagan Bansal, Sam Manning, Kevin Xu · PDF
  38. Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations

    Aryan Shrivastava, Jessica Hullman, Max Lamparth · PDF
  39. MISR: Measuring Instrumental Self-Reasoning in Frontier Models

    Kai Fronsdal, David Lindner · PDF
  40. Mitigating Downstream Model Risks via Model Provenance

    Keyu Wang, Scott Schaffter, Abdullah Norozi Iranzad, Doina Precup, Jonathan Lebensold, Meg Risdal · PDF
  41. Monitoring Human Dependence On AI Systems With Reliance Drills

    Rosco Hunter, Richard Moulange, Jamie Bernardi, Merlin Stein · PDF
  42. NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models

    William Tan, Kevin Zhu · PDF
  43. On Adversarial Robustness of Language Models in Transfer Learning

    Bohdan Turbal, Anastasiia Mazur, Jiaxu Zhao, Mykola Pechenizkiy · PDF
  44. On Demonstration Selection for Improving Fairness in Language Models

    Song Wang, Peng Wang, Yushun Dong, Tong Zhou, Lu Cheng, Yangfeng Ji, Jundong Li · PDF
  45. On the Ethical Considerations of Generative Agents

    N'yoma Diamond, Soumya Banerjee · PDF
  46. PAL: Pluralistic Alignment Framework for Learning from Heterogeneous Preferences

    Daiwei Chen, Yi Chen, Aniket Rege, Ramya Korlakai Vinayak · PDF
  47. Plentiful Jailbreaks with String Compositions

    Brian R.Y. Huang · PDF
  48. Policy Dreamer: Diverse Public Policy Generation Via Elicitation and Simulation of Human Preferences

    Arjun Karanam, José Ramón Enríquez, Udari Madhushani Sehwag, Michael Elabd, Kanishk Gandhi, Noah Goodman, Sanmi Koyejo · PDF
  49. Position Paper: Model Access should be a Key Concern in AI Governance

    Edward Kembery · PDF
  50. Position: AI Agents & Liability – Mapping Insights from ML and HCI Research to Policy

    Connor Dunlop, Weiwei Pan, Julia Smakman, Lisa Soder, Siddharth Swaroop · PDF
  51. Position: Governments Need to Increase and Interconnect Post-Deployment Monitoring of AI

    Merlin Stein, Jamie Bernardi, Connor Dunlop · PDF
  52. Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents

    Ivoline C. Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy · PDF
  53. ReFeR: A Hierarchical Framework of Models as Evaluative and Reasoning Agents

    Yaswanth Narsupalli, Abhranil Chandra, Sreevatsa Muppirala, Manish Gupta, Pawan Goyal · PDF
  54. Report Cards: Qualitative Evaluation of LLMs Using Natural Language Summaries

    Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, Michael R. Zhang · PDF
  55. SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation

    Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine · PDF
  56. Salad-Bowl-LLM: Multi-Culture LLMs by In-Context Demonstrations from Diverse Cultures

    Dongkwan Kim, Junho Myung, Alice Oh · PDF
  57. Sandbag Detection through Model Impairment

    Cameron Tice, Philipp Alexander Kreer, Nathan Helm-Burger, Prithviraj Singh Shahani, Fedor Ryzhenkov, Teun van der Weij, Felix Hofstätter, Jacob Haimes · PDF
  58. SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs

    Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting · PDF
  59. Shh, don't say that! Domain Certification in LLMs

    Cornelius Emde, Preetham Arvind, Alasdair Paren, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Philip Torr, Adel Bibi · PDF
  60. Simulation System Towards Solving Societal-Scale Manipulation

    Maximilian Puelma Touzel, Sneheel Sarangi, Austin Welch, Gayatri K, Dan Zhao, Zachary Yang, Hao Yu, Tom Gibbs, Ethan Kosak-Hine, Andreea Musulan, Camille Thibault, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine · PDF
  61. SocialStigmaQA Spanish and Japanese - Towards Multicultural Adaptation of Social Bias Benchmarks

    Clara Higuera Cabañes, Ryo Iwaki, Beñat San Sebastian, Rosario Uceda Sosa, Manish Nagireddy, Hiroshi Kanayama, Mikio Takeuchi, Gakuto Kurata, Karthikeyan Natesan Ramamurthy · PDF
  62. Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback

    Marcus Williams, Micah Carroll, Constantin Weisser, Brendan Murphy, Adhyyan Narang, Anca Dragan · PDF
  63. THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models

    Mengfei Liang, Archish Arun, Zekun Wu, CRISTIAN ENRIQUE MUNOZ VILLALOBOS, Jonathan Lutch, Emre Kazim, Adriano Koshiyama, Philip Colin Treleaven · PDF
  64. The Elicitation Game: Stress-Testing Capability Elicitation Techniques

    Felix Hofstätter, Jayden Teoh, Teun van der Weij, Francis Rhys Ward · PDF
  65. The Impact of Large Language Models in Academia: from Writing to Speaking

    Mingmeng Geng, Caixi Chen, Yanru Wu, Dongping Chen, Yao Wan, Pan Zhou · PDF
  66. The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

    Stefan Sylvius Wagner, Maike Behrendt, Marc Ziegele, Stefan Harmeling · PDF
  67. Towards a Theory of AI Personhood

    Francis Rhys Ward · PDF
  68. Towards Safe Multilingual Frontier AI

    Arturs Kanepajs, Vladimir Ivanov, Richard Moulange · PDF
  69. Toxic Neurons Aren’t Enough to Explain DPO: A Mechanistic Analysis for Toxicity Reduction

    Yushi Yang, Filip Sondej, Harry Mayne, Adam Mahdi · PDF
  70. Understanding Model Bias Requires Systematic Probing Across Tasks

    Soline Boussard, Susannah Cheng Su, Helen Zhao, Siddharth Swaroop, Weiwei Pan · PDF
  71. Ways Forward for Global AI Benefit Sharing

    Sam Manning, Claire Dennis, Stephen Clare · PDF