Are Aligned Language Models “Adversarially Aligned”?

Channel:

Simons Institute for the Theory of Computing

Subscribers:

69,500

Published on August 17, 2023 11:41:34 AM ● Video Link: https://www.youtube.com/watch?v=uqOfC3KSZFc

Duration: 1:02:41

2,126 views

Nicholas Carlini (Google DeepMind)
https://simons.berkeley.edu/talks/nicholas-carlini-google-deepmind-2023-08-16
Large Language Models and Transformers

An "aligned" model is "helpful and harmless". In this talk I will show that while language models may be aligned under typical situations, they are not "adversarially aligned". Using standard techniques from adversarial examples, we can construct inputs to otherwise-aligned language models to coerce them into emitting harmful text and performing harmful behavior. Creating aligned models robust to adversaries will require significant advances in both alignment and adversarial machine learning.

Other Videos By Simons Institute for the Theory of Computing

2023-08-19	A data-centric view on reliable generalization: From ImageNet to LAION-5B
2023-08-19	Pretraining Task Diversity and the Emergence of Non-Bayesian In-Context Learning for Regression
2023-08-18	In-Context Learning: A Case Study of Simple Function Classes
2023-08-18	Watermarking of Large Language Models
2023-08-18	Human-AI Interaction in the Age of Large Language Models
2023-08-18	Are LLMs the Beginning or End of NLP?
2023-08-18	Beyond Language: Scaling up Robot Ontogeny
2023-08-18	Integrating Language into Intelligent Architectures
2023-08-17	Meaning in the age of large language models
2023-08-17	Formalizing Explanations of Neural Network Behaviors
2023-08-17	Are Aligned Language Models “Adversarially Aligned”?
2023-08-17	Language Models as Statisticians, and as Adapted Organisms
2023-08-17	On Localization in Language Models
2023-08-17	Panel Discussion
2023-08-16	Large Language Models Meet Copyright Law
2023-08-16	How to Use Self-Play for Language Models to Improve at Solving Programming Puzzles
2023-08-16	Build an Ecosystem, Not a Monolith
2023-08-16	Interpretability via Symbolic Distillation
2023-08-16	A Theory for Emergence of Complex Skills in Language Models
2023-08-16	Scaling Data-Constrained Language Models
2023-08-15	Understanding the Origins and Taxonomy of Neural Scaling Laws

Tags:

Simons Institute

theoretical computer science

UC Berkeley

Computer Science

Theory of Computation

Theory of Computing

Large Language Models and Transformers

Nicholas Carlini

Channel	Latest
김춘삼	6 hours ago
Purple Kyogre	6 hours ago
wat007	6 hours ago
Ashe10	6 hours ago
Saif xpln	6 hours ago
Sahu memes	6 hours ago
VK KAAL FF	6 hours ago
MrQuoty	6 hours ago
Tomahawk	6 hours ago
RIFAS	6 hours ago
CB Zero	6 hours ago
G2H GAMING	6 hours ago
FryBry	6 hours ago
Heavy Metal Gamer Show	6 hours ago
ChrisMaleASMR	6 hours ago
Lau Laurelin	6 hours ago
J7k CG	6 hours ago
FireFox	6 hours ago
OneBossOneKill	6 hours ago
Harman Smith	6 hours ago
PAGAL SANAM	7 hours ago
19CaHTuMeTPoB	7 hours ago
꧁༒𝑁𝑎𝑣𝑖 𝑀𝑜𝑜𝑛 𝑆𝑝𝑎𝑟𝑘𝑙𝑒 𝑅𝑒𝑦𝑒𝑠 𝑊𝑜𝑙𝑓 𝐺𝐷༒꧂🇳🇮	7 hours ago
MSNCR - Music No copyright	7 hours ago
Luderking Joga	7 hours ago