Are Aligned Language Models “Adversarially Aligned”?
Subscribers:
68,600
Published on ● Video Link: https://www.youtube.com/watch?v=uqOfC3KSZFc
Nicholas Carlini (Google DeepMind)
https://simons.berkeley.edu/talks/nicholas-carlini-google-deepmind-2023-08-16
Large Language Models and Transformers
An "aligned" model is "helpful and harmless". In this talk I will show that while language models may be aligned under typical situations, they are not "adversarially aligned". Using standard techniques from adversarial examples, we can construct inputs to otherwise-aligned language models to coerce them into emitting harmful text and performing harmful behavior. Creating aligned models robust to adversaries will require significant advances in both alignment and adversarial machine learning.
Other Videos By Simons Institute for the Theory of Computing
Tags:
Simons Institute
theoretical computer science
UC Berkeley
Computer Science
Theory of Computation
Theory of Computing
Large Language Models and Transformers
Nicholas Carlini