Are Aligned Language Models “Adversarially Aligned”?

Published on ● Video Link: https://www.youtube.com/watch?v=uqOfC3KSZFc



Duration: 1:02:41
2,126 views
70


Nicholas Carlini (Google DeepMind)
https://simons.berkeley.edu/talks/nicholas-carlini-google-deepmind-2023-08-16
Large Language Models and Transformers

An "aligned" model is "helpful and harmless". In this talk I will show that while language models may be aligned under typical situations, they are not "adversarially aligned". Using standard techniques from adversarial examples, we can construct inputs to otherwise-aligned language models to coerce them into emitting harmful text and performing harmful behavior. Creating aligned models robust to adversaries will require significant advances in both alignment and adversarial machine learning.







Tags:
Simons Institute
theoretical computer science
UC Berkeley
Computer Science
Theory of Computation
Theory of Computing
Large Language Models and Transformers
Nicholas Carlini