Schedule

Legend: 🎥 lecture recording, 🖥️ slides, 📖 notes, 📝 written questions, ⌨️ coding assignment.
Apply to participate in the course program by Jan 29th to have your assignments graded and participate in discussions and speaker events.

Background

1: Introduction; 🎥, 🖥️️
2: Optional Deep Learning Review; 🎥, 🖥️, 📖, 📝, ⌨️; building blocks, optimizers, losses, datasets

Safety Engineering

3: Risk Decomposition; 🎥, 🖥️️; risk analysis definitions, disaster risk equation, decomposition of safety areas, ability to cope and existential risk
4: Accident Models; 🎥, 🖥️; FMEA, Bow Tie model, Swiss Cheese model, defense in depth, preventative and protective measures, complex systems, nonlinear causality, emergence, STAMP
5: Black Swans; 🎥, 🖥️; unknown unknowns, long tailed distributions, multiplicative processes, extremistan
►: Review questions 📝

Robustness

6: Adversarial Robustness; 🎥, 🖥️, 📖, ⌨️; optimization pressure, PGD, untargeted vs targeted attacks, adversarial evaluation, white box vs black box, transferability, unforeseen attacks, text attacks, robustness certificates
7: Black Swan Robustness; 🎥, 🖥️️, 📖; stress tests, train-test mismatch, adversarial distribution shifts, simulated scenarios for robustness
8: Review questions 📝

Monitoring

8: Anomaly Detection; 🎥, 🖥️️, 📖, ⌨️; AUROC/AUPR/FPR95, likelihoods and detection, MSP baseline, OE, ViM, anomaly datasets, one-class learning, detecting adversaries, error detection
9: Interpretable Uncertainty; 🎥, 🖥️, 📖; calibration vs sharpness, proper scoring rules, Brier score, RMS calibration error, reliability diagrams, confidence intervals, quantile prediction
10: Transparency; 🎥, 🖥️; saliency maps, token heatmaps, feature visualizations, ProtoPNet
11: Trojans; 🎥, 🖥️, 📖, ⌨️; hidden functionality from poisoning, treacherous turns
12: Detecting Emergent Behavior; 🎥, 🖥️, 📖; emergent capabilities, instrumental convergence, Goodhart’s law, proxy gaming
13: Review questions 📝

Alignment

13: Honest Models; 🎥, 🖥️; truthful vs. honest, inverse scaling, instances of model dishonesty
14: Power Aversion; 🖥️; TBC early 2023; social, economic, and governmental formalizations of power bases; power penalties
15: Machine Ethics; 🎥, 🖥️, ⌨️; normative ethics background, human values, value learning with comparisons, translating moral knowledge into action, moral parliament, value clarification

Systemic Safety

16: ML for Improved Decision-Making; 🎥, 🖥️, 📖; forecasting, brainstorming
17: ML for Cyberdefense; 🎥, 🖥️; intrusion detection, detecting malicious programs, automated patching, fuzzing
18: Cooperative AI; 🎥, 🖥️, 📖; nash equilibria, dominant strategies, stag hunt, Pareto improvements, cooperation mechanisms, morality as cooperation, cooperative dispositions, collusion externalities

Additional Existential Risk Discussion

19: X-Risk Overview; 🎥, 🖥️; arguments for x-risk
20: Possible Existential Hazards; 🎥, 🖥️; weaponization, proxy gaming, treacherous turn, deceptive alignment, value lock-in, persuasive AI
21: Safety-Capabilities Balance; 🎥, 🖥️; theories of impact, differential technological progress, capabilities externalities
22: Natural Selection Favors AIs over Humans; 🎥, 🖥️; Lewontin’s conditions, multiple AI agents, generalized Darwinism, mechanisms for cooperation
23: Review and Conclusion; 🎥, 🖥️, 📝; pillars of ML safety research, task-train-deploy pipeline