Research Daily: Top AI papers of the day

Get these updates on email!

ArXiv Paper Title:
Automatically Interpreting Millions of Features in Large Language Models

October 21, 2024

Keywords:
Large Language Models, Interpretability, Sparse Autoencoders, Natural Language Explanations

Read the paper on ArXiv SAE latent explanations visualized in a random sentence. Detection and fuzzing scores are shown.

Unlocking LLMs: Automating the Interpretation of Millions of Features

Introduction: Peeking Inside the Black Box

Large language models (LLMs) are amazing, but understanding how they work remains a significant challenge. While individual neurons offer limited insight, sparse autoencoders (SAEs) provide a promising pathway to unlock their inner workings. SAEs transform the complex LLM activations into a higher-dimensional latent space – potentially easier for humans to interpret. However, these SAEs can have millions of latent features, making manual interpretation impossible. This research introduces an automated framework to tackle this challenge head-on.

The Automated Interpretation Pipeline: From Latent Features to Human-Understandable Explanations

This research paper details a novel, open-source framework designed to automatically generate and evaluate natural language explanations for the latent features learned by SAEs trained on LLMs. This innovative pipeline consists of several key steps:

  1. SAE Training: SAEs are trained on different LLMs (Llama, Gemma) using varied architectures, activation functions, and loss functions.
  2. Activation Collection: Latent activations are collected from the trained SAEs using a large text corpus (RedPajama-v2, Pile).
  3. Explanation Generation: LLMs (Llama 70b, Claude) generate natural language explanations for each latent feature, based on its activation patterns. Clever prompting techniques are used to maximize the quality and clarity of these explanations.
  4. Explanation Scoring: Five novel, computationally efficient scoring methods (Detection, Fuzzing, Surprisal, Embedding, Intervention) are introduced to assess explanation quality. These methods are compared to existing, more computationally expensive simulation-based methods. The new methods focus on the ability of the explanations to distinguish between activating and non-activating contexts.
  5. Semantic Similarity Analysis: The semantic similarity between independently trained SAEs across different LLM layers is measured using the Hungarian algorithm to align latent features and comparing their explanations.

Key Findings: SAE Latents Outperform Individual Neurons

The research yields several key findings:

Future Directions and Limitations

While the results are promising, certain limitations and potential areas for improvement exist:

Future research could focus on:

This research represents a substantial step towards a deeper understanding of LLMs. By automating the interpretation of millions of latent features, this work opens exciting new avenues for research and application, paving the way for more interpretable and controllable AI systems. The authors have made their code and generated explanations publicly available, encouraging further exploration and development in this rapidly evolving field.