Preface

This is a non-exhaustive list that attempts to catalogue all current ways to extract ‘steering vectors’ corresponding to concepts in a model’s latent representational space.

Motivation

One can think of training a model as manipulating inputs so that they are mapped into a higher order space that pushes apart distinct concepts and pulls together similar datapoints.

This frame of model behavior is known as the Manifold Hypothesis:

The manifold hypothesis is that natural data forms lower-dimensional manifolds in its embedding space. There are both theoretical1 and experimental2 reasons to believe this to be true. If you believe this, then the task of a classification algorithm is fundamentally to separate a bunch of tangled manifolds.

One way to see this in action is by visualizing the latent space of a neural network across its different layers and over succeeding training epochs. These layers tend to be multidimensional transformations but we can map them in 2D space to perceive what’s going on: The Grand Tour of Layer Dynamics - Visualizing Neural Networks. The above video shows how the concept of shoes are represented in a different section of the latent space (as early as layer 5) correlating to the model’s accurate disambiguation of shoe images.

Hence, you can think of layer-by-layer model processing as successive vector operations to push datapoints to their corresponding bins, like how solving a Rubik’s cube is a bunch of translations that lead to similar colors being pushed to the same side!

A more intuitive visualization is how the dimension reduction of MNIST is transformed into the output digit bins through the training process:

Extracting Directions in Latent Space

1: Contrastive Activation Addition

Demo: refusal_demo.ipynb - Colab

Advantages:

  1. Easy and fast to execute + iterate over
  2. Needs at least samples (only)

Drawbacks:

  1. The difference of two means () may not accurately represent the non-linearities within the mean group (e.g. if is actually a heterogeneous set)
    • Solution: check if the directions are coplanar across a given linear dimension
  2. The dataset may contain subtle biases e.g. when taking the difference to obtain refusal, you might instead only be extracting “the refusal direction for lying” as a result of subtracting non-refusal from lying-exclusive refusal. Similarly, differences in dataset representation (which classes are over-/undersampled) can have an effect on the extracted vector.
  3. The obtained feature may be noisy and could likely contain small components of “absorbed” features due to the averaging out process.

2: Probing

What? Train a linear/logistic regression model to categorize the presence / absence of the concept we want to find a vector for. Afterwards, read off the weights of the model to obtain the vector. Demo: deception-detection/deception_detection/detectors.py at main · ApolloResearch/deception-detection · GitHub

Advantages:

  1. Still relatively easy to run
  2. Reading off weights is intuitive

Drawbacks:

  1. Still dataset-dependent such that you would want it to be maximally exhaustive of the concept you’re investigating
  2. The ability to extract linear representations is not a guarantee
  3. The quality of our extracted direction depends on model performance and various hyperparameters as a consequence of the training process

3: Sparse Autoencoders

What? Given a trained SAE with latents, if we can identify an atomic enough latent, we can use that as the vector representation of the feature.

Demo: GitHub - callummcdougall/sae_vis: Create feature-centric and prompt-centric visualizations for sparse autoencoders (like those from Anthropic’s published research).

Reference: SAE features for refusal and sycophancy steering vectors — AI Alignment Forum

Advantages:

  1. Easy to do if an open-source SAE exists

Drawbacks:

  1. The quality of the obtained vector (not having noise) is dependent on the dataset and the training process
  2. SAEs are prone to feature splitting and absorption.
  3. The feature can exist in multiple layers either as a sequential computation or a redundant circuit.

4: Crosscoders

What? Given a trained crosscoder with latents, if we can identify an atomic enough latent, we can use that as the vector representation of the feature. The difference in crosscoders is that the input is of the previous layer and the output is the next layer. This allows us to obtain vectors that are themselves the in-sequence vector transformation of the inputs. From which, we can derive a circuit of operations in line with our target concept.

Reference: Sparse Crosscoders for Cross-Layer Features and Model Diffing

Demo: crosscoder-model-diff-replication/analysis.py at main · ckkissane/crosscoder-model-diff-replication · GitHub

Advantages:

  1. Higher quality vector
  2. Can determine whether which concepts are sequentially related to the concept we’re interested in

Drawbacks:

  1. The quality of the obtained vector (not having noise) is dependent on the dataset and the training process
  2. Crosscoders are also prone to feature splitting and absorption.

5: Activation patching

What? Iteratively ablate or corrupt outputs from neurons model subcomponents until we obtain a subset under which setting the answer is greatly steered into the opposite direction. You can then take the downstream effect of changes in a particular model component (e.g. corrupting A3.2) to a later layer (e.g. layer 3’s residual stream) as the difference between the positive and negative versions of the concept. For example, if by corrupting a path to A3.2 to MLP4, I was able to influence a truthful answer to a deceptive one, then by computing the changes generated by patching on Residual Stream 4, I’m able to isolate the vector representing this change.

Demo:Exploratory_Analysis_Demo.ipynb - Colab

Advantages:

  1. This might result in a cleaner vector as the low-level changes are accounted for in a more structured way.

Drawbacks:

  1. I haven’t seen an implementation of this in order to extract a vector. There have been numerous paper references without code.

Appendix

Nanda, Neel. (2022.). A Comprehensive Mechanistic Interpretability Explainer & Glossary. http://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J

Footnotes

  1. A lot of the natural transformations you might want to perform on an image, like translating or scaling an object in it, or changing the lighting, would form continuous curves in image space if you performed them continuously. -Neural Networks, Manifolds, and Topology — colah’s blog

  2. Carlsson et al. found that local patches of images form a klein bottle - Neural Networks, Manifolds, and Topology — colah’s blog