Weekly Log

Thesis

We aim to scope out a new, weights-based interpretability method that leverages neural networks’ mechanistic (gate switching) and geometric (splines in activation space) properties to identify all present conceptual regions without sparsity (as in sparse autoencoders) and reconstruction (as in parameter decomposition) tradeoffs.

Previous work during the Pivotal Fellowship (Reynoso & Heimersheim, 2025) showed that:

  1. Plain L2 difference in activation space is not indicative of stable regions; cumulative L2 difference may be more informative of their presence.

  2. Perturbing along privileged directions in activation space (derived from SAEs) may have earlier jumps in L2 and exhibit higher-than-baseline peaks. However, there is only a slight difference in the above metrics between the baseline random directions and the SAE directions.

  3. Measuring the number of neurons switching between active and inactive states when perturbing is indicative of transition points towards the target concept for finetuned models.

We aim to determine:

  • Whether clustering splines (recording which neurons are active for a given feature) is enough to identify conceptual regions
    • Implications -  if true, we can:
      • Provide an upper bound for the number of features that a model can contain () where k corresponds to the no. of layers in a model
      • Simply perform a clustering algorithm on binary gate activation patterns corresponding to an MLP layer’s firing pattern e.g. [0 1 1 …] to group samples and associate them with features
      • Generalize spline codes to determine when any future inputs activate features
    • Applications
      • Detect high-level traits, e.g. deception, with greater generalization than probes (which are heavily tuned to their train set)
      • More accurately represent layer-by-layer computation vs sparse autoencoders and similar architectures, which obtain multiple redundant circuits associated with single features
  • Whether gate switching behavior replicates for a wider set of concepts
    • Applications
      • Use this as a metric to steer activations accurately
      • Determine the model’s internal boundaries to find more accurate boundaries when probing
  • Whether feature-finding can be achieved through an analytic derivation of what makes gates switch signs
    • Implications -  if true, we can:
      • Intervene on any input’s activation to determine transformations that steer outputs towards any specified concept
      • Come up with empirical guarantees for model performance given a set of inputs
    • Application
      • Perform targeted interventions towards more desirable completions using this method