Manifolds and Subspaces in Neural Networks: A Geometric Approach to Interpretability in Large Language Models

Related: AI Engineering Roadmap for practical implementation of interpretability tools

Introduction: The Geometric Perspective on Neural Network Interpretability.

The remarkable success of deep neural networks, particularly large language models (LLMs), across a multitude of complex tasks has led to their widespread adoption in various applications. These models, characterized by their intricate architectures and vast parameter spaces, have demonstrated impressive capabilities in natural language understanding, generation, and reasoning. However, a significant challenge accompanying their effectiveness is their inherent “black-box” nature. The complex interplay of non-linear transformations within these networks often obscures the underlying mechanisms that drive their predictions and decisions. This lack of transparency poses a considerable hurdle for understanding their inner workings, ensuring their reliability in critical applications, and addressing potential ethical concerns such as algorithmic bias 1.

The growing demand for interpretability has spurred extensive research into methods that can shed light on the internal representations and computations of neural networks. One promising avenue of investigation involves viewing neural network representations through a geometric lens. By employing concepts from mathematics, such as manifolds and subspaces, researchers aim to uncover the hidden structures within the high-dimensional data that these networks process and generate. This geometric perspective offers a powerful framework for understanding how neural networks learn, represent information, and ultimately make decisions. This report will delve into the definitions, applications, and limitations of manifolds and subspaces in the context of neural networks, with a particular focus on large language models. It will explore how these geometric concepts are relevant to various interpretability techniques, providing a comprehensive overview of the current understanding in this rapidly evolving field.

Foundational Concepts: Manifolds and Subspaces in Neural Network Representation Spaces.

2.1 Defining Manifolds in Neural Networks.
In mathematics, a manifold is formally defined as a topological space that locally resembles Euclidean space near each point 2. This means that while the overall structure of a manifold can be complex and curved, any sufficiently small region of it appears flat, just like a piece of paper appears flat even though it might be part of a folded document. The manifold hypothesis posits that real-world high-dimensional data, such as images, text, or neural activity patterns, often lies on lower-dimensional manifolds that are embedded within the higher-dimensional space in which they are represented 4. This suggests that the apparent complexity of the data might be due to its embedding in a high-dimensional space, while its intrinsic structure is simpler and lower-dimensional.
In the context of neural networks, the concept of a neural manifold refers to a collection of neural activities that correspond to the same task condition within the neural state space 7. For instance, in a network trained to recognize images of cats and dogs, the neural responses elicited by all the cat stimuli would constitute the “cat manifold,” and similarly for dogs 3. These neural manifolds are not arbitrary sets of points but are structured by the specific task the network is trained on and the underlying interactions between neurons. Neurons within a network are interconnected and exhibit co-modulation, meaning their firing patterns are not independent 2. These dependencies constrain the possible patterns of neural activity, causing them to be confined to a lower-dimensional manifold within the full, higher-dimensional space of all possible activity patterns. This intrinsic manifold is thought to be task-dependent, implying that different tasks or behaviors will give rise to different underlying manifolds 2. The goal of deep learning can be viewed as learning an approximation of these underlying manifolds through the architecture and training process of neural networks 5. Manifold learning, a subfield of machine learning, focuses on discovering and representing this hidden, low-dimensional structure within high-dimensional data 8. The realization that data might reside on such manifolds helps explain why machine learning models can often find meaningful features and make accurate predictions from seemingly complex, high-dimensional datasets 5.
2.2 Defining Subspaces in Neural Networks.
In linear algebra, a subspace is defined as a vector space that is entirely contained within another vector space. A key characteristic of a subspace is its linearity; it must include the origin (the zero vector) and be closed under vector addition and scalar multiplication. In the realm of neural networks, subspaces can represent specific regions within the weight space (the space of all possible weight configurations of the network) or the activation space (the space of all possible activation patterns of the neurons) that possess particular properties 10.
One important concept is that of active subspaces. An active subspace is a low-dimensional subspace that captures the directions in the input space (e.g., neuron activations) along which a function of interest (e.g., the network’s output or loss) exhibits the most significant variance 13. Identifying active subspaces can help in dimensionality reduction by highlighting the “active neurons” or combinations of neurons that are most crucial for the network’s function. This allows researchers to focus their interpretability efforts on a smaller, more manageable set of components 14. Neural network subspaces can also contain diverse solutions that achieve high accuracy on a given task 10. Exploring these subspaces, for instance, by finding lines or curves connecting different high-accuracy models, can lead to insights into the optimization landscape of neural networks and potentially yield models with better generalization and robustness 11. Furthermore, during training, subspaces within the weight space can be optimized to find regions that contain multiple accurate models, suggesting the existence of wider minima in the loss landscape that are associated with better generalization properties 12.
2.3 Differentiating Manifolds and Subspaces.
The fundamental distinction between a manifold and a subspace lies in their linearity 8. A subspace is inherently a linear concept, characterized by its closure under linear operations and the inclusion of the origin. In contrast, a manifold can be non-linear, meaning it can be curved and does not necessarily contain the origin of the embedding space. Consider the analogy of a plane passing through the origin in three-dimensional space; this plane is a subspace. However, the surface of a sphere in the same space is a manifold because it is curved and does not include the origin.
Relating this back to neural networks, while the activations of neurons reside in a high-dimensional vector space, the task-relevant activations might be confined to a curved, lower-dimensional manifold within that space due to the non-linear nature of activation functions and the complex interactions between neurons 2. Many interpretability techniques, particularly those based on linear algebra, often focus on analyzing subspaces within these high-dimensional spaces. These subspaces can be thought of as local linear approximations of the potentially non-linear underlying manifold. While these linear approximations provide a valuable tool for analysis, it is important to recognize that they might not fully capture the intricate and non-linear structure of the true representation space.

Dimensionality Reduction as a Tool for Understanding Feature Geometry.

Dimensionality reduction techniques play a crucial role in our ability to understand the high-dimensional representations learned by neural networks. These methods aim to project the data from a high-dimensional space into a lower-dimensional space while preserving some of its essential properties, making it easier to visualize and analyze 2. In the context of neural networks, these techniques can be applied to the feature manifolds and subspaces formed by neuron activations to gain insights into their underlying geometric structures.

3.1 Linear vs. Non-Linear Techniques.
Principal Component Analysis (PCA) is a widely used linear dimensionality reduction technique 15. PCA works by identifying the directions (principal components) in the data that capture the maximum variance. It essentially finds a new set of orthogonal axes that are aligned with the directions of greatest spread in the data. While PCA is effective for linearly related data and provides a straightforward linear projection that preserves the most variance, it operates under the assumption that the important information lies along straight lines (linear subspaces) 9. This assumption can be a limitation when dealing with neural network representations, which often exhibit complex, non-linear relationships. In such cases, PCA might fail to uncover the true underlying structure of the data manifold.
To address the limitations of linear techniques, a variety of non-linear manifold learning techniques have been developed. These include Uniform Manifold Approximation and Projection (UMAP), t-distributed Stochastic Neighbor Embedding (t-SNE), and Isometric Mapping (ISOMAP) 4. Unlike PCA, these methods are designed to preserve the local or global geometric structure of the data, even if it lies on a curved, non-linear manifold. For example, ISOMAP aims to preserve the geodesic distances between data points, which are the shortest path distances along the manifold 9. UMAP and t-SNE, on the other hand, focus on preserving the local neighborhood structure of the data in the lower-dimensional embedding 4. These non-linear techniques can be particularly useful for visualizing complex datasets where linear methods might obscure important relationships.
3.2 Effects on Underlying Geometric Structures using Toy and Synthetic Datasets.
When applied to data with non-linear structure, such as a Swiss roll dataset, PCA often struggles to provide a faithful representation in a lower dimension 9. Due to its linear nature, PCA might attempt to “slice” straight through the curved manifold, leading to a “crowding effect” where points that are far apart along the manifold end up being close together in the PCA projection 9. This overlap can make it difficult to discern the true relationships and clusters within the data.
In contrast, manifold learning techniques like ISOMAP and UMAP have demonstrated the ability to better reveal the underlying structure of such datasets 9. For instance, ISOMAP can “unwrap” the Swiss roll, preserving the distances between points as if the roll were laid flat 9. Similarly, UMAP can often produce visualizations that maintain both the local and global structure of complex manifolds, allowing for a more intuitive understanding of the data’s organization 9. However, it is important to note that the choice of dimensionality reduction technique can significantly impact the resulting visualization and the conclusions drawn about the data’s structure 9. Different algorithms make different assumptions and have varying strengths and weaknesses. For example, t-SNE is known for its ability to preserve local structure well, often revealing clusters in the data, but it can sometimes distort the global relationships between these clusters 9. UMAP, while also effective at local structure preservation, often does a better job of maintaining the overall global structure of the data 9. Furthermore, the performance of these techniques can be highly dependent on the choice of hyperparameters, such as the perplexity in t-SNE or the number of neighbors in UMAP and ISOMAP 9. Different hyperparameter settings can lead to different visualizations of the same data, emphasizing the need for careful tuning and interpretation of the results. Visualizations obtained from non-linear dimensionality reduction techniques should therefore be interpreted with caution, keeping in mind the specific properties of the algorithm used and the potential for distortions.

Table 1: Comparison of Dimensionality Reduction Techniques


Technique	Type	Primary Goal	Strengths	Weaknesses	Suitability for Manifold Learning
PCA	Linear	Maximize Variance	Simple, fast, easily interpretable	Assumes linear relationships, sensitive to scale and outliers	Limited
UMAP	Non-Linear	Preserve Local and Global Structure	Fast, scales well to large datasets, often preserves global structure better	Hyperparameters can significantly affect the visualization, interpretability of axes can be limited	Excellent
t-SNE	Non-Linear	Preserve Local Structure	Excellent for revealing clusters	Can distort global structure, computationally expensive for large datasets, results can vary with initialization and perplexity	Good
ISOMAP	Non-Linear	Preserve Geodesic Distances	Effective for unfolding certain types of manifolds	Sensitive to “short-circuit” errors, computationally expensive for large datasets	Good

Exploring Interpretability Tools: Subspace Projection, Null Spaces, and Activation Patching.

To gain a deeper understanding of the information encoded within the high-dimensional activation spaces of neural networks, researchers have developed various interpretability tools. Among these are subspace projection, the analysis of null spaces, and activation patching, each offering unique perspectives on the network’s internal representations and computations.

4.1 Subspace Projection.
Subspace projection involves projecting high-dimensional activation vectors onto lower-dimensional subspaces to identify and isolate relevant features or concepts 29. The underlying idea is that if a particular concept is encoded within the network’s activations, it might be represented along specific directions or within a lower-dimensional subspace of the overall activation space. By projecting the activations onto these learned subspaces, researchers can analyze the strength and nature of the concept’s representation. Supervised methods like Supervised Independent Subspace Principal Component Analysis (sisPCA) can be employed to discover independent subspaces that correspond to different underlying factors or sources of variability in the data 33. For instance, in biological data analysis, sisPCA can help disentangle gene expression patterns related to different cellular processes or experimental conditions. Furthermore, subspace inference techniques have been developed within the framework of Bayesian neural networks to make the computationally intensive Bayesian methods more tractable. By restricting inference to a lower-dimensional weight subspace, these methods allow for uncertainty quantification over the model’s parameters and predictions in a more efficient manner 30. Projecting activations onto these learned subspaces can therefore serve as a valuable tool for isolating and understanding the contribution of specific features or concepts to the network’s overall behavior.
4.2 Null Spaces.
In linear algebra, the null space of a matrix (or a linear transformation) is the set of all input vectors that, when multiplied by the matrix, result in the zero vector 34. In the context of neural networks, analyzing the null space can provide insights into the redundancy or irrelevance of certain input components or directions within the activation space. For example, if specific parts of the input data consistently get mapped to zero by a neural network, it suggests that these parts do not contribute to the final prediction 35. This property has been explored in applications like image steganography, where information can be encoded in the null space of a network without affecting its primary task 35. In the study of neural circuits, the concept of “output-null” dimensions has emerged. These are directions in the neural activity space along which changes in activity do not affect the output of the circuit 36. The existence of such dimensions suggests that neural circuits might control communication by directing activity changes primarily within these output-null dimensions during preparatory phases, allowing for changes in internal state without prematurely triggering downstream effects. Iterative Null-space Projection (INLP) is a specific technique that leverages the concept of null spaces for interpretability, particularly in the context of fairness. INLP aims to remove information about protected attributes, such as gender or race, from neural representations. It achieves this by iteratively training linear classifiers to predict the protected attribute from the representations and then projecting the representations onto the null space of these classifiers 37. This process makes it harder for linear models to extract the protected information from the representations, thereby mitigating potential biases. Analyzing the null space can thus reveal aspects of information redundancy or biases that might be encoded within the network’s learned representations.
4.3 Activation Patching.
Activation patching, also known as Interchange Intervention or Causal Tracing, is a mechanistic interpretability technique used to understand the causal influence of specific internal activations on a neural network’s output 29. The core idea is to run the model on two different inputs (a “source” and a “destination” prompt) and then, during the forward pass with the destination prompt, replace the activations at a particular layer or component with the activations from the source prompt. By observing how this “patching” affects the model’s final output, researchers can infer the causal relevance of the patched activations for the behavior that differs between the two prompts. For example, if patching the activations of a specific attention head from a prompt where the model correctly answers a question into a prompt where it answers incorrectly leads to the correct answer, it suggests that this attention head plays a crucial role in that specific reasoning step. While activation patching is a powerful tool for uncovering causal relationships within neural networks, it is susceptible to “interpretability illusions” 29. These illusions can occur when patching along a seemingly relevant subspace appears to have the intended causal effect on the model’s behavior, but this effect is actually achieved by activating a dormant parallel pathway that is causally disconnected from the model’s output for the original input. This highlights the importance of careful experimental design and validation when using activation patching to avoid drawing misleading conclusions about the true underlying mechanisms. Activation patching can be performed at different levels of granularity, targeting components like the residual stream, individual attention heads, or the outputs of MLP layers 43. Different methods for activation patching exist, including denoising (patching clean activations into a corrupted run) and noising (patching corrupted activations into a clean run), each providing different types of information about the sufficiency and necessity of the patched components 43. The interpretation of activation patching results also heavily relies on the choice of appropriate metrics, such as the logit difference between the correct and incorrect answers, to accurately quantify the effect of the intervention 45.

Table 2: Interpretability Tools and Their Mechanisms


Tool	Description	Goal/Purpose for Interpretability	Key Papers Referenced	Potential Limitations/Illusions
Subspace Projection	Projecting high-dimensional activations onto lower-dimensional subspaces	Identify relevant features or concepts encoded in specific subspaces	29	Difficulty in determining the “correct” subspace, potential for spurious correlations
Null Spaces	Analyzing the set of inputs or activations that map to zero or do not affect the output	Reveal information redundancy, biases, or control mechanisms within neural circuits	34	Complete removal of information can be non-trivial, interpretation can be abstract
Activation Patching	Replacing internal activations from one model run with those from another to observe the effect on the output	Understand the causal influence of specific activations or components on model behavior	29	Susceptible to interpretability illusions, results can depend on the choice of prompts and metrics

Studying the Linear Representation Hypothesis and the Geometry of Large Language Models.

The linear representation hypothesis proposes that high-level semantic concepts are encoded as linear directions or low-dimensional subspaces within the representation spaces of large language models (LLMs) 50. If this hypothesis holds true, it would imply that understanding and manipulating the model’s behavior could be achieved through linear algebraic operations on these representation spaces. Researchers have formalized this hypothesis using the language of counterfactuals, where a concept is defined by pairs of inputs or outputs that differ only in the value of that concept 52. This formalization connects the linear representation hypothesis to interpretability techniques like linear probing (measuring the probability of a concept using a linear classifier) and model steering (controlling the model’s output by adding or subtracting a concept vector from the activations) 52.

To make sense of geometric notions like similarity and projection in the representation space, the concept of a “causal inner product” has been introduced 52. This non-Euclidean inner product is designed to respect the structure of language and unifies the embedding (input context) and unembedding (output word) representations of concepts. In this framework, causally separable concepts, which can be varied independently, are represented by orthogonal vectors in a transformed representation space 52. Studying the geometry of feature representations in LLMs involves examining how different concepts are arranged and related to each other in this high-dimensional space 51. Recent work has suggested that hierarchical relationships between concepts, such as the relationship between “animal” and “mammal,” might be encoded as orthogonality in the representation space 51. This implies that the vector representing a more specific concept might be orthogonal to the vector representing the broader category, or to the difference vector between them. This orthogonality could provide a structured way to understand how semantic hierarchies are organized within LLMs. However, some researchers have raised concerns that certain observed geometric properties, such as orthogonality and the formation of simplices for categorical concepts, might be trivially true in high-dimensional spaces due to the properties of random vectors, and therefore might not be specific indicators of meaningful concept representations within the model 57.

Investigating the Distinction Between Linear and Non-Linear Representations in Neural Networks.

While the linear representation hypothesis offers an elegant framework for understanding LLMs, it is important to acknowledge that neural networks, by their very nature, are capable of learning and utilizing non-linear representations 65. The presence of non-linear activation functions in these networks allows them to model complex, non-linear relationships between inputs and outputs. To investigate whether specific features or concepts are represented linearly or non-linearly, researchers often employ probing techniques 68. Probing involves training simple classifiers, often linear models like logistic regression or support vector machines, on the intermediate activations of a trained neural network to predict a particular property or concept. If a linear probe can accurately predict the concept from the activations, it suggests that the information about that concept is organized in a way that is at least approximately linearly separable in the feature space, supporting the idea of a linear representation. The effectiveness of linear probes in uncovering various semantic and syntactic features in LLMs indicates that many important aspects of language are indeed encoded in a roughly linear fashion within these models 72.

Conversely, if linear probes fail to accurately predict a concept, it might suggest that the concept is encoded in a more complex, non-linear manner. In such cases, non-linear probing techniques, using models with non-linear activation functions, might be more successful. Another set of techniques for studying concept representations involves concept erasure methods 74. The goal of concept erasure is to remove information about a specific concept from the network’s representations while preserving other relevant information. This can be achieved by learning a transformation of the representations that makes it difficult for a classifier to predict the target concept. Concept erasure can be linear, aiming to prevent linear classifiers from predicting the concept, or non-linear, aiming to thwart even non-linear classifiers 74. Methods like LEACE (Least-squares Concept Erasure) and TaCo (Targeted Concept Erasure) have been developed to effectively remove concept information, even against non-linear adversaries 77. Recent research has also provided evidence that not all features in language models are linear 81. For example, studies have discovered multi-dimensional, non-linear representations, such as circular features in the activation space that represent cyclical concepts like days of the week and months of the year. These findings suggest that while linear representations play a significant role in LLMs, the models also utilize more complex, non-linear encoding schemes for certain types of information, indicating that a purely linear view might not capture the full richness of the learned representations.

Examining How Concepts Are Encoded in Subspaces.

The idea that concepts might be encoded within low-dimensional subspaces of the high-dimensional activation space of neural networks is a recurring theme in interpretability research 2. This suggests that while the overall activation space might have a very large number of dimensions (corresponding to the number of neurons), the information relevant to a particular concept might be concentrated in a much smaller subset of these dimensions, forming a subspace. Orthogonality plays a crucial role in the efficient and potentially interpretable encoding of concepts within these subspaces 63. If the subspaces representing different concepts are orthogonal or nearly orthogonal to each other, it implies that these concepts are encoded in an independent manner, reducing redundancy and potentially making it easier to understand how each concept is represented and manipulated by the network without interference from others. Orthogonal representations can also contribute to more stable training and better generalization 91.

The dimensionality of these concept subspaces is also an important factor 100. Studies on the intrinsic dimensionality of data representations in trained neural networks have shown that the minimal number of parameters needed to describe these representations is often orders of magnitude smaller than the total number of neurons in a layer 100. This suggests that the essential information is compressed into a lower-dimensional space, and identifying this intrinsic dimensionality can help focus interpretability efforts on the most relevant aspects of the representation. The Johnson-Lindenstrauss (JL) lemma provides a theoretical foundation for why dimensionality reduction can be effective in preserving essential relationships in high-dimensional spaces 105. It states that a set of points in a high-dimensional space can be embedded into a much lower-dimensional space while approximately preserving the pairwise distances between the points. This lemma suggests that we can significantly reduce the dimensionality of neural network activation spaces without losing too much crucial information, which can be valuable for visualization and further analysis. However, it is also important to consider that in high-dimensional spaces, random vectors are likely to be nearly orthogonal 96. This ubiquity of near orthogonality can complicate the interpretation of observed orthogonality in neural network representations, as it might not always signify a specific, learned relationship between concepts.

Exploring the Use of SVD for Analyzing Transformer Weight Matrices.

Singular Value Decomposition (SVD) is a powerful matrix factorization technique that decomposes a matrix into the product of three other matrices: a left singular vector matrix (U), a diagonal matrix of singular values (S), and the transpose of a right singular vector matrix (V^T) 112. In the context of transformer networks, SVD can be applied to the weight matrices of different layers, such as the query, key, value, output matrices in attention heads, and the input and output weight matrices in the Multi-Layer Perceptron (MLP) layers. The singular values and vectors obtained from SVD reveal the principal directions of the linear transformations performed by these weight matrices, essentially highlighting the directions in the input space that have the most influence on the output.

A significant finding in recent research is that the singular value decompositions of transformer weight matrices, particularly those in the Output-Value (OV) circuits of attention heads and the MLP layers, are often highly interpretable 114. When the singular vectors of these weight matrices are projected back into the token embedding space (the vector space where individual words or tokens are represented), they often align with semantically meaningful clusters of tokens. For example, one singular vector might correspond to the concept of “negation” and activate strongly on words like “not,” “no,” and “never,” while another might correspond to “locations” and activate on city and country names 114. This suggests that transformer networks learn to align the principal directions of their weight matrices to read from and write to semantically interpretable directions in the residual stream, the central information highway of the transformer architecture 118. This interpretability of the SVD components allows researchers to gain insights into the specific functions and the types of information being processed by different parts of the transformer network. Furthermore, SVD can be used as a tool for both understanding and manipulating the representations within the network 117. By identifying the singular vectors that correspond to specific natural language concepts, researchers can potentially edit the model’s behavior by removing or enhancing the contribution of these vectors in the weight matrices. For instance, removing the singular vector associated with a harmful concept might mitigate the model’s tendency to produce toxic language. The analysis of transformer weight matrices using SVD also supports the idea of interpretable subspaces and low-rank communication channels between different layers of the network 119. These channels suggest that information flow and computation within transformers might be happening within specific, lower-dimensional subspaces of the weight and activation spaces, rather than utilizing the full dimensionality for every operation. Identifying these low-rank communication channels is a key step towards achieving a more mechanistic understanding of how information is processed and transformed within these complex models.

Investigating the Geometry of Specific Concepts in Language Models.

Recent research has begun to explore the geometric representations of specific types of concepts within the representation spaces of large language models, aiming to uncover how semantic meaning is encoded through spatial arrangements of concept vectors.

9.1 Categorical Concepts.
One intriguing finding suggests that simple categorical concepts in LLMs, such as the different types of animals (e.g., mammal, bird, reptile, fish), might be represented as simplices in the representation space 51. A simplex is a geometric shape that is the generalization of a triangle or a tetrahedron to any number of dimensions. For example, in a 2D space, a simplex is a triangle, and in a 3D space, it’s a tetrahedron. The idea here is that each vertex of the simplex could correspond to a specific category within the broader concept (e.g., mammal, bird, etc.), and the representation of a particular instance or token related to that concept would lie somewhere within or on the boundary of this simplex. This geometric arrangement might reflect the mutually exclusive nature of the categories; an animal is typically either a mammal or a bird, not both simultaneously at the same level of categorization. However, it has been argued that the observation of simplices might not be entirely specific to meaningful categorical concepts, as even random sets of vectors in high-dimensional spaces can exhibit similar geometric properties 64.
9.2 Hierarchical Concepts.
Another line of research has focused on how hierarchical relationships between concepts, such as the “is-a” relationship (e.g., a dog is a mammal, a mammal is an animal), are encoded geometrically in LLMs. A prominent hypothesis is that these hierarchical relationships are represented through orthogonality between the concept vectors 51. For instance, the vector representing the concept “mammal” might be orthogonal to the vector representing the difference between “mammal” and its superordinate concept “animal.” This orthogonality could signify that the features that distinguish a mammal from a general animal are independent of the features that define an animal itself, reflecting the hierarchical structure where “mammal” inherits the properties of “animal” but also has its own unique characteristics. This type of geometric encoding could provide a mechanism for LLMs to understand and reason about semantic hierarchies. However, similar to the case of categorical concepts, the interpretation of orthogonality as a specific indicator of hierarchical relationships is also subject to the caveat that random vectors in high-dimensional spaces tend to be nearly orthogonal, so observed orthogonality might not always be a direct result of learned semantic structure 57.

Table 3: Summary of Findings on Concept Geometry in LLMs


Concept Type	Proposed Geometric Representation	Supporting Papers	Key Insights	Criticisms/Caveats
Categorical Concepts	Simplices	51	May reflect the mutually exclusive nature of categories	Similar structures can arise from random vectors in high dimensions
Hierarchical Concepts	Orthogonal Vectors/Subspaces	51	Could encode hierarchical inheritance and differentiation	Orthogonality is common in high-dimensional spaces, so observed orthogonality might not always be semantically significant

Considering the Impact of Layer Normalization on Interpretability.

Layer normalization is a widely used technique in deep neural networks, particularly in transformer architectures, designed to stabilize and accelerate the training process 126. It works by normalizing the activations of neurons within each layer across the features for each individual training example. This normalization helps to maintain a more consistent distribution of activations throughout the network, allowing for the use of higher learning rates and reducing the sensitivity to weight initialization, ultimately leading to faster convergence and improved model performance 126. While layer normalization offers significant benefits for training, it can pose challenges for the interpretability of the resulting models, particularly for mechanistic interpretability efforts 133. One of the main ways layer normalization hinders interpretability is by obscuring the direct relationship between the values in the residual stream (the central pathway for information flow in transformers) and the model’s final output logits (the unnormalized scores that determine the predicted probabilities of different tokens) 133. The normalization process removes information about the original scale and distribution of features, which can be important for techniques like the logit lens, where researchers try to directly map directions in the residual stream to changes in the output probabilities 133. Furthermore, layer normalization can complicate the decomposition of the transformer model into individual computational paths or “circuits” 133. The non-linear nature of the layer normalization operation makes it difficult to analyze the flow of information through the network in a purely linear way, hindering the identification of specific sub-networks responsible for particular functions 133. In response to these challenges, some researchers have explored methods to remove layer normalization from pre-trained models through fine-tuning, aiming to obtain models that are more amenable to mechanistic interpretability analysis without sacrificing performance 133. It is worth noting that layer normalization is distinct from other normalization techniques like batch normalization, which normalizes across the batch dimension, and weight normalization, which reparameterizes the weights of the network 127. Each of these techniques has its own impact on the training dynamics and the interpretability of the resulting neural networks.

Brainstorming Potential Interactive Visualizations.

To further enhance the understanding of manifolds and subspaces in neural networks, particularly in the context of LLM interpretability, interactive visualizations can be invaluable tools. One promising approach involves using dimensionality reduction techniques like UMAP to create interactive 2D or 3D embeddings of the high-dimensional feature manifolds present in the activation space of LLMs 9. These visualizations could be generated for different layers of the network and for various tasks, allowing researchers to explore the spatial arrangement of neuron activations corresponding to different inputs or concepts. Interactivity would enable users to zoom in on specific regions, rotate the view to gain different perspectives, and query individual data points to examine the corresponding input or concept. This could potentially reveal clusters of similar inputs, the separation between different conceptual manifolds, and the overall geometric organization of the representation space. Another potential visualization method could focus on illustrating the effect of activation patching. By showing how the model’s output changes in response to the replacement of specific activations or subspaces with those from another input, researchers could gain a more intuitive understanding of the causal relationships within the network. For example, a visualization could highlight the change in the probability distribution over output tokens as activations from a “correct” example are patched into an “incorrect” example. Visualizing the principal components obtained from the SVD of weight matrices in the token embedding space could also be highly informative. An interactive visualization could allow users to explore the semantic clusters associated with different singular vectors, perhaps by displaying the top-activating tokens for each vector. This could provide a direct visual link between the mathematical decomposition of the weight matrices and the semantic concepts learned by the model. Finally, for concepts that are hypothesized to have specific geometric representations, such as simplices for categorical concepts, visualizations in a lower-dimensional embedding space could help to validate these theories and provide a more intuitive grasp of the underlying structure.

Conclusion: Geometric Insights for Advancing Neural Network Interpretability.

This report has explored the concepts of manifolds and subspaces within the context of neural networks, with a particular focus on their implications for the interpretability of large language models. The geometric perspective offers a powerful framework for understanding the complex internal representations and computations of these models. We have discussed the definitions of manifolds and subspaces, the role of dimensionality reduction techniques in revealing underlying geometric structures, and various interpretability tools that leverage these concepts, including subspace projection, null space analysis, and activation patching. The linear representation hypothesis and the geometry of features in LLMs have been examined, highlighting the potential for concepts to be encoded as linear directions or within orthogonal subspaces. The distinction between linear and non-linear representations and the methods used to investigate them, such as probing and concept erasure, have been considered. The encoding of concepts in subspaces, the role of orthogonality and dimensionality (including the Johnson-Lindenstrauss lemma), and the insights gained from SVD analysis of transformer weight matrices have been detailed. Furthermore, the report has investigated the geometry of specific concept types, such as categorical and hierarchical concepts, and the impact of layer normalization on interpretability. Finally, we have brainstormed potential interactive visualizations that could further enhance our understanding of these geometric aspects. While significant progress has been made in applying geometric concepts to the challenge of neural network interpretability, it remains an ongoing area of research with many open questions. Future work will likely continue to refine our understanding of how high-dimensional data, including the internal representations of LLMs, reside on lower-dimensional manifolds, and how the linear approximations provided by subspaces can be effectively used for analysis. Further exploration of non-linear representation schemes and the development of more sophisticated interpretability tools that can handle the complexity of these models will be crucial for achieving truly transparent and reliable AI systems.

Works cited

A Survey on Neural Network Interpretability - GitHub Pages, accessed on March 23, 2025, https://petertino.github.io/web/PAPERS/Yu_Survey_Interpret_DNN.pdf
Neural Manifolds — Linear Algebra and Topology in Neuroscience …, accessed on March 23, 2025, https://medium.com/bits-and-neurons/neural-manifolds-linear-algebra-and-topology-in-neuroscience-dde5a8181811
The manifold framework of neural circuits - Geometry Matters, accessed on March 23, 2025, https://geometrymatters.com/the-manifold-framework-of-neural-circuits/
Manifold Hypothesis - PRIMO.ai, accessed on March 23, 2025, https://primo.ai/index.php/Manifold_Hypothesis
Unveiling Manifold learning. What a neural network is really doing? | by Yogesh Haribhau Kulkarni (PhD) | Technology Hits | Medium, accessed on March 23, 2025, https://medium.com/technology-hits/unveiling-manifold-learning-fec4126bde73
Manifold hypothesis - Wikipedia, accessed on March 23, 2025, https://en.wikipedia.org/wiki/Manifold_hypothesis
Understanding Feature Learning in Neural Networks via Manifold Capacity and Effective Geometry - CCN 2024, accessed on March 23, 2025, https://2024.ccneuro.org/pdf/488_Paper_authored_Learning_Geometry__CCN_-(1).pdf
Manifold Learning in Machine Learning | by Hey Amit | Medium, accessed on March 23, 2025, https://medium.com/@heyamit10/manifold-learning-in-machine-learning-e008e480d036
No Straight Lines Here: The Wacky World of Non-Linear Manifold …, accessed on March 23, 2025, https://sites.gatech.edu/omscs7641/2024/03/10/no-straight-lines-here-the-wacky-world-of-non-linear-manifold-learning/
Learning Neural Network Subspaces - Apple Machine Learning …, accessed on March 23, 2025, https://machinelearning.apple.com/research/learning-neural-network-subspaces
Learning Neural Network Subspaces - arXiv, accessed on March 23, 2025, https://arxiv.org/pdf/2102.10472
Improving Neural Network Subspaces - Apple Machine Learning Research, accessed on March 23, 2025, https://machinelearning.apple.com/research/improving-neural-subspaces
Active Subspace of Neural Networks: Structural Analysis and …, accessed on March 23, 2025, https://epubs.siam.org/doi/abs/10.1137/19M1296070
Active Subspace of Neural Networks: Structural Analysis and Universal Attacks ∗ - ece.ucsb.edu - University of California, Santa Barbara, accessed on March 23, 2025, https://web.ece.ucsb.edu/~zhengzhang/journals/2020_SIMODS_Cui_Active_Subsapce.pdf
Dimensionality reduction - Wikipedia, accessed on March 23, 2025, https://en.wikipedia.org/wiki/Dimensionality_reduction
Manifold Learning vs PCA. Dimensionality reduction might sound …, accessed on March 23, 2025, https://medium.com/biased-algorithms/manifold-learning-vs-pca-4b45df25ea0b
2.2. Manifold learning — scikit-learn 1.6.1 documentation, accessed on March 23, 2025, https://scikit-learn.org/stable/modules/manifold.html
Manifold-Learning/Isomap.ipynb at master · drewwilimitis/Manifold …, accessed on March 23, 2025, https://github.com/drewwilimitis/Manifold-Learning/blob/master/Isomap.ipynb
Unraveling Data Patterns with Isomap: A Guide to Dimensionality Reduction — Part 4 | by Connie Zhou | Medium, accessed on March 23, 2025, https://medium.com/@conniezhou678/unraveling-data-patterns-with-isomap-a-guide-to-dimensionality-reduction-part-4-1d774eee69a5
Beyond 3D: Flattening Complex Data Structures with Isomap | BIP xTech - Medium, accessed on March 23, 2025, https://medium.com/bip-xtech/beyond-3d-flattening-complex-data-structures-with-isomap-03643dca594c
Isomap - Wikipedia, accessed on March 23, 2025, https://en.wikipedia.org/wiki/Isomap
ISOMAP - YouTube, accessed on March 23, 2025, https://www.youtube.com/watch?v=bi-xB9OywlM
Applied topology 21: Nonlinear dimensionality reduction - Isomap, Part I - YouTube, accessed on March 23, 2025, https://www.youtube.com/watch?v=Xu_3NnkAI9s
Dimension Reduction - IsoMap | DigitalOcean, accessed on March 23, 2025, https://www.digitalocean.com/community/tutorials/dimension-reduction-with-isomap
Understanding UMAP, accessed on March 23, 2025, https://pair-code.github.io/understanding-umap/
Isomap for Dimensionality Reduction in Python - Ben Alex Keen, accessed on March 23, 2025, https://benalexkeen.com/isomap-for-dimensionality-reduction-in-python/
Swiss Roll Reduction with LLE in Scikit Learn - GeeksforGeeks, accessed on March 23, 2025, https://www.geeksforgeeks.org/swiss-roll-reduction-with-lle-in-scikit-learn/
Should dimensionality reduction for visualization be considered a “closed” problem, solved by t-SNE? - Cross Validated, accessed on March 23, 2025, https://stats.stackexchange.com/questions/270391/should-dimensionality-reduction-for-visualization-be-considered-a-closed-probl
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching | OpenReview, accessed on March 23, 2025, https://openreview.net/forum?id=Ebt7JgMHv1
Evaluating and Improving Subspace Inference in Bayesian Deep Learning | OpenReview, accessed on March 23, 2025, https://openreview.net/forum?id=1wRXUROlzY
Neural Networks Perform Sufficient Dimension Reduction - arXiv, accessed on March 23, 2025, https://arxiv.org/html/2412.19033v1
Forecast Combination and Interpretability Using Random Subspace - The Econometric Society, accessed on March 23, 2025, https://www.econometricsociety.org/event_papers/download/278/367/2/Kozyrev_RS_UPD.pdf
NeurIPS Poster Disentangling Interpretable Factors with Supervised Independent Subspace Principal Component Analysis, accessed on March 23, 2025, https://nips.cc/virtual/2024/poster/96270
Null space. The black hole of Math. | by Mathphye - Medium, accessed on March 23, 2025, https://medium.com/@mathphye/null-space-the-black-hole-of-math-b63dc14d0057
Null Space Properties of Neural Networks with Applications to Image Steganography - arXiv, accessed on March 23, 2025, https://arxiv.org/abs/2401.10262
Cortical activity in the null space: permitting preparation without movement - PMC - PubMed Central, accessed on March 23, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC3955357/
[PDF] Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection | Semantic Scholar, accessed on March 23, 2025, https://www.semanticscholar.org/paper/Null-It-Out%3A-Guarding-Protected-Attributes-by-Ravfogel-Elazar/e969aa3422a49152c22f3faf734e4561a2a3cf42
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection - ResearchGate, accessed on March 23, 2025, https://www.researchgate.net/publication/340683004_Null_It_Out_Guarding_Protected_Attributes_by_Iterative_Nullspace_Projection
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection - ACL Anthology, accessed on March 23, 2025, https://aclanthology.org/2020.acl-main.647/
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection - ACL Anthology, accessed on March 23, 2025, https://aclanthology.org/2020.acl-main.647.pdf
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection - ResearchGate, accessed on March 23, 2025, https://www.researchgate.net/publication/343299041_Null_It_Out_Guarding_Protected_Attributes_by_Iterative_Nullspace_Projection
Mechanistic Interpretability | Decode Neural Networks | CSA - Cloud Security Alliance, accessed on March 23, 2025, https://cloudsecurityalliance.org/blog/2024/09/05/mechanistic-interpretability-101
How to use and interpret activation patching - arXiv, accessed on March 23, 2025, https://arxiv.org/html/2404.15255v1
How to use and interpret activation patching - arXiv, accessed on March 23, 2025, https://arxiv.org/pdf/2404.15255
How to use and interpret activation patching - LessWrong, accessed on March 23, 2025, https://www.lesswrong.com/posts/FhryNAFknqKAdDcYy/how-to-use-and-interpret-activation-patching
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching - Semantic Scholar, accessed on March 23, 2025, https://www.semanticscholar.org/paper/Is-This-the-Subspace-You-Are-Looking-for-An-for-Makelov-Lange/61393e6ad8262f2fbf1ef980608c3deae5fb1afd
ICLR Poster Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching - ICLR 2025, accessed on March 23, 2025, https://iclr.cc/virtual/2024/poster/19100
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching | OpenReview, accessed on March 23, 2025, https://openreview.net/forum?id=cGui9S5GqL
[2311.17030] Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching - arXiv, accessed on March 23, 2025, https://arxiv.org/abs/2311.17030
The Linear Representation Hypothesis in Language Models - NeurIPS, accessed on March 23, 2025, https://neurips.cc/virtual/2023/77537
The Geometry of Categorical and Hierarchical Concepts in Large Language Models, accessed on March 23, 2025, https://openreview.net/forum?id=bVTM2QKYuA
The Linear Representation Hypothesis and the Geometry of Large Language Models - GitHub, accessed on March 23, 2025, https://raw.githubusercontent.com/mlresearch/v235/main/assets/park24c/park24c.pdf
The Linear Representation Hypothesis and the Geometry of Large Language Models - arXiv, accessed on March 23, 2025, https://arxiv.org/abs/2311.03658
The Linear Representation Hypothesis and the Geometry of Large Language Models | AI Research Paper Details - AIModels.fyi, accessed on March 23, 2025, https://www.aimodels.fyi/papers/arxiv/linear-representation-hypothesis-geometry-large-language-models
The Linear Representation Hypothesis and the Geometry of Large Language Models - arXiv, accessed on March 23, 2025, https://arxiv.org/html/2311.03658v2
The Linear Representation Hypothesis and the Geometry of Large Language Models - OpenReview, accessed on March 23, 2025, https://openreview.net/pdf?id=T0PoOJg8cK
Intricacies of Feature Geometry in Large Language Models - LessWrong, accessed on March 23, 2025, https://www.lesswrong.com/posts/eczwWrmX5XNEo7JsS/intricacies-of-feature-geometry-in-large-language-models
On the geometry of Large Language Models representations - Tech Talk Alberto Cazzaniga., accessed on March 23, 2025, https://www.youtube.com/watch?v=Ey8iiP0NblU
Intricacies of Feature Geometry in Large Language Models | Cool Papers, accessed on March 23, 2025, https://papers.cool/venue/Ut3ml7Hdwx@OpenReview
Do Large Language Models Truly Understand Geometric Structures? - OpenReview, accessed on March 23, 2025, https://openreview.net/forum?id=FjQOXenaXK
The Geometry of Categorical and Hierarchical Concepts in Large Language Models - arXiv, accessed on March 23, 2025, https://arxiv.org/html/2406.01506v1
The Geometry of Categorical and Hierarchical Concepts in Large Language Models, accessed on March 23, 2025, https://openreview.net/forum?id=KXuYjuBzKo
(PDF) The Geometry of Categorical and Hierarchical Concepts in Large Language Models, accessed on March 23, 2025, https://www.researchgate.net/publication/381152829_The_Geometry_of_Categorical_and_Hierarchical_Concepts_in_Large_Language_Models
The Geometry of Feelings and Nonsense in Large Language Models - LessWrong, accessed on March 23, 2025, https://www.lesswrong.com/posts/C8LZ3DW697xcpPaqC/the-geometry-of-feelings-and-nonsense-in-large-language
medium.com, accessed on March 23, 2025, https://medium.com/@abhishekjainindore24/linear-and-non-linear-neural-networks-ec67f2d61fb1#:~:text=Examples%20of%20linear%20activation%20functions,relationships%20between%20inputs%20and%20outputs.
Linear and non linear neural networks | by Abhishek Jain - Medium, accessed on March 23, 2025, https://medium.com/@abhishekjainindore24/linear-and-non-linear-neural-networks-ec67f2d61fb1
Activation functions in Neural Networks - GeeksforGeeks, accessed on March 23, 2025, https://www.geeksforgeeks.org/activation-functions-neural-networks/
sidn.baulab.info, accessed on March 23, 2025, https://sidn.baulab.info/probing/#:~:text=Intervention%20Technique%20for%20Probing%20Internal%20Representations&text=They%20found%20that%20linear%20probess,state%20in%20the%20network’s%20activations.
Probing - Structure and Interpretation of Deep Networks, accessed on March 23, 2025, https://sidn.baulab.info/probing/
Unveiling Neural Network Mysteries with Linear Classifier Probes - Code Labs Academy, accessed on March 23, 2025, https://codelabsacademy.com/en/blog/linear-classifier-role-in-the-analysis-of-deep-neural-networks
A Fresh Look at Nonlinearity in Deep Learning | TDS Archive - Medium, accessed on March 23, 2025, https://medium.com/towards-data-science/a-fresh-look-at-nonlinearity-in-deep-learning-a79b6955d2ad
Deep learning models are secretly (almost) linear - Beren’s Blog, accessed on March 23, 2025, https://www.beren.io/2023-04-04-DL-models-are-secretly-linear/
Deep learning models might be secretly (almost) linear - LessWrong, accessed on March 23, 2025, https://www.lesswrong.com/posts/JK9nxcBhQfzEgjjqe/deep-learning-models-might-be-secretly-almost-linear
Kernelized Concept Erasure - ACL Anthology, accessed on March 23, 2025, https://aclanthology.org/2022.emnlp-main.405.pdf
[2201.12191] Kernelized Concept Erasure - arXiv, accessed on March 23, 2025, https://arxiv.org/abs/2201.12191
Robust Concept Erasure via Kernelized Rate-Distortion Maximization - NIPS papers, accessed on March 23, 2025, https://proceedings.neurips.cc/paper_files/paper/2023/file/86bd650f85480c595ecab29081a3774e-Paper-Conference.pdf
NeurIPS Poster LEACE: Perfect linear concept erasure in closed form, accessed on March 23, 2025, https://neurips.cc/virtual/2023/poster/71176
LEACE: Perfect linear concept erasure in closed form - OpenReview, accessed on March 23, 2025, https://openreview.net/forum?id=awIpKpwTwF¬eId=Ju4XcafMir
Kernelized Concept Erasure | Bytez, accessed on March 23, 2025, https://bytez.com/docs/arxiv/2201.12191
TaCo: Targeted Concept Erasure Prevents Non-Linear Classifiers From Detecting Protected Attributes - arXiv, accessed on March 23, 2025, https://arxiv.org/html/2312.06499v4
Not All Language Model Features Are Linear- American Mathematical Society, accessed on March 23, 2025, https://meetings.ams.org/math/jmm2025/meetingapp.cgi/Paper/43413
(PDF) Not All Language Model Features Are Linear - ResearchGate, accessed on March 23, 2025, https://www.researchgate.net/publication/380847625_Not_All_Language_Model_Features_Are_Linear
Not All Language Model Features Are Linear - arXiv, accessed on March 23, 2025, https://arxiv.org/html/2405.14860v1
Not All Language Model Features Are Linear | AI Research Paper Details - AIModels.fyi, accessed on March 23, 2025, https://www.aimodels.fyi/papers/arxiv/not-all-language-model-features-are-linear
Paper page - Not All Language Model Features Are Linear - Hugging Face, accessed on March 23, 2025, https://huggingface.co/papers/2405.14860
Convolutional neural network models describe the encoding subspace of local circuits in auditory cortex | bioRxiv, accessed on March 23, 2025, https://www.biorxiv.org/content/10.1101/2024.11.07.622384v1.full-text
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations | OpenReview, accessed on March 23, 2025, https://openreview.net/forum?id=NUQeYgg8x4
Convolutional neural network models describe the encoding subspace of local circuits in auditory cortex - PubMed, accessed on March 23, 2025, https://pubmed.ncbi.nlm.nih.gov/39574636/
Autoencoder - Wikipedia, accessed on March 23, 2025, https://en.wikipedia.org/wiki/Autoencoder
Encoding sensory and motor patterns as time-invariant trajectories in recurrent neural networks, accessed on March 23, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC5851701/
Controllable Orthogonalization in Training DNNs - CVF Open Access, accessed on March 23, 2025, https://openaccess.thecvf.com/content_CVPR_2020/papers/Huang_Controllable_Orthogonalization_in_Training_DNNs_CVPR_2020_paper.pdf
Orthogonal Neural Network: An Analytical Model for Deep Learning - MDPI, accessed on March 23, 2025, https://www.mdpi.com/2076-3417/14/4/1532
Batch Normalization Orthogonalizes Representations in Deep Random Networks - NeurIPS, accessed on March 23, 2025, https://proceedings.neurips.cc/paper/2021/file/26cd8ecadce0d4efd6cc8a8725cbd1f8-Paper.pdf
Existence, Stability and Scalability of Orthogonal Convolutional Neural Networks - Journal of Machine Learning Research, accessed on March 23, 2025, https://jmlr.org/papers/volume23/22-0026/22-0026.pdf
Orthogonal regularizers in deep learning: how to handle rectangular matrices? - Université catholique de Louvain, accessed on March 23, 2025, https://perso.uclouvain.be/estelle.massart/documents/papier_ICPR_orthogonal.pdf
Can We Gain More from Orthogonality Regularizations in Training Deep Networks? - NeurIPS, accessed on March 23, 2025, http://papers.neurips.cc/paper/7680-can-we-gain-more-from-orthogonality-regularizations-in-training-deep-networks.pdf
Near-Orthogonality Regularization in Kernel Methods, accessed on March 23, 2025, http://auai.org/uai2017/proceedings/papers/6.pdf
Classical and Quantum Algorithms for Orthogonal Neural Networks - OpenReview, accessed on March 23, 2025, https://openreview.net/forum?id=t7y6MKiyiWx
A Network Architecture Plug-in for Learning Orthogonal Filters - CVF Open Access, accessed on March 23, 2025, https://openaccess.thecvf.com/content_WACV_2020/papers/Zhang_Self-Orthogonality_Module_A_Network_Architecture_Plug-in_for_Learning_Orthogonal_Filters_WACV_2020_paper.pdf
Intrinsic dimension of data representations in deep neural networks - NIPS papers, accessed on March 23, 2025, http://papers.neurips.cc/paper/8843-intrinsic-dimension-of-data-representations-in-deep-neural-networks.pdf
The Curse of Dimensionality in Machine Learning: Challenges, Impacts, and Solutions, accessed on March 23, 2025, https://www.datacamp.com/blog/curse-of-dimensionality-machine-learning
What is the curse of dimensionality, and why does deep learning overcome it? - Xomnia, accessed on March 23, 2025, https://www.xomnia.com/post/what-is-the-curse-of-dimensionality-and-why-does-deep-learning-overcome-it/
[1905.12784] Intrinsic dimension of data representations in deep neural networks - arXiv, accessed on March 23, 2025, https://arxiv.org/abs/1905.12784
Can neural networks be used for dimensionality reduction? Are there any tasks you shouldn’t use NN for? | Kaggle, accessed on March 23, 2025, https://www.kaggle.com/discussions/questions-and-answers/159941
en.wikipedia.org, accessed on March 23, 2025, https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma#:~:text=The%20lemma%20states%20that%20a,is%20a%20random%20orthogonal%20projection.
Johnson–Lindenstrauss lemma - Wikipedia, accessed on March 23, 2025, https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma
The Johnson-Lindenstrauss Lemma - University of Toronto Mathematics, accessed on March 23, 2025, https://www.math.toronto.edu/undergrad/projects-undergrad/Project03.pdf
machine learning - When to use the Johnson-Lindenstrauss lemma over SVD?, accessed on March 23, 2025, https://cstheory.stackexchange.com/questions/21487/when-to-use-the-johnson-lindenstrauss-lemma-over-svd
Johnson-Lindenstrauss lemma | Personal blog of Boris Burkov, accessed on March 23, 2025, https://borisburkov.net/2021-09-10-1/
The Johnson-Lindenstrauss Lemma in Python | by r3d_robot - Medium, accessed on March 23, 2025, https://medium.com/@r3d_robot/the-johnson-lindenstrauss-lemma-ef10698d0dc6
Orthogonality expected at higher dimensions - Doug Turnbull, accessed on March 23, 2025, https://softwaredoug.com/blog/2022/12/26/surpries-at-hi-dimensions-orthoginality
Small Singular Values Matter: A Random Matrix Analysis of Transformer Models - arXiv, accessed on March 23, 2025, https://arxiv.org/html/2410.17770v2
Harnessing Singular Value Decomposition (SVD) for Efficient Neural Network Weight Pruning | by Vitality Learning, accessed on March 23, 2025, https://vitalitylearning.medium.com/harnessing-singular-value-decomposition-svd-for-efficient-neural-network-weight-pruning-21a44f22793f
SVD on Transformer Weight Matrices - YouTube, accessed on March 23, 2025, https://www.youtube.com/watch?v=gRjh4AYWZZo
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models - arXiv, accessed on March 23, 2025, https://arxiv.org/html/2312.05821v1
Six (and a half) intuitions for SVD - GreaterWrong.com, accessed on March 23, 2025, https://www.greaterwrong.com/posts/iupCxk3ddiJBAJkts/six-and-a-half-intuitions-for-svd
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable - LessWrong : r/MachineLearning - Reddit, accessed on March 23, 2025, https://www.reddit.com/r/MachineLearning/comments/z7rabn/r_the_singular_value_decompositions_of/
The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable, accessed on March 23, 2025, https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models, accessed on March 23, 2025, https://arxiv.org/html/2406.09519v3
Decision Transformer Interpretability - LessWrong, accessed on March 23, 2025, https://www.lesswrong.com/posts/bBuBDJBYHt39Q5zZy/decision-transformer-interpretability
Transformer-Specific Interpretability - NAACL 24 Tutorial: Explanations in the Era of Large Language Models, accessed on March 23, 2025, https://explanation-llm.github.io/slides/section_4_slides.pdf
Interpretable Lightweight Transformer via Unrolling of Learned Graph Smoothness Priors, accessed on March 23, 2025, https://www.researchgate.net/publication/381227247_Interpretable_Lightweight_Transformer_via_Unrolling_of_Learned_Graph_Smoothness_Priors
A Comprehensive Mechanistic Interpretability Explainer & Glossary - Neel Nanda, accessed on March 23, 2025, https://www.neelnanda.io/mechanistic-interpretability/glossary
The Geometry of Categorical and Hierarchical Concepts in Large Language Models | AI Research Paper Details - AIModels.fyi, accessed on March 23, 2025, https://www.aimodels.fyi/papers/arxiv/geometry-categorical-hierarchical-concepts-large-language-models
The Geometry of Categorical and Hierarchical Concepts in Large Language Models, accessed on March 23, 2025, https://powerdrill.ai/discover/discover-The-Geometry-of-clx4ks37503ca019nt3owh82e
Mastering Layer Normalization: Enhancing Neural Networks for Optimal Performance, accessed on March 23, 2025, https://www.lunartech.ai/blog/mastering-layer-normalization-enhancing-neural-networks-for-optimal-performance
Layer Normalization Explained | Papers With Code, accessed on March 23, 2025, https://paperswithcode.com/method/layer-normalization
Normalization Methods in Deep Learning - Ahmad Badary, accessed on March 23, 2025, https://ahmedbadary.github.io/work_files/research/dl/concepts/norm_methods
Context Normalization Layer with Applications - arXiv, accessed on March 23, 2025, https://arxiv.org/html/2303.07651v2
Batch vs Layer Normalization - Zilliz Learn, accessed on March 23, 2025, https://zilliz.com/learn/layer-vs-batch-normalization-unlocking-efficiency-in-neural-networks
Visualizing What Batch Normalization Is and Its Advantages - Data Leads Future, accessed on March 23, 2025, https://www.dataleadsfuture.com/visualizing-what-batch-normalization-is-and-its-advantages/
Using Normalization Layers to Improve Deep Learning Models - MachineLearningMastery.com, accessed on March 23, 2025, https://machinelearningmastery.com/using-normalization-layers-to-improve-deep-learning-models/
You can remove GPT2’s LayerNorm by fine-tuning - arXiv, accessed on March 23, 2025, https://arxiv.org/pdf/2409.13710
Understanding and Improving Layer Normalization, accessed on March 23, 2025, http://papers.neurips.cc/paper/8689-understanding-and-improving-layer-normalization.pdf
Unlocking the Power of Batch Normalization in Neural Networks | by Juan C Olamendy, accessed on March 23, 2025, https://medium.com/@juanc.olamendy/unlocking-the-power-of-batch-normalization-in-neural-networks-655656d9bd04
Normalization Techniques in Deep Learning | Restackio, accessed on March 23, 2025, https://www.restack.io/p/explainability-in-deep-learning-answer-normalization-techniques-cat-ai
Visualizing What Batch Normalization Is and Its Advantages : r/datascience - Reddit, accessed on March 23, 2025, https://www.reddit.com/r/datascience/comments/1aihddg/visualizing_what_batch_normalization_is_and_its/

chtnnh's Digital Garden

Explorer

mechInterp

Manifolds and Subspaces in Neural Networks: A Geometric Approach to Interpretability in Large Language Models

Works cited

Graph View

Backlinks