{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to transformers\n",
"\n",
"\n",
"\n",
"Star\n",
"Issue\n",
"Watch\n",
"Follow\n",
"\n",
"The transformer {cite}`vaswani2017attention` is a deep learning architecture which has powered many of the recent advances across a range of machine learning applications, including text modelling, image modelling {cite}`dosovitskiy2021image`, and many others.\n",
"This is an overview of the transformer architecture, including a self-contained mathematical description of the architectural details, and a concise implementation.\n",
"All of this exposition is based off an excellent introduction paper on transformers by Rich Turner {cite}`turner2023introduction`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modelling with tokens\n",
"__One architecture, many applications.__\n",
"The purpose of the transformer architecture was, originally, to model sequence data such as text.\n",
"The approach for achieving this was to first convert individual words, or characters, into one-dimensional arrays called _tokens_, and then operate on these tokens with a neural network.\n",
"This approach however extends beyond word modelling.\n",
"For example, the transformer can be applied to tasks as diverse as modelling of images and video, proteins, or weather.\n",
"In all these applications, the data are first converted into sets of tokens.\n",
"After this step, the transformer can be applied in roughly the same way, irrespective of the original representation of the data.\n",
"This versatility, together with their empirical performance, are some of the main appealing features of the transformer.\n",
"\n",
"__Inputs as tokens.__\n",
"In particular, for the moment, we will assume that the input data have already been converted into tokens and defer the details of this tokenisation for later.\n",
"More concretely, let us assume that each data example, e.g. a sentence, image, or protein, has been conerted into a set of tokens $\\{x_n\\}_{n=1}^N,$ where each $x_n$ is a $D$ dimensional array $x_n \\in \\mathbb{R}^D.$\n",
"We can collect these tokens into a single $D \\times N$ array $X^{(0)} \\in \\mathbb{R}^{D \\times N},$ forming a single data input for the transformer."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transformer block\n",
"Much like in other deep architectures, the transformer maintains a representation of the input data, and progressively refines it using a sequence of so-called _transformer blocks_.\n",
"In particular, given an initial representation $X^{(0)}$ the archtecture comprises of $M$ transformer blocks, i.e. for each $m = 1, \\dots, M,$ it computes\n",
"\n",
"$$X^{(m)} = \\texttt{TransformerBlock}(X^{(m-1)}).$$\n",
"\n",
"Each of these blocks consists of two main operations, namely a self-attention operation and a token-wise multi-layer perceptron (MLP) operation.\n",
"The self-attention operation has the role of combining the representations of different tokens in a sequence, in order to model dependencies between the tokens.\n",
"It is applied collectively to all tokens within the transformer block.\n",
"The MLP operation has the role of refining the representation of each token.\n",
"It is applied separately to each token and is shared across all tokens within a transformer block.\n",
"Let's look at these two operations in detail."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Self-attention\n",
"\n",
"__Attention.__\n",
"The role of the first operation in a transformer block is to combine the representations of different tokens in order to model dependencies between them.\n",
"Given a $D \\times N$ input array $X^{(m)} = (x_1, \\dots, x_N^{(m)})$ the output of the self-attention layer is another $D \\times N$ array $Y^{(m)} = (y_1, \\dots, y_N^{(m)}),$ where each column is simply a weighted average of the input features, that is\n",
"\n",
"$$\n",
"y^{(m)}_n = \\sum_{n' = 1}^N x^{(m - 1)}_{n'} A_{n', n}^{(m)}.\n",
"$$ (papers:eq:attention:sum)\n",
"\n",
"```{margin}\n",
"__Relation to CNNs:__\n",
"If we constrain the weights to have a single index that depends on the difference of $n$ and $n',$ i.e. to have the form $A_{n', n}^{(m)} = A_{n'-n}^{(m)},$ then the sum in equation {eq}`papers:eq:attention:sum` can be viewed as a convolution with zero padding.\n",
"One difference to standard CNNs is that their convolutions are local, meaning they have a limited kernel size and do not see the entire input, whereas here we would have a kernel that sees the entire input, i.e. $y^{(m)}_n$ would depend on all of the $x^{(m - 1)}_{n'}.$\n",
"Another important difference to standard CNNs is that the weights do not depend on the inputs $x^{(m - 1)}_n,$ whereas in a transformer they do.\n",
"This is gives significant additional flexibility to the transformer architecture.\n",
"```\n",
"The weighting array $A_{n', n}^{(m)}$ is of size $N \\times N$ and has the property that its columns normalise to one, that is $\\sum_{n'=1}^N A_{n', n}^{(m)} = 1.$\n",
"It is referred to the attention matrix because it weighs the extent to which the feature $y^{(m)}_n$ should depend on each $x^{(m)}_{n'},$ i.e. it determines the extent to which each $y^{(m)}_n$ should attend to each $x^{(m)}_{n'}.$\n",
"For compactness, we can collect these equations to a single linear operation, that is\n",
"\n",
"$$Y^{(m)} = X^{(m - 1)} A^{(m)}.$$\n",
"\n",
"But what about the attention weights themselves?\n",
"We have not specified how these are computed and, their precise definition is going to be one important factor that differentiates transformers from other architectures.\n",
"In fact, many other operations forming the core of other archictectures, such as convolution layers in convolutional neural networks (CNNs), can be written as similar weighted sums.\n",
"Let's next look at the specifics of the transformer attention weights.\n",
"\n",
"__Self-attention.__\n",
"One of the innovations within the transformer architecture is that the attention weights are adaptive, meaning that they are computed based on the input itself.\n",
"This is in contrast with other deep learning architectures such CNNs, where weighted sums are also used, but these weights are fixed and shared across all inputs.\n",
"One straightforward way to compute attention weights would be to compare them by a simple similarity metric, such as an inner product.\n",
"For example, given two tokens $x_i$ and $x_j,$ we can compute a dot-product between them, which acts as a similarity metric, exponetiate the result to make it positive and then normalise the result to ensure that each column sums to one, that is\n",
"\n",
"$$A^{(m)}_{n, n'} = \\frac{\\exp(x_n^\\top x_{n'})}{\\sum_{n'' = 1}^N \\exp(x_{n''}^\\top x_{n'})}.$$\n",
"\n",
"An alternative, slightly more flexible approach is to transform each token in the sequence by a linear map, say by applying a matrix $U \\in \\mathbb{R}^{K \\times D}$ to each token first, that is\n",
"\n",
"$$A^{(m)}_{n, n'} = \\frac{\\exp(x_n^\\top U^\\top U x_{n'})}{\\sum_{n'' = 1}^N \\exp(x_{n''}^\\top U^\\top U x_{n'})}.$$\n",
"\n",
"This allows the tokens to be compared in a different space.\n",
"For example, if $K < D$ this approach automatically projects out some of the components of the tokens, comparing them in a lower-dimensional space.\n",
"However, this approach still has an important limitation, namely symmetry.\n",
"Specifically, the attention matrix above would be symmetric, which means that any two tokens would attend to each other with equal strengths.\n",
"This might be undesirable because, for example, we could imagine that one token might be important for informing the representation of another token, but not the other way around.\n",
"To address this, we can apply different linear operations, say $U_k$ and $U_q$ to each of the tokens being compared, and instead compute\n",
"\n",
"$$A^{(m)}_{n, n'} = \\frac{\\exp(x_n^\\top U_k^\\top U_q x_{n'})}{\\sum_{n'' = 1}^N \\exp(x_{n''}^\\top U_k^\\top U_q x_{n'})}.$$\n",
"\n",
"In this way, the resulting attention matrix that is not necessarily symmetric and an overall more expressive architecture.\n",
"Tokens no longer have to attend to each other with the same strength.\n",
"This weighting is known as self-attention, since each token in the sequence attends to every other token of the same sequence.\n",
"It is also possible to generalise this to attention between different sequences, which might be useful for some applications such as, for example joint modelling of text and images.\n",
"This generalisation is called cross-attention, and we defer its discussion for later.\n",
"\n",
"__Multi-head self-attention.__\n",
"In order to increase the capacity of the self-attention layer, the transformer block includes $H$ separate self-attention operations with different parameters, in parallel.\n",
"The results of these operations are then projected down to a single $D \\times N$ array again, which is required for further processing.\n",
"In particular, we have\n",
"\n",
"```{margin}\n",
"As a recap to the notation in these equations: the $m$ superscript runs from $1$ to $M$ and is the index of the transformer block, the $n, n'$ and $n''$ superscripts run from $1$ to $N$ and index the tokens in the sequence within the current block, the $h$ subscript runs from $1$ to $H$ and denotes a particular self-attention head in the block.\n",
"Finally the $k$ and $q$ subscripts are not indices, but symbols distinguishing the two different kinds of matrices $U_k$ and $U_q.$\n",
"```\n",
"\n",
"$$\\begin{align}\n",
"Y^{(m)} = \\texttt{MHSA}(X^{(m - 1)}) &= \\sum^H_{h = 1} V_h^{(m)} X^{(m - 1)} A_h^{(m)}, \\text{ where } \\\\\n",
"\\left[A^{(m)}_h\\right]_{n, n'} &= \\frac{\\exp\\left(k_{h, n}^{(m)\\top} q_{h, n'}^{(m)}\\right)}{\\sum_{n'' = 1}^N \\exp\\left(k_{h, n''}^{(m)\\top} q_{h, n'}^{(m)}\\right)} \n",
"\\end{align}$$\n",
"\n",
"\n",
"```{margin}\n",
"Note that the matrices $U_{h, k},$ $U_{h, k}$ and $V_h$ correspond to the key, query and value matrices in the standard exposition of transformers.\n",
"Here we have taken a bottom-down approach without introducing these terms, but use this notation to make the relationship to the standard exposition clear.\n",
"```\n",
"where $q_{h, n}^{(m)} = U^{(m)}_{q, h} x_n^{(m-1)}$ and $k_{h, n}^{(m)} = U^{(m)}_{k, h} x_n^{(m-1)}.$\n",
"At this point we should note that, due to the nonlinearity of $A^{(m)},$ together with the multiplication by $V^{(m)}_h$ and summation across $h,$ multi-head cross attention performs not just inter-feature but also intra-feature processing, i.e. each token interacts with and changes its own representation.\n",
"However, the capacity of this intra-feature processing is limited, and it is the job of the second stage, the MLP, to address this.\n",
"Let's next look at the MLP stage."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multi-layer perceptron\n",
"The self-attention layer has the role of aggregating information across tokens in a sequence to model joint dependencies.\n",
"In order to refine the representations themselves, a simple MLP is applied to each token in isolation, in a relatively simple step\n",
"\n",
"$$x^{(m)}_n = \\texttt{MLP}(y^{(m)}_n).$$\n",
"\n",
"Note that this MLP is shared across all input locations, i.e. all tokens, within a given layer.\n",
"\n",
"### Residuals and normalisation\n",
"Before putting together the $\\texttt{MHSA}$ and $\\texttt{MLP}$ operations, we will add two ubiquitous deep learning operations to improve the stability and ease of training of the model, namely residual connections and normalisation.\n",
"\n",
"__Residual connections.__\n",
"Residual connections {cite}`he2015deep` are widely used across deep learning architectures, because they simplify model initialisation, stabilise learning and provide a useful inductive bias toward simpler functions.{cite}`szegedy2017inception`\n",
"Instead of specifying a mapping of the form $x^{(m)} = f(x^{(m)}),$ a residual connection amounts to specifying a function involving an identity function plus a residual term\n",
"\n",
"$$x^{(m)} = x^{(m-1)} + g(x^{(m)}).$$\n",
"\n",
"This can be equivalently thought of as learning to model differences between the representations at different blocks, that is $x^{(m)} - x^{(m-1)} = g(x^{(m)}).$\n",
"If we do not use residual connections and compose multiple blocks together, the activations in each can become more extreme as we go deeper in the network, resulting in either zero or extremely large gradients, which can be problematic during training.\n",
"One motivation for using residual connections is that, if we initialise the parameters of $g$ such that its outputs are close to zero, then $x^{(m)}$ will be approximately constant across $m = 1, \\dots, M.$\n",
"This can improve training ease and stability because all blocks in the network, even the deeper ones, receive an input close to $x^{(0)},$ and the gradients will tend to receive less extreme gradients.\n",
"Residual connections are used both in the $\\texttt{MHSA}$ and $\\texttt{MLP}$ layers of the transformer.\n",
"\n",
"__Token normalisation.__\n",
"Another ubiquitous and extremely useful operation in deep learning is normalisation.\n",
"There are various different kinds of normalisation, including LayerNorm {cite}`ba2016layer`, BatchNorm {cite}`ioffe2015batch`, GroupNorm {cite}`wu2018group` and InstanceNorm {cite}`ulyanov2016instance`.\n",
"Normalisation has been widely found to improve learning stability and overall model performance.\n",
"One reason for this is that normalisation typically prevents the inputs to a layer from becoming extremely large, which can result into extreme or staturated outputs, which in turn mean that the gradients with respect to the network parameters can be close to zero or extremely large.\n",
"The transformer architecture uses LayerNorm which, when applied to the tokens, amounts to per-token normalisation.\n",
"Specifically, when applied to an array $X$ of input tokens, LayerNorm amounts to\n",
"\n",
"$$\\bar{x}_{d, n} = \\texttt{LayerNorm}(X)_{d, n} = \\gamma_d \\frac{x_{d, n} - \\mu(x_n)}{\\sigma(x_n)} + \\beta_d,$$\n",
"\n",
"where $\\mu$ and $\\sigma$ denote operations that compute the mean and the standard deviation respectively, and $\\gamma_d$ and $\\beta_d$ are a learnt scale and a learnt shift.\n",
"In other words, within a transformer, LayerNorm separately normalises each token within each sequence within each batch.\n",
"\n",
"\n",
"### Putting it together\n",
"\n",
"```{margin}\n",
"__Relationship to GNNs:__\n",
"As highlighted by these equations, transformers are very similar in spirit to graph neural networks (GNNs).\n",
"Both architectures consist of interleaving local and aggregation operations.\n",
"In a GNN there are local transformations in the form of MLPs applied to the features corresponding to nodes of a graph, and there are also aggregation operations in the form of pooling, which aggregate information from all neighbours and edges of each node in a graph.\n",
"One key difference between transformers and GNNs is that the aggregation operation in a GNN typically only incorporates information across the neighbours of each node, whereas self-attention in transformers incorporates information across all tokens in a single forward pass.\n",
"```\n",
"\n",
"In summary, we can collect these operations into the following equations\n",
"\n",
"$$\\begin{align}\n",
"\\bar{X}^{(m-1)} &= \\texttt{LayerNorm}\\left(X^{(m-1)}\\right) \\\\\n",
"Y^{(m)} &= \\bar{X}^{(m-1)} + \\texttt{MHSA}\\left(\\bar{X}^{(m-1)}\\right) \\\\\n",
"\\bar{Y}^{(m)} &= \\texttt{LayerNorm}\\left(Y^{(m)}\\right) \\\\\n",
"X^{(m)} &= Y^{(m)} + \\texttt{MLP}(\\bar{Y}^{(m)})\n",
"\\end{align}$$\n",
"\n",
"These make up the entirety of the transformer block, which is repeated $M$ times to compute the output of the transformer.\n",
"An important detail we have not discussed thus far is how to build the tokens themselves.\n",
"\n",
"\n",
"### Tokens and embeddings\n",
"\n",
"__Tokenisation.__\n",
"Tokenisation is an application-specific detail but, generally, there are two main approaches, depending on whether the inputs are continuous or discrete.\n",
"As a reminder, in both cases, we want convert each input element in our sequence, say $s_n,$ to a $D$-dimensional array $x^{(0)}_n.$\n",
"We will specify a map $\\texttt{tokenise}$ that performs the operation $s_n = \\texttt{tokenise}(x^{(0)}_n)$ separately for the case where the inputs $s_n$ are discrete or continuous.\n",
"\n",
"__Discrete or continuous inputs.__\n",
"In text modelling the raw inputs are integers representing unique words or characters.\n",
"In such applications, i.e. whenever we have discrete inputs, we can use a look-up table containing learnable vectors.\n",
"That is, if $s_n \\in \\{1, \\dots, K\\},$ we can define $K$ arrays, each of length $D$, say $z_0, \\dots, z_K \\in \\mathbb{R}^D,$ and let\n",
"\n",
"$$x^{(0)}_n = \\texttt{tokenise-discrete}(s_n) = z_{s_n}.$$\n",
"\n",
"This allows us to map each word into a continuous space and operate on the resulting arrays with the transformer architecture.\n",
"In other applications, such as vision, the inputs are typically treated as continuous, that is $s_n \\in \\mathbb{R}^{D_s}.$\n",
"In such cases, we can simply apply a simple operation such as a linear transformation, to map each $s_n$ into a $D$-dimensional array.\n",
"For example, letting $W \\in \\mathbb{R}^{D\\times D_s},$ we can define\n",
"\n",
"$$x^{(0)}_n = \\texttt{tokenise-continuous}(s_n) = W s_n,$$\n",
"\n",
"giving a $D$-dimensional token which is ready for use in the transformer.\n",
"We have now covered almost all parts of the transformer, except one final, but very important point concerning the embeddings.\n",
"Thus far, we have glossed over the fact that the transformer block has no notion of position, which is a very important issue that we look into next.\n",
"\n",
"__Position embeddings.__\n",
"Specifically, the $\\texttt{MHSA}$ operation, the token-wise $\\texttt{MLP}$ operation, as well as $\\texttt{LayerNorm}$ and residual additions are all examples of permutation equivariant: permuting the tokens and applying any one of these operations gives the same result as first applying the operation and then permuting the resulting tokens.\n",
"Composing these operations retains permutation equivariance, meaning that permuting the elements of the original sequence and applying the transformer will yield exactly the same result as first applying the transformer and then permuting the resulting features.\n",
"This is undesirable because, for example in text modelling, the phrases \"Arsenal bets Chelsea\" and \"Chelsea beats Arsenal\" are composed of identical words but have opposite meanings, and we would like the resulting features produced by the transformer to reflect this.\n",
"One way to get around this issue is augmenting the tokens with information about the position of an input feature within the sequence.\n",
"For example, we could set up an additional embedding which directly maps each position to a learnable array and concatentate the result with the tokenised feature, that is\n",
"\n",
"$$x^{(0)}_n = \\texttt{tokenise}(s_n) \\odot \\texttt{tokenise-discrete}(n),$$\n",
"\n",
"where $\\odot$ denotes concatenation, and we have used different tokenisation functions for the sequence elements and their positions.\n",
"A variant of this approach is, instead of concatenating, to add the position embedding to the input tokens, that is\n",
"\n",
"$$x^{(0)}_n = \\texttt{tokenise}(s_n) + \\texttt{tokenise-discrete}(n).$$\n",
"\n",
"This approach can be viewed as a special case of the concatenation approach, followed by a fixed linear projection.\n",
"It is the standard approach used, for example, in vision transformers.\n",
"Another approach is applying, for example, sinusoidal functions with different frequencies on the input, for example\n",
"\n",
"$$\\texttt{tokenise-discrete}(n) = [\\sin(\\omega_1 n), \\dots, \\sin(\\omega_D n)],$$\n",
"\n",
"which are then concatentated to the tokenised features as described above.\n",
"Other approaches bake in positional information directly into the $\\texttt{MHSA}$ layer, for example by making the attention weights depend on the position difference of pairs of tokens.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementation\n",
"\n",
"Now that we've covered all the details, let's implement a small transformer!\n",
"\n",
"### (Multi head) self-attention\n",
"First, we turn to the $\\texttt{MHSA}$ layer, which consists of self attention layers, one for each attention head.\n",
"Let's first define the self attention layers."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2024-05-22 09:55:15.630210: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2\n",
"To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n"
]
}
],
"source": [
"from typing import List, Optional\n",
"\n",
"import tensorflow as tf\n",
"import tensorflow_datasets as tfds\n",
"import tensorflow_probability as tfp\n",
"tfk = tf.keras\n",
"\n",
"# Type for random seed\n",
"Seed = [tf.Tensor, tf.Tensor]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{margin}\n",
"Note that, the softmax operation used in the `self_attention_weights` method exponentiates the entries of an array and divides them by the sum of the resulting entries, in one step.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"class SelfAttention(tfk.Model):\n",
"\n",
" def __init__(\n",
" self,\n",
" seed: Seed,\n",
" projection_dim: int,\n",
" name: str = \"self_attention\",\n",
" **kwargs,\n",
" ):\n",
"\n",
" super().__init__(name=name, **kwargs)\n",
"\n",
" # Split the seed and set up the dense layers\n",
" seed1, seed2 = tfp.random.split_seed(seed, 2)\n",
"\n",
" self.Uk = tfk.layers.Dense(\n",
" projection_dim,\n",
" activation=\"gelu\",\n",
" use_bias=False,\n",
" kernel_initializer=tf.initializers.GlorotNormal(seed=int(seed1[0])),\n",
" )\n",
"\n",
" self.Uq = tfk.layers.Dense(\n",
" projection_dim,\n",
" activation=\"gelu\",\n",
" use_bias=False,\n",
" kernel_initializer=tf.initializers.GlorotNormal(seed=int(seed2[0])),\n",
" )\n",
"\n",
"\n",
" def self_attention_weights(self, x: tf.Tensor) -> tf.Tensor:\n",
" \"\"\"\n",
" Compute self-attention weights for tokens in a sequence\n",
"\n",
" Args:\n",
" x: input sequence of tokens, shape (B, N, D)\n",
" \n",
" Returns:\n",
" attention weights, shape (B, N, N)\n",
" \"\"\"\n",
" k = self.Uk(x)\n",
" q = self.Uq(x)\n",
"\n",
" dot_product = tf.matmul(k, q, transpose_b=True)\n",
" dot_product /= tf.math.sqrt(tf.cast(tf.shape(k)[-1], tf.float32))\n",
"\n",
" return tf.nn.softmax(dot_product, axis=1)\n",
" \n",
" def call(self, x: tf.Tensor) -> tf.Tensor:\n",
" \"\"\"\n",
" Apply self-attention to a sequence of tokens\n",
"\n",
" Args:\n",
" x: input sequence of tokens, shape (B, N, D)\n",
"\n",
" Returns:\n",
" output sequence of tokens, shape (B, N, D)\n",
" \"\"\"\n",
" return tf.matmul(self.self_attention_weights(x), x, transpose_a=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Self-attention is remarkably simple!\n",
"Now, we can define multi-head self attention.\n",
"This is a simple extension of our existing self-attention module."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"class MultiHeadSelfAttention(tfk.Model):\n",
"\n",
" def __init__(\n",
" self,\n",
" seed: Seed,\n",
" token_dim: int,\n",
" projection_dim: int,\n",
" num_heads: int,\n",
" name: str = \"multi_head_self_attention\",\n",
" **kwargs,\n",
" ):\n",
" super().__init__(name=name, **kwargs)\n",
"\n",
" keys = tfp.random.split_seed(seed, 2*num_heads)\n",
" self.self_attention = [\n",
" SelfAttention(\n",
" seed=key,\n",
" projection_dim=projection_dim,\n",
" ) for key in keys[::2]\n",
" ]\n",
"\n",
" self.linear = [\n",
" tfk.layers.Dense(\n",
" token_dim,\n",
" use_bias=False,\n",
" activation=None,\n",
" kernel_initializer=tf.initializers.GlorotNormal(seed=int(key[0])),\n",
" ) for key in keys[1::2]\n",
" ]\n",
"\n",
" def call(self, x: tf.Tensor) -> tf.Tensor:\n",
" \"\"\"\n",
" Apply multi-head self-attention to a sequence of tokens\n",
"\n",
" Args:\n",
" x: input sequence of tokens, shape (B, N, D)\n",
"\n",
" Returns:\n",
" output sequence of tokens, shape (B, N, D)\n",
" \"\"\"\n",
" \n",
" # Compute tokens for each head and apply linear \n",
" heads = [\n",
" linear(sa(x))\n",
" for sa, linear in zip(self.self_attention, self.linear)\n",
" ]\n",
"\n",
" # Stack and sum across heads\n",
" return tf.reduce_mean(tf.stack(heads, axis=2), axis=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multi-layer perceptron\n",
"Now we turn to the MLP.\n",
"This is also a very simple implementation."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"class MLP(tfk.Model):\n",
"\n",
" def __init__(\n",
" self,\n",
" seed: Seed,\n",
" num_hidden: int,\n",
" num_layers: int,\n",
" num_output: Optional[int] = None,\n",
" name: str = \"mlp\",\n",
" **kwargs,\n",
" ):\n",
"\n",
" super().__init__(name=name, **kwargs)\n",
"\n",
" # Set up output dimensions of linear layers\n",
" out_feats = [num_hidden] * num_layers + [num_output]\n",
"\n",
" # Split the random key into sub-keys for each layer\n",
" seeds = tfp.random.split_seed(seed, num_layers+1)\n",
"\n",
" self.linear = [\n",
" tfk.layers.Dense(\n",
" out_feat,\n",
" activation=None,\n",
" kernel_initializer=tf.initializers.GlorotNormal(seed=int(seed[0])),\n",
" )\n",
" for seed, out_feat in zip(seeds, out_feats)\n",
" ]\n",
"\n",
"\n",
" def call(self, x: tf.Tensor) -> tf.Tensor:\n",
" \"\"\"\n",
" Compute forward pass through the MLP.\n",
"\n",
" Args:\n",
" x: input tensor of shape (..., feature_dim,)\n",
" \n",
" Returns:\n",
" output tensor of shape (..., feature_dim,)\n",
" \"\"\"\n",
" for layer in self.linear[:-1]:\n",
" x = layer(x)\n",
" x = tf.nn.gelu(x)\n",
" return self.linear[-1](x)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transformer block\n",
"\n",
"Now we're ready to define the transformer block, which consists of the multi-head self attention and mlp operations, as well as two normalisation layers, connected with residual connections."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"class TransformerBlock(tfk.Model):\n",
"\n",
" def __init__(\n",
" self,\n",
" seed: Seed,\n",
" token_dimension: int,\n",
" mlp_num_hidden: int,\n",
" mlp_num_layers: int,\n",
" num_heads: int,\n",
" name: str = \"transformer_block\",\n",
" **kwargs,\n",
" ):\n",
" super().__init__(name=name, **kwargs)\n",
"\n",
" key1, key2 = tfp.random.split_seed(seed, 2)\n",
" self.mhsa = MultiHeadSelfAttention(\n",
" seed=key1,\n",
" token_dim=token_dimension,\n",
" projection_dim=token_dimension,\n",
" num_heads=num_heads,\n",
" )\n",
"\n",
" self.mlp = MLP(\n",
" seed=key2,\n",
" num_hidden=mlp_num_hidden,\n",
" num_layers=mlp_num_layers,\n",
" num_output=token_dimension,\n",
" )\n",
"\n",
" self.ln1 = tfk.layers.LayerNormalization(axis=2)\n",
" self.ln2 = tfk.layers.LayerNormalization(axis=2)\n",
"\n",
" \n",
" def call(self, x: tf.Tensor) -> tf.Tensor:\n",
" \"\"\"\n",
" Apply the transformer block to input tokens `x`.\n",
"\n",
" Arguments:\n",
" x: input tensor of shape (B, N, D)\n",
"\n",
" Returns:\n",
" output tensor of shape (B, N, D)\n",
" \"\"\"\n",
" x = x + self.mhsa(self.ln1(x))\n",
" x = x + self.mlp(self.ln2(x))\n",
"\n",
" return x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tokens and embeddings\n",
"\n",
"Next up, we'll need to also define how to tokenise a sequence and and generate positional embeddings.\n",
"First, let us consider an image classification task for the moment, and use the vision transformer (ViT) {cite}`dosovitskiy2021image` embedding style.\n",
"In ViT, an image is split into smaller sub-images called patches, each of which is linearly projected to form an embedding.\n",
"Convolutions are very handy here:\n",
"We can split an image into patches and project these, in one go, using convolutions with a stride equal to the kernel size.\n",
"\n",
"```{margin}\n",
"The `Conv2D` layer here splits the image into patches and linearly embeds each one, doing both steps in one go.\n",
"Specifically, convolution is a linear operation on a patch of size `(k, k)` where `k` is the kernel size.\n",
"By using striding (with a stride equal to the kernel size), we ensure each patch is processed separately, resulting in an image of size `(k/p, k/p)` where `p` is the patch size, which is then reshaped into one long sequence of tokens.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"class ImageTokeniser(tfk.Model):\n",
"\n",
" def __init__(\n",
" self,\n",
" seed: Seed,\n",
" token_dimension: int,\n",
" patch_size: int,\n",
" name: str = \"image_tokeniser\",\n",
" **kwargs,\n",
" ):\n",
" super().__init__(name=name, **kwargs)\n",
"\n",
" assert patch_size % 2 == 0, \"Patch size must be even\"\n",
"\n",
" self.conv = tfk.layers.Conv2D(\n",
" filters=token_dimension,\n",
" kernel_size=(patch_size, patch_size),\n",
" strides=(patch_size, patch_size),\n",
" padding=\"VALID\",\n",
" activation=None,\n",
" data_format=\"channels_last\",\n",
" kernel_initializer=tf.initializers.GlorotNormal(seed=int(seed[0])),\n",
" )\n",
"\n",
" def call(self, x: tf.Tensor) -> tf.Tensor:\n",
" \"\"\"\n",
" Tokenise the image `x`, applying a strided convolution.\n",
" This is equivalent to splitting the image into patches,\n",
" and then linearly projecting each one of these using a\n",
" shared linear projection.\n",
"\n",
" Arguments:\n",
" x: image input tensor of shape (B, W, H, C)\n",
"\n",
" Returns:\n",
" output tensor of shape (B, N, D)\n",
" \"\"\"\n",
"\n",
" assert (\n",
" x.shape[1] % self.conv.kernel_size[0] == 0\n",
" and x.shape[2] % self.conv.kernel_size[1] == 0\n",
" ), (\n",
" f\"Input dimensions must be divisible by patch size, \"\n",
" f\"found {x.shape=} and {self.conv.kernel_size}.\"\n",
" )\n",
"\n",
" x = self.conv(x)\n",
" return tf.reshape(x, [tf.shape(x)[0], -1, tf.shape(x)[-1]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we turn to position embeddings.\n",
"For simplicity, let us assume that all sequences have a fixed length, i.e. all images we will process will process images of a fixed height and width.\n",
"We'll adopt a fairly general approach, by letting each embedding be a learnable array, and using a different such array for each position in the sequence."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"class PositionEmbedding(tfk.Model):\n",
"\n",
" def __init__(\n",
" self,\n",
" seed: Seed,\n",
" token_dimension: int,\n",
" sequence_length: int,\n",
" name: str = \"position_embedding\",\n",
" **kwargs,\n",
" ):\n",
" super().__init__(name=name, **kwargs)\n",
" \n",
" self.embeddings = tf.Variable(\n",
" tf.random.normal(\n",
" (sequence_length, token_dimension),\n",
" seed=int(seed[0]),\n",
" )\n",
" )\n",
"\n",
" def call(self, x: tf.Tensor) -> tf.Tensor:\n",
" \"\"\"\n",
" Add position embeddings to input tensor.\n",
"\n",
" Arguments:\n",
" x: input tensor of shape (B, N, D)\n",
"\n",
" Returns:\n",
" output tensor of shape (B, N, D)\n",
" \"\"\"\n",
" return x + self.embeddings[None, :, :]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Putting it together\n",
"\n",
"Finally, we're ready to put together the full transformer architecture.\n",
"We'll build a little ViT using our transformer blocks, image tokeniser and position embeddings.\n",
"Note that in order to obtain class probability logits from a ViT, we typically use an additional token $x_0^{(m)}$ called the class token.\n",
"This token initially does not depend on the input, i.e. $x_0^{(0)}$ is fixed.\n",
"By attending to the other tokens however, $x_0^{(m)}$ incorporates information about the input.\n",
"Once the transformer blocks have been applied, we take the resulting class token $x_0^{(M)}$ and pass it through a final MLP to obtain class logits."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"class TinyVisionTransformer(tfk.Model):\n",
"\n",
" def __init__(\n",
" self,\n",
" seed: Seed,\n",
" tokeniser: ImageTokeniser,\n",
" embedding: PositionEmbedding,\n",
" token_dimension: int,\n",
" mlp_num_hidden: int,\n",
" mlp_num_layers: int,\n",
" num_heads: int,\n",
" num_blocks: int,\n",
" num_classes: int,\n",
" name: str = \"tiny_vision_transformer\",\n",
" **kwargs,\n",
" ):\n",
" super().__init__(name=name, **kwargs)\n",
"\n",
" seeds = tfp.random.split_seed(seed, num_blocks+1)\n",
" self.blocks = [\n",
" TransformerBlock(\n",
" seed=seeds[i],\n",
" token_dimension=token_dimension,\n",
" mlp_num_hidden=mlp_num_hidden,\n",
" mlp_num_layers=mlp_num_layers,\n",
" num_heads=num_heads,\n",
" )\n",
" for i in range(num_blocks)\n",
" ]\n",
"\n",
" self.final_mlp = MLP(\n",
" seed=seeds[-1],\n",
" num_hidden=mlp_num_hidden,\n",
" num_layers=mlp_num_layers,\n",
" num_output=num_classes,\n",
" )\n",
"\n",
" self.tokeniser = tokeniser\n",
" self.embedding = embedding\n",
" self.class_token = tf.Variable(\n",
" tf.zeros(\n",
" (1, 1, token_dimension),\n",
" dtype=tf.float32,\n",
" )\n",
" )\n",
"\n",
"\n",
" def call(self, x: tf.Tensor) -> tf.Tensor:\n",
" \"\"\"\n",
" Apply vision transformer to batch of images.\n",
"\n",
" Arguments:\n",
" x: input image tensor of shape (B, H, W, C)\n",
"\n",
" Returns:\n",
" output logits tensor of shape (B, num_classes)\n",
" \"\"\"\n",
"\n",
" class_token = tf.tile(\n",
" self.class_token,\n",
" [tf.shape(x)[0], 1, 1],\n",
" )\n",
"\n",
" x = self.tokeniser(x)\n",
" x = self.embedding(x)\n",
" x = tf.concat([class_token, x], axis=1)\n",
"\n",
" for block in self.blocks:\n",
" x = block(x)\n",
"\n",
" x = self.final_mlp(x[:, 0, :])\n",
" return x - tf.math.reduce_logsumexp(x, axis=1, keepdims=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataset\n",
"Because this is meant to be a demo that should run on a laptop, we'll use the MNIST dataset.\n",
"We'll use [tensorflow datasets](https://www.tensorflow.org/datasets/api_docs/python/tfds) to load the data and preprocess it."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": [
"hide-cell"
]
},
"outputs": [],
"source": [
"import tensorflow as tf\n",
"import tensorflow_datasets as tfds\n",
"\n",
"def preprocess_image(image, label):\n",
" image = 2. * (tf.cast(image, tf.float32) / 255.) - 1.\n",
" return image, label\n",
"\n",
"def get_batches(batch_size: int, split: str, data_dir: str=\"/tmp/tfds\"):\n",
"\n",
" # Conversion from labels to one-hot\n",
" def one_hot(image, label):\n",
" return image, tf.one_hot(label, 10)\n",
"\n",
" assert split in [\"train\", \"test\"], \"Split must be 'train' or 'test'\"\n",
" ds = tfds.load(\n",
" name=\"mnist\",\n",
" split=split,\n",
" as_supervised=True,\n",
" data_dir=data_dir,\n",
" shuffle_files=False,\n",
" )\n",
" ds = ds.map(preprocess_image)\n",
" ds = ds.batch(batch_size)\n",
" ds = ds.prefetch(tf.data.AUTOTUNE)\n",
" ds = ds.map(one_hot)\n",
"\n",
" return ds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training\n",
"Now let's train the network.\n",
"In general, when training a ViT, a few tricks are typically used, including for example, learning rate scheduling and data augmentation.\n",
"Dropout is also sometimes used in the architecture itself.\n",
"We won't use any of these techniques here to keep it simple."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"tags": [
"hide-input",
"hide-output"
]
},
"outputs": [],
"source": [
"@tf.function\n",
"def train_step(\n",
" model: tfk.Model,\n",
" images: tf.Tensor,\n",
" labels: tf.Tensor,\n",
" loss_fn: tf.losses.Loss,\n",
" optimizer: tf.optimizers.Optimizer,\n",
") -> tf.Tensor:\n",
" \n",
" with tf.GradientTape() as tape:\n",
" logits = model(images)\n",
" loss = loss_fn(labels, logits)\n",
"\n",
" gradients = tape.gradient(loss, model.trainable_variables)\n",
" optimizer.apply_gradients(zip(gradients, model.trainable_variables))\n",
"\n",
" return loss, logits\n",
"\n",
"# Model parameters\n",
"token_dimension = 128\n",
"patch_size = 4\n",
"num_mlp_hidden = 128\n",
"num_mlp_layers = 1\n",
"num_heads = 8\n",
"num_blocks = 8\n",
"num_classes = 10\n",
"\n",
"# Training parameters\n",
"batch_size = 16\n",
"num_epochs = 10\n",
"learning_rate = 1e-3\n",
"weight_decay = 1e-4\n",
"\n",
"# Create the tokeniser and embeddings\n",
"seeds = tfp.random.split_seed([0, 0], 3)\n",
"\n",
"tokeniser = ImageTokeniser(\n",
" seeds[0],\n",
" token_dimension=token_dimension,\n",
" patch_size=patch_size,\n",
")\n",
"\n",
"embedding = PositionEmbedding(\n",
" seeds[1],\n",
" token_dimension=token_dimension,\n",
" sequence_length=(28 // patch_size)**2,\n",
")\n",
"\n",
"# Create a transformer\n",
"transformer = TinyVisionTransformer(\n",
" seeds[2],\n",
" tokeniser=tokeniser,\n",
" embedding=embedding,\n",
" token_dimension=token_dimension,\n",
" mlp_num_hidden=num_mlp_hidden,\n",
" mlp_num_layers=num_mlp_layers,\n",
" num_heads=num_heads,\n",
" num_blocks=num_blocks,\n",
" num_classes=num_classes,\n",
")\n",
"\n",
"# Create optimizer\n",
"optimizer = tf.optimizers.Adam(\n",
" learning_rate=learning_rate,\n",
" weight_decay=weight_decay,\n",
")\n",
"\n",
"# Create loss function and accuracy helpers\n",
"loss_fn = tf.losses.CategoricalCrossentropy(from_logits=True)\n",
"accuracy = tf.metrics.CategoricalAccuracy()\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"tags": [
"remove-input"
]
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "33335881af3840d6973f29c009aca614",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "797f41a2e88947debc5e23980547166a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2fc2502c43a84c42938bc2a71c244180",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c96dbae11a0f485ba49b0cc667ae6871",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "198bd8508bbc4a2aab46c5519ec1fa08",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c8fe11d04c0b4ccfb44652a361b04a62",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "fca83109aa5140dca2f969672c71354d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "716f3ce8af1546d4867c200f7989ff59",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0cbc514874df467daf835f75f8ddedd7",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7718f66e620346ec8ad80ff973bf75dd",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"0it [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 10: loss 0.090 (train 0.082), acc. 0.957 (train 0.956)\n"
]
}
],
"source": [
"from tqdm.notebook import tqdm\n",
"\n",
"# Keep track of losses and accuracies for plotting\n",
"all_train_losses = []\n",
"all_train_accuracies = []\n",
"all_test_losses = []\n",
"all_test_accuracies = []\n",
"num_steps = 0\n",
"\n",
"for epoch in range(num_epochs):\n",
"\n",
" pbar = tqdm(get_batches(batch_size, \"train\"))\n",
"\n",
" epoch_losses = []\n",
" epoch_accuracies = []\n",
" for images, labels in pbar:\n",
" loss, logits = train_step(\n",
" transformer,\n",
" images,\n",
" labels,\n",
" loss_fn,\n",
" optimizer,\n",
" )\n",
" acc = accuracy(labels, logits)\n",
"\n",
" epoch_losses.append(loss)\n",
" epoch_accuracies.append(acc)\n",
" all_train_losses.append((num_steps, loss))\n",
" all_train_accuracies.append((num_steps, acc))\n",
"\n",
" num_steps += 1\n",
"\n",
" pbar.set_description(\n",
" f\"Epoch ({epoch+1:03d}) \"\n",
" f\"mean loss: {tf.reduce_mean(epoch_losses):.3f}, \"\n",
" f\"mean accuracy: {tf.reduce_mean(epoch_accuracies):.3f}\"\n",
" )\n",
"\n",
" test_losses = []\n",
" test_accuracies = []\n",
" for images, labels in get_batches(batch_size, \"test\"):\n",
"\n",
" logits = transformer(images)\n",
" test_losses.append(loss_fn(labels, logits))\n",
" test_accuracies.append(accuracy(labels, logits))\n",
"\n",
" mean_loss = tf.reduce_mean(test_losses)\n",
" mean_acc = tf.reduce_mean(test_accuracies)\n",
" all_test_losses.append((num_steps, mean_loss))\n",
" all_test_accuracies.append((num_steps, mean_acc))\n",
"\n",
"print(\n",
" f\"Epoch {num_epochs}: \"\n",
" f\"loss {mean_loss:.3f} (train {tf.reduce_mean(epoch_losses):.3f}), \"\n",
" f\"acc. {mean_acc:.3f} (train {tf.reduce_mean(epoch_accuracies):.3f})\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"tags": [
"remove-input",
"center-output"
]
},
"outputs": [
{
"data": {
"application/pdf": "",
"image/svg+xml": [
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"from matplotlib_inline.backend_inline import set_matplotlib_formats\n",
"\n",
"set_matplotlib_formats('pdf', 'svg')\n",
"\n",
"matplotlib.rcParams['mathtext.fontset'] = 'stix'\n",
"matplotlib.rcParams['font.family'] = 'STIXGeneral'\n",
"\n",
"train_steps, train_nlls = zip(*all_train_losses)\n",
"_, train_accs = zip(*all_train_accuracies)\n",
"test_steps, test_nlls = zip(*all_test_losses)\n",
"_, test_accs = zip(*all_test_accuracies)\n",
"\n",
"plt.figure(figsize=(12, 3))\n",
"plt.subplot(1, 2, 1)\n",
"plt.plot(train_steps, train_nlls, color=\"tab:blue\",linewidth=0.1, alpha=0.3)\n",
"plt.plot(test_steps, test_nlls, color=\"tab:red\")\n",
"plt.ylim([-0.1, 2])\n",
"plt.xlim([0, train_steps[-1]])\n",
"plt.xlabel(\"# gradient steps\", fontsize=18)\n",
"plt.ylabel(\"NLL\", fontsize=18)\n",
"plt.xticks(fontsize=14)\n",
"plt.yticks(fontsize=14)\n",
"\n",
"plt.subplot(1, 2, 2)\n",
"plt.plot(train_steps, train_accs, color=\"tab:blue\", label=\"Train\")\n",
"plt.plot(test_steps, test_accs, color=\"tab:red\", label=\"Test\")\n",
"plt.xlim([0, train_steps[-1]])\n",
"plt.xlabel(\"# gradient steps\", fontsize=18)\n",
"plt.ylabel(\"Accuracy\", fontsize=18)\n",
"plt.xticks(fontsize=14)\n",
"plt.yticks(fontsize=14)\n",
"\n",
"plt.legend(loc=\"lower right\", fontsize=16)\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"We have looked at the details of the transformer architecture.\n",
"It consists of identical blocks, each of which contains a self-attention and a multi-layer perceptron operation, together with normalisation layers and residual connections.\n",
"Coupling these together with position embeddings and an appropriate tokenisation layer makes up the entire transformer architecture.\n",
"We looked at a specific example for computer vision, the ViT, and trained it on MNIST."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n",
"```{bibliography}\n",
":filter: docname in docnames\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 2
}