Four ways (two bad, two good) to calculate spike-triggered receptive fields

A spike of a neuron picks out a snippet of the visual stimulus being presented at that time. The collection of these snippets are the spike-triggered snippets, and the mean of the snippets is the spike-triggered average (STA). If the stimulus is a video, the STA will be a 3D volume.

An array of snippets with shape \( (N, T, H, W) \). There are \( N \) snippets, one per spike. Each snippet has \( T \) time steps, where at each time step a monochromatic image \( H \times W \) is shown. When the mean of the snippets, `snippets.mean(axis=0)`, is taken, we use coordinates \( (t, y, x) \) to index into the resulting 3D volume.

When the stimulus is projected onto a retina, the \( y, x \) coordinates correspond to different locations on the retina. It's useful to reduce the 4D array of snippets to a 2D array in such a way that each \( (y, x) \) element measures the degree to which that pixel coordinate is involved in the neuron's firing.

I'll argue below that simply using the mean along the snippet and time dimensions is not a good option; slightly better is the snippet variance; better still is a sort of mean-squared-error; and finally, I think the best results come from using some basic information theory.

Stimulus and the STA

The below STA, which is just sta = snippets.mean(axis=0), is from a recording of a chicken retinal ganglion cell responding to a binary checkerboard stimulus. Each pixel has 50% chance of being on or off at each time step. The stimulus was a single LED. I'm not sure what wavelength it was. This recording was carried out by Marvin Seifert in 2025, who has been steadily improving the quality of these types of recordings.

STA for a chicken retinal ganglion cell recorded with ~40,000 spikes. Here, the snippets include 800 ms before and 200 ms after a spike. The spike is shown with a ❚ at 800 ms. Normalization: the min and max values across the stack are mapped to 0 and 255 when creating the video. The same thing is done for all videos and images below.

The stimulus is projected onto an approximately 2mm x 2mm area by a DLP LightCrafter 4500. The projector has a resolution of 912x1140 pixels, making each pixel an approximately 2 μm square on the retina. This is close to the width of the outer segment of photoreceptors.

Each box of the checkerboard stimulus was larger, possibly 4 pixels wide; I can't remember exactly. The noise was also the shifted-noise variant, where the checkerboard grid is shifted randomly in an attempt to achieve higher resolution in the STA without using smaller boxes. I'm not fond of this technique, as it makes it harder to interpret the results. These details are not so important for the discussion of receptive fields. I just thought I'd mention them to give you a feel for the data.

2D receptive field

If we simply take the mean of the STA along the time dimension, any pixel coordinate that contributes to the neuron's firing by being both on and off at various times in the snippets can have their contribution averaged out. You can see for the 3D STA above—its mean along the time dimension is the following image:

Mean of the STA along the time dimension. A lot of structure that appears in the STA video above is lost in this 2D representation.

The simplest "fix" is to make positive all contributions by taking a square.

2D receptive field via sample variance

Instead of rf = sta.mean(axis=0), we will calculate ((sta - c)**2).mean(axis=0). Which is the mean squared deviation from the constant \( c \). There is a subtle but important question of what value to choose \( c \). If \( c \) is the mean value along the time dimension in the STA, then we would be calculating the sample variance rf = sta.var(axis=0). This looks like:

An example should make it clear why this is not a good approach. To make the argument simple, we will assume that snippets do not include any timesteps after a spike, for example, they are just the 100 ms before a spike. Imagine the extreme case where the pixel at snippet coordinate (y,x) is ON for the entire 100 ms in every spike snippet. This would suggest the pixel being ON for this 100 ms is a requirement for the neuron to fire. So we would expect a receptive field to have a high value at (y,x). But if we use rf = ((sta - sta.mean(axis=0))**2).mean(axis=0), the mean of the pixel value within the spike snippets will be 1, and the average squared deviation from the this value will be 0. So if the receptive field is calculated in this way, it will be 0 at (y,x), which is not what we want.

2D receptive field via deviation from stimulus mean

If we stick with the strategy of taking the mean squared deviation from a constant, a better choice for the constant is 0.5, which is the mean pixel value of the stimulus (the 50% ON 50% OFF stimulus).

This choice gives us the following:

Why is it better? Intuitively, we are measuring how much each pixel deviates from it's average value in the stimulus, and this is the type of information we want to capture in the receptive field.

Connection to likelihood

A theoretical motivation for the use of (sta - 0.5)**2 arises from the fact that for a mean pixel value \( v \) at some location \( (t,y,x) \), the value \( (v-0.5)^2 \) is proportional (up to a constant) to the negative log-likelihood of observing that mean pixel value \( v \). This is true as the sum of many (40,000) Bernoulli random variables with parameter \( p=0.5 \) is approximately normally distributed with mean \( \mu = 0.5 \). A normally distributed random variable has a probability density of the following form:

\[ \mathbb{P}(Z=z) = c_1 e^{-\frac{(z-\mu)^2}{2\sigma^2}} \]

where \( c_1 \) is a constant. The log of this value is:

\[ \log(\mathbb{P}(Z=z)) = \log(c_1) - \frac{(z-\mu)^2}{2\sigma^2} \]

and so we can say that:

\[ (z - \mu)^2 \propto -\log(\mathbb{P}(Z=z)) + c_2 \]

where \( c_2 \) is another constant. So, the squared deviation from the mean pixel value is proportional to the negative log-likelihood of observing that mean pixel value.

Instead of shifting and scaling the data to fit in the range [0, 255], we could calculate the likelihood from the formula above, and give meaningful units to the intensities:

2D receptive field via relative entropy

Another option is to compare how much the pixel distribution has changed when conditioned on the spike.

An example. A pixel at coordinate (t,y,x) is [ON, OFF] with probability [0.5, 0.5]. If, conditional on a spike, the pixel is now [ON, OFF] with probability [0.2, 0.8], then the pixel's distribution has changed. The entropy which was \[ 0.5 \log_{2}(2) + 0.5 \log_{2}(2) = 1.0 \text{ bits} \] has reduced to \[ 0.2 \log_{2}(5) + 0.8 \log_{2}(1.25) = 0.7219 \text{ bits.} \] In the extreme case where the pixel is always on in the spike snippets, the entropy would reduce to 0 bits. The difference between the original entropy of 1 bit and the conditional entropy of say 0.7219 bits contributes to the mutual information between the pixel and the spike. Contributes to, but not equal to. Mutual information would also consider all snippets where there is no spike.

We will calculate the relative entropy:

\[ H(\,(Y \,|\, S=1) \,\|\, Y) \]

Which is the relative entropy of the pixel distribution conditional on a spike compared to the original pixel distribution. For our case where the original distribution is [0.5, 0.5], and in the spike snippets the probability changes to [p, q], the calculation is:

\[ \begin{align*} H(\,(Y \,|\, S=1) \,\|\, Y) &= p \log_{2}\left(\frac{p}{0.5}\right) + q \log_{2}\left(\frac{q}{0.5}\right) \\ &= \log_{2}(2) + p \log_{2}(p) + q \log_{2}(q) \\ &= 1 - H(Y \,|\, S=1) \end{align*} \]

Which is just 1 minus the entropy of the pixel distribution conditional on a spike. In Python, you could write:

def relative_entropy(snippets):
    N, T, H, W = snippets.shape
    p = snippets.mean(axis=0)
    q = 1 - p
    H = - p * np.log2(p) - q * np.log2(q)
    return (1 - H)

The resulting 3D stack, and the 2D receptive field obtained by taking the mean along the time dimension, looks like this:

The results using the MSE (which was likelihood in disguise) and the relative entropy are very similar, but not exactly the same. They are similar because for values close to 0.5, the two calculations give very similar results:

When the videos or images are normalized to the range [0, 255], this difference is less pronounced. Below is the MSE scaled to roughly match the relative entropy curve. You can see that they are not scaled versions of each other:

They are surprisingly similar. For other scenarios where the stimulus has a different distribution, these two curves can be more different. For the [0.5, 0.5] stimulus used here, there is no practical difference between the two approaches.

The takeaway is: when creating receptive fields, do not use the mean of the STA along the time dimension, nor the sample variance along the time dimension, but instead use the mean squared error or the relative entropy. For the mean squared error, the "error" is measured from the stimulus mean, which is 0.5 for the stimulus used here. Instead of the MSE calculation, you may wish to calculate the exact log-likelihoods—the receptive fields will look the same, but the values will have interpretable units.

2025.08.02