Four ways (two bad, two good) to calculate spike-triggered receptive fields
A spike of a neuron picks out a snippet of the visual stimulus being presented at that time. The collection of these snippets are the spike-triggered snippets, and the mean of the snippets is the spike-triggered average (STA). If the stimulus is a video, the STA will be a 3D volume.
snippets.mean(axis=0), is taken, we use coordinates \( (t, y,
x) \) to index into the resulting 3D volume.
When the stimulus is projected onto a retina, the \( y, x \) coordinates correspond to different locations on the retina. It's useful to reduce the 4D array of snippets to a 2D array in such a way that each \( (y, x) \) element measures the degree to which that pixel coordinate is involved in the neuron's firing.
I'll argue below that simply using the mean along the snippet and time dimensions is not a good option; slightly better is the snippet variance; better still is a sort of mean-squared-error; and finally, I think the best results come from using some basic information theory.
Stimulus and the STA
The below STA, which is just sta = snippets.mean(axis=0), is from a
recording of a chicken retinal ganglion cell
responding to a binary checkerboard stimulus. Each pixel has 50% chance of
being on or off at each time step. The stimulus was a single LED. I'm not sure
what wavelength it was. This recording was carried out by
Marvin Seifert in 2025, who has been steadily improving the quality of these
types of recordings.
The stimulus is projected onto an approximately 2mm x 2mm area by a DLP LightCrafter 4500. The projector has a resolution of 912x1140 pixels, making each pixel an approximately 2 μm square on the retina. This is close to the width of the outer segment of photoreceptors.
Each box of the checkerboard stimulus was larger, possibly 4 pixels wide; I can't remember exactly. The noise was also the shifted-noise variant, where the checkerboard grid is shifted randomly in an attempt to achieve higher resolution in the STA without using smaller boxes. I'm not fond of this technique, as it makes it harder to interpret the results. These details are not so important for the discussion of receptive fields. I just thought I'd mention them to give you a feel for the data.
2D receptive field
If we simply take the mean of the STA along the time dimension, any pixel coordinate that contributes to the neuron's firing by being both on and off at various times in the snippets can have their contribution averaged out. You can see for the 3D STA above—its mean along the time dimension is the following image:
The simplest "fix" is to make positive all contributions by taking a square.
2D receptive field via sample variance
Instead of rf = sta.mean(axis=0), we will
calculate ((sta - c)**2).mean(axis=0). Which is the mean
squared deviation from the constant \( c \). There is a subtle but
important question of what value to choose \( c \). If \( c \) is the mean
value along the time dimension in the STA, then we would be calculating
the sample variance rf = sta.var(axis=0). This looks like:
(sta - sta.mean(axis=0))**2
. Right: 2D receptive field obtained by
following up with a mean along the time dimension, rf = ((sta - sta.mean(axis=0))**2).mean(axis=0).
An example should make it clear why this is not a good approach.
To make the argument simple, we will assume that snippets do not include
any timesteps after a spike, for example, they are just the 100 ms before
a spike. Imagine the extreme case where the pixel at snippet coordinate (y,x)
is ON for the entire 100 ms in every spike snippet. This would suggest the
pixel being ON for this 100 ms is a requirement for the neuron to fire.
So we would expect a receptive field to have a high value at (y,x).
But if we use rf = ((sta - sta.mean(axis=0))**2).mean(axis=0),
the mean of the pixel value within the spike snippets will be 1, and the
average squared deviation from the this value will be 0. So if the
receptive field is calculated in this way, it will be 0 at (y,x), which
is not what we want.
2D receptive field via deviation from stimulus mean
If we stick with the strategy of taking the mean squared deviation from a constant, a better choice for the constant is 0.5, which is the mean pixel value of the stimulus (the 50% ON 50% OFF stimulus).
This choice gives us the following:
(sta - 0.5)**2
. Right: 2D receptive field obtained by
following up with a mean along the time dimension,
rf = ((sta - 0.5)**2).mean(axis=0).
Why is it better? Intuitively, we are measuring how much each pixel deviates from it's average value in the stimulus, and this is the type of information we want to capture in the receptive field.
Connection to likelihood
A theoretical motivation for the use of (sta - 0.5)**2 arises
from the fact that for a mean pixel value \( v \) at some location \( (t,y,x)
\), the value \( (v-0.5)^2 \) is proportional (up to a constant) to the negative
log-likelihood of observing that mean pixel value \( v \). This is true as the
sum of many (40,000) Bernoulli random variables with parameter \( p=0.5 \) is
approximately normally distributed with mean \( \mu = 0.5 \).
A normally distributed random variable has a probability density of the
following form:
where \( c_1 \) is a constant. The log of this value is:
and so we can say that:
where \( c_2 \) is another constant. So, the squared deviation from the mean pixel value is proportional to the negative log-likelihood of observing that mean pixel value.
Instead of shifting and scaling the data to fit in the range [0, 255], we could calculate the likelihood from the formula above, and give meaningful units to the intensities:
2D receptive field via relative entropy
Another option is to compare how much the pixel distribution has changed when conditioned on the spike.
An example. A pixel at coordinate (t,y,x) is [ON, OFF] with probability [0.5, 0.5]. If, conditional on a spike, the pixel is now [ON, OFF] with probability [0.2, 0.8], then the pixel's distribution has changed. The entropy which was \[ 0.5 \log_{2}(2) + 0.5 \log_{2}(2) = 1.0 \text{ bits} \] has reduced to \[ 0.2 \log_{2}(5) + 0.8 \log_{2}(1.25) = 0.7219 \text{ bits.} \] In the extreme case where the pixel is always on in the spike snippets, the entropy would reduce to 0 bits. The difference between the original entropy of 1 bit and the conditional entropy of say 0.7219 bits contributes to the mutual information between the pixel and the spike. Contributes to, but not equal to. Mutual information would also consider all snippets where there is no spike.
We will calculate the relative entropy:
Which is the relative entropy of the pixel distribution conditional on a spike compared to the original pixel distribution. For our case where the original distribution is [0.5, 0.5], and in the spike snippets the probability changes to [p, q], the calculation is:
Which is just 1 minus the entropy of the pixel distribution conditional on a spike. In Python, you could write:
def relative_entropy(snippets):
N, T, H, W = snippets.shape
p = snippets.mean(axis=0)
q = 1 - p
H = - p * np.log2(p) - q * np.log2(q)
return (1 - H)The resulting 3D stack, and the 2D receptive field obtained by taking the mean along the time dimension, looks like this:
The results using the MSE (which was likelihood in disguise) and the
relative entropy are very similar, but not exactly the same. They are
similar because for values close to 0.5, the two calculations give
very similar results:
When the videos or images are normalized to the range [0, 255], this difference is less pronounced. Below is the MSE scaled to roughly match the relative entropy curve. You can see that they are not scaled versions of each other:
They are surprisingly similar. For other scenarios where the stimulus has a different distribution, these two curves can be more different. For the [0.5, 0.5] stimulus used here, there is no practical difference between the two approaches.
The takeaway is: when creating receptive fields, do not use the mean of the STA along the time dimension, nor the sample variance along the time dimension, but to use the mean squared error or the relative entropy. For the mean squared error, the "error" is measured from the stimulus mean, which is 0.5 for the stimulus used here. Instead of the MSE calculation, you may wish to calculate the exact log-likelihoods instead—the receptive fields will look the same, but the values will have interpretable units.