Introduction

Suppose some human being is given a list of logical or practical questions, and has answered each of them.

Now a model is asked to simulate the human being with a likelihood function. The likelihood function outputs a likelihood for any input-output pair. But for an input statement in the training data, the calculated likelihood should be generally higher when the correpsonding output matches the human’s output, and lower otherwsie.

Below, we’ll check if the model has learned the “true” human function. It is also called misspecification test in statistics: whether the learned likelihood function is correctly specified.

Misspecification test

In econometrics, the Information matrix test is used to determine whether a regression model is misspecified.

Following the same notations as in the Wikipedia article, after learning, the model’s $r$ parameters are stored in the vector $θ$ , and the logarithm of the learned likelihood function is denoted by $l (θ)$ . If the model is a neural network, the $r$ parameters in $θ$ would represent the weights inside the neural network.

Now, for each input-output training sample, one’d evaluate the matrix below

\frac{\partial^{2} l (θ)}{\partial θ \partial θ^{T}} + (\frac{\partial l (θ)}{\partial θ}) {(\frac{\partial l (θ)}{\partial θ})}^{T},

a square matrix of size $r \times r$ .

If the machine learned well the human being’s function, the information matrix test says that, each of the $r^{2}$ elements of the matrix should be independent white noise centred around zero, statistically. The test of that hypothesis will involve computing the covariance matrix between the $r^{2}$ elements of the above matrix, over the training samples.

According to White, Halbert et al 1992, if the model was correctly specified, this covariance matrix is computable by a formula. The result is a matrix of size $r^{2} \times r^{2}$ , and would take $r^{4}$ scale of computation.

The $r^{4}$ scale of computation

For Large Language Models (LLM), the number of parameters $r$ is usually in the order of billions, or $10^{9}$ , if not more, e.g. Facebook (aka Meta)’s open sourced Llama. Then $r^{4}$ would take it to the order of $10^{36}$ at minimum.

According to a list of ranking of fastest computers, the fastest one may execute at 1.102 exaFLOPS, or $1.102 \times 10^{18}$ floating-point operations per second. If we have 1,000 of them, the time for our computation would be in the order of $10^{15}$ seconds, that is, 31 million years.

Hence it may be close to impossible to statistically test if a LLM was misspecified, if the LLM was densely connected inside, i.e. one neuron connected to most other neurons.

If there is a high degree of sparsity in the neural network

If relative to the total number of neurons, one neuron is only connected to few fellow neurons, the matrix

\frac{\partial^{2} l (θ)}{\partial θ \partial θ^{T}} + (\frac{\partial l (θ)}{\partial θ}) {(\frac{\partial l (θ)}{\partial θ})}^{T}

would consequentially be highly sparse, or with many elements being zero. The number of non-zero elements would be in the order of $r$ , mostly on the diagonal of the matrix.

The covariance matrix of about $r$ non-zero elements, would be of size of order $r^{2}$ . In the example for Llama, the scale of computation drops to $10^{18}$ . The 1,000 supercomputers would need the order of 0.001 second to run the statistical test.

Human’s number of neurons and their interconnections (synapses)

If we check this Wikipedia article on Neuron,

The human brain has some $8.6 \times 10^{10}$ (eighty six billion) neurons.[28] Each neuron has on average 7,000 synaptic connections to other neurons. It has been estimated that the brain of a three-year-old child has about $10^{15}$ synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from $10^{14}$ to $5 \times 10^{14}$ synapses (100 to 500 trillion).

we can read two things:

first, for an adult human brain, relatively to the total number of neurons, averagely one neuron is connected to less than one-millionth of them;

second, the total number of synapses, our $r$ , decreases as the human grows from child to adult. Take the 3-year old child, $r$ is circa $10^{15}$ .

For a 3-year old child, we can only hope any single task has involved far fewer than $10^{15}$ synapses to be able to statistically verify a computational model of it with the current machinery.

REFERENCES

White, Halbert et al, 1992, Artificial Neural Networks: Approximation and Learning Theory.

LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971