Brief Solutions Ltd

Understanding house and rent statistics, May 2025

2025-05-29T00:00:00+00:00

Following the webinar of Office for National Statistics “Bricks, mortar and data: understanding house and rent statistics”, below are the resources:

ONS produce the UK Average House Price index which is published on HM Land Registry available to download here https://www.gov.uk/government/statistical-data-sets/uk-house-price-index-data-downloads-march-2025. The lowest level is local authority, and this data is at the end of conveyancing, when HMLR have processed the logged sale.

Inaugural lecture of Prof Phyllis Illari on causality, May 2025

2025-05-20T00:00:00+00:00

Inaugural lecture of Prof Phyllis Illari, UCL on causality, May 2025.

Intro: causation vs correlation

What have I been up to? Philosophical accounts

What have I been up to? Talking to sciences

Causal Mosaic: philosophical concepts of causality

Causal Mosaic: other philosophical resources, esp. “variation” vs “production” (the box in the bottom centre)

Misspecification test, or why there is need for a high degree of sparsity for a large neural network to be statistically verified

2025-05-19T00:00:00+00:00

Introduction

Suppose some human being is given a list of logical or practical questions, and has answered each of them.

Now a model is asked to simulate the human being with a likelihood function. The likelihood function outputs a likelihood for any input-output pair. But for an input statement in the training data, the calculated likelihood should be generally higher when the correpsonding output matches the human’s output, and lower otherwsie.

Below, we’ll check if the model has learned the “true” human function. It is also called misspecification test in statistics: whether the learned likelihood function is correctly specified.

Misspecification test

In econometrics, the Information matrix test is used to determine whether a regression model is misspecified.

Following the same notations as in the Wikipedia article, after learning, the model’s $r$ parameters are stored in the vector $θ$ , and the logarithm of the learned likelihood function is denoted by $l (θ)$ . If the model is a neural network, the $r$ parameters in $θ$ would represent the weights inside the neural network.

Now, for each input-output training sample, one’d evaluate the matrix below

\frac{\partial^{2} l (θ)}{\partial θ \partial θ^{T}} + (\frac{\partial l (θ)}{\partial θ}) {(\frac{\partial l (θ)}{\partial θ})}^{T},

a square matrix of size $r \times r$ .

If the machine learned well the human being’s function, the information matrix test says that, each of the $r^{2}$ elements of the matrix should be independent white noise centred around zero, statistically. The test of that hypothesis will involve computing the covariance matrix between the $r^{2}$ elements of the above matrix, over the training samples.

According to White, Halbert et al 1992, if the model was correctly specified, this covariance matrix is computable by a formula. The result is a matrix of size $r^{2} \times r^{2}$ , and would take $r^{4}$ scale of computation.

The $r^{4}$ scale of computation

For Large Language Models (LLM), the number of parameters $r$ is usually in the order of billions, or $10^{9}$ , if not more, e.g. Facebook (aka Meta)’s open sourced Llama. Then $r^{4}$ would take it to the order of $10^{36}$ at minimum.

According to a list of ranking of fastest computers, the fastest one may execute at 1.102 exaFLOPS, or $1.102 \times 10^{18}$ floating-point operations per second. If we have 1,000 of them, the time for our computation would be in the order of $10^{15}$ seconds, that is, 31 million years.

Hence it may be close to impossible to statistically test if a LLM was misspecified, if the LLM was densely connected inside, i.e. one neuron connected to most other neurons.

If there is a high degree of sparsity in the neural network

If relative to the total number of neurons, one neuron is only connected to few fellow neurons, the matrix

\frac{\partial^{2} l (θ)}{\partial θ \partial θ^{T}} + (\frac{\partial l (θ)}{\partial θ}) {(\frac{\partial l (θ)}{\partial θ})}^{T}

would consequentially be highly sparse, or with many elements being zero. The number of non-zero elements would be in the order of $r$ , mostly on the diagonal of the matrix.

The covariance matrix of about $r$ non-zero elements, would be of size of order $r^{2}$ . In the example for Llama, the scale of computation drops to $10^{18}$ . The 1,000 supercomputers would need the order of 0.001 second to run the statistical test.

Human’s number of neurons and their interconnections (synapses)

If we check this Wikipedia article on Neuron,

The human brain has some $8.6 \times 10^{10}$ (eighty six billion) neurons.[28] Each neuron has on average 7,000 synaptic connections to other neurons. It has been estimated that the brain of a three-year-old child has about $10^{15}$ synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from $10^{14}$ to $5 \times 10^{14}$ synapses (100 to 500 trillion).

we can read two things:

first, for an adult human brain, relatively to the total number of neurons, averagely one neuron is connected to less than one-millionth of them;

second, the total number of synapses, our $r$ , decreases as the human grows from child to adult. Take the 3-year old child, $r$ is circa $10^{15}$ .

For a 3-year old child, we can only hope any single task has involved far fewer than $10^{15}$ synapses to be able to statistically verify a computational model of it with the current machinery.

REFERENCES

White, Halbert et al, 1992, Artificial Neural Networks: Approximation and Learning Theory.

LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971

Welcome to Jekyll!

2025-03-16T23:00:33+00:00

You’ll find this post in your _posts directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run jekyll serve, which launches a web server and auto-regenerates your site when a file is updated.

Jekyll requires blog post files to be named according to the following format:

YEAR-MONTH-DAY-title.MARKUP

Where YEAR is a four-digit number, MONTH and DAY are both two-digit numbers, and MARKUP is the file extension representing the format used in the file. After that, include the necessary front matter. Take a look at the source for this post to get an idea about how it works.

Jekyll also offers powerful support for code snippets:

def print_hi(name)
  puts "Hi, #{name}"
end
print_hi('Tom')
#=> prints 'Hi, Tom' to STDOUT.

Check out the Jekyll docs for more info on how to get the most out of Jekyll. File all bugs/feature requests at Jekyll’s GitHub repo. If you have questions, you can ask them on Jekyll Talk.