During the global pandemic, the spread of coronavirus has been limited through a range of measures including lockdowns, restricted travel and constraints on socialising. Widespread testing has also played an important role in efforts to contain the virus. Test and Trace is the system the UK government have been using to monitor test results across the country and inform members of the public when they need to self-isolate. It‘s potentially the most significant weapon in the fight against the virus, so the integrity and accuracy of the data is critical.
On the 5th of October it was reported that almost 16,000 positive tests had been left out of the central database. The omission resulted in rates of infection being under-reported and meant tens of thousands of people who may have come into contact with the virus were left totally unaware.
“Ensuring your methodology is fit for purpose is a baseline requirement for data analysis of any kind. Designing a more fitting system at the outset could have avoided this mistake altogether.”
Initially, the reasons given for the missing data varied between software error, a ‘glitch’ and data exceeding the maximum file size. One prominent member of the government claimed the error was due to data being ‘truncated’ (no prizes for guessing who that was). To anyone with even a basic understanding of data, none of these explanations seemed satisfactory. So, what was the real reason, and how could it have been avoided?
As further information came to light, it turned out that the body in charge of Test and Trace, Public Health England, had been using Microsoft Excel to pull together the positive COVID-19 tests. This didn’t explain why the data was missing, but it was an alarming revelation considering the ‘world beating’ £12 billion system that was promised.
Finally, it emerged that the missing data was caused by human error: the way the system had been set up was not adequate for the scale of the information being uploaded.
PHE had been using XLS, an outdated file format, to upload test logs to the central spreadsheet. Because of this, files were limited to about 65,000 rows of data. The way positive tests are recorded requires multiple rows per test, which meant that there was simply not enough space to include them all. Any rows exceeding this limit were ignored.
Ensuring your methodology is fit for purpose is a baseline requirement for data analysis of any kind. Designing a more fitting system at the outset could have avoided this mistake altogether. While Excel is a versatile and reliable tool for storing and organising information up to a point, for something of this scale and level of importance, a bespoke solution would have been more appropriate. A simple database employing SQL would easily accommodate several million rows of data, while enabling efficient retrieval.
If PHE had built quality assurance into their process, the problem would have been identified long before anybody was placed in harm’s way, irrespective of the way the process was designed.
The takeaways are two-fold. Firstly, whether you are forecasting sales, measuring the success of a marketing campaign or reporting infection rates in a pandemic, choosing the right software and data formats, while being aware of their limitations, is absolutely crucial. Secondly, quality assurance should be a non-negotiable feature of recording, analysing or reporting your data, to pick up on errors before they do any damage.
The Author, Matthew Nixon: @mattalytic
Molzana: @MOLZANAuk