Interviewer: Esther Kezia Thorpe
Esther: Before the FT’s coronavirus charts and trackers exploded in popularity, what sort of work were you doing in data at the FT?
John: As a data journalist, the specialism is essentially the data, like the stuff we work with, rather than any particular subject matter. So it’s really been quite varied.
Within that it tends to follow things I’m more interested in or have more knowledge of, so I do a lot of work on politics, quite a lot in economics, bits of sports, bits of environment, so a big range. If we looked over the last few months, I’m always heavily involved in our coverage of elections. So we did a lot of work in the lead up to and on the night and the aftermath of the UK election back in December.
I was involved in a lot of our stuff on the US Democratic primaries, but also been working a lot with our economics team on things like regional inequality – that’s something I’ve always done bits and pieces on.
But to give you an idea of the output of our team, I’d say across our team of data journalists and visual journalists, we’ll probably cover at least 100 more stories by the end of this year, easily. Obviously COVID is a massive story, and that’s one that all of us are involved in, and we’re doing loads of stuff just on this one story.
But even over the last few weeks, our team has been doing stuff on US job losses, and the US election and just bits of companies news and markets news. We cover all sorts of things, but for me, it’s tended to be mainly politics and economics before this.
So the Coronavirus crisis, the work you’ve done, there has been at the forefront of what publishers have been doing with the data visualisations. If we rewind to January and when this was all kicking off, what did those conversations look like around how you were going to approach it? Because it’s not easy data.
Yeah, well, it’s interesting because back then – and this is a nice thing with this one, you can look back and remember when this was just a remote, little world affairs story that we were interested in – and I remember the conversations, because it’s the kind of thing that our team is always involved in from an early phase because the great thing with data is that it makes little difference whether we’re doing a story about London or about Wuhan.
If the information is there, we’ve got something to do with it. So we were having conversations with our world news desk, like you say, in sort of January, and certainly a lot in February about, what does this look like for the FT? Is this a China story? Or is it a global story? And do we do a couple of news stories or do we have an ongoing page tracking this?
And we decided fairly early on that we needed an ongoing ‘home’ for all things Coronavirus, because once it became clear that this was spreading beyond Wuhan, even when it was still confined to Asia but outside of China, it was clear that this was going to be something that we’d want to follow, and to do so on a single home instead of lots of stories. We started looking at that from very early on.
And I guess the thing that was different back then was that when there were fewer countries involved and the numbers were all much lower, the data accuracy and precision felt like less of a concern because it was individual…the numbers changing from day to day were relatively small and and the Johns Hopkins team in America obviously had a pretty good handle on things.
And of course, the thing that we now know – like anyone who’s been following this story now knows well – is that the data today is anything but crystal clear, and anything but 100% reliable. So it’s changed a hell of a lot.
But back on day one, it was just a case of thinking, ‘Okay, Johns Hopkins have a source here, they’ve got a spreadsheet, we’ll just plug that in and do our stuff on the other end.’ Whereas since then it’s been a much more fiddly and manual reporting process.
What is that process like, what does the workflow look like from from day to day?
So it’s changed loads as well over the weeks. So when I think back into the ‘olden days’ of early March when I started making their trajectory charts, it was at that point still a case of, all right, pull in the data from Johns Hopkins, maybe update it with some fresh numbers from Spain or Italy or the UK that would come out in the evening. Because I guess one of the tensions, one of the difficulties here in a newsroom is that the sources like Johns Hopkins or the European Centre for Disease Control, and these other aggregators, they are good at getting the data out on a sort of 24 hour timeliness kind of way of doing things.
But as a newsroom, if we’ve just published a story saying ‘Record number of deaths in Italy,’ then the data in our charts needs to reflect that update five minutes ago. So what became apparent fairly soon was that it was great that there are sources like ECDC and Johns Hopkins that we could pull in as a starting point. But we would then have to do a lot of manual data collection on the day to make sure that we had those European afternoon and evening updates in our data, even if they hadn’t yet filtered through to the other aggregators.
So I guess at the point, over the weeks in early April, it probably would have been, late March, early April, when this was as big an exercise as it got really, for us, the way we were doing it was, for a subset of about 10 countries of a particular, let’s say, editorial significance – so either these were the countries that the FT has most of its readers in, like the UK and US, or they were countries that were significant to the story, so Italy, France, Spain and China – for that selection of countries, we would go every day to the government websites in those countries, and to press conferences, and PDFs and all of this stuff, and we would pull out those numbers individually.
It would often even be one of our correspondents in those regions, I would chat to them directly, and they would say, ‘Okay, I’m at the press conference, and the officials just said it was 723 deaths today and 2,300 cases or something,’ and we would take those numbers directly from the horse’s mouth as it were.
And so for the rest of the world, we’d be using data from ECDC or Johns Hopkins that might be a few hours out of date. But we knew that that had come from a trusted aggregator, and it was then this set of 10 or so countries that we were pulling in ourselves.
And as I say, partly that was about timeliness, it was about making sure that all of the numbers we had were the very, very latest, but it’s also because I think what often doesn’t come across here is that the data from the likes of Johns Hopkins and the ECDC is also just being aggregated by teams just like ours. And that means, just as we might make mistakes in our data, there are often little funny things going on in those datasets as well.
So, just this morning, for example, the data from one of the aggregators said that Italy had had 2,195 deaths yesterday, when it was actually just 195, but someone had put an extra two in there.
So I guess in terms of the daily task, a lot of the time just went on this sort of sense checking. Every night, every evening as it was, I would be pulling in the data from the aggregators and from the government’s websites, but then the crucial task, and often the thing that took the most time was then starting to go through that and checking if it all made sense, because there will often be these things like a number typed in the wrong cell, or missing values for a country.
And then there were the errors that – not so much errors, but the inconsistencies – that countries themselves introduced. So several times there have been countries that have decided to change their methodology halfway through. First it was China who changed its definition of a positive test, then it was France who started including care home deaths alongside other deaths when other countries weren’t doing that. Of course, the UK has done the same thing recently.
And then from time to time, you’ll get a country which releases a backlog of positive tests or deaths all in one dump, even though that number represents deaths or tests that have occurred over a period of weeks.
So a lot of the time, on any given day or evening of updating these charts went into that second phase after we pulled in the data, just saying, does this all make sense? Does any of this look weird? If it looks weird, is it a mistake, or is it because a country has changed what it’s doing today? If a country has changed what it’s doing, how do we address that?
So in some cases, we would strip out non hospital deaths, for example, so we were comparing all countries like for like. In others, we would take one of these backlog dumps of data and we would ourselves spread it out across multiple days to remove the idea that a country had had suddenly a huge number of positive tests or deaths in one day.
So that sense checking, sanity checking and cleaning essentially, has always been the biggest task here, and I think that’s the biggest difference between COVID data and COVID as a data story, and anything else that that journalists usually work with.
Because usually, when you’re working with GDP, or unemployment data, or climate change temperature data or anything like that, we’re used – and this isn’t just data journalists, but any journalist – you’re used to seeing a number and you just know that that number is what it is. You can compare the number from the UK with a number from India with the number from the US. But the thing that’s become very apparent early on here is that that’s absolutely not the case.
And so most of our work as data journalists is getting to the point when we can compare these numbers to one another, and that includes a lot of adjustments, and checking against alternative data sources all the way through.
I know people say numbers don’t lie, but in this case, numbers can be very, very much manipulated to make political points. And I’ve seen governments doing this with charts that journalists have been producing. How do you approach that, trying to level everything out and get all the data so that you can compare it in a way that…is it possible to that without being political? How do you manage that in your team?
Yeah it’s really interesting, and I think any data journalist who says that data is objective, and it’s the highest form or the purest form of journalism, I think they’re pulling your leg.
Anyone who’s worked with data knows exactly that, that depending on what the data sources you use, and depending on what version of some data you use – do you do per capita adjustments or not? Do you use a linear scale or a log scale, do you use the mean or the median – all of these things affect the output.
And as the journalist, as soon as you know that changing these measures, even if they’re all technically legitimate, as soon as you know that changing them will change the order in which countries appear, for example, you’re making an editorial decision. I know that if I change to per capita, the US won’t look as bad, but Switzerland, and Belgium will look worse.
And so the way you have to do it is you just have to put all of the politics of it aside and think, what is the fairest, or just sort of straightest way of doing this stuff? And I think the nice thing for me in this process is that because I’ve been quite open and transparent in communicating the rationale for this stuff on on Twitter, for example, or in interviews, I’ve had to hold myself to that really.
Once I’ve set out why I think certain metrics are the best ones to use for statistical and graphical reasons, the nice thing there is, I can say, well look, those decisions were made, that was the reason for it. And, and that’s what we’re going to stick with. And if I’d suddenly switched from using absolute numbers to per capita, then people could rightly ask, ‘Oh, have you done that because it changes the story?’
These are absolutely editorial decisions as well as technical ones, but as I say, all we’ve tried to do at the FT – and I’m sure this is the same for most people covering this story – is just to be fair, and again to try and do this for the right reasons, make these decisions for statistical reasons rather than political ones.
Because the feedback we’ve had to these charts is highly partisan, not just feedback but people sharing our charts, there are people I’ve seen, using our charts to make points that I can’t fathom how they decided that this chart makes that point! There have been versions of the charts going around with all sorts of annotations scribbled on them.
And on the one hand, that’s the great thing about data visualisation, is that people can read into it all sorts of different things beyond the top level numbers. But as you said, it does mean that we’ve had to be acutely aware of the fact that this has become a very political topic, and any number that we put out isn’t going to be reported just as a straight number, it’s going to be reported as the number that proves x is bad and y is good.
The number of letters to the editor, and bits of feedback and that kind of stuff we’ve had from these charts is absolutely huge. I’ve had more conversations with our reader liaisons team, and even with the editor herself, I’ve had more conversations over the last few weeks than I ever had done before because of the the feedback to these charts.
So as you say, they may be quantitative and objective in terms of the calculations, but they are highly personal to people.
And I know you’ve kept quite a lot of communication on Twitter, you’ve invited people to ask you questions. Are there any really common misconceptions or challenges that have come up that you’ve had to deal with and respond to?
Yeah there’s all sorts of things here, and the misconceptions and things sort of work in in different ways. I think one misconception is just the idea that this data does exactly what it says on the tin.
The fact that every Tuesday morning now, the latest UK death numbers come out, and at that point, we suddenly have three or four different numbers, all of which, on some level claim to be a count of UK COVID deaths. We have the number that the government’s reported the previous evening, we have the number that the ONS has reported that morning, we have the number of excess deaths from the ONS that morning, and it just means Tuesdays are always this crazy time when you have people writing all of these stories, each of them based on an official number, proving that the UK is the worst or the second worst, or the third worst, or things are at their peak, or they’re slightly off their peak.
And so one of the things is just a misconception that the data here is nearly as unequivocal as anything else. There are all sorts of things, there’s been a whole debate about whether these numbers should be adjusted for a country’s population – I was more involved in that debate earlier, and then I have gladly taken a bit of a step back more recently.
There aren’t many concerns, misconceptions that have come our way that I’ve felt have been unjustified. I think every time someone has said, ‘Hang on a minute, shouldn’t you be doing x?’ I think their concern has almost always come from the right place. But it’s just that, as we’ve said, this data and this phenomenon, these are not the normal things that we talk about, and while there are plenty of phenomena where adjusting for a population size, for example, is absolutely the first thing you should do, there are other things where it doesn’t necessarily give you any benefits, and it can even make things a bit more skewed.
There’s been constant debate around this stuff, we get the same thing on excess mortality at the moment, so every day when we put those numbers out, we’ll get some people coming in and saying, ‘Hang on, these numbers are exaggerating things,’ because not all of these excess deaths are going to be people dying from COVID. They’re going to they’re going to be all sorts of things caused by the lockdown or caused by people staying away from hospitals.
But I’ll also get a lot of people saying, ‘Hang on, these numbers are an underestimate,’ because they’re not accounting for the fact that there have been fewer deaths from road accidents or homicides.
And both of those concerns are true, but that to me is exactly why excess deaths are a good measure, because they’re affected in both directions. And so at the end of the day, they’re probably going to be fairly close to accurate.
So again, I don’t feel that people are trolling on any of this stuff – I mean, there have been one or two that I’ve had to take a deep breath and ignore – but generally, I think it’s just that this is a really tricky issue. There aren’t many straightforward one word answers or one number answers. And to be honest, I think it’s been it’s been a great experience as a journalist to cover it for that reason, because it’s a very conversational story. We’re in constant dialogue with readers, with epidemiologists, with academics, and this whole thing is a conversation.
I think a lot of people are starting to become aware that even on the science side, a lot of answers on COVID are not straightforward, that it’s taken months for people, for experts to work out whether everyone should wear masks or not. There’s still a lot of debate about whether lockdown timing is the single critical factor, no one really knows what the R number is right now, for example.
So I think it’s just been a case of everyone learning that data and science and all of this stuff are fuzzier than we might have thought, and as a result, being at the centre of those conversations has been really, really rewarding.
I mean, in a way you’re not going to get a lot of clarity on a lot of this for many months to come yet, I mean we’re still getting death rates in from three weeks ago in the UK, let alone anywhere else.
No, exactly that, and at the start of this year, our team at the FT was thinking, ‘Okay, this is a US election year, and that’s going to dominate our work.’ And obviously, it still is a US election year and we’ve got people working on that who’ve been working on that since January and have remained on that, even while this has been going on.
But at the same time, we know that Coronavirus isn’t going anywhere anytime soon, and the story is changing but it’s not disappearing; the focus is now shifting on to countries easing lockdown, and then it may go on to second waves. There’s a growing consensus that normality isn’t going to resume really, and therefore that remains a story, the fact that things aren’t back to normal, and so we’ll still be covering that.
We’re going to have people, I suspect myself and various others, still working on this for the FT for months, if not even longer. It’s just a question of juggling things, because we don’t want to neglect other bits of the FT’s coverage that we would otherwise be doing. It’s a huge, sprawling story.
I remember, it was the week before at the FT before we started working from home, I think maybe the first week of March, it just became very clear during that week that our team’s workload had suddenly shot through the roof.
There were several evenings that week when a lot of us were still in the office well beyond 7pm, and that was because there were data stories springing up across all parts of the FT: the markets team wanted to look at whether the markets were dipping faster than they ever had done before, and then obviously the health team were looking at things on that.
It’s been a huge story, it’s going to continue to be a huge story, and I’m sure I’ll still be working on COVID-related stuff well into the tail end of the year.
The FT chose to put this Coronavirus story, the data and all the charts in front of their paywall, so it’s free for everybody to access. Has the FT seen – and I don’t want to use the word benefits because obviously it’s a horrendous health story – but has the FT seen benefits in terms of audience and readership growth from this?
The thing on that is that the FT’s analytics and audience teams are fantastic, and the way we tend to do things is, we never ascribe a subscription to one story. We don’t say, ‘Oh, what was the last thing someone read before subscribing.’ It’s a much more long term view of the various triggers that might have led someone to subscribe.
So it’s hard to know, it’s pretty much impossible to know, I think, whether we’ve got subscribers based purely on these pieces. But anecdotally, I can absolutely point to specific examples where people have emailed or tweeted to say, ‘I’ve subscribed to the FT because of this piece of work,’ or we’ve had people who’d let their subscription lapse and have re subscribed.
I would guess it’s a huge traffic hitter.
Oh, yeah, the main page that the trajectory charts all sit in is now by a large margin the FT’s most viewed page ever. But incidentally, the second most viewed ever was our Brexit poll tracker page that we did back in June 2016, so the data team has a good record with this stuff.
But yeah we are absolutely seeing people subscribe to the FT as a direct result of these pieces of work as well as all the rest of our coverage recently. And they’ll also obviously be affected a bit like there was in summer 2016, and then after Trump came in as well, in terms of people just reacting to a major shock event like this by thinking ‘I need to support quality journalism.’
So I think it’ll partly be that larger scale quality journalism effect, and partly it’s been great to see people responding directly to this page. It’s definitely brought people to the FT.
And I think it’s been great that people, especially, there are people who maybe haven’t had much of a relationship with the FT before March and April of this year, and would have thought, ‘Oh, they just cover finance.’ And I think, of course, COVID is absolutely a huge finance and business and an economic story.
But for people to see the Financial Times covers diseases and covers people and covers health, I think it’s great that we’ve been able to show that as well. It was ultimately a decision about wanting to put public service journalism in front of the public, but it has all sorts of side benefits for us and for the readers.
You touched on it briefly there, but I’ve seen data journalists around, I mean, not just the Coronavirus crisis, but Brexit and things like that, they’ve been described as the new ‘rockstars’ in journalism. Have you’ve seen a shift in importance of the role over the years you’ve been in the field?
My sort of initial hackles raised answer to that is well, data journalists have always been doing great stuff!
I think for me, there’s two things I’d say on this. One is that the way I’ve come to think about COVID for data journalists is that, it’s a bit like covering an election but where there’s an election every day for months, in the sense that, when we cover elections, for a few days, we as data journalists are the centre of the operation, and the centre of where reader’s eyes are going, because we’re looking at the results, we’re looking at the polls, we’re looking at who voted for whom and why.
And we become the epicentre of all of the coverage for say the few days before and after and on the night. And that’s essentially been the dynamic this time, but it’s just not stopped.
So this is a story where you can’t really consume this story without data. I mean, of course, you absolutely can, and maybe it would be a bit less frantic if you did! But people need numbers when they’re being told by their government that their country is doing well, or that testing is increasing. And so the need for numbers, obviously points editors and readers to us.
But more generally, I think you’re right in the sense that there is more and more information out there which is already quantified or it’s at least quantifiable, it’s something that we as data journalists can gather and quantify. And that means there are more and more topics and events where pulling together the information, and visualising it, and putting it out there is going to be the best way to cover it, especially with the advent of techniques like web scraping, where we can now create structured data sets from written information online as well.
What I’d also say is, this isn’t just about data journalists, I think it’s also about the need for all journalists to be a bit more data literate, because you’d struggle to cover something like COVID authoritatively if you’re not data literate, because all you could then go on is what your sources or government spokespeople are saying to you. And then of course, if they are wrong, or if they’re deliberately trying to mislead, then you’re just laundering that false or misleading information.
I think the way the world is moving the ability to obtain, clean, analyse and visualise data is only going to grow more important, and for data journalists, it’s a huge opportunity.