How can we effectively model the transmission of covid19?

A crossroads of predictions, and why we need to pay attention immediately

Larry Tarof
33 min readApr 2, 2020

Abstract

The increase of covid19 is an important topic at this point in time. Typical tabulations talk about the exponential increase and the hope to “flatten the curve”. This work attempts to explore some of the mathematical modeling behind this, but also in the context of active mitigation that is or is not being done.

Epidemiologists (those who study the transmission of diseases) need mathematical models in order to make predictions. These models require assumptions about the virus behavior, and the better these assumptions, the more accurate the predictions will be. Those who study outbreaks similar to the proliferation of covid19 generally assert/assume that a particular mathematical function, the “logistic function” [1], is effective in doing this modeling. This work endeavors to test this assumption, and identify bounds within the case data which point to how much confidence we either can or cannot have in the predictions which arise from this mathematical analyis.

Can we predict the growth and, eventually, the containment of covid19?

In this work, a specific methodology will be presented which, as will be shown, fits a known set of case data [2–4] surprisingly well to the a logistic function. The key question is: is covid19 well described by the logistic function, as so many scientific and medical professionals believe? If yes, the following graphs show the predictive power and can give a guide to what we can expect. If, however, the growth remains exponential beyond a fraction of the population, this will mean that the best mathematical treatment of the covid19 growth fails, and we have no established means to predict its growth. Or are there additional analysis tools, tying in with actual practice today, which can be brought to bear?

This may be the most important question which faces our species in 2020.

Introduction

Overview

There are many useful resources to tabulate how acute the spread of covid19 is in different countries. For example, there is an excellent compilation by country of the exponential development of cases, complete with doubling time[5]. On an exponential scale, a straight line means doubling with a time constant. The steeper slopes are for faster doubling and the shallower slopes are for slower doubling. It is clear that some countries double as quickly as every 2 days, some as slowly as every 10 days in this exponential rendering. Note that S. Korea and China bend over. Why is this? Why does this deviate from exponential growth? And what mathematical modeling can use to understand and hopefully obtain some predictive possibility?

Fig 1: Total confirmed cases since 100th case, from [5]. Some countries double as quickly as every 2 days, some as slowly as every 10 days.

There is considerable case data tabulating the daily number of cases of covid19 per country and sometimes even by state/district or municipality [2–4]. Access to this all-important primary data allows for independent analysis such as this work, which would otherwise not be possible.

The logistic function [1] is well known in physics, chemistry, biology and other disciplines. A versatile function, it is used for modeling population growth (biology/ecology), reaction rates (chemistry), tumor growth (medicine), electron distributions (atomic physics), crop response (agriculture) and even the probability of chess outcomes. Many have tried using the logistic function for modeling the covid19 growth, and this seems to be the accepted mathematical model (any Google search will show this quickly). Based on this important assumption, this work explores the predictions and evidence to date through the lens of the logistic function, and explores under what conditions we can have confidence in the logistic function and under what conditions perhaps we cannot.

The analysis below demonstrates that the logistic function can be used to effectively model the transmission of covid19 in many circumstances, and doing so highlights some important differences and similarities among the experiences and practices of different countries.

Basics of logistic function

This basic equation relates the number of cases at any date to the affected population (“percentage*population”), the acceleration rate k (how fast the function changes and the midpoint (the inflection point date, at which date the process begins to decelerate, which is what we all hope for).

Eqn 1: the logistic equation

At the outset (first cases), the number of cases increases because there are more people to transmit the disease than there are recipients of the disease. Later on, there are fewer people to infect, and once this inflection point is reached, the process of case increases slows down and eventually stops.

In the simple graph below (Fig 00a from wiki.org) the inflection point is at zero — this is the point where half the population to be infected is infected. The acceleration rate k is is a measure of how steep (high k) or shallow (low k) the curve is. The more easily the disease is transmitted, the higher k is, and the stronger the mitigation measures in place, the lower k is.

To predict the path of the disease, we need to know where we are on the curve at the start, what k is, and ultimately how many people will be affected.

Fig 2: Simple visualization of the logistic function from Wiki.

A more detailed review with some specific visualizations is given in Appendix A. A key point is that in the absence of an inflection point, the disease continues to increase exponentially.

The key question is: how do these variables relate?

Modeling and fitting

In any real geographic setting (country, state, municipality) the question is how best to describe the increase in the number of cases using the three variables described in the one equation above. What is the percentage of population affected throughout the course of the virus? What is the rate of evolution? And what is the midpoint, which together with rate, determines the endpoint?

A typical way to model involves least squares difference fitting. This has some issues described in the next paragraph. For the mathematically less inclined, please just know that a fitting the logistic function with the method presented in this work can correct for those problems.

A key issue with typical least squares fitting with the actual number of cases is that the fit becomes skewed towards the dates with the highest number of cumulative cases. In contrast, and to the author’s knowledge different from other approaches this author is aware of, the method presented here is to do a least squares fit of the logarithm of the logistic function to the logarithm of the number of cases, using Microsoft Excel solver for three variables — k (rate), the midpoint and the total percent of population affected. Equal weighting is used for the difference of squares of the logarithm. In this way, and in contrast to many other approaches, both low and high cumulative cases at early and later time receive comparable weighting, so therefore the fit is as close as possible over the full time period explored.

In the next sections different countries/areas will be explored, and key conclusions, predictions and questions will follow. N.B. all figures which look like Microsoft Excel graphs are in fact Microsoft Excel graphs produced from the author’s analysis of primary case data.

Following the graphical/mathematical analysis, the social reasons behind the curves are explored and the results tied in with best/worst practices.

Case study: known controlled cases

How amenable is the case data to the logistic function? In this section, the case data is explored for both China and S. Korea, and analyzed through the lens of the logistic function. It will be shown that a two-part function will be necessary to understand the case data.

China

China is clearly the most developed in terms of covid19 evolution. It is also complex, because this was Ground Zero for the covid19 virus knowledge. An excellent timeline is given by [6], reproduced here, also further analyzed in [7], and this is necessary background for understanding how to treat China mathematically using the logistic function. One complicating factor is the large spike in cases reported on Feb 12. This cannot be real, but rather is said to reflect a change in the way cases were reported [8].

Fig 3: Figure and caption reproduced from [6], but with further analysis in [7]. Epidemic Curve of the Confirmed Cases of Coronavirus Disease 2019 (COVID-19)Daily numbers of confirmed cases are plotted by date of onset of symptoms (blue) and by date of diagnosis (orange). Because, on retrospective investigation, so few cases experienced illness in December, these cases are shown in the inset. The difference between the cases by date of symptom onset curve (blue) and the cases by date of diagnosis curve (orange) illustrates lag time between the start of illness and diagnosis of COVID-19 by viral nucleic acid testing. The graph’s x-axis (dates from December 8, 2019, to February 11, 2020) is also used as a timeline of major milestones in the epidemic response. The first few cases of pneumonia of unknown etiology are shown in blue boxes on December 26 (n = 4) and 28–29 (n = 3). Most other cases that experienced onset of symptoms in December were only discovered when retrospectively investigated. Major epidemic response actions taken by the Chinese government are shown in brown boxes. The normally scheduled Lunar New Year national holiday is shown in light yellow, whereas the extended holiday during which attendance at school and work was prohibited (except for critical personnel such as health workers and police) is shown in dark yellow. This figure was adapted with permission. CDC indicates Chinese Center for Disease Control and Prevention; HICWM, Hubei Integrated Chinese and Western Medicine; 2019-nCoV, 2019 novel coronavirus; WHO, World Health Organization.

First off, it’s important to understand two key factors:

1. When cases were initially contracted c.f. when they were reported

2. What the jump in China cases on Feb 12 means

The figure above does an excellent job in demonstrating that symptoms occured well before they were tracked as cases within China. And an explanation within [8] (emphasis from the source quoted), “Most of these 13,332 clinically confirmed cases relate to a period going back days and weeks, and were all retrospectively reported as cases on a single day, but they need to be redistributed over the entire preceding period.” It is important to realize that an set of data is only as good as its source and in this work the proportionality described below is used. There is no guarantee that this was the actual day-by-day number of cases, but some assumption must be made in order to attempt the the mathematical analysis.

We will use both pieces of information in analyzing the China case data.

The next figure shows a tale of two sets of data: that before Feb 12 and after Feb 12, from [2]. The green line has k=0.35 and the orange line has k=0.17 (much lower rate), but with higher fraction of population. In this case it’s clear that the linear visualization allows better discernment between the two fits, but also just how good the fits actually are. The period subsequent to Feb 12 has a factor of ~500 better least squares difference relative to the period before Feb 12, despite how reasonable the fit is for the earlier time period. Why?

Fig 4: China cumulative data [manually extracted from [2] with two potential fits: one prior to Feb 12 and one after Feb 12. Prior to Feb 12 k=0.35 (green curve) fits best and subsequent to Feb 12 k=0.17 (orange curve) fits best to the actual cumulative data (blue dots). Both exponential and linear visualizations are shown here.

To understand the above case data best, let us move to Hubei, which was the epicenter of the covid19 origination. Wuhan, the origin of the outbreak, is the capital of Hubei. It should be possible to evaluate the logistic treatment best using the Hubei case data.

Let us also do the mathematical exercise of distributing proportionally the cases prior to Feb 12 to conform to the post-Feb 12 standard. Let us then do the mathematical fit. The data looks like this:

Fig 5: Data for Hubei [manually extracted from 2]. The blue dots are the actual cumulative cases. The grey dots are the data proportioned to the Feb 12 endpoint. The purple curve (k=0.34, but lower saturation point than orange) fits the data reasonably well, but not perfectly, prior to Feb 12 and nearly, but not quite, connects up with the orange curve (k=0.15) .

Using this mathematical proportionality exercise, where it is clear that the blue points prior to Feb 12 underestimated the actual number of cases, it is seen at the purple curve almost joins up with the orange curve, and that the orange curve is a nearly ideal fit to the data subsequent to Feb 14. Note that the purple curve is assuming a proportionality (hence the same k, but with a higher population fraction). Yet it is impossible to know the precise distribution of the excess cases which were reported on Feb 12.

It is clear that the orange curve (low k=0.15) does not represent the early phase. Even taking the acknowledged under-reporting into consideration, it is not plausible that the orange fit describes the early outbreak, which is much better described by a high k and also, as it happens, a lower saturation point. Figs. A1 and A2 (in the appendix) can be used to help visualize that given the mid and endpoints in Hubei, there is no way to plausibly suggest that a low k was present early in the outbreak. Also, there is no way to plausibly suggest that the high k which fits the data reasonably well in the beginning would also fit the later stage, unless the actual data prior to Feb 12 had an unusual distribution not suggested by the data we do have. This idea will be discussed later, as well.

Referring back to Fig 3, we know that the lockdown occurred in Wuhan and 15 other cities on Jan 23–24, at the point where only ~500 cases were known to exist in China. Given the incubation period is 1–14 days and most commonly 5 days [9] as best understood at this time, but also given the lag between onset of symptoms and diagnosis/reporting is a median of 14 days (Fig. 3 in this work, from [6,7]), it is reasonable use a two week transition period for the Hubei data to go from high k to low k. This is probably what occurred, and the purple curve is a mathematical estimate, based on proportioning data arbitrarily, to describe more precisely than what may be possible what actually happened.

What is conclusive, however, is that the lockdown in Hubei, and later throughout China, moved what was a high k rate of growth of cases to a low k rate of cases, independent of the fitting mechanism (proportionality in this case) to handle the precise data prior to Feb 12. What is also clear is that the logistic function does a good job of mathematically describing what happened. Given the excellent job done in China of containing the virus and saturating the curve, we can assert that k=0.154 is a good target for “flattening the curve” (containing the virus) and eventually saturating the curve.

South Korea

Let us turn next to the second Asia success story about “flattening the curve”. In this case, the early data is fit to one set of logistic criteria as a least squares logarithmic fit to all three variables (k=rate, midpoint and total population susceptible) and the late data is fit to a second set of logistic criteria, this time presupposing the final susceptible population and also using the same k as for Hubei.

Fig 6: Data for S. Korea manually extracted from [3]. The blue dots are the actual cumulative cases. The green curve (k=0.38, 0.016% of the population) fits the early data very well, and the orange curve (k=0.05, 0.022% of the population) fits the late data well but not the early data.

It is clear that the green curve (k=0.38, midpoint Mar 1) fits the early data very well, and the orange curve (k=0.05) fits the very late data well. This may look to some like a continued increase, which it is, but this is a prime example of how a very low k takes quite a while for the logistic function to saturate. Once again, the early data has a lower saturation point than the later data. It is clear from this case data that k=0.05 is the standard for controlling covid19 while causing minimal disruption daily life, through strong “smart mitigation” strategies. Some of these strategies will be discussed later.

How did S. Korea respond to the covid19 crisis? There have been a number of pieces written on the topic [e.g. 10–13]. As of 25 February, Daegu officials were already, before 1000 cases had been identified, aggressively warning residents to take precautions, while allowing private businesses such as restaurants to stay open [12, data from 3]. As a precautionary measure, many restaurants check the temperatures of their customers before accepting them. Relentless public messaging urges South Koreans to seek testing if they or someone they know develop symptoms. Offices, hotels and other large buildings often use thermal image cameras to identify people with fevers. South Koreans’ cellphones vibrate with emergency alerts whenever new cases are discovered in their districts. Websites and smartphone apps detail hour-by-hour, sometimes minute-by-minute, timelines of infected people’s travel — which buses they took, when and where they got on and off, even whether they were wearing masks. People who believe they may have crossed paths with a patient are urged to report to testing centers [13]. Indeed, as of Mar 28, there were 394,000 tests done with 9,583 positive results [12]. This means that S. Korea is willing to invest in 41 tests for every positive result, has tested as of Mar 28 approximately 7.7% of its population, and has an covid19 incidence rate of under 0.02% as of Mar 31.

Indeed, there is a Tik Tok video, also available on YouTube, which has gone “viral” about a handwashing dance [14]. S. Korea has accepted, in a full social way, the need for all inhabitants to be onboard in a multifaceted assault on covid19 spread.

It is noteworthy and striking just how many of the measures used with success in S. Korea were on full display, and perhaps debuted to the public, in the movie “Contagion” which was first screened in 2011. Sometimes Hollywood gets it right.

Case study: not yet under control

In this section the logistic function is used to fit the case data. Most of the emphasis is on the USA and Italy. It will be shown that the highest k values are necessary to describe the increasing case data, and that predictions from a single instance of the logistic function with high k do not necessarily work.

USA

The USA has more cases than any other country in the world at this point. What can we learn from mathematical analysis? The best fit to the cumulative date from 100 cases onwards as of Mar 29 is given in the figure below.

Fig 7: Statistics for the USA as an aggregate. The blue dots are the actual cumulative cases. The best fit since 100 cases until Mar 29 is given. For the best fit k=0.27, several candidate functions are given in the top graph, and the best fit is clear. This is isolated, again logarithmically, in the middle graph. The lowest graph shows a linear fit.

The USA is increasing in cases rapidly at this point, but the logistic function suggests that there is significant increase yet to come, and at this early stage, this modeling based on a single k suggests that the USA would experience 1.3M total cases with a midpoint of Apr 3. This is seen by many be a bold prediction based on the much smaller number of cases observed to date. Also, looking at the last data points, this prediction seems perhaps a little high as the effect of the US slowdown is beginning to be felt. However, the USA is difficult to treat as a single entity because lockdowns started in California on Mar 19, but, for example, Kansas only instituted a lockdown as of Mar 30 and several states still have not. These maps from [15] shows the current status of where the virus is most prevalent and what stay-at-home orders have been received.

Fig 8a: USA covid19 incidence as of Mar 30 from [15]
Fig. 8b: USA covid19 stay-at-home orders as of Mar 30 from [15]

It is clear that the US is neither following the lead of either China nor S.Korea in terms of social distancing or testing/tracking to this point.

On the social distancing side, perhaps the best recent examples of egregious neglect are seen in the Mardi Gras celebrations in New Orleans (near Feb 25), the Florida spring break beaches reported Mar 19 [15] and for which students tested positive within 5 days [16], and in the service at the River of Tampa Bay Church, where “the church has said it sanitized the building, and the pastor said on Twitter that the church is an essential business. He also attacked the media for ‘religious bigotry and hate’” [17]. Also, recently, (1) a Boston biogen conference Feb 26–27 — with linkages to follow on cases found in North Carolina, Indiana, New Jersey, Tennessee Washington, D.C. and Europe [18,19], (2) a Feb 29 funeral in Albany, Georgia [20], and (3) a Mar 5 40th birthday party in Westport, Connecticut [21] have come to light, all of which are associated with cluster outbreaks. Undoubtedly there are others.

On the testing side, as of Mar 18–20, based on the rich dataset of [22], the USA had recorded 0.3 tests per 1000 people, c.f. S. Korea which at the same time had recorded 6.15 tests per 1000 people — a 20x ratio. At the same time, S. Korea had administered 36.6 tests per covid19 incidence c.f. 7.55 for the USA, a 4.8x ratio.

It may be that the US is too inhomogeneous a population to fully treat with a logistic function today. Perhaps, as in China, there are pockets of covid19 within a smaller population group which either will or will not explode depending on measures taken. Perhaps it is easier to mathematically address New York State, a more homogeneous population, where the greatest number of cases exist within the USA.

Accordingly, the case statistics for New York State as of Mar 29 are given in the figure below.

Fig 9a: Statistics for New York State from [4]. The blue dots are the actual cumulative cases. The best fit since 100 cases is given. Microsoft Excel Solver originally gave k=0.39, but the fit was not as good by eye, so k was manually adjusted in this particular intance to 0.38. The midpoint is Mar 25 in this fit.
Fig 9b: similar analysis applied a few days later. The blue dots are the actual cumulative cases from [4]. Once again, the linear cumulative cases go beyond what the best curvefit for the logistic function would indicate. By Mar 31, k had been reduced to 0.35.

The data suggests that New York State is ready for a slowdown at this point, all else equal. Here k=0.38, a fast acceleration. A few days later, the similar trend is observed, with k=0.35. However in both cases, the cumulative number of cases overshoots the predicted saturation point, which continues to increase. It is clear that the logistic function with a single curve is not an effective predictive model in this instance. It may be that the data is somewhat affected by the ability to test, so we cannot know at this time whether or not this data represents the true cases.

Gov. Cuomo has recently instituted measures and it is possible that these measures will affect these curves going forward.

The epicenter of covid19 acute difficulty in the USA at the moment is New York City itself. The data from [4] up to Mar 26 is given below. Beyond this date, this particular source has stopped tracking NYC as its own entity, so until this source resumes, it will not be possible to see the data going forward.

Fig 10: Statistics for New York City. The blue dots are the actual cumulative cases from [4]. This is the best fit from inception, rather than from a specific number of cases. In this case k=0.49. The model’s midpoint is Mar 24.

The fit for New York City, the most acute area in the USA for covid19 at this point, suggests a highly acute k=0.49, the highest studied, but also that the number of cases is going to level off soon, which seems quite doubtful. Again, it would appear that for k values very high, a concentrated fraction of the population is spreading, but eventually the potentially infected population base itself will increase meaning that the saturation estimate from this analyis is too low. I.e. when k is very high the numerator in the logistic function is itself a function of time, which makes prediction ineffective. We will return to this point later.

Canada

The statistics from Canada are available. The best fit from 50 cases on shows a midpoint of Apr 7, but does not particularly fit the early data well.

Fig 11: Canada statistics from [3]. The blue dots are the actual cumulative cases. This is the best fit from 50 cases onwards. The midpoint is Apr 7 and k=0.234. Social distancing measures have been in effect for most of Canada since

In mid-march a state of emergency was declared in all Canadian provinces and on Mar 12, at which point 142 cases had been reported, or 0.0004% of the population, Canada starting aggressively closing schools [23], and shortly thereafter, Canada instituted quarantine measures for anyone returning from abroad.

On the testing front, using the same data rich set [22], Canada had administered 3 tests per 1000 people (c.f. 6 for S. Korea, a factor of 2x less) and 130 tests per covid19 incidence (c.f 36.6 for S. Korea, a factor of 3.6x more). Canada is showing a willingness to test.

So far, the Canadian curve has been relatively gentle and seems to fit k=0.23, higher than Hubei, but lower than in many other places. It will be useful and important to revisit this curve once data is available for at least two weeks past the time of this writing. This k value lies between that which makes logistic function prediction effective and that which makes logistic function prediction ineffective. We should have a much better idea by, say, mid-April, of the effect of the social distancing and other measures put into place in Canada.

Europe

Italy has the most covid19 cases in Europe and is second only to the USA in total cases. Following an outbreak, lockdowns were instituted in the originating Lombardi region swiftly on Feb 21, when only 17 cases had been reported, and in most of Northern Italy on Mar 8, at which point 1492 cases had been reported [24]. The best attempts at fitting the logistic function until Mar 29 are given below.

Fig. 12: Statistics for Italy. The blue dots are the actual cumulative cases. Two fits are given — the original fit prior to lockdown and the fit subsequent to lockdown. The most striking feature is that the curve prior to lockdown totally missed the population, but had a high k=0.42. The curve subsequent to lockdown has k=0.21

Why do the curves not fit as well as for China, despite similar mathematical approach? Let us consider the specifics of the situation. The original outbreak was in the Lombardy region, of approximately 50,000 people. This is the area which was contained on Feb 21. The initial fits are consistent with that sort of population and nearby outlying areas, with a very high k. The lockdown extended to most of northern Italy on Mar 8. The best fit would suggest a maximum number of affected people of 110,000 people, but at this point covid19 is extending into southern Italy. The model assumes a constant population, but in this case, the relative population is expanding (this idea as been explored in the USA, China and S. Korea sections as well). So this data would be consistent with a high k spread among a smaller population with a midpoint of Mar 5, 2020, escalating to a lower k spread among a higher population with a midpoint of Mar 20, 2020. That said, the daily statistics for Italy look like the number of cases is past its peak and the cumulative number of cases is starting to look under control — at least that’s the best view from today’s data.

Another note about Italy: the mortality rate, not gone into in detail in this work, is quite high. It has been noted that many of the covid19 deaths in Italy are at least partly attributable to other symptoms [9a], and also Italy’s population has an older demographic. Indeed, Italy is #6 in the world, c.f. the USA at #46 for life expectancy [25]. Covid19 is particularly harsh on the older demographic, as this figure from [26] shows. In future work, it would be important to correlate mortality and also attempt to predict hospital critical care requirements, but this goes beyond the scope of the present work.

Fig 13: reproduced from [26]

As stated above, the best fit logistic function would have predicted that Italy would have maxed out at roughly 110,000 people, but this is clearly not the case, looking at the data, and is evidently not the case, looking at the data a few days later. Again, the evidence shows that if k is too high, a single logistic function fails to predict the inflection point or the saturation level. Again, this may be because the numerator in the logistic function is itself increasing with time, as the infection spreads to a greater geographic portion of the total available population. This idea is a familiar one from China, S. Korea and the USA, but is particularly apparent for Italy.

It is important to note that it is difficult to separate the exponential from logistic fit prior to the inflection point. This has also been noted recently [27] for the specific case of using Python to try to predict the saturation point in Italy. The author in [27], for Italy specifically, notes that on different successive days he gets different saturation point. In the present work, conceived of and written over roughly a week, the data has evolved significantly, and it is also this author’s observation that the saturation point changes significantly, as does k throughout that period.

Applying lessons of China control — some insight

Consider that the potential affected population started out locally — i.e. a small number, at the outset, and then through migration of people unknowingly carrying covid19, the potential affected population became larger. For the same k, the virus would grow exponentially for considerably more time, but also in the same period lockdowns were introduced. We know that in Hubei, the effect of lockdown was to have a late period, extending for some time, well described by k=0.15, once the effects true social distancing came into effect. Why would other areas of the world be significantly different?

Italy revisited

Let us consider that Italy is in lockdown mode. Let us assume that the lessons of China apply, and that it took the full two weeks, similar for the modeling in Hubei above, for the social distancing effects to take hold. If we use Microsoft Excel solver for the period starting 2wks after the lockdown and introduce the further constraint that k=0.15, which we learned from Hubei, what happens?

Fig 14: Italy covid19 cases, assuming Hubei k value applies once 2 wks have elapsed since the lockdown in Northern Italy. The blue dots are the actual cumulative cases. This fit is much closer to the data.

Two clear distributions are apparent. The first distribution has a high k=0.42 (even higher than China) at the outset, and predicts a relatively small affected population. This is consistent with a small geographic origin with rapid spreading. In the absence of lockdown, these people move around in Italy and begin to infect others. Once lockdown happens, there are those who are already infected and it takes some time for incubation/detection to happen. Once that period is complete, then under social distancing the distribution then flattens out to approximate the Hubei distribution, and with a midpoint of Mar 24, 16 days past the lockdown. As the number of cases is running slightly higher than the best fit curve in this model, this model predicts that Italy will saturate, if lockdown continues, at approximately 150,000 people. If southern Italy becomes seriously infected, this could give rise to a new spike. It would be important to lock down southern Italy now so that the curve flattens as shown in the model.

To restate a key point: this work appears to show that a high k is nearly indistinguishable from an exponential distribution for the cases studied in this work, and does not seem to actually reach the inflection point while k is high. What may be happening is that there is a high k within a smaller population, but the inflection point is not reached until the measures are in place to lessen k, which in all the cases studied in this work are also associated with a higher saturation point.

New York revisited

What happens if we take the lesson of the fit in Italy and apply to New York, the most infected area in the US?

Let us once again use Microsoft Excel solver and, as in the case of Italy, constrain the curve to k=0.15 after Gov. Cuomo’s measures. Since insufficient time has elapsed, there is no data, but we can use the same k=0.15 and a midpoint 2 weeks after the lockdown, similar to the treatment of Hubei, matching the curves at the last known data points, to arrive at a prediction. Again, let’s assume the epicenter was New York City, which presently has a highly acute k=0.49 — the highest studied in this work by far. Let’s assume that people have migrated and started to incubate throughout NY, and that with an aggregate k of 0.38 in New York state, the inhabitants are sensitized to covid19 and that k=0.15 will apply going forward as the population is amenable to social distancing. As of the last date for which data was available, Mar 26, NYC represented roughly 65% of the cases within NY. This is not unlike the Hubei to China ratio. If we put all this together, what prediction do we arrive at?

Fig 15: New York State prediction going forward with k=0.15, matching Hubei and also the Italy prediction in previous section. The blue dots are the actual cumulative cases.

This prediction suggests that if social distancing is enforced, the total number of cases will be approximately 210,000 for the total state. It is not clear what proportion of this will be New York City itself. It is imporant to note that Hubei had the brunt of the cases in China, despite being such a small fraction of China’s overall population.

A key conclusion is that a two-fit logistic function approach — the first to the initial epidemic spread and the second to a postulated endpoint modeled on best practice — looks at this point like a more effective prediction tool than a single fit logistic function.

An obvious question — can we revisit the USA prediction? Perhaps in an update, but it is important to note that the USA is not in sufficient social distancing mode at the time of this writing for the same math to apply. Certain states are, and other states continue to resist. The effects of the Mardi Gras in New Orleans, Spring Break in Florida, the River of Tampa Bay Church and so many others, where people undoubtedly contracted covid19 and are now in difficult to determine areas of the USA, have yet to be fully observed.

Smart mitigation

Much of the discussion above has focused on social distancing and covid19 avoidance. However, as learned from S. Korea and others, testing is of vital importance. In this section, the effect of testing and other smart mitigation is explored, and it is shown that the number of covid19 cases is a strong function in particular of the testing. This in turn influences the logistic fit.

The following graph has been taken from [28], and indicates number of tests which had been done at the same date the number of reported cases is indicated. This is from Mar 18–20, and represents a comprehensive set.

Also in the graph is overlaid, in this work, isocontours (lines) of constant tests per case for visualization. More tests per capita indicate smart mitigation, fewer tests the opposite. It is seen that Russia, with supposedly the least number of cases per capita, leads the pack in tests per case. Also high on list are other countries with notably low numbers of cases per capita at this time. Lagging the field, and with the most data to date, are the USA, Italy and Spain, with the fewest tests per confirmed case. It is no coincidence that these three countries also have the highest three incidences of covid19. It is clear that there is roughly a factor of 100x between the “best practice” (most tests per case) and “worst practice” (least tests per case).

Fig 16: tests per confirmed case from [28]. The lines, intended to be graded from green (safest) to red (least safe) indicate a constant number of tests per case.

To investigate the effect of cases per capita, this author has attempted to construct a table of some of the “best” practices, some middle and some of the “worst” practices, drawing on and synthesizing data from [3], [28] and [29]. An exhaustive table is beyond the scope of this work at this time, but this gives a good initial idea. Table 1 shows that the same three countries — the USA, Italy and Spain — represent three of three of the bottom four in tests per capita for those countries extracted in this work.

Table 1 — selected countries with data compiled/calculated from [3],[22a] and [26a].
Fig 17: Incidences of covid 19 for the countries in Table 1 vs tests per incident — selected countries with data compiled/calculated from [3],[28] and [29].
Fig 18: Incidences of covid 19 for the countries in Table 1 vs tests per capita — selected countries with data compiled/calculated from [3],[28] and [29].

This table and graph indicates the results roughly 2wks later, whether serious or critical condition, total cases or total mortality, is a strong function of the number of tests per incident and correlated, but more loosely so, with the number of tests per capita. What does this mean in practical terms? This means that if extensive testing is done before the epidemic gets out of control, you don’t need as many tests. If judicious testing is combined with tracking and other measures, such as was done in S. Korea, then this analysis would conclude that k can become a lot smaller, flattening out the curve significantly.

If testing is not done in an expedient fashion and more tests per capita are required because there are more cases, there may be ethical dilemmas in choosing who does and who does not receive tests, if tests are a scarce resource. Best, of course, is to use the power of large numbers and resources to reduce the cost and increase the supply of tests. This work does not address these choices, resources, costs or supply issues.

Further implications, discussion

First off, it seems that in the majority of the cases tried so far, the logistic function can be used for analysis purposes. China’s statistical behavior fits the logistic model, and out to the endpoint, provided we consider an initial period, before action is taken, and a later period, after action is taken.

It is important to bear in mind a key anecdotal conclusion: a k value in the logistic function of <0.2 probably means the population as a whole is being considered and the result may well have predictive power, whereas a k value much higher than this probably means that there is a rapid infectious spread through a population base which itself is still expanding — i.e. the logistic function is treating a non-constant population. This is indistinguishable from exponential growth in this phase (as also noted in [26], but not for this reason of expanding population base). So far there is no evidence of a clear inflection point (the midpoint of the logistic function) of any population when k>0.2 or so — all that happens is the next day’s data raises the target saturation point. For k>0.3 or so, the particular geography in question is probably in full crisis mode.

The model for extreme social distancing, as practiced in China, leads to k=0.15, out to the endpoint. China, with the world’s largest population, has brought the number of cases at, what at the time of this writing appears to be, the endpoint, to 57 cases and 2 deaths per million people. This is a drastic, but deemed necessary, way to contain the spread within a population of 1.4B people. If the Spring Festival had happened as originally planned, there is little doubt that China would be facing a major, nearly uncontainable, epidemic. The wise decision, and against long standing public practice, to ban travel at that time, led to the possibility to contain the spread. In this way of thinking, the actual population under scrutiny is probably significantly less than 1.4B. Viewed from Hubei’s standpoint, where the practice was the most felt, the endpoint is more like 1150 cases per million population. This had the potential to spiral out of control, but was contained through effective social distancing practice.

The model of extreme testing and tracking, as practiced in S. Korea, is best described by a very low k=0.05 out to what appears to be an endpoint. In this case, the number of cases continues to grow, but slowly, in a measured fashion, consistent with how long the k=0.05 curve takes to play out. The endpoint simulated in this work is approximately 230 cases per million population. The population has to be amenable to lots of testing, lots of tracking and acceptable social practice, including handwashing and in many cases face masks (not discussed in detail in this work).

In contrast, Italy (and Spain, not explored in depth in this work yet) has approximately 2000 cases per million population and climbing, with more critical cases than the health system can support. Italy presently has 65 critical or serious cases per million population, more than enough to overload the health system. Even though the USA has more cases than any other country worldwide, the number of serious and critical conditions number only 14 per million, which is why the USA in its aggregate sense is not yet overloaded, although New York City, with roughly 3% of the USA population for the metropolitan area, has the brunt of the cases and has overloaded the health system. This is projected to be even worse before it gets better. Precise statistics for critical care patients in NYC seem to be unavailable at this time, but if the population and case ratio is taken, an estimate of approximately 140 critical cases per million can be made, even worse than Italy as an aggregate. This is clearly quite serious.

The logistic model has been applied to Italy and to New York State to imagine what a “flattened” curve looks like with the social distancing measures in place. If this prediction is true, then we could reasonably hope that Italy maxes out at approximately 150,000 cases and New York, with three times less population than Italy, maxes out at approximately 200–250,000 cases. There is additional hope in New York — a new initiative from Mount Sinai Hospital announced on Apr 1 is offering the opportunity to track covid19 symptoms [30], consistent with the practice adopted in S. Korea. Also, for the first time, temperature tracking within the USA is reported on a large geographic scale [31]. This is using the Information Age for good, and these measures can help significantly.

It is not possible to estimate how high the number of cases in the USA will climb until social distancing takes place for real. Episodes like the Mardi Gras, Florida Spring Break and the River of Tampa Bay Church can only serve to exacerbate the problem in the USA. If the measures such as (1) are being adopted in New York by Gov. Cuomo which will allow for social distancing and (2) exciting new developments such as the app developed by Mount Sinai Hospital in New York [30] can be extended to the USA, there is a far greater chance of incurring less human damage than is already inevitable.

Summary/Conclusions

In conclusion, the logistic model has been applied to case data for the covid19 pandemic. It is seen that there is strong predictive power in the judicious use of the logistic model. Some key results are these:

  • Open access to rich sources key primary data [2–4,22, etc.] allows for independent modeling to be done by people such as this author. It is the author’s belief that if we all work together, our best ideas can be combined to defeat this pandemic.
  • The logistic function is supposed apply to covid19, and therefore should in principle have predictive power. Fitting exponential data linearly has flaws in that the most recent data is emphasized and the earlier data ignored. To address these issues, in this work, logistic function modeling was applied to case data from multiple countries, using Microsoft Excel solver to do least squares fitting to the logarithm of the three key variables in the logistic equation, the total affected population, k (rate) and the midpoint/inflection point date.
  • A two-step logistic model can be applied to model best practice cases. China has been shown to have k=0.15 through enforced social distancing and South Korea has been shown to have k=0.05 through “smart mitigation”.
  • A single logistic function fails to effectively predict the inflection point or the saturation point for k greater than approximately 0.3. For values much larger than this, an acute epidemic is perceived and the growth is indistinguishable from exponential, and leads to “phantom” population saturation predictions which are not realized. Only later, when strong mitigation strategies are in place, is there any real predictive power. Willful negligence of social distancing is a recipe for sustained high k.
  • Using k=0.15 in other countries’ data once lockdown is in place seems to be a potentially useful predictor of behavior going forward for countries which enact social distancing measures. This is a central hypothesis formed as a conclusion of this work [Apr 3 update: the Italy data seems to be conforming if plots further in time than Fig 14 are plotted against the same prediction].
  • The testing per covid19 incidence ratio is the best single predictor of range of cases, critical cases and mortality. It is not the absolute number of tests, but rather the testing per covid19 incidence ratio. Lots of testing later on does not compensate for testing earlier on, where the same number of tests are associated with a higher test to case ratio. It is merely playing catch-up when already late.
  • Test per capita is the second best predictor of range of cases, critical cases and mortality. The sooner within the epidemic testing is widely invoked, the stronger the effect, because the test to case ratio is higher.

Further work

Beyond the scope of this work, but of critical importance:

  1. What is the hospital capacity and what rate of critical care can the system tolerate today, or must the system adapt to? This would involve analysis of critical care population, for how long a critical care episode lasts, and the local capacity (e.g. within New York City or certain areas of Italy) for said critical care.
  2. A key question: what resources would need to be put into place and what would be the most effective way to (1) identify and track suspected covid19 movement and (2) test and report local human temperatures on a wide scale? A start with (1) has been made with [27a] and a start with (2) has been made with [28a]. What other timely innovations in this era of internet connectivity can be made?
  3. Masks — not addressed in this treatment. How do masks factor into the logistic function treatment? Effect on k, for instance?
  4. More comprehensive and not attempted here — do the logistic curve trends for other countries also follow the patterns uncovered and asserted in this work?
  5. What would k, midpoint and saturation point look like for important candidate countries as a function of time? Either from inception, a key event or perhaps a rolling two week period? Would this sort of treatment lead to a better prediction of inflection point and final saturation point? This work hypothesizes, on the basis of sparse anecdotal modeling attempts, that k may have a initial increase, then a decrease, and the saturation point increases, then flattens. Perhaps the combination of the two can be a useful predictor.

Acknowledgment

The author is indebted to S.Bradshaw and R.Clayton and others for stimulating discussions, and to D.Mont, T.Hoyt, K.Liu, M.Bongiovanni-Corbet and D. Tarof for useful comments which have been incorporated into this work. The data and its analysis are the sole responsibility of the author and any errors are mine, despite the help of all these people.

A further note — readers are pointing out typos, opportunities to clarify, small edits etc. — I’ve been updating as these observations come to my attention — thx, all.

References

  1. https://en.wikipedia.org/wiki/Logistic_function
  2. https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6
  3. https://www.worldometers.info/coronavirus/
  4. https://www.statnews.com/2020/03/26/covid-19-tracker/
  5. https://ourworldindata.org/grapher/covid-confirmed-cases-since-100th-case
  6. Wu Z, McGoogan JM. Characteristics of and Important Lessons From the Coronavirus Disease 2019 (COVID-19) Outbreak in China: Summary of a Report of 72 314 Cases From the Chinese Center for Disease Control and Prevention. JAMA. Published online February 24, 2020. doi:10.1001/jama.2020.2648
  7. https://medium.com/@tomaspueyo/coronavirus-act-today-or-people-will-die-f4d3d9cd99ca
  8. https://www.worldometers.info/coronavirus/how-to-interpret-feb-12-case-surge/
  9. https://www.who.int/news-room/q-a-detail/q-a-coronaviruses
  10. https://www.npr.org/sections/goatsandsoda/2020/03/26/821688981/how-south-korea-reigned-in-the-outbreak-without-shutting-everything-down
  11. https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_South_Korea
  12. https://www.washingtonpost.com/world/asia_pacific/japan-targets-coronavirus-testing-while-south-korea-goes-big-the-us-faces-which-path-to-take/2020/03/28/97e81b44-6eb6-11ea-a156-0048b62cdb51_story.html
  13. https://www.nytimes.com/2020/03/23/world/asia/coronavirus-south-korea-flatten-curve.html
  14. https://www.youtube.com/watch?v=YiNGNvP5Cnk
  15. https://www.usatoday.com/story/travel/destinations/2020/03/19/spring-break-beaches-florida-look-packed-despite-coronavirus-spread/2873248001/
  16. https://www.cbsnews.com/news/coronavirus-florida-spring-break-test-positive-covid-19-college-students-not-social-distancing-university-of-tampa/
  17. https://time.com/5812654/florida-pastor-coronavirus-arrest-services/
  18. https://www.bostonglobe.com/2020/03/17/metro/biogen-host-boston-conference-linked-dozens-coronavirus-cases-donates-10m-pandemic-response/
  19. https://www.nbcboston.com/news/coronavirus/boston-biogen-meeting-coronavirus-warning/2089438/https://ourworldindata.org/covid-testing
  20. https://www.nytimes.com/2020/03/30/us/coronavirus-funeral-albany-georgia.html
  21. https://www.nytimes.com/2020/03/23/us/coronavirus-westport-connecticut-party-zero.html
  22. https://ourworldindata.org/coronavirus
  23. https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Canada
  24. https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_the_United_States#Lockdownshttps://en.wikipedia.org/wiki/2020_Italy_coronavirus_lockdown
  25. https://www.worldometers.info/demographics/life-expectancy/
  26. https://www.livescience.com/why-italy-coronavirus-deaths-so-high.html
  27. https://towardsdatascience.com/covid-19-infection-in-italy-mathematical-models-and-predictions-7784b4d7dd8d
  28. https://ourworldindata.org/grapher/covid-19-tests-country
  29. https://www.worldometers.info/world-population/population-by-country/
  30. https://www.nbcnewyork.com/news/local/mount-sinai-launches-new-app-to-track-covid-19-in-nyc-asks-all-new-yorkers-to-help/2354621/
  31. https://www.washingtonpost.com/national/health-science/start-ups-health-weather-map-may-help-forecast-spread-of-diseases-like-covid-19/2020/03/26/36c069b8-6ef0-11ea-a3ec-70d7479d83f0_story.html

Appendix A — review of the logistic function

If you’re already familiar with the logistic function, you can skip this section. Or you can read quickly.

Our study begins with the mathematical presentation of the key variables in the logistic equation, presented here as follows:

This mathematical expression relates the number of covid19 cases to three key variables:

(1) The percentage of the population affected. How many of a given population will contract covid19?

(2) The time midpoint or time inflection point. Beyond this point in time (a specific date), the number of cumulative cases increases, but the rate at which those cases increase slows down, and some sort of an end is in sight. This is a specific date.

(3) The rate factor k. A higher k value means a faster acceleration and a lower k value means a lower acceleration.

How do these variables relate?

First, let’s consider a constant midpoint date and a particular population. In this example case, Apr 14, 2020 was chosen as the midpoint and the population was chosen as 100M. The graphs below show the number of cases vs date both exponentially and linearly. It is instructive to study both. The linear curve (2nd graph) shows the “S”-curve shape of the logistic function. The S is sharper/steeper for higher k, more drawn out for lower k, of greater magnitude for a higher percentage of the population and of lower magnitude for a lower percentage of the population. These two graphs serve to demonstrate the behavior of the two key variables, k (rate) and percentage of affected population. Most of the graphs shown in this work incorporate both sets of graphs so the reader can evaluate both views.

Fig A1: Basic logistic math using a common midpoint. A hypothetical midpoint of Apr 14, 2020 is used as an example. For a population of 100M, different scenarios are used. The central scenario is k=3, 0.5% population affected total. Green dash is for flatter curve (k=0.2) and red dash is for more accelerated curve (k=0.4). The black dash-dot is for double the percent of population and the light blue dash-dot is for half the percent of the population. Two versions are shown: logarithmic scale and linear scale.

More useful in predictive power, though, is, from a given starting point, visualizing how the number of cases will increase with time. Here the same values of k and population fraction are involved. However, in this case, the starting point, rather than the midpoint, are common. All examples show 4 cases on Feb 15, 2020 and simulate how the cases evolve as either k is changed or the fraction of the population is changed. In this case, the midpoint (the halfway point on a linear scale, and 3dB down (30% of a major division on the log10 scale) is different for each case. And it is this midpoint, or inflection point, which shows the beginning of the end, which is of primary importance in any community or country which is experiencing the covid19 disease. This is shown in the next two graphs, again, exponential and linear. It is worth studying these to understand the impact of k and fraction of population.

Fig A2: Basic logistic math using a common starting point. A hypothetical starting point (4 cases) of Feb 15, 2020 is used as an example. For a population of 100M, different scenarios are used. The central scenario is k=3, 0.5% population affected total. Green dash is for flatter curve (k=0.2) and red dash is for more accelerated curve (k=0.4). The black dash-dot is for double the percent of population and the light blue dash-dot is for half the percent of the population. Two versions are shown: logarithmic scale and linear scale. The midpoint of the central scenario is the same as for the previous figure, but the midpoints of every other scenario are different.

--

--

Larry Tarof

Larry is a semiconductor physicist by day and a musician (piano/voice/guitar, “Dr L’s Music”) evenings/weekends. He should someday update his LinkedIn profile.