Abstract
The arising 2019 novel Coronavirus (2019-nCoV) has taken the world by storm. It is known that this disease is not that same as the coronaviruses that are commonly found among humans and cause mild symptoms, such as the common cold. In addition, Coronavirus 229E, NL63, OC43, or HKU1 are not the same as a 2019-nCoV diagnosis. Thus, the new coronavirus leaves many unanswered questions. Big Data is an important tool that health care officials are using to control and track the disease. This paper aims to deploy a modular regression time-series function to forecast the confirmed cases of COVID-19, deaths, and recovery cases worldwide based on the data published by WHO and multiple international government organizations.
- Introduction
The 2019 Novel Coronavirus (2019-nCoV) is a respiratory illness first confirmed in early December in Wuhan Province, China. The Chinese government reported initially that patients had some link to a large seafood and animal market, suggesting animal-to-person spread. The disease has since been confirmed to be spread person-to-person as well [1]. The disease is novel in regards to its molecular differences to other identified coronaviruses such as 229E, NL63, OC43, or HKU1; in addition, in accordance with the writing of this article, no confirmed vaccine or antiviral medicine has been made available to treat this disease [2]. It is no surprise that the attention of all world health organizations has been focused on attempting to control the spread of 2019-nCoV.
Multiple governments and government health organizations such as the World Health Organization (WHO), National Health Commission of the People’s Republic of China (NHC), China CDC (CCDC), Hong Kong Department of Health, Macau Government, Taiwan CDC, US CDC, the Government of Canada, The Australian Government Department of Health, The European Centre for Disease Prevention and Control (ECDC), The Ministry of Health Singapore (MOH), and The Italy Ministry of Health have published public data about the confirmed cases, deaths and recovery cases of their respective citizens (WHO, although, reports cases worldwide). The Johns Hopkins University Center for Systems Science and Engineering (supported by the ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab) have compiled the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE [3,4]. This dataset can be used for data analysis, manipulation, and forecasting.
This paper, for data visualization and forecasting purposes, will use a popular computer programming language, Python. Using Plotly, a Python graphing library [5], and the 2019-nCoV John Hopkins Data Repository, the number of confirmed cases, deaths and recovery cases can be visualized; it is clear that all variables are exponentially increasing (Figure 1).
Figure 1 Worldwide Confirmed Cases, Deaths, and Recovery cases from 1/22/20-03/24/20.
This paper will be focusing on forecasting the confirmed cases, deaths, and recovered cases. It is important to thus define that recovered cases mean two negative swab tests on consecutive days; an absence of fever, with no use of fever-reducing medication, for three full days; improvement in other symptoms, such as coughing and shortness of breath; a period of seven full days since symptoms first appeared [6].
Data forecasting is a common task to extrapolate data points further in time. It has been used in a multitude of fields ranging from predicting fossil locations to predicting astronomical anomalies [7,8]. Forecasting is a useful way to make sense of large amounts of data and help anticipate future events. There are many models of forecasting such as the Drift Method, Seasonal naïve approach, support vector machines, and artificial neural networks [9]. In this paper we will use Python to use a very common forecasting model: Time Series Forecasting.
- Materials and Methodology
The data repository provided by the Johns Hopkins University Center for Systems Science and Engineering will be the primary dataset for the forecasting model presented in this paper. The dataset that is used in this paper has been updated last on 03/24/20. The dataset includes the following variables: serial number, observation date, province/state, country/region, last update, confirmed cases, deaths, and recovered cases.
Python 3.5 and the following libraries will be used for implementing the forecasting model: pandas, numpy, seaborn. matplotlib, plotly, fbprophet, pycountry.
The model implemented in this paper is adapted from Facebook Core AI’s Prophet Forecasting paper[10]. It consists of a forecasting time series model based on an additive procedure where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. Prophet works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
The Prophet model is defined as:
(Figure 2) where g(t) is the trend function which models non-periodic changes in the value of the time series, s(t) represents periodic changes (e.g., weekly and yearly seasonality), and h(t) represents the effects of holidays which occur on potentially irregular schedules over one or more days. The error term ε(t) represents any idiosyncratic changes which are not accommodated by the model. It is important to note, however, that the logistic growth model used is a special case of generalized logistic growth curves, which is only a single type of sigmoid curve. The prophet model is implemented using the Facebook prophet library, which allows us to facilitate the forecasting process by simply feeding in the previous data (
) .
The model will forecast the confirmed cases, deaths, and recovered cases from 04-01-2020 to 04-15-2020 using Prophet and 95% confidence intervals. No tweaking of seasonality or additive regression models were added.
Figure 2 The Facebook Prophet time series model summarized.
- Results
Using prophet, confirmed cases, deaths, and recovery cases are forecasted from 04/01/20 to 04/15/20 with a prediction value and a lower and upper uncertainty prediction as well. The results for confirmed cases are shown in Table I and are graphed in Figure 3. The results for deaths are shown in Table II and are graphed in Figure 4. The results for recovery cases are shown in Table III and are graphed in Figure 5.
Table I Forecasted confirmed cases by date including lower and upper uncertainty predictions
Date
Predicted
Lower
Upper
04-01-20
453079.445301
419513.649591
489212.379745
04-02-20
469124.886850
433109.124770
505562.892242
04-03-20
486153.186869
447807.002236
527074.312065
04-04-20
502529.593721
463820.128418
541716.175332
04-05-20
519217.466290
477959.049949
559228.976027
04-06-20
536899.806302
493526.477808
577874.905225
04-07-20
553624.325152
507990.543179
599764.276492
04-08-20
558972.941367
512067.448528
603994.585677
04-09-20
575018.382917
529097.743649
621851.808099
04-10-20
592046.682936
543933.863237
644608.498571
04-11-20
608423.089788
554294.063989
662345.878965
04-12-20
625110.962356
568976.305970
684913.009929
04-13-20
642793.302369
579227.765196
704431.586162
04-14-20
659517.821218
597737.597275
720024.128186
04-15-20
664866.437434
603810.623149
726738.134273
Table II Forecasted deaths by date including lower and upper uncertainty predictions
Date
Prediction
Lower
Upper
04-01-20
22585.187266
21178.557044
24080.738468
04-02-20
23417.122506
21972.416925
24916.567463
04-03-20
24328.257159
22606.544818
26014.643357
04-04-20
25259.147869
23540.297159
27038.111695
04-05-20
26206.462505
24432.292052
28165.841966
04-06-20
27197.541807
25047.127419
29000.991058
04-07-20
28230.167361
25946.760013
30254.327814
04-08-20
28832.410283
26404.584039
30966.692089
04-09-20
29664.345523
27207.506090
32099.965154
04-10-20
30575.480176
27940.697342
33150.336532
04-11-20
31506.370885
28739.088122
34265.891853
04-12-20
32453.685522
29468.474615
35504.261682
04-13-20
33444.764824
30155.166010
36665.576906
04-14-20
34477.390378
31011.621609
37981.901815
04-15-20
35079.633299
31276.591941
38824.239081
Table III Forecasted recovered cases by date including lower and upper uncertainty predictions
Date
04-01-20
121728.856961
118364.783725
125295.153700
04-02-20
124095.768386
120061.339507
128195.299243
04-03-20
126560.755391
122519.559616
131212.448196
04-04-20
129609.682297
124625.420052
134590.361581
04-05-20
132473.766993
127668.713193
137927.978113
04-06-20
135096.620843
129442.154843
140453.558490
04-07-20
138229.146223
131994.325687
144038.737917
04-08-20
140073.486312
133492.175692
146196.948247
04-09-20
142440.397737
135479.818246
149118.426320
04-10-20
144905.384742
136716.940202
152145.464183
04-11-20
147954.311649
139758.738004
156329.311780
04-12-20
150818.396345
142215.670193
159890.308617
04-13-20
153441.250194
144086.838825
163330.860406
04-14-20
156573.775575
146928.498743
167165.276032
04-15-20
158418.115664
147715.166883
169698.813546
Figure 3 Confirmed cases plotted and forecasted over time.
Figure 4 Deaths plotted and forecasted over time.
Figure 5 Recovered cases plotted and forecasted over time.
- Discussion
The results presented show clear upward trends. The graphs further elucidate and project the track COVID-19 will spread in our world. It is important to note that the current data is forecasted from data updated on 03/24/20; the more recent the data, the more accurate the forecasting model will perform. However, under these minute-by-minute changes in the confirmed cases, the deaths, and the recovered cases, the model may be inaccurate.
It will be interesting to compare the results of the model as the cases increase every hour, every day. Another thing to account is a spike (possibly from the lack of testing in some countries in earlier months) which may also throw the model off.
- Conclusion
In this paper, data from a reliable data repository is computed and visualized; Python is implemented to use the Prophet Time Series prediction model for the data provided. Data analysis shows clear upward trends that follow a specific slope/pattern. The fit of the model can be evaluated in real-time as the cases continuously increase. Future work may include introducing a recurrent neural network to adjust the parameters of the prophet model based on its ability to predict on a day-by-day basis.
The Code to this paper can be found here.
- References
[1] "What You Need To Know About The Coronavirus 2019". 2020. Cdc.Gov. https://www.cdc.gov/coronavirus/2019-ncov/downloads/2019-ncov-factsheet.pdf.
[2] "Coronavirus". 2020. Who.Int. https://www.who.int/emergencies/diseases/novel-coronavirus-2019.
[3] Dong, Ensheng, Hongru Du, and Lauren Gardner. 2020. "An Interactive Web-Based Dashboard To Track COVID-19 In Real Time". The Lancet Infectious Diseases. doi:10.1016/s1473-3099(20)30120-1.
[4] "Cssegisanddata/COVID-19". 2020. Github. https://github.com/CSSEGISandData/COVID-19.
[5] "Plotly Python Graphing Library". 2020. Plotly.Com. https://plotly.com/python/.
[6] "Coronavirus Disease 2019 (COVID-19)". 2020. Centers For Disease Control And Prevention. https://www.cdc.gov/coronavirus/2019-ncov/about/index.html.
[7] Block, Sebastián, Frédérik Saltré, Marta Rodríguez-Rey, Damien A. Fordham, Ingmar Unkel, and Corey J. A. Bradshaw. 2016. "Where To Dig For Fossils: Combining Climate-Envelope, Taphonomy And Discovery Models". PLOS ONE 11 (3): e0151090. doi:10.1371/journal.pone.0151090.
[8] Feigelson, Eric D., G. Jogesh Babu, and Gabriel A. Caceres. 2018. "Autoregressive Times Series Methods For Time Domain Astronomy". Frontiers In Physics 6. doi:10.3389/fphy.2018.00080.
[9] Chretien, Jean-Paul, Dylan George, Jeffrey Shaman, Rohit A. Chitale, and F. Ellis McKenzie. 2014. "Influenza Forecasting In Human Populations: A Scoping Review". Plos ONE 9 (4): e94130. doi:10.1371/journal.pone.0094130.
[10] Taylor, Sean J, and Benjamin Letham. 2017. "Prophet: Forecasting At Scale". doi:10.7287/peerj.preprints.3190v2.