The arising 2019 novel Coronavirus (2019-nCoV) has taken the world by storm. It is known that this disease is not that same as the coronaviruses that are commonly found among humans and cause mild symptoms, such as the common cold. In addition, Coronavirus 229E, NL63, OC43, or HKU1 are not the same as a 2019-nCoV diagnosis. Thus, the new coronavirus leaves many unanswered questions. Big Data is an important tool that health care officials are using to control and track the disease. This paper aims to deploy a modular regression time-series function to forecast the confirmed cases of COVID-19, deaths, and recovery cases worldwide based on the data published by WHO and multiple international government organizations.
The 2019 Novel Coronavirus (2019-nCoV) is a respiratory illness first confirmed in early December in Wuhan Province, China. The Chinese government reported initially that patients had some link to a large seafood and animal market, suggesting animal-to-person spread. The disease has since been confirmed to be spread person-to-person as well . The disease is novel in regards to its molecular differences to other identified coronaviruses such as 229E, NL63, OC43, or HKU1; in addition, in accordance with the writing of this article, no confirmed vaccine or antiviral medicine has been made available to treat this disease . It is no surprise that the attention of all world health organizations has been focused on attempting to control the spread of 2019-nCoV.
Multiple governments and government health organizations such as the World Health Organization (WHO), National Health Commission of the People’s Republic of China (NHC), China CDC (CCDC), Hong Kong Department of Health, Macau Government, Taiwan CDC, US CDC, the Government of Canada, The Australian Government Department of Health, The European Centre for Disease Prevention and Control (ECDC), The Ministry of Health Singapore (MOH), and The Italy Ministry of Health have published public data about the confirmed cases, deaths and recovery cases of their respective citizens (WHO, although, reports cases worldwide). The Johns Hopkins University Center for Systems Science and Engineering (supported by the ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab) have compiled the 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE [3,4]. This dataset can be used for data analysis, manipulation, and forecasting.
This paper, for data visualization and forecasting purposes, will use a popular computer programming language, Python. Using Plotly, a Python graphing library , and the 2019-nCoV John Hopkins Data Repository, the number of confirmed cases, deaths and recovery cases can be visualized; it is clear that all variables are exponentially increasing (Figure 1).
Figure 1 Worldwide Confirmed Cases, Deaths, and Recovery cases from 1/22/20-03/24/20.
This paper will be focusing on forecasting the confirmed cases, deaths, and recovered cases. It is important to thus define that recovered cases mean two negative swab tests on consecutive days; an absence of fever, with no use of fever-reducing medication, for three full days; improvement in other symptoms, such as coughing and shortness of breath; a period of seven full days since symptoms first appeared .
Data forecasting is a common task to extrapolate data points further in time. It has been used in a multitude of fields ranging from predicting fossil locations to predicting astronomical anomalies [7,8]. Forecasting is a useful way to make sense of large amounts of data and help anticipate future events. There are many models of forecasting such as the Drift Method, Seasonal naïve approach, support vector machines, and artificial neural networks . In this paper we will use Python to use a very common forecasting model: Time Series Forecasting.
- Materials and Methodology
The data repository provided by the Johns Hopkins University Center for Systems Science and Engineering will be the primary dataset for the forecasting model presented in this paper. The dataset that is used in this paper has been updated last on 03/24/20. The dataset includes the following variables: serial number, observation date, province/state, country/region, last update, confirmed cases, deaths, and recovered cases.
Python 3.5 and the following libraries will be used for implementing the forecasting model: pandas, numpy, seaborn. matplotlib, plotly, fbprophet, pycountry.
The model implemented in this paper is adapted from Facebook Core AI’s Prophet Forecasting paper. It consists of a forecasting time series model based on an additive procedure where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. Prophet works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
The Prophet model is defined as:
(Figure 2) where g(t) is the trend function which models non-periodic changes in the value of the time series, s(t) represents periodic changes (e.g., weekly and yearly seasonality), and h(t) represents the effects of holidays which occur on potentially irregular schedules over one or more days. The error term ε(t) represents any idiosyncratic changes which are not accommodated by the model. It is important to note, however, that the logistic growth model used is a special case of generalized logistic growth curves, which is only a single type of sigmoid curve. The prophet model is implemented using the Facebook prophet library, which allows us to facilitate the forecasting process by simply feeding in the previous data (
The model will forecast the confirmed cases, deaths, and recovered cases from 04-01-2020 to 04-15-2020 using Prophet and 95% confidence intervals. No tweaking of seasonality or additive regression models were added.
Figure 2 The Facebook Prophet time series model summarized.
Using prophet, confirmed cases, deaths, and recovery cases are forecasted from 04/01/20 to 04/15/20 with a prediction value and a lower and upper uncertainty prediction as well. The results for confirmed cases are shown in Table I and are graphed in Figure 3. The results for deaths are shown in Table II and are graphed in Figure 4. The results for recovery cases are shown in Table III and are graphed in Figure 5.
Table I Forecasted confirmed cases by date including lower and upper uncertainty predictions
Table II Forecasted deaths by date including lower and upper uncertainty predictions
Table III Forecasted recovered cases by date including lower and upper uncertainty predictions
Figure 3 Confirmed cases plotted and forecasted over time.
Figure 4 Deaths plotted and forecasted over time.
Figure 5 Recovered cases plotted and forecasted over time.
The results presented show clear upward trends. The graphs further elucidate and project the track COVID-19 will spread in our world. It is important to note that the current data is forecasted from data updated on 03/24/20; the more recent the data, the more accurate the forecasting model will perform. However, under these minute-by-minute changes in the confirmed cases, the deaths, and the recovered cases, the model may be inaccurate.
It will be interesting to compare the results of the model as the cases increase every hour, every day. Another thing to account is a spike (possibly from the lack of testing in some countries in earlier months) which may also throw the model off.
In this paper, data from a reliable data repository is computed and visualized; Python is implemented to use the Prophet Time Series prediction model for the data provided. Data analysis shows clear upward trends that follow a specific slope/pattern. The fit of the model can be evaluated in real-time as the cases continuously increase. Future work may include introducing a recurrent neural network to adjust the parameters of the prophet model based on its ability to predict on a day-by-day basis.
The Code to this paper can be found here.
 "What You Need To Know About The Coronavirus 2019". 2020. Cdc.Gov. https://www.cdc.gov/coronavirus/2019-ncov/downloads/2019-ncov-factsheet.pdf.
 "Coronavirus". 2020. Who.Int. https://www.who.int/emergencies/diseases/novel-coronavirus-2019.
 Dong, Ensheng, Hongru Du, and Lauren Gardner. 2020. "An Interactive Web-Based Dashboard To Track COVID-19 In Real Time". The Lancet Infectious Diseases. doi:10.1016/s1473-3099(20)30120-1.
 "Cssegisanddata/COVID-19". 2020. Github. https://github.com/CSSEGISandData/COVID-19.
 "Plotly Python Graphing Library". 2020. Plotly.Com. https://plotly.com/python/.
 "Coronavirus Disease 2019 (COVID-19)". 2020. Centers For Disease Control And Prevention. https://www.cdc.gov/coronavirus/2019-ncov/about/index.html.
 Block, Sebastián, Frédérik Saltré, Marta Rodríguez-Rey, Damien A. Fordham, Ingmar Unkel, and Corey J. A. Bradshaw. 2016. "Where To Dig For Fossils: Combining Climate-Envelope, Taphonomy And Discovery Models". PLOS ONE 11 (3): e0151090. doi:10.1371/journal.pone.0151090.
 Feigelson, Eric D., G. Jogesh Babu, and Gabriel A. Caceres. 2018. "Autoregressive Times Series Methods For Time Domain Astronomy". Frontiers In Physics 6. doi:10.3389/fphy.2018.00080.
 Chretien, Jean-Paul, Dylan George, Jeffrey Shaman, Rohit A. Chitale, and F. Ellis McKenzie. 2014. "Influenza Forecasting In Human Populations: A Scoping Review". Plos ONE 9 (4): e94130. doi:10.1371/journal.pone.0094130.
 Taylor, Sean J, and Benjamin Letham. 2017. "Prophet: Forecasting At Scale". doi:10.7287/peerj.preprints.3190v2.