class: center, middle, inverse, title-slide # Large-Scale Time Series Forecasting ### Thiyanga S. Talagala ### R-Ladies Bergen, Norway ### 24 June 2021 --- ## About me <iframe src="https://thiyanga.netlify.app/" width="100%" height="400px"></iframe> Web: https://thiyanga.netlify.com/ --- background-image: url(img/jhu.png) background-size: contain --- class: inverse, center, middle background-image: url(img/jhu.png) background-position: 50% 60%1 background-size: contain
Let's visualize the coronavirus pandemic!
--- background-image: url(img/coronavirus.png) background-size: 90px background-position: 100% 6% # Data: coronavirus package ```r install.packages("coronavirus") # devtools::install_github("RamiKrispin/coronavirus") ``` ```r library(coronavirus) head(coronavirus, 8) ``` ``` date province country lat long type cases 1 2020-01-22 Afghanistan 33.93911 67.70995 confirmed 0 2 2020-01-22 Albania 41.15330 20.16830 confirmed 0 3 2020-01-22 Algeria 28.03390 1.65960 confirmed 0 4 2020-01-22 Andorra 42.50630 1.52180 confirmed 0 5 2020-01-22 Angola -11.20270 17.87390 confirmed 0 6 2020-01-22 Antigua and Barbuda 17.06080 -61.79640 confirmed 0 7 2020-01-22 Argentina -38.41610 -63.61670 confirmed 0 8 2020-01-22 Armenia 40.06910 45.03820 confirmed 0 ``` ---
--- ![](index_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- ![](index_files/figure-html/unnamed-chunk-6-1.png)<!-- --> --- class: split-70 hide-slide-number background-image: url("img/jhu.png") background-size: cover .column.slide-in-left[ .sliderbox.vmiddle.shade_main.center[ .font5[Time Series Features]]] .column[ ] --- # Time series features Transform a given time series `\(y=\{y_1, y_2, \cdots, y_n\}\)` to a feature vector `\(F = (f_1(y), f_2(y), \cdots, f_p(y))'\)`. ## Examples of time series features - strength of trend - strength of seasonality - lag-1 autocorrelation - spectral entropy - proportion of zeros More information: Talagala, T. S., Hyndman, R. J., & Athanasopoulos, G. (2018). Meta-learning how to forecast time series. Monash Econometrics and Business Statistics Working Papers, 6, 18. --- class: split-two white .column.bg-main1[.content.vmiddle.center[ ## Time-domain representation ![](index_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ]] .column.bg-main2[.content.vmiddle.center[ ## Feature-domain representation ![](index_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ]] --- class: split-two white .column.bg-white[.content[ ```r library(tsibble) norway <- confirmed %>% filter(country == "Norway") ``` ]] .column.bg-white[.content[ ]] --- class: split-two white .column.bg-white[.content[ ```r library(tsibble) norway <- confirmed %>% filter(country == "Norway") *norway.tsibble <- tsibble( # * date = as.Date("2020-01-22") + 0:491, # * Observation = norway$cases, # * index = date) # ``` ]] .column.bg-white[.content[ ]] --- class: split-two white .column.bg-white[.content[ ```r library(tsibble) norway <- confirmed %>% filter(country == "Norway") norway.tsibble <- tsibble( date = as.Date("2020-01-22") + 0:491, Observation = norway$cases, index = date) *norway.tsibble # ``` ``` # A tsibble: 492 x 2 [1D] date Observation <date> <dbl> 1 2020-01-22 0 2 2020-01-23 0 3 2020-01-24 0 4 2020-01-25 0 5 2020-01-26 0 6 2020-01-27 0 7 2020-01-28 0 8 2020-01-29 0 9 2020-01-30 0 10 2020-01-31 0 # … with 482 more rows ``` ]] .column.bg-white[.content[ ]] --- class: split-two white .column.bg-white[.content[ ```r library(tsibble) norway <- confirmed %>% filter(country == "Norway") norway.tsibble <- tsibble( date = as.Date("2020-01-22") + 0:491, Observation = norway$cases, index = date) norway.tsibble ``` ``` # A tsibble: 492 x 2 [1D] date Observation <date> <dbl> 1 2020-01-22 0 2 2020-01-23 0 3 2020-01-24 0 4 2020-01-25 0 5 2020-01-26 0 6 2020-01-27 0 7 2020-01-28 0 8 2020-01-29 0 9 2020-01-30 0 10 2020-01-31 0 # … with 482 more rows ``` ]] .column.bg-white[.content[ ```r *library(fable) # *autoplot(norway.tsibble) # ``` ![](index_files/figure-html/unnamed-chunk-14-1.png)<!-- --> ]] --- ## Compute features ```r *library(feasts) # *norway.tsibble %>% # * features(Observation, feature_set(tags = # * c("decomposition", "intermittent", "autocorrelation"))) %>% as.data.frame()# ``` ``` trend_strength seasonal_strength_week seasonal_peak_week seasonal_trough_week 1 0.9158947 0.4880091 0 5 spikiness linearity curvature stl_e_acf1 stl_e_acf10 acf1 acf10 1 5673.13 4291.447 767.7658 -0.206399 0.1942277 0.853502 6.380439 diff1_acf1 diff1_acf10 diff2_acf1 diff2_acf10 season_acf1 pacf5 1 -0.322936 0.1896912 -0.5713678 0.3516748 0.8460248 0.9376168 diff1_pacf5 diff2_pacf5 season_pacf zero_run_mean nonzero_squared_cv 1 0.453038 1.107245 0.2593937 6.833333 0.9778515 zero_start_prop zero_end_prop 1 0.07113821 0 ``` --- .pull-left[ ## tibble ```r confirmed ``` ``` # A tibble: 94,956 x 3 # Groups: country [193] country date cases <chr> <date> <dbl> 1 Afghanistan 2020-01-22 0 2 Afghanistan 2020-01-23 0 3 Afghanistan 2020-01-24 0 4 Afghanistan 2020-01-25 0 5 Afghanistan 2020-01-26 0 6 Afghanistan 2020-01-27 0 7 Afghanistan 2020-01-28 0 8 Afghanistan 2020-01-29 0 9 Afghanistan 2020-01-30 0 10 Afghanistan 2020-01-31 0 # … with 94,946 more rows ``` ] .pull-right[ ## tsibble ```r confirmed.tsibble <- confirmed %>% as_tsibble(index = date, key = country) confirmed.tsibble ``` ``` # A tsibble: 94,956 x 3 [1D] # Key: country [193] # Groups: country [193] country date cases <chr> <date> <dbl> 1 Afghanistan 2020-01-22 0 2 Afghanistan 2020-01-23 0 3 Afghanistan 2020-01-24 0 4 Afghanistan 2020-01-25 0 5 Afghanistan 2020-01-26 0 6 Afghanistan 2020-01-27 0 7 Afghanistan 2020-01-28 0 8 Afghanistan 2020-01-29 0 9 Afghanistan 2020-01-30 0 10 Afghanistan 2020-01-31 0 # … with 94,946 more rows ``` ] --- ## Features for all countries ```r features.all <- confirmed.tsibble %>% features(cases, feature_set(tags = c("decomposition", "intermittent", "autocorrelation"))) features.all ``` ``` # A tibble: 193 x 25 country trend_strength seasonal_strengt… seasonal_peak_we… seasonal_trough… <chr> <dbl> <dbl> <dbl> <dbl> 1 Afghanis… 0.870 0.260 6 5 2 Albania 0.985 0.493 2 6 3 Algeria 0.991 0.308 2 4 4 Andorra 0.684 0.656 6 5 5 Angola 0.914 0.380 1 6 6 Antigua … 0.437 0.169 2 5 7 Argentina 0.980 0.774 2 5 8 Armenia 0.982 0.876 2 6 9 Australia 0.833 0.271 1 3 10 Austria 0.984 0.712 2 6 # … with 183 more rows, and 20 more variables: spikiness <dbl>, # linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>, stl_e_acf10 <dbl>, # acf1 <dbl>, acf10 <dbl>, diff1_acf1 <dbl>, diff1_acf10 <dbl>, # diff2_acf1 <dbl>, diff2_acf10 <dbl>, season_acf1 <dbl>, pacf5 <dbl>, # diff1_pacf5 <dbl>, diff2_pacf5 <dbl>, season_pacf <dbl>, # zero_run_mean <dbl>, nonzero_squared_cv <dbl>, zero_start_prop <dbl>, # zero_end_prop <dbl> ``` --- class: split-two white .column.bg-white[.content[ # Feature-based visualization ```r features.all %>% ggplot(aes(x = trend_strength, y = seasonal_strength_week)) + geom_point() + coord_equal() + xlim(c(0,1)) + ylim(c(0,1)) + labs(x = "Trend strength", y = "Seasonal strength") + theme(legend.position = "bottom") ``` ]] .column.bg-white[.content[ #
]] --- .pull-left[ # Time-domain ![](index_files/figure-html/unnamed-chunk-21-1.png)<!-- --> ] .pull-right[ # Feature-space
] --- class: split-70 hide-slide-number background-image: url("img/ts.jpeg") background-size: cover .column.slide-in-left[ .sliderbox.vmiddle.shade_main.center[ .font5[Large-scale Forecasting]]] .column[ ] --- class: inverse, center, middle background-image: url(img/rice1.png) background-size: contain # Big picture --- class: inverse, center, middle background-image: url(img/rice2.png) background-size: contain # Big picture --- class: inverse, center, middle background-image: url(img/f1.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f2.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f3.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f4.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f5.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f6.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f7.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f8.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f9.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f10.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f11.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f12.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f13.png) background-size: contain --- ## FFORMS: Feature-based FORecast Model Selection ### seer (magic ball) .pull-left[ <img src="img/seer.png" width="20%" /> ```r install.packages("seer") #library(devtools) #install_github("thiyangt/seer") library(seer) ``` ] .pull-right[ Acropolis Museum, Athens, Greece <img src="img/greece.JPG" width="60%" /> ] --- # Example dataset ```r library(Mcomp) yearlym1 <- subset(M1, "yearly") yearlym1 ``` ``` M-Competition data: 181 YEARLY time series Type of data Period DEMOGR INDUST MACRO1 MACRO2 MICRO1 MICRO2 MICRO3 YEARLY 30 35 30 29 16 29 12 ``` --- ## Input: features ```r library(seer) cal_features(yearlym1[1:2], database="M1", h=6, highfreq=FALSE) ``` ``` # A tibble: 2 x 25 entropy lumpiness stability hurst trend spikiness linearity curvature e_acf1 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0.442 0.0400 0.977 0.985 0.985 0.00000132 4.46 0.705 -0.0603 2 0.363 0.0790 0.894 0.988 0.989 0.00000154 4.47 0.613 0.272 # … with 16 more variables: y_acf1 <dbl>, diff1y_acf1 <dbl>, diff2y_acf1 <dbl>, # y_pacf5 <dbl>, diff1y_pacf5 <dbl>, diff2y_pacf5 <dbl>, nonlinearity <dbl>, # lmres_acf1 <dbl>, ur_pp <dbl>, ur_kpss <dbl>, N <int>, y_acf5 <dbl>, # diff1y_acf5 <dbl>, diff2y_acf5 <dbl>, alpha <dbl>, beta <dbl> ``` --- ## Output: class labels ```r seer::fcast_accuracy(tslist=yearlym1[1:2], models= c("arima","ets","rw", "theta", "nn"), database ="M1", cal_MASE, h=6, length_out = 1, fcast_save = TRUE) ``` ``` $accuracy arima ets rw theta nn YAF2 10.527612 10.319029 13.52428 12.088375 11.811087 YAF3 5.713867 7.704409 7.78949 6.225463 6.700791 $ARIMA YAF2 YAF3 "ARIMA(0,1,0) with drift" "ARIMA(0,1,1) with drift" $ETS YAF2 YAF3 "ETS(A,A,N)" "ETS(M,A,N)" $forecasts $forecasts$arima YAF2 YAF3 [1,] 579581.0 390955.9 [2,] 605761.9 407325.1 [3,] 631942.9 423694.4 [4,] 658123.8 440063.6 [5,] 684304.8 456432.8 [6,] 710485.7 472802.0 $forecasts$ets YAF2 YAF3 [1,] 556280.7 384603.9 [2,] 594333.0 385162.7 [3,] 632385.3 385721.5 [4,] 670437.6 386280.4 [5,] 708489.9 386839.2 [6,] 746542.3 387398.0 $forecasts$rw YAF2 YAF3 [1,] 553400 384040 [2,] 553400 384040 [3,] 553400 384040 [4,] 553400 384040 [5,] 553400 384040 [6,] 553400 384040 $forecasts$theta YAF2 YAF3 [1,] 565938.8 394342.6 [2,] 578486.4 404640.6 [3,] 591034.0 414938.7 [4,] 603581.5 425236.7 [5,] 616129.1 435534.8 [6,] 628676.6 445832.8 $forecasts$nn YAF2 YAF3 [1,] 575410.7 399629.0 [2,] 592457.7 407194.1 [3,] 605225.4 410556.4 [4,] 614538.9 411989.1 [5,] 621198.0 412588.3 [6,] 625889.9 412837.0 ``` --- ## Training set ```r prepare_trainingset(accuracy_set = accuracy_m1, feature_set = features_m1)$trainingset ``` ``` # A tibble: 2 x 26 entropy lumpiness stability hurst trend spikiness linearity curvature e_acf1 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0.442 0.0400 0.977 0.985 0.985 0.00000132 4.46 0.705 -0.0603 2 0.363 0.0790 0.894 0.988 0.989 0.00000154 4.47 0.613 0.272 # … with 17 more variables: y_acf1 <dbl>, diff1y_acf1 <dbl>, diff2y_acf1 <dbl>, # y_pacf5 <dbl>, diff1y_pacf5 <dbl>, diff2y_pacf5 <dbl>, nonlinearity <dbl>, # lmres_acf1 <dbl>, ur_pp <dbl>, ur_kpss <dbl>, N <int>, y_acf5 <dbl>, # diff1y_acf5 <dbl>, diff2y_acf5 <dbl>, alpha <dbl>, beta <dbl>, # classlabels <chr> ``` --- ## FFORMS classifier ```r rf <- build_rf(training_set = training_set, testset= M3yearly_features, rf_type="ru", ntree=100, seed=1, import=FALSE, mtry = 8) ``` ```r head(rf$predictions) ``` ``` 1 2 3 4 5 6 ETS-trend rwd rwd rwd rwd rwd 10 Levels: ARIMA ARMA/AR/MA ETS-dampedtrend ETS-notrendnoseasonal ... wn ``` --- class: inverse, center, middle background-image: url(img/forest.jpg) background-size: content --- <img src="img/wolf.png" width="100%" /> Source: https://theblue.ai/blog/lime-models-explanation/ --- # explainer package Machine learning interpretability tools (randomForest) - Which features are the most important? - Where are they important? - How are they important? - When and how are features linked with the prediction outcome? - When and how strongly do features interact with other features? ```r devtools::install_github("thiyangt/explainer") library(explainer) ``` --- background-image: url(img/ice1.png) background-size: contain ### Partial dependence plots and ICE curves --- background-image: url(img/ice2.png) background-size: contain ### Partial dependence plots and ICE curves --- background-image: url(img/ice3.png) background-size: contain ### Partial dependence plots and ICE curves --- background-image: url(img/ice4.png) background-size: contain ### Partial dependence plots and ICE curves --- background-image: url(img/ice5.png) background-size: contain ### Partial dependence plots and ICE curves --- ### Partial dependency plots for hourly data: entropy - forecastability of a time series .pull-left[ <img src="img/entropy.png" width="3171" /> ] .pull-right[ <img src="img/en_pdp4.png" width="2620" /> ] --- # Interaction effect Interaction between linearity and seasonal lag at seasonally-differenced series ![](img/dtwopdp1-1.png) --- # Interaction effect Interaction between linearity and seasonal lag at seasonally-differenced series ![](img/dtwopdp2-1.png) --- class: middle center bg-main1 # Thank you <img src="img/seer.png" width="20%" />
<i class="fab fa-twitter fa-3x faa-float animated " style=" color:lightblue;"></i>
<i class="fab fa-github fa-3x faa-float animated " style=" color:yellow;"></i>
# @thiyangt # Web: https://thiyanga.netlify.com/