class: center, middle, inverse, title-slide .title[ # Handling Massive Collections of Time Series Data with Feature Engineering ] .subtitle[ ## International Water Management Institute (IWMI) ] .author[ ### Thiyanga S. Talagala ] .date[ ### 2023-11-02 ] --- ## Top-cited scientists On October 4, 2023, Elsevier released an updated dataset titled “Updated Science-Wide Author Databases of Standardized Citation Indicators,” following a comprehensive analysis conducted by a team of experts led by Professor John P. A. Ioannidis from Stanford University. Link: [https://thiyangt.github.io/citation/]( https://thiyangt.github.io/citation/ ) --- background-image: url(1.png) background-size: contain --- # About me - PhD in Statistics and Mathematics, Monash University, Australia -- - Senior Lecturer, Department of Statistics, University of Sri Jayewardenepura -- - Co-founder and Co-organizer, R Ladies-Colombo -- - Coordinator, Statistical Consultancy Service, University of Sri Jayewardenepura -- - Founder and lead maintainer, [Dengue Data Hub](https://denguedatahub.netlify.app/) -- Current research interests: Time Series Analysis, Data Visualization, Machine Learning, Machine Learning Interpretability, Algorithm Selection, Open-science, Reproducible Research -- November, 2023 highlights: [#30DayMapChallenge](https://30daymapchallenge.com/) --- class: middle, center, # Handling Massive Collections of # Time Series Data with # Feature Engineering --- class: middle, center, # Handling **Massive Collections** of # Time Series Data with # Feature Engineering --- class: middle, center # Handling **Massive Collections** of # **Time Series Data** with # Feature Engineering --- background-image: url(dengue.jpg) background-size: contain --- class: middle, center # `denguedatahub` R package installation <img src="https://denguedatahub.netlify.app/logo.png" width="300px" /> `install.packages("denguedatahub")` `library(denguedatahub)` --- ### District-wise Weekly Dengue cases from 2006 to 2023 - August ```r library(denguedatahub) library(tsibble) srilanka_weekly_data ``` ``` # A tibble: 21,934 × 6 year week start.date end.date district cases * <dbl> <dbl> <date> <date> <chr> <dbl> 1 2006 52 2006-12-23 2006-12-29 Colombo 71 2 2006 52 2006-12-23 2006-12-29 Gampaha 12 3 2006 52 2006-12-23 2006-12-29 Kalutara 12 4 2006 52 2006-12-23 2006-12-29 Kandy 20 5 2006 52 2006-12-23 2006-12-29 Matale 4 6 2006 52 2006-12-23 2006-12-29 NuwaraEliya 1 7 2006 52 2006-12-23 2006-12-29 Galle 1 8 2006 52 2006-12-23 2006-12-29 Hambanthota 1 9 2006 52 2006-12-23 2006-12-29 Matara 11 10 2006 52 2006-12-23 2006-12-29 Jaffna 0 # ℹ 21,924 more rows ``` --- class: middle, center # Data visualization serves as a prerequisite for modeling. -- ## why? -- It plays a crucial role in helping us better **grasp the patterns**, **relationships**, and **essential characteristics** of the data we're working with, making it easier to create accurate and effective models. --- # Static chart ![](index_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- # Interactive chart
--- ![](index_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- #### Number of series: 1 ![](index_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- #### Number of series: 2 ![](index_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- #### Number of series: 3 ![](index_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- ## Number of series: 10 ![](index_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- ## Number of series: More than 100 ![](index_files/figure-html/unnamed-chunk-11-1.png)<!-- --> --- background-image: url(https://thiyanga.netlify.app/post/positimages/site.png) background-size: contain # Posit conf::2023, Hyatt Regency in Chicago, Illinois, USA --- background-image: url(usgs.png) background-size: contain --- class: middle, center # Handling **Massive Collections** of # **Time Series Data** with # **Feature Engineering** --- background-image: url(img/tukey.jpeg) background-size: 200px background-position: 100% 6% # Time series features - **Cognostics**: **Co**mputer-aided dia**gnostics** (John W. Tukey, 1985) - Characteristics of time series - Summary measures of time series **Basic Principle** - Transform a given time series `\(y=\{y_1, y_2, \cdots, y_n\}\)` into a feature vector `\(F = (f_1(y), f_2(y), \cdots, f_p(y))'\)`. --- .pull-left[ #### Time-domain representation ![](index_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ] -- .pull-right[ #### Time series features - Strength of trend - Strength of seasonality ``` # A tibble: 6 × 3 trend seasonality id <dbl> <dbl> <chr> 1 0.995 0 N0001 2 0.591 0 N0633 3 0.961 0 N0625 4 0.178 0 N0645 5 0.251 0.906 N1912 6 0.968 0.927 N2012 ``` ] --- # STL decomposition .pull-left[ ![](index_files/figure-html/unnamed-chunk-14-1.png)<!-- --> ] .pull-right[ `$$y_t = T_t + S_t + R_t$$` `$$F_T = max \left(0, 1 - \frac{Var(R_t)}{Var(T_t + R_t)} \right)$$` `$$F_S = max \left(0, 1 - \frac{Var(R_t)}{Var(S_t + R_t)} \right)$$` ] STL is an acronym for “Seasonal and Trend decomposition using Loess”. More info click [here](https://otexts.com/fpp2/stl.html). --- .pull-left[ #### Time-domain representation ![](index_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] -- .pull-right[ #### Feature-domain representation ![](index_files/figure-html/unnamed-chunk-16-1.png)<!-- --> ] --- # R code ```r six_series ``` ``` ## $N0001 ## Time Series: ## Start = 1975 ## End = 1988 ## Frequency = 1 ## [1] 940.66 1084.86 1244.98 1445.02 1683.17 2038.15 2342.52 2602.45 2927.87 ## [10] 3103.96 3360.27 3807.63 4387.88 4936.99 ## ## $N0633 ## Time Series: ## Start = 1974 ## End = 1989 ## Frequency = 1 ## [1] 6664 7719 6750 6638 6666 6497 5867 6895 6077 6796 7488 7285 8014 8731 7650 ## [16] 7200 ## ## $N0625 ## Time Series: ## Start = 1973 ## End = 1989 ## Frequency = 1 ## [1] 1570 2270 2246 2072 2024 2120 2522 2560 2592 2576 2916 2874 3398 3580 3770 ## [16] 4222 4486 ## ## $N0645 ## Time Series: ## Start = 1955 ## End = 1986 ## Frequency = 1 ## [1] 6030 5070 5970 7870 5490 7600 5620 5040 6140 5410 8880 8130 6850 6990 6180 ## [16] 6310 5080 7400 5790 6682 6582 4167 7165 7426 7290 6900 7459 7003 6226 7453 ## [31] 5009 6115 ## ## $N1912 ## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ## 1981 3959 3704 5149 5419 5151 6368 6427 5724 4809 4596 4454 4488 ## 1982 3696 4103 5299 5156 5699 6326 5732 6029 5193 4888 4482 4417 ## 1983 4539 3470 5358 5099 5979 6631 6163 6653 5648 4732 4429 3916 ## 1984 4117 4636 4639 5033 5431 6100 6499 6716 4880 5206 4421 3897 ## 1985 4479 4102 4301 5606 5762 5998 6279 5893 4825 4870 4088 3882 ## 1986 4303 4065 4861 6173 5884 5856 5899 5293 4687 4796 3988 4080 ## 1987 4048 4174 5353 5913 6196 6471 5950 5962 5203 4896 4085 4183 ## 1988 3978 4371 5325 5928 5800 6568 6115 6202 5340 4983 4410 4531 ## 1989 4630 4416 5290 5998 6153 6359 5826 6184 4989 5126 4668 4174 ## 1990 4215 4284 5342 5109 6026 5835 5835 5876 4757 4920 4176 3886 ## 1991 3987 3750 4591 5303 5822 5663 ## ## $N2012 ## Jan Feb Mar Apr May Jun Jul Aug Sep Oct ## 1979 2317.0 2265.0 2842.0 2485.0 2904.0 2781.5 2587.0 2734.5 2389.5 2657.0 ## 1980 2273.0 2400.5 2582.5 2733.5 2951.0 3036.0 2871.0 2939.5 3004.0 3050.5 ## 1981 2414.0 2564.5 3128.5 3303.5 3274.0 3609.0 3262.0 3309.0 3263.0 2998.5 ## 1982 2413.5 2525.0 3055.5 3138.5 3294.5 3650.0 3131.0 3328.5 3242.5 2702.0 ## 1983 2456.5 2594.0 3283.0 3384.0 3642.5 4043.0 3460.5 3771.5 3568.0 3284.5 ## 1984 3122.5 3362.0 3803.5 3771.5 4196.0 4199.5 3925.5 4165.0 3722.0 3826.5 ## 1985 3152.5 3106.0 3771.5 4326.5 4650.5 4402.5 4311.0 4321.5 4029.5 4102.0 ## 1986 3493.0 3405.5 3800.0 4519.5 4569.0 4455.0 4313.5 4281.5 4207.5 4317.0 ## 1987 3515.5 3808.0 4290.5 4566.0 4632.5 4719.0 4592.0 4508.0 4448.5 4524.0 ## 1988 3621.5 3966.0 4633.5 4698.0 5012.5 5179.0 4556.5 4854.0 4663.0 4540.5 ## 1989 3986.0 4086.0 4626.5 4789.0 5208.5 5301.0 ## Nov Dec ## 1979 2177.0 1870.0 ## 1980 2381.5 2264.5 ## 1981 2448.5 2202.0 ## 1982 2378.0 2081.0 ## 1983 2900.5 2529.5 ## 1984 3165.0 2832.5 ## 1985 3368.0 2855.0 ## 1986 3300.5 3071.5 ## 1987 3790.0 3438.0 ## 1988 4098.5 3757.5 ## 1989 ``` --- # R code ```r library(tsfeatures) tsfeatures(six_series, c("stl_features")) |> select(trend, seasonal_strength) ``` ``` ## # A tibble: 6 × 2 ## trend seasonal_strength ## <dbl> <dbl> ## 1 0.995 NA ## 2 0.591 NA ## 3 0.961 NA ## 4 0.178 NA ## 5 0.251 0.906 ## 6 0.968 0.927 ``` --- ## Examples of time series features .pull-left[ - length - strength of trend - strength of seasonality - lag-1 autocorrelation - spectral entropy - proportion of zeros - spikiness ] .pull-right[ - curvature - linearity - stability - number of peaks - parameter estimates of Holt-Winters' additive method - unit root test statistics ] --- # `tsfeatures` package .pull-left[ ## R installation <img src="r.png" width="2557" /> `install.packages('tsfeatures')` ] .pull-right[ ## Python installation <img src="python.png" width="2485" /> `pip install tsfeatures` ] --- ## Example: tsfeatures in R ```r tslist <- list(sunspot.year, WWWusage, AirPassengers, USAccDeaths) features <- tsfeatures(tslist) features ``` ``` ## # A tibble: 4 × 20 ## frequency nperiods seasonal_period trend spike linearity curvature e_acf1 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 0 1 0.125 2.10e-5 3.58 1.11 0.793 ## 2 1 0 1 0.985 3.01e-8 4.45 1.10 0.774 ## 3 12 1 12 0.991 1.46e-8 11.0 1.09 0.509 ## 4 12 1 12 0.802 9.15e-7 -2.12 2.85 0.258 ## # ℹ 12 more variables: e_acf10 <dbl>, entropy <dbl>, x_acf1 <dbl>, ## # x_acf10 <dbl>, diff1_acf1 <dbl>, diff1_acf10 <dbl>, diff2_acf1 <dbl>, ## # diff2_acf10 <dbl>, seasonal_strength <dbl>, peak <dbl>, trough <dbl>, ## # seas_acf1 <dbl> ``` --- # tidyverse ```r library(tidyverse) ``` --- #tidyver~se~ ```r library(tidyverse) ``` --- #tidyverts ![](feast.png) --- ## Large-scale time series visualization Demo: https://thiyangt.github.io/fformsviz/fforms.html [Short demo](https://twitter.com/i/status/1634754298377560064) [Dengue data visualization](https://denguedatahub.netlify.app/viz) --- class: center, middle, inverse # Which time series features should be used? --- ![](index_files/figure-html/unnamed-chunk-24-1.png)<!-- --> --- ![](index_files/figure-html/unnamed-chunk-25-1.png)<!-- --> --- ![](index_files/figure-html/unnamed-chunk-26-1.png)<!-- --> --- class: center, middle # Feature-based time series forecasting --- class: inverse, center, middle background-image: url(img/f1.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f2.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f3.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f4.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f5.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f6.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f7.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f8.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f9.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f10.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f11.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f12.png) background-size: contain --- class: inverse, center, middle background-image: url(img/f13.png) background-size: contain --- ### FFORMS: **F**eature-based **FOR**ecast **M**odel **S**election .pull-left[ seer (magic ball) <img src="img/seer.png" width="50%" /> ```r install.packages("seer") #library(devtools) #install_github("thiyangt/seer") library(seer) ``` ] -- .pull-right[ Acropolis Museum, Athens, Greece <img src="img/greece.JPG" width="60%" /> ] --- class: inverse background-image: url(img/forest.jpg) background-size: cover ## Random forest --- class: center, inverse, middle background-image: url(img/forest.jpg) background-size: cover # Can we trust ML-algorithms if we don't know how it works? --- class: inverse background-image: url(img/forest.jpg) background-size: cover # Peeking inside FFORMS Random forest --- # Peeking inside FFORMS Random forest - Which features are the most important? - Where are they important? - How are they important? -- > Machine learning interpretability methods. Paper: Talagala, T. S., Hyndman, R. J., & Athanasopoulos, G. (2023). Meta‐learning how to forecast time series. Journal of Forecasting, 42(6), 1476-1501. --- ## Partial dependency plots for hourly data: entropy <img src="entropy.png" width="80%" /> --- ## Recap -- - Time series features -- - Storing -- - Data visualization -- - Forecasting -- - Clustering -- - Anomaly detection --- ![](anomaly.png) --- class: center, middle <img src="anomaly.png" width="800px" /> Source: https://prital.netlify.app/talks/f4sg2023/f4sg-talk#5 Contact: ~~Thiyanga~~ Priyanga Dilini Talagala --- class: center, middle # Thank You!
<i class="fab fa-twitter fa-3x faa-float animated " style=" color:lightblue;"></i>
<i class="fab fa-github fa-3x faa-float animated " style=" color:black;"></i>
# @thiyangt ### web: https://thiyanga.netlify.app # email: ttalagala@sjp.ac.lk