Jekyll2024-01-29T04:28:27+00:00http://sassafras13.github.io/feed.xmlEmma BenjaminsonData ScientistA High-Level Overview of Cloud Computing2023-04-10T00:00:00+00:002023-04-10T00:00:00+00:00http://sassafras13.github.io/GCP<p>In this post, I want to learn some of the basics of the Google Cloud Platform (GCP). I have previous experience working with Microsoft’s Azure platform, and so I am looking to expand and strengthen my understanding of cloud computing platforms in general as well as Google’s offerings more specifically.</p>
<h2 id="introduction">Introduction</h2>
<p>Cloud computing platforms in general offer many different services, which customers can pay for without having to manage the resources themselves [1]. Some of the most common services that these platforms offer are <strong>data storage</strong> capabilities and <strong>computing power</strong> [1].</p>
<p>The company offering the platform will typically distribute the computing resources over many locations (each location is known as a data center) [1]. Google Cloud Platform, for example, distributes their infrastructure over tens of different regions, which themselves are split up into smaller zones [2]. This distribution helps ensure that the platform is redundant and protected against outages, and customers can also minimize latency by connecting to the region nearest to them [4]. Some resources are global (they can be accessed from any region) while other resources can only be accessed by resources in the same zone or region [4].</p>
<p>Cloud computing is a useful service because it affords people access to extremely powerful computing resources without having to invest the money themselves in setting up those resources [1]. Many cloud computing providers allow customers to pay as they use the services, which saves them a large up-front capital investment because they don’t have to buy the computers themselves [1]. Customers can also quickly access more resources if necessary (or scale down their use as needed), which again is very convenient considering that the alternative would be to buy more computing resources yourself [3]. However, this cost structure can also induce a little anxiety in the customer, because it is possible to make mistakes (like leaving an instance running) that cost a lot of money!</p>
<p>Google Cloud Platform is one example of a cloud computing platform, and it offers a lot of support for data storage and machine learning [2]. Apparently, GCP is built on the same infrastructure that Google uses for its other products (like GMail and Google Drive) [2]. Apparently GCP is actually an umbrella term because it offers over 100 unique services, including computing resources, data storage, networking, support for AI, management tools, etc.</p>
<p>Any work you do with a GCP resource must be part of a <strong>project</strong> [4]. Projects contain settings and metadata that control how your resources are used. Once you have established a project (and how you are going to pay for it) you can start to use the resources that GCP provides. GCP offers 3 main avenues for leveraging its services: 1) the console, 2) the command-line interface (CLI) and 3) client libraries [4]. The console is a web-based GUI that you can use to manage your projects [4]. The GCP CLI allows you to do the same thing from your terminal, either directly from your own computer or via a browser window [4]. The client libraries allow you to connect your Node.js and Python projects directly to GCP services, as well as allowing you to manage the resources that you are paying for [4].</p>
<p>Typically when you use a cloud computing platform for AI-related work, one of the first things you will have to do is create a <strong>virtual machine</strong> to run your code [3]. Typically GCP offers standard virtual machines (with standardized specifications for number of cores, size of RAM, operating system, etc.), but it also allows you to customize your virtual machine so that you can choose optimal specifications for your application (and budget) [3].</p>
<p>Once you have set up your virtual machine, you can interact with it using APIs (application programming interfaces) that are not unique to GCP [3]. That is, if you choose to use GCP to host your work, you are not locked into using a code base that you cannot transfer anywhere else - a lot of the code supporting interfacing with GCP can be used on other cloud computing platforms too [3].</p>
<p>This concludes a brief overview of how Google has set up its cloud computing platform.</p>
<h2 id="references">References</h2>
<p>[1] “Cloud computing.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Cloud_computing">https://en.wikipedia.org/wiki/Cloud_computing</a> Visited 7 Apr 2023.</p>
<p>[2] “Google Cloud Platform.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Google_Cloud_Platform">https://en.wikipedia.org/wiki/Google_Cloud_Platform</a> Visited 7 Apr 2023.</p>
<p>[3] Kevadiyasmeet. “What is Google Cloud Platform (GCP)?” Geeks for Geeks. 22 Jun 2022. <a href="https://www.geeksforgeeks.org/what-is-google-cloud-platform-gcp/">https://www.geeksforgeeks.org/what-is-google-cloud-platform-gcp/</a> Visited 7 Apr 2023.</p>
<p>[4] “Google Cloud overview.” Google Cloud. <a href="https://cloud.google.com/docs/overview">https://cloud.google.com/docs/overview</a> Visited 7 Apr 2023.</p>In this post, I want to learn some of the basics of the Google Cloud Platform (GCP). I have previous experience working with Microsoft’s Azure platform, and so I am looking to expand and strengthen my understanding of cloud computing platforms in general as well as Google’s offerings more specifically.A First Foray into Using QGIS2023-04-10T00:00:00+00:002023-04-10T00:00:00+00:00http://sassafras13.github.io/QGIS<p>In this blog post we are going to learn a little about Geographic Information Systems (GIS) and play around with QGIS, a free and open source GIS tool.</p>
<h2 id="what-is-a-gis">What is a GIS?</h2>
<p>I have seen GIS mentioned a lot in discussions about using data science to solve problems related to climate change, but I never really understood what kind of data is used with a GIS or what problems it can solve. GIS is a general term that refers to computer hardware and software that is used to manipulate geographic data [1]. Apparently sometimes the term is so general, it also refers to the community of users, the common workflows for using the tool and the organizations that build and use the GIS tool [1]. But the most important aspect of a GIS system is that it allows the user to join together geographical datasets by leveraging their shared location information as the “primary key” to use a term from SQL [1]. Essentially, we have lots of datasets where the data is distributed over latitudes, longitudes and elevations, and this allows us to overlay it and look for relationships between otherwise disparate datasets [1].</p>
<p>The primary function of a GIS is to store geographic data. Different iterations of these systems do this in different ways - sometimes the data is stored as separate files, and sometimes as relational databases [1]. In addition to storing the data distributed over space, it is also becoming more common to see datasets stored over time as well, so you can see the temporal evolution of a geographic phenomenon [1]. A GIS can store many different types of data, but the two most commonly seen categories are discrete data (like classifications of geographic regions) and continuous data (like amount of rainfall) [1]. The GIS will also often enable analysis of the raw data to develop more complex types of data that can be overlaid on a map [1]. Typically, data is stored as either a set of raster images or as vector data [1]. More recently, 3D point clouds are also becoming more common [1].</p>
<p>There are many implementations of GIS tools, and currently ArcGIS is the industry standard, maintained by Esri [1]. Since I am on a grad student budget, I was interested in the available free and open source platforms available, and it appears that QGIS is a popular alternative to ArcGIS.</p>
<p>I worked through Michael Treglia’s excellent introductory tutorial that showed how to build a map of a “suitability region” for a hypothetical species [2]. My final product is shown in Figure 1. Some of the topics covered in this tutorial included [2]:</p>
<ul>
<li>Importing <strong>raster</strong>, <strong>vector</strong> and <strong>CSV</strong> data</li>
<li>Managing <strong>layers</strong> on the map</li>
<li>Using <strong>plugins</strong> to import additional map layers</li>
<li>Vector <strong>geoprocessing tools</strong> for clipping and differencing layers of vector data</li>
<li>Vector <strong>data management tools</strong> for joining different datasets together</li>
<li>Using the <strong>raster calculator</strong> to derive new layers following some mathematical operations</li>
<li>Creating <strong>maps</strong> for export as PDF and other file types</li>
</ul>
<p><img src="/images/mymap.png" alt="Fig 1" title="Figure 1" /><br />
Figure 1 - My rendition of a map of a suitable region for a hypothetical species, with rivers overlaid, in College Station, TX</p>
<h2 id="references">References</h2>
<p>[1] “Geographic information system.” Wikipedia. Visited 4 Apr 2023. <a href="https://en.wikipedia.org/wiki/Geographic_information_system#">https://en.wikipedia.org/wiki/Geographic_information_system#</a></p>
<p>[2] Treglia, M. L. “An Introduction to GIS Using QGIS (v. 3.0).” 13 May 2018. <a href="https://mltconsecol.github.io/QGIS-Tutorial/QGIS-Tutorial/Treglia_QGIS_Tutorial_3_0.pdf">https://mltconsecol.github.io/QGIS-Tutorial/QGIS-Tutorial/Treglia_QGIS_Tutorial_3_0.pdf</a> Visited 10 Apr 2023.</p>In this blog post we are going to learn a little about Geographic Information Systems (GIS) and play around with QGIS, a free and open source GIS tool.WiDS Datathon 20232023-03-12T00:00:00+00:002023-03-12T00:00:00+00:00http://sassafras13.github.io/WidsDatathon<p>This year I participated in my first ever datathon, hosted by the <a href="https://www.kaggle.com/competitions/widsdatathon2023">Women in Data Science</a> (WiDS) organization. The challenge for this year’s competition was to accurately predict daily temperatures over hundreds of regions in the United States over two months. The challenge was hosted on Kaggle. In this post, I will summarize my approach and the things I learned from participating in the datathon. I would like to sincerely thank the competition organizers, as well as all the other participants who kindly shared their Kaggle notebooks - I have cited the notebooks I read in the References section below.</p>
<h2 id="introduction">Introduction</h2>
<p>This year’s datathon gave participants a large dataset consisting of over 240 different features for 500+ unique regions inside the US, in daily increments over about 2 years [1]. The challenge was to use this dataset to predict values over a 2 month time span for the target feature [1]. The target variable was the arithmetic mean of the min and max observed temperatures over the next 2 weeks for each unique location and start date in the dataset [1].</p>
<p>The features in the dataset gave information about the geographical regions sampled, as well as meteorological data including temperature, precipitation, humidity, sea level pressure, wind speeds, data on significant effects like El Nino and tropical convection, etc. [1]. The dataset included both measured values as well as predictions made by a variety of existing meteorological models and ensembles, averaged daily and monthly [1].</p>
<p>The unique locations were characterized by normalized, anonymized latitude and longitude values, as well as by climate region labels and elevation data [1]. To help myself visualize the geographic distribution of the data, I plotted the climate region labels and elevation data as shown below in Figures 1 and 2.</p>
<p><img src="/images/geographic_regions_climate.png" alt="Fig 1" title="Figure 1" /><br />
Figure 1 - Geographic distribution of climate regions in the dataset.</p>
<p><img src="/images/geographic_regions_elevation.png" alt="Fig 2" title="Figure 2" /><br />
Figure 2 - Elevation data for all the unique locations sampled in the dataset.</p>
<p>Both the training and test datasets contained the same features - with the exception that the test dataset did not contain the target feature values [1]. But this meant that we could train models to make predictions using the other features because we had their values for the target time range - this was helpful because it meant we did not have to create our own additional features for the target time range by creating <a href="https://sassafras13.github.io/TimeSeriesForecastingML/">lagged values</a>.</p>
<p>For this challenge, I relied heavily on several Python libraries including <code class="language-plaintext highlighter-rouge">Pandas</code>, <code class="language-plaintext highlighter-rouge">plotly</code>, <code class="language-plaintext highlighter-rouge">statsmodels</code>, <code class="language-plaintext highlighter-rouge">sklearn</code> and Microsoft’s implementation of a light gradient-boosting model, <code class="language-plaintext highlighter-rouge">lightgbm</code>. I have included links to my Kaggle notebooks where applicable.</p>
<p>The overall outline of my approach to this datathon can be summarized as:</p>
<ol>
<li>Examine the raw data and <a href="https://sassafras13.github.io/BasicsTimeSeries/">impute missing values</a>.</li>
<li>Perform an <a href="https://sassafras13.github.io/EDA/">exploratory data analysis</a>, looking for patterns and features with strong correlations to the target variable.</li>
<li>Establish a baseline predictive approach using <a href="https://sassafras13.github.io/ARIMA/">ARIMA</a>.</li>
<li>Use feature engineering to generate additional features conveying useful time-related information.</li>
<li>Apply a machine learning solution to try to beat my baseline approach, primarily LightGBM [5].</li>
</ol>
<h2 id="examine-the-raw-data-and-impute-missing-values">Examine the Raw Data and Impute Missing Values</h2>
<p><img src="/images/missing_data.png" alt="Fig 3" title="Figure 3" /> <br />
Figure 3 - Visualized missing values in the training dataset.</p>
<p>The Kaggle notebook for my data intake process is available <a href="https://www.kaggle.com/code/janed57821/wids-data-intake">here</a>.</p>
<p>I found that there were 8 features which were missing values, as visualized in Figure 3. These features contained predictions of temperature and precipitation from the North American Multi-Model Ensemble forecast model, which is a “is a collection of physics-based forecast models from various modeling centers in North America” according to the datathon’s data page [1]. I noted that some of the predictions were monthly mean predictions, while others were daily mean predictions. An example of the daily and monthly predictions for the observed temperature from the CCSM3 model for (latitude, longitude) = (0, 0.84) taken from the training dataset are shown below in Figures 4-5. In both cases, I decided to use the forward-fill method to impute the missing values since precipitation and temperature are strongly correlated with previous days/months.</p>
<p><img src="/images/data_intake_nmme0-tmp2m-34w__ccsm30.png" alt="Fig 4" title="Figure 4" /><br />
Figure 4 - Monthly NMME model forecasts for the target label using the CCSM3 model.</p>
<p><img src="/images/data_intake_nmme-tmp2m-56w__ccsm3.png" alt="Fig 5" title="Figure 5" /><br />
Figure 5 - Weighted average of monthly NMME model forecasts for the target variable using the CCSM3 model.</p>
<p>One useful insight from this step was that the target time period did not immediately follow the training data. The training data ranged from 1 Sept 2014 to 31 Aug 2016, but the test data ran from 1 Nov 2022 to 31 Dec 2022. In total, there was a gap of 4 years, 2 months between the train and test datasets. This made generating a baseline more challenging, as we will see below.</p>
<h2 id="perform-an-eda">Perform an EDA</h2>
<p>The Kaggle notebook for my EDA is available <a href="https://www.kaggle.com/code/janed57821/wids-eda-corr-geo-seasonality-etc">here</a>.</p>
<p>In my EDA, I focused on trying to understand how the data was organized across geographic regions, and then I did a deep dive into the data for one location to see if I could find correlations between any of the features and the target feature. My reasoning was that the same features (which include data on the geographic region as well as various meteorological measures like temperature, precipitation, wind speeds, humidity, etc.) would correlate strongly with the target variable regardless of the geographic region.</p>
<p>I first looked at the distribution of climate regions in the data, and found that the 3 categories that were most common in the dataset were cold, semi-arid climates (“BSk”), warm-summer humid continental climates (“Dfb”) and humid, subtropical climates (“Cfa”). Conversely, the 3 least frequently appearing regions were Mediterranean-influence subarctic climates (“Dsc”), Monsoon-influenced hot-summer humid continental climate (“Dwa”) and Monsoon-influenced warm-summer humid continental climate (“Dwb”). This distribution is shown in Figure 6, and if we also look back at Figure 1, we can see that this distribution corresponds to the graphical representation. It appears, for example, that the most common “BSk” region (cold, semi-arid climate) appears in regions that likely are near the Rocky Mountains as well as mountain ranges in the eastern US. And on the other end of the distribution, the “Dwb” regions (monsoon-influenced warm-summer humid continental climate) only appear in the southern part of the country, closer to the Equator and the Gulf.</p>
<p><img src="/images/EDA_climate_region_distribution.png" alt="Fig 6" title="Figure 6" /> <br />
Figure 6 - Plot of frequency of each climate region label appearing in the training dataset.</p>
<p>Given that the cold, semi-arid climate (“BSk”) was the most frequently appearing category in the dataset, I selected a single location from that category for my deep dive into the available features. As stated in the Introduction, the dataset includes a wide variety of meteorological data. I noticed that there were many different features containing similar forms of data - for example, there were many temperature predictions from a range of models within the NMME ensemble, averaged over varying lengths of time. While the relationship between the target variable and the other features would likely depend somewhat on the frequency of the feature (i.e. if it varied daily, monthly or over some other period), I hypothesized that the general relationship between different types of meteorological data and the target variable would be relatively constant. That is, I hypothesized that all the temperature data would have similar amounts of correlation to the target variable when compared with the correlation between all the wind speed data and the target variable, for example. This is by no means a proven hypothesis, but it was a reasonable starting point to guide an in-depth analysis of some of the key features, since exploring all 240 would be prohibitively time-consuming. Instead, I selected 14 key features from the dataset, as well as the target variable, which broadly represented all the categories of data available in my dataset. A complete list of variables that I examined, along with a description, is available in my Kaggle notebook.</p>
<p>I began my deep dive by plotting a correlation matrix for these 14 key features and the target variable. Note that computing correlation for time series data is a little different than for other data types - I reviewed several common correlation methods and how they are applied to time series data in this <a href="https://sassafras13.github.io/EDA/">blog post</a>. For the correlation matrix shown in Figure 7, I used the Pearson correlation coefficient.</p>
<p><img src="/images/EDA_correlation_matrix.png" alt="Fig 7" title="Figure 7" /> <br />
Figure 7 - Correlation matrix (using Pearson correlation coefficient) for 14 key features and target variable.</p>
<p>When we examine this correlation matrix, we can see that the features most highly correlated with the target variable (“contest-tmp2m-14d__tmp2m”) are the NMME model mean prediction of the temperature (“nmme-tmp2m-34w__nmmemean”), the precipitable water prediction (“contest-prwtr-eatm-14d”), and the potential evaporation (“contest-pevpr-sfc-gauss-14d”). The sea level pressure (“contest-slp-14d”) was also highly negatively correlated with the target variable. For contrast, some of the data with the least correlation was the Madden-Julian oscillation period and amplitude, which measure tropical convection (“mjo1d__phase” and “mjo1d__amplitude”). To visually confirm that these correlations were possible, I plotted the target variable against the feature in question as shown in Figures 8 - 11 below (the target variable is always plotted on the secondary y-axis in red).</p>
<p><img src="/images/EDA-nmme-tmp2m-34w__nmmemean.png" alt="Fig 8" title="Figure 8" /> <br />
Figure 8 - NMME mean temperature prediction compared to target variable.</p>
<p><img src="/images/EDA-contest-prwtr-eatm-14d__prwtr.png" alt="Fig 9" title="Figure 9" /> <br />
Figure 9 - Precipitable water predictions compared to target variable.</p>
<p><img src="/images/EDA-contest-pevpr-sfc-gauss-14d__pevpr.png" alt="Fig 10" title="Figure 10" /><br />
Figure 10 - Potential evaporation plotted against target variable.</p>
<p><img src="/images/EDA-mjo1d__amplitude.png" alt="Fig 11" title="Figure 11" /> <br />
<img src="/images/EDA-mjo1d__phase.png" alt="Fig 11" title="Figure 11" /> <br />
Figure 11 - The Madden-Julian oscillation phase and amplitude plotted against the target variable.</p>
<p>After looking at the raw time series data for these highly correlated features, I wanted to understand if they also shared similarities if we decomposed them into trends and seasonalities (I write about these concepts <a href="https://sassafras13.github.io/EDA/">here</a>). I used the statsmodels <code class="language-plaintext highlighter-rouge">seasonal_decompose</code> function to divide the target variable and my three highly correlated features into trends, seasonalities and residuals. I used an additive model and a period of 365 days (one year) because I know that temperatures tend to follow patterns that repeat on a yearly basis.</p>
<p><img src="/images/EDA_seasonaldecomp_target.png" alt="Fig 12" title="Figure 12" /> <br />
Figure 12 - Results of seasonal decomposition for target variable.</p>
<p>As we can see from Figure 12, the target variable does not have a strong trend component - this suggests that the temperature range was fairly constant over the time period for which we have data. Most of the target variable’s variation is contained in the seasonal component, indicating that the temperature is varying following a consistent pattern year over year. The residuals are zero indicating that this method was not able to separate out random variation from the trend and seasonal components. Now let’s look at how our highly correlated variables compare.</p>
<p><img src="/images/EDA_seasonaldecomp_nmme-tmp2m.png" alt="Fig 13" title="Figure 13" /> <br />
Figure 13 - Results of seasonal decomposition for NMME mean temperature prediction.</p>
<p>Here we can see that the NMME’s mean temperature forecast is very similar to the target variable in that there is also not a strong trend (i.e. the temperature cycles consistently year to year without showing a strong increase or decrease). The NMME forecast also has a strong seasonal component that matches the pattern of the target variable. This key feature has a consistently small, negative residual but that could just be an indication that the model fit has a slight DC offset from the actual data.</p>
<p><img src="/images/EDA_seasonaldecomp_prwtr-eatm.png" alt="Fig 14" title="Figure 14" /> <br />
Figure 14 - Seasonal decomposition for precipitable water predictions.</p>
<p>The precipitable water prediction does differ slightly from the target variable because we see signs of a more significant downward trend in the precipitable water values over the two years of data. However, the precipitable water prediction also has a strong seasonal component that, again, matches the target variable - this commonality is probably the primary reason for why the two features were found to be strongly correlated. Again, the residuals for this feature are zero.</p>
<p><img src="/images/EDA_seasonaldecomp_pevpr.png" alt="Fig 15" title="Figure 15" /> <br />
Figure 15 - Seasonal decomposition for potential evaporation.</p>
<p>The potential evaporation predictions also show a strong trend component, this time in the positive direction. It is possible that the trends do not seem to influence the Pearson correlation coefficient very strongly because we do not have a large enough time window in our data to really see the trend come into play - if we were looking at, say, a decade’s worth of data, it might reduce the correlation between this feature and the target more strongly. In contrast, the strong seasonal component that closely matches the target’s pattern is likely the driving factor behind the high correlation scores for this feature and the target.</p>
<p>We have almost wrapped up the EDA portion of this project, but before I stopped, I also wanted to take a closer look at the distribution of the target variable over all of the data (and not just for one location). I generated a violin plot of the target variable by region for the entire time period because I wanted to see if there were a lot of outliers in the target feature. Outliers can make it harder to generate good predictions with a model, so I wanted to get a sense of how difficult the task ahead was going to be. As we can see from the plot below, several of the climate regions showed outliers but most of these regions fell into the mid- to low-range in terms of their frequency in the dataset, suggesting that they will not make my job as a forecaster significantly more difficult because they will not appear very often.</p>
<p>Now let’s move on to setting up some baselines to see if that will turn out to be the case!</p>
<p><img src="/images/EDA_target_violin.png" alt="Fig 16" title="Figure 16" /> <br />
Figure 16 - Violin plot of target variable divided by region over entire time period of training dataset.</p>
<h2 id="establish-a-baseline-with-arima">Establish a Baseline with ARIMA</h2>
<p>The Kaggle notebook with my ARIMA model is available <a href="https://www.kaggle.com/code/janed57821/arima-model">here</a>.</p>
<p>I chose the ARIMA model as my baseline because I wanted to establish what I hoped would be a challenging baseline to surpass with a machine learning model. Some commonly used baseline forecasting models - such as the naive method (using the last value) or using the mean - are not appropriate for predicting two months into the future [7]. The ARIMA model includes terms that can capture several aspects of the time series (see my <a href="https://sassafras13.github.io/ARIMA/">blog post</a> for more details) and so make it one of the most adaptable non-ML predictive models available.</p>
<p>There are some drawbacks to using ARIMA, however, The first is that it is designed to be used only on the target feature, and so I cannot leverage any of the other features in the dataset to help make predictions. Secondly, the datathon challenges participants to predict mean temperatures in 2020, but the training data ends in 2016, which is bad for using ARIMA because typically you want to use ARIMA to predict the immediate next points in the time series.</p>
<p>To adapt to this separation between train and test time windows, I use training data up to 31 October 2015, and then make predictions for the next 61 days in 2015 and assume that they are also suitable predictions for 2020. Given what we learned about the target variable in my EDA, we saw that there was not a significant trend in the target over the 2 years of data in our training dataset. However, as we noted in that discussion, two years may not be a wide enough window to see a strong year over year trend emerge. It is possible that the temperatures observed in 2020 are significantly warmer or colder than those we saw in 2014 - 2016, and so that approach may not be very accurate.</p>
<p>Another challenge with using ARIMA is that we must fit the ARIMA model to data from 514 unique geographical locations, as described above. This will mean that optimal model parameters for one region may not be optimal for another region.</p>
<p>Okay, now that we’ve considered some of the issues we’ll have to address with using ARIMA, let’s set it up. First, I need to choose parameters p, q and d for the model - they determine the autoregression (p), moving average (q) and integration order (d) for the model. Typically, we can look at the autocorrelation of the time series to choose our q value and the partial autocorrelation to choose the p value, and then select a d value that minimizes the prediction error after p and q are chosen. Since we have 514 unique regions and thus 514 time series we need to consider, I chose one time series for each of the 3 most frequently appearing climate regions to use as data for fitting our model.</p>
<p>I computed both the autocorrelation and partial autocorrelation for one of these locations, as shown in Figures 17 - 19 below. As we can see in Figure 17, the plot shows that there is always a strong correlation between the current time point and previous time points over an entire year’s worth of data. This makes it more difficult to identify a particular value of q that will fit the data well.</p>
<p><img src="/images/ARIMA_loc10_ACF.png" alt="Fig 17" title="Figure 17" /> <br />
Figure 17 - Autocorrelation over the previous 365 days for location 10, target variable.</p>
<p><img src="/images/ARIMA_loc10_PACF.png" alt="Fig 18" title="Figure 18" /> <br />
Figure 18 - Partial autocorrelation for location 10 over the previous 365 days.</p>
<p>Similarly, in Figure 18, we also saw that there was consistently a strong partial autocorrelation between the current time point and individual previous time points, although the correlation more frequently alternated between positive and negative over the year-long window we examined. To try to find a pattern in the shorter term, we also plotted the partial autocorrelation over the last 50 days in Figure 19.</p>
<p><img src="/images/ARIMA_loc10_PACF_narrow.png" alt="Fig 19" title="Figure 19" /> <br />
Figure 19 - Partial autocorrelation for location 10 over the previous 50 days.</p>
<p>As we can see, there is still not a strong pattern over the previous 50 days. Overall this method has not really helped us to select good values of p and q. My next approach is to conduct a more traditional hyperparameter sweep varying the values of p, q and d for my 3 chosen locations. I tested values of 1 and 7 (a day and a week) for p, q, and d in all the possible combinations. I used an expanding window validation strategy to compute the model prediction error (using RMSE) over expanding periods of time selected from my training dataset [5]. I used the fitted models to predict the next 14 days for each fold in the expanding window strategy. (Thanks very much to Leonie for demonstrating how to implement a hyperparameter sweep with an expanding window in code [5].)</p>
<p>I found that values of (p, q, d) = (1, 1, 7) worked the best for 2 out of my 3 sample regions, so I selected that parameter set for my ARIMA model. Note that “best” in this situation means that these parameters minimized the mean RMSE computed over all the windows in the expanding window validation. The fitted ARIMA model predictions over 14 days for each of my 3 sample regions are shown in Figures 20 - 22 below. We can see that, while the model worked quite well for locations 1 and 10, the model was grossly wrong for location 139.</p>
<p><img src="/images/ARIMA_loc10_pred.png" alt="Fig 20" title="Figure 20" /> <br />
Figure 20 - Location 10 prediction with fitted ARIMA model for next 14 days.</p>
<p><img src="/images/ARIMA_loc1_pred.png" alt="Fig 21" title="Figure 21" /><br />
Figure 21 - Location 1 prediction with fitted ARIMA model for next 14 days.</p>
<p><img src="/images/ARIMA_loc139_pred.png" alt="Fig 22" title="Figure 22" /><br />
Figure 22 - Location 139 prediction with fitted ARIMA model over the next 14 days.</p>
<p>Using this set of hyperparameters, I fit an ARIMA model to all the data leading up 31 October 2015 for each of the unique 514 locations, and then generated predictions for the next 61 days. I collated these and submitted them to the datathon to obtain the RMSE against the full test dataset. I obtained an RMSE score of <strong>8.817</strong>, which is pretty terrible compared to the winning submissions which had scores around 0.6. So overall my ARIMA approach was a very rough baseline, and in the next sections we will try to improve on it using an ML model.</p>
<h2 id="feature-engineering">Feature Engineering</h2>
<p>The Kaggle notebook for my feature engineering work is available <a href="https://www.kaggle.com/code/janed57821/wids-lgbm">here</a>. Many thanks to Leonie for sharing her tutorial on the topic [5].</p>
<p>While reading through Leonie’s tutorial on how to use the LightGBM framework (see next section), I noticed that she created a number of additional features for the training and test datasets [5]. I was confused why she would need to create additional features at first because, to be honest, there seemed to be so much data available already that I did not understand why additional features would be necessary. But after doing a bit more research, it appears that there are at least two reasons why we would want to use feature engineering to create several specific additional features for time series predictions.</p>
<h3 id="data-drift">Data Drift</h3>
<p>The first reason why additional features can be helpful is due to a phenomenon known as <strong>data drift</strong> [8]. The basic definition of data drift is when data used to train a model is different in some way from the data used to validate/test the model [8]. We see this a lot in time series when, for example, the training data is from a different season than the test data, or when there is a strong trend in the data that makes past data in the series very different from future time points [8]. Drift can also be apparent in differences in geographic region, where we can expect that the ranges of values and the patterns we see may be very different for different climate regions or different hemispheres [8].</p>
<p>As we can see, I did not look for evidence of data drift during my EDA (lessons learned for the future!) so I cannot be sure that this phenomenon is present in the dataset. But, if we assume that it is at least possible that there is some drift, then we will need to address it. If we were in a production setting, we would want to frequently re-train the model as we acquired new data so that the model is more likely to accurately predict future values.</p>
<p>For us as participants in a datathon with a fixed dataset, we can take another approach, which is to encode information about the time points for the data we have to help the model identify and use seasonal/cyclical/trend patterns in the data for future predictions. These additional features may make the model more robust. We call this process <strong>periodic feature engineering</strong> and we’ll discuss it in more detail below.</p>
<h3 id="discontinuities">Discontinuities</h3>
<p>Another reason why we may want to create additional features related to time has to do with discontinuities in the regular datetime data type. If we use raw (ordinal) values for the day or hour, for example, there are discontinuities at, for example, the end of a day, the end of a month and the end of a year [9]. (For example, we go from the 23rd hour to the 0th hour, or the 31st day to the 1st day, etc.) These discontinuities in the inputs can lead to discontinuities (like spikes or troughs) in the output predictions of the model [9].</p>
<p>To address this issue, we can create more continuous representations of these times using several strategies. For example, we can encode hours in a day into a sine and cosine wave, where one period of the wave is equal to 24 hours in a day [9]. That way, the 23rd hour at the end of one day flows smoothly into the 0th hour at the start of the next day. For this dataset, we generate sine and cosine encodings for the day, week, month, quarter and season corresponding to each sample in the dataset [5, 9].</p>
<h2 id="using-lightgbm-to-predict-target-variable">Using LightGBM to Predict Target Variable</h2>
<p>The Kaggle notebook for the LightGBM implementation is available <a href="https://www.kaggle.com/code/janed57821/wids-lgbm">here</a>. Many thanks to Leonie for sharing her tutorial on the topic [5].</p>
<p>I chose to use the light gradient-boosting machine (LGBM) as my ML model. An LGBM is a free and open-source gradient-boosting framework that was developed by Microsoft [10]. The framework is based on a decision tree architecture and uses a lot of the same speed-ups as were provided by XGBoost [10]. I chose this model for several reasons. The first is that I wanted a tree-based model because they have good explainability (you can see where the model chose to create decision nodes within the dataset, as we’ll demonstrate below). Secondly, I liked that LGBMs leveraged a lot of the benefits of XGBoost*1, which include implementation-level details like sparse matrix optimization, the ability to do parallel training, etc. [10]. And thirdly, LGBMs have been very competitive in recent years in various time series prediction competitions [5] which makes them an optimal choice for this datathon.</p>
<p>To train my LGBM, I used the same expanding window validation strategy that I used to test my ARIMA model [5]. I chose relatively standard parameters for the model, and if I had had more time I would have performed hyperparameter tuning to optimize them. For the purposes of comparing the LGBM to ARIMA, however, I will show the results of testing my default model constructed with a maximum of 32 leaves and a max depth of 8 levels [5, 11]. I set the feature fraction to 50%, meaning that the algorithm can randomly select 50% of the training features to build the decision tree [5, 11]. I used a learning rate of 0.1, 1000 gradient boosting iterations and an early stopping condition to end training after 100 rounds if the validation RMSE does not improve in that time [5, 11].</p>
<p>The model reported a final validation RMSE of 0.94 and a final RMSE on the training data of 0.39, suggesting that the model had overfit to the dataset. (We will get the model’s final performance on the full test dataset for direct comparison with the ARIMA model shortly.)</p>
<p>I wanted to get a sense of how well my model was performing in the best and worst case so I plotted the model prediction against the validation data (ground truth). As you can see in Figure 23, my best case prediction (for location 33) was very close to the ground truth. Conversely, in Figure 24, the worst case prediction (for location 182) was pretty far off, but actually not as bad as my worst ARIMA model predictions. I say this because my prediction at least captured a similar trend and even predicted a small dip in the data – my ARIMA models could generate predictions going in completely the opposite direction to the ground truth.</p>
<p><img src="/images/LGBM_best_pred.png" alt="Fig 23" title="Figure 23" /><br />
Figure 23 - Best LGBM prediction for location 33 over 14 days.</p>
<p><img src="/images/LGBM_worst_pred.png" alt="Fig 24" title="Figure 24" /><br />
Figure 24 - Worst LGBM prediction for location 182 over next 14 days.</p>
<p>But there is even more to uncover with the LGBM model I trained, thanks to the explainability of decision tree architectures. I generated a plot of feature importance because I wanted to see if the LGBM used a lot of the features that I had found had the strongest correlations with the target variable during my EDA. Before discussing the results, however, I want to point out that since I set the feature fraction to 50%, it is possible that some of the features in the dataset were not included in model training, and since this is not an ensemble model, there was never a moment during training when the decision tree was exposed to 100% of the available features.</p>
<p><img src="/images/LGBM_features.png" alt="Fig 25" title="Figure 25" /><br />
Figure 25 - Top 20 most important features in my trained LGBM.</p>
<p>As we can see in Figure 25, the sea level pressure (“contest-slp-14d”), precipitable water prediction (“contest-prwtr-eatm-14d”) and potential evaporation (“contest-pevpr-sfc-gauss-14d”) were all high-ranking features in the LGBM model. The NMME model mean prediction of temperature for 3-4 weeks was also highly ranked, although my LGBM model used the value generated by a different model in the ensemble than the mean over the entire ensemble, which we considered in the EDA. Nevertheless, I think this is a confirmation that our EDA’s predictions about which features would be important were accurate.</p>
<p>The implementation of the LGBM model also allows for plotting the tree itself to review the final architecture and how the decision nodes were arranged. In Figure 26, we can see that the first two levels were dominated by dividing the data around temperatures (the first level splits the data around 9.785 degrees C and the second levels further split these subsets). The potential evaporation feature is incorporated at the third level, and then the other features of interest are incorporated at the 4th through 6th levels. (A note for the curious: the leaf nodes are labeled with a single value which indicates the predicted value for data points that fall into that leaf – the value corresponds to the feature in the parent node [12].)</p>
<p><img src="/images/LGBM_architecture.png" alt="Fig 26" title="Figure 26" /><br />
Figure 26 - Schematic of LGBM structure with all decision nodes and leaves shown.</p>
<p>Finally, how did our LGBM model compare to our ARIMA model? My final RMSE score was 1.53, compared to 8.815 with ARIMA. This error is still about 3X larger than the winning models for the datathon, but it shows a good improvement over ARIMA, and that’s without any hyperparameter tuning.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Thank you so much for reading my summary of my work on the WiDS 2023 Datathon. I feel like I barely scratched the surface of what it is possible to do with this dataset and all of the ML models available, but I also learned a lot from the things I did try. I may continue to iterate on this project in the future and try more models to see how they perform, and if I can get closer to the winning submissions.</p>
<h2 id="footnotes">Footnotes</h2>
<p>*1 One of the creators of XGBoost, Tianqi Chen, also teaches the Deep Learning Systems course at CMU. I took the class and it was an incredible introduction to the mechanics underlying the implementation of PyTorch for both CPU and CUDA architectures. After taking the class, I have a much greater appreciation for the importance of having a deep level of familiarity with the implementation of machine learning algorithms. I would highly recommend the course, which is also freely available <a href="https://dlsyscourse.org/">online</a>.</p>
<h2 id="references">References</h2>
<p>[1] WiDS Datathon 2023. “Adapting to Climate Change by Improving Extreme Weather Forecasts: Dataset Description.” <a href="https://www.kaggle.com/competitions/widsdatathon2023/data">https://www.kaggle.com/competitions/widsdatathon2023/data</a> Visited 11 Mar 2023.</p>
<p>[2] Linh, N. and Anh, V. “EDA, time series and climate regions analysis.” WiDS Datathon 2023. <a href="https://www.kaggle.com/code/linhnguyn555/eda-time-series-and-climate-regions-analysis">https://www.kaggle.com/code/linhnguyn555/eda-time-series-and-climate-regions-analysis</a> Visited 11 Mar 2023.</p>
<p>[3] Samaha, K. “EDA, WiDS Datathon-2023.” WiDS Datathon 2023. <a href="https://www.kaggle.com/code/khsamaha/eda-wids-datathon-2023">https://www.kaggle.com/code/khsamaha/eda-wids-datathon-2023</a> Visited 11 Mar 2023.</p>
<p>[4] Gunathilake, A. “WiDS Datathon 2023_Extreme Weather Forecasts.” WiDS Datathon 2023. <a href="https://www.kaggle.com/code/anjanagunathilake/wids-datathon-2023-extreme-weather-forecasts?scriptVersionId=119974583">https://www.kaggle.com/code/anjanagunathilake/wids-datathon-2023-extreme-weather-forecasts?scriptVersionId=119974583</a> Visited 11 Mar 2023.</p>
<p>[5] Leonie. “WiDS Datathon 2023: Forecasting with LGBM.” WiDS Datathon 2023. <a href="https://www.kaggle.com/code/iamleonie/wids-datathon-2023-forecasting-with-lgbm">https://www.kaggle.com/code/iamleonie/wids-datathon-2023-forecasting-with-lgbm</a> Visited 11 Mar 2023.</p>
<p>[6] Felicioni, F. “WiDS 2023: different locations train/test SOLVED.” WiDS Datathon 2023. <a href="https://www.kaggle.com/code/flaviafelicioni/wids-2023-different-locations-train-test-solved">https://www.kaggle.com/code/flaviafelicioni/wids-2023-different-locations-train-test-solved</a> Visited 11 Mar 2023.</p>
<p>[7] Hyndman, R. J., Athanasopoulos, G. Forecasting: Principles and Practice, 2nd. Ed. 2018. <a href="https://otexts.com/fpp2/">https://otexts.com/fpp2/</a> Visited 12 Mar 2023.</p>
<p>[8] Cheung, R., Datta, T., Massa, H., Ostermeier, S. “Data Drift for Dynamic Forecasts: An Arthur tutorial for the 2023 WiDS Datathon.” <a href="https://colab.research.google.com/drive/10r73mOp1R7cORfeuP97V65a-rgwGyfWr?usp=sharing#scrollTo=_UCopoHTKKag">https://colab.research.google.com/drive/10r73mOp1R7cORfeuP97V65a-rgwGyfWr?usp=sharing#scrollTo=_UCopoHTKKag</a> Visited 12 Mar 2023.</p>
<p>[9] “Time-related feature engineering.” scikit-learn.org. <a href="https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html">https://scikit-learn.org/stable/auto_examples/applications/plot_cyclical_feature_engineering.html</a> Visited 12 Mar 2023.</p>
<p>[10] “LightGBM.” Wikipedia. <a href="https://en.wikipedia.org/wiki/LightGBM">https://en.wikipedia.org/wiki/LightGBM</a> Visited 12 Mar 2023.</p>
<p>[11] “Parameters.” LightGBM. Microsoft Corporation, 2023. <a href="https://lightgbm.readthedocs.io/en/latest/Parameters.html">https://lightgbm.readthedocs.io/en/latest/Parameters.html</a> Visited 12 Mar 2023.</p>
<p>[12] “lightgbm.plot_tree.” LightGBM. Microsoft Corporation, 2023. <a href="https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_tree.html">https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_tree.html</a> Visited 12 Mar 2023.</p>This year I participated in my first ever datathon, hosted by the Women in Data Science (WiDS) organization. The challenge for this year’s competition was to accurately predict daily temperatures over hundreds of regions in the United States over two months. The challenge was hosted on Kaggle. In this post, I will summarize my approach and the things I learned from participating in the datathon. I would like to sincerely thank the competition organizers, as well as all the other participants who kindly shared their Kaggle notebooks - I have cited the notebooks I read in the References section below.Using the ARIMA Model for Forecasting2023-02-18T00:00:00+00:002023-02-18T00:00:00+00:00http://sassafras13.github.io/ARIMA<p>An important part of applying machine learning models to time series data is having a good baseline model for comparison. In this post, I want to learn about the AutoRegressive Integrated Moving Average (ARIMA) modeling approach for time series data. This is not considered a machine learning model, but was developed separately in the fields of statistics and econometrics and can serve as a good, competitive baseline for comparison with ML tools [1, 2].</p>
<p>The ARIMA model is comprised of several components, which refer to different letters in the name, that is [3]:</p>
<ul>
<li><strong>Autoregression (AR)</strong>: An autoregressive model will identify and use relationships between the current entry in the time series and lagged (i.e. previous) entries. This basically means that the model knows that past entries in the time series will affect the current value [4]. Typically an autoregressive model is a linear regression model that tries to fit coefficients to \(p\) previous observations [4]:</li>
</ul>
\[y_t = \beta_1 y_{t-1} + \beta_2 y_{t-2} + … + \beta_p y_{t - p}\]
<ul>
<li>
<p><strong>Integrated (I)</strong>: Using differences between entries in the time series to make it stationary (i.e. the mean of the distribution is fixed). We can compute one difference as \(y_t - y_{t-1}\), or do second-order differencing, i.e. \((y_t - y_{t-1}) - (y_{t-1} - y_{t-2})\), or some higher order difference, \(d\) [4].</p>
</li>
<li>
<p><strong>Moving average (MA)</strong>: Using the relationship between the current observation and the residual error in a moving average computed over \(q\) lagged entries in the series. Mathematically, this can be expressed as [4]:</p>
</li>
</ul>
\[y_t = \epsilon_t + \alpha_1 \epsilon_{t-1} + \alpha_2 \epsilon_{t-2} + … + \alpha_{q} \epsilon_{t-q}\]
<p>The full ARIMA model combines all of these terms together linearly. Generally, when calling an implementation of ARIMA in Python or some other coding language, we will need to specify the parameters \(p\), \(d\), \(q\), as discussed above[3,4]. Specifically, each of these parameters will define [3,4]:</p>
<ul>
<li>\(p\): the number of autoregressive terms (the AR order)</li>
<li>\(d\): the differencing order (for the I component)</li>
<li>\(q\): the number of moving average terms (the MA order)</li>
</ul>
<p>There are several ways to choose the values of \(p\), \(d\) and \(q\). For example, to choose \(p\) and \(q\), we could use different autocorrelation functions to determine across how many previous time steps there is a significant correlation with the current time step [2, 3]. For the MA order (\(q\)), we can use the autocorrelation function (ACF) to determine the optimal number of terms to include in a moving average [4]. Specifically, we would look for the number of previous entries in the time series that had a significant autocorrelation value [3]. For the AR order (\(p\)), we could use the partial autocorrelation function (PACF), which computes the correlation between two points in the time series while excluding the influence of other points in the series [4]. Choosing the differencing order is a little less straightforward - [2] recommends testing different values of \(d\) and comparing the performance of the model each time using RMSE as the metric [2].</p>
<h2 id="references">References</h2>
<p>[1] Joseph, M. Modern Time Series Forecasting with Python : Explore Industry-Ready Time Series Forecasting Using Modern Machine Learning and Deep Learning. 1st ed. Birmingham: Packt Publishing, Limited, 2022.</p>
<p>[2] “Autoregressive integrated moving average.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average">https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average</a> Visited 18 Feb 2023.</p>
<p>[3] Brownlee, J. “How to Create an ARIMA Model for Time Series Forecasting in Python.” Machine Learning Mastery, 10 Dec 2020. <a href="https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/">https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/</a> Visited 18 Feb 2023.</p>
<p>[4] Maklin, C. “ARIMA Model Python Example - Time Series Forecasting.” Towards Data Science, 25 May 2019. <a href="https://towardsdatascience.com/machine-learning-part-19-time-series-and-autoregressive-integrated-moving-average-model-arima-c1005347b0d7">https://towardsdatascience.com/machine-learning-part-19-time-series-and-autoregressive-integrated-moving-average-model-arima-c1005347b0d7</a> Visited 18 Feb 2023.</p>An important part of applying machine learning models to time series data is having a good baseline model for comparison. In this post, I want to learn about the AutoRegressive Integrated Moving Average (ARIMA) modeling approach for time series data. This is not considered a machine learning model, but was developed separately in the fields of statistics and econometrics and can serve as a good, competitive baseline for comparison with ML tools [1, 2].A Review of the Fundamentals of Statistics2023-02-17T00:00:00+00:002023-02-17T00:00:00+00:00http://sassafras13.github.io/StatsBasics<p>I need to review some common statistical tests for comparing samples, but before I do that, I wanted to start with a brief overview of some key statistical concepts. I will outline where statistics fit into the scientific method, the fundamental ideas behind a statistical test, parametric and non-parametric tests, as well as different types of data that we can analyze with statistics.</p>
<h2 id="the-scientific-method-with-statistics">The Scientific Method, with Statistics</h2>
<p>The general steps in the scientific method are [3]:</p>
<ol>
<li>Define the research question.</li>
<li>Formulate the hypothesis.</li>
<li>Design the experiment.</li>
<li>Collect data.</li>
<li>Analyze the data.</li>
<li>Make conclusions.</li>
</ol>
<p>In this process, we use statistics to formulate the hypothesis (step 2) and to choose the sample size as part of the experimental design (step 3), as well as to analyze the results (step 5) [3]. In step 2, we want to clearly define the hypothesis in a way that we can accept or reject it using the results of our experiment, and we may need to define several, connected hypotheses to address the research question [3]. We will discuss this process in more detail in the next section.</p>
<p>In step 3, we need to collect sufficient samples (observations, data) to be able to rigorously accept or reject the hypothesis [3]. To do this, the samples must be [3]:</p>
<ol>
<li>Random</li>
<li>Representative</li>
<li>Sufficiently large</li>
</ol>
<p>The first item, randomness, is necessary because probability and statistics are based on the idea that the samples are drawn randomly and that any sample that we draw is equally likely to be drawn as any other sample [3]. Of course, sometimes it is impossible to be drawing purely random samples and so in this case you should proceed cautiously and report the fact that you know the samples are not purely random as part of your results [3]. The second item, that the samples are representative of the entire population they are drawn from, is also likewise somewhat impossible in practice, so we instead accept that likely if we get enough data, it will be reasonably representative of the population [3]. This leads us to our third item, the size of the data we collect. Here, we do not want to just collect a massive dataset - the point of statistics is to intelligently size the sample - and so we will want to determine the size of data we should collected based on factors such as the size of the difference we are trying to observe, the variation in our target variable, the type of statistical analysis we are doing, and the sampling cost [3].</p>
<p>And step 5, analyzing the data, is where the bulk of statistics operates.</p>
<h2 id="the-basics-of-a-statistical-test">The Basics of a Statistical Test</h2>
<p>In statistics, we will often derive a <strong>null hypothesis</strong> from the research hypotheses [3]. The null hypothesis assumes that there is no difference between two groups, and that any differences that are observed are purely by chance [3]. For example, if we are testing whether a vaccine is effective against a disease by inoculating a control group with saline and a test group with the vaccine, the null hypothesis is that these two groups will not show any meaningful difference in their susceptibility to the disease [3]. We often refer to the null hypothesis as “H-nought” or \(H_0\) [3]. To be specific, in our example, \(H_0\) states that [3]:</p>
\[H_0 : \pi_1 = \pi_C\]
<p>Where \(\pi_1\) is the number of patients with the disease in the treated group, and \(\pi_C\) is the number of patients with the disease in the control group [3]. Given data about the number of patients in each group and the incidence rate of disease in each group, we can compute the probability that the null hypothesis is true [3]. For example, let’s say that for 10,000 patients in each group, we see that \(\pi_1 = 0.01\) and \(\pi_C = 0.04\) - that is, the treated group had a quarter of the incidence rate of the control group. The probability that this outcome was purely by coincidence (i.e. the probability of the null hypothesis) is very small, and so we can <strong>reject</strong> the null hypothesis and confirm that there is a statistically significant reason why the test group was healthier than the control group [3].</p>
<p>More specifically, we typically need to establish <em>a priori</em> a probability threshold, below which we reject the null hypothesis (i.e. the probability that the two groups were different is entirely by chance), and above which we accept the null hypothesis. Put another way, if the probability that the difference between the two groups is entirely due to chance is very small, then we can reject this null hypothesis, but if the probability is high that the measured difference is pure coincidence, then we accept the null hypothesis [3]. If we can reject the null hypothesis, then we can usually accept our experimental hypothesis instead (i.e. that this vaccine is an effective treatment for the disease) [3].</p>
<p>It is worth noting that this statistical test of either accepting or rejecting the null hypothesis is really asking the question of whether we would see similar results that differ so much from the null hypothesis again if we repeated the experiment [3]. It is also important to remember that accepting or rejecting a hypothesis is not the same as <em>proving</em> a hypothesis [3]. Since this method involves probability, there is always a non-zero possibility that the hypothesis we rejected was actually true, and vice versa [3].</p>
<p>There are two ways that researchers will typically assess the null hypothesis. The first is to compute the probability that the results were measured if the null hypothesis is true, and then use this probability to accept or reject the null hypothesis [3]. This probability is often called either the <strong>significance level</strong> or the <strong>P value</strong> [3]. When the P value is large, then we accept the null hypothesis [3]. Conversely, a more “decisive” approach (according to Dowdy and Wearden) is to set a <strong>rejection level</strong> before the experiment is conducted [3]. Then given the experimental data, a probability is computed and if it is below this predetermined level, the null hypothesis is rejected [3]. This method was more in favor when computers were not yet widely used for computation.</p>
<p>Depending on your research project, you may have specific hypotheses that you want to verify, or you may simply want to explore the data that you have [2]. If you have specific hypotheses, then you can use statistical tests to accept or reject them, but you should be careful of trying to test too many hypotheses with the same dataset [2]. For example, Campbell and Shantikumar recommend using a Bonferroni correction, which recommends setting a significance level for \(n\) independent hypotheses as \(0.05 / n\) [2]. So, for example, when testing two independent hypotheses, in each case the result is significant only if \(P < 0.025\) [2].</p>
<h2 id="parametric-and-non-parametric-statistical-tests">Parametric and Non-Parametric Statistical Tests</h2>
<p>There are a range of statistical tests available for comparing sets of data [1]. They often use different measures of the data, such as the mean or standard deviation, to compare against some criteria to determine if the two datasets are similar or not [1]. Broadly speaking, there are two classes of statistical tests: <strong>parametric</strong> and <strong>non-parametric</strong> tests [1]. The term “parametric” means that the test (or model, in the case of machine learning) makes some specific assumptions about the data being processed which affect the way the test is performed [2]. For example, many parametric tests (as we will see shortly) assume that the data is normally distributed [2]. Conversely, non-parametric tests make no such assumptions, and so might be a better choice if you don’t know how your data is distributed (or you know it is not normally distributed) [2].</p>
<h2 id="types-of-data">Types of Data</h2>
<p>Once the hypothesis has been defined, we need to consider what kinds of data we are going to collect (following steps 3-4 in the scientific method above) [2]. There are different taxonomies for classifying data that we can collect through measurement. One posed by Stevens has 4 types of data (nominal, ordinal, interval and ratio) [6] - others might be nominal, ordinal, quantitative (discrete) and quantitative (continuous) [2,3]. I was not sure what nominal and ordinal meant so I wanted to define them here:</p>
<ul>
<li>
<p><strong>Nominal:</strong> Nominal data consists of labels or names applied to data points that are not numeric [4]. Typically the labels come from different categories that don’t overlap, and that do not have any hierarchical ranking [4]. We could only measure the central tendency of nominal data using the mode [4].</p>
</li>
<li>
<p><strong>Ordinal:</strong> Similar to nominal data, ordinal data is also a set of labels used to qualitatively describe data points [5]. However, unlike nominal data, there is an assumed ranking to the ordinal scale [5].</p>
</li>
</ul>
<p>The type of data we are analyzing will affect the type of comparison test that we choose. Let’s do a brief survey of the types of comparison tests available to us.</p>
<h2 id="references">References</h2>
<p>[1] Sirisilla, S. “7 Ways to Choose the Right Statistical Test for Your Research Study.” Enago.com. 4 Jan 2023. <a href="https://www.enago.com/academy/right-statistical-test/">https://www.enago.com/academy/right-statistical-test/</a> Visited 17 Feb 2023.</p>
<p>[2] Campbell, M.J., Shantikumar, S. “Parametric and Non-parametric tests for comparing two or more groups.” HealthKnowledge, 2016. <a href="https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1b-statistical-methods/parametric-nonparametric-tests">https://www.healthknowledge.org.uk/public-health-textbook/research-methods/1b-statistical-methods/parametric-nonparametric-tests</a> Visited 17 Feb 2023.</p>
<p>[3] Dowdy, S., Wearden, S. Statistics for Research, Second Edition. John Wiley & Sons, 1991.</p>
<p>[4] “What is Nominal Data? Definition, Examples, Variables & Analysis.” Siplilearn, 30 Sept 2022. <a href="https://www.simplilearn.com/what-is-nominal-data-article">https://www.simplilearn.com/what-is-nominal-data-article</a> Visited 17 Feb 2023.</p>
<p>[5] “Ordinal data.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Ordinal_data">https://en.wikipedia.org/wiki/Ordinal_data</a> Visited 17 Feb 2023.</p>
<p>[6] “Statistical data type.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Statistical_data_type">https://en.wikipedia.org/wiki/Statistical_data_type</a> Visited 17 Feb 2023.</p>I need to review some common statistical tests for comparing samples, but before I do that, I wanted to start with a brief overview of some key statistical concepts. I will outline where statistics fit into the scientific method, the fundamental ideas behind a statistical test, parametric and non-parametric tests, as well as different types of data that we can analyze with statistics.Common Statistical Hypothesis Tests2023-02-17T00:00:00+00:002023-02-17T00:00:00+00:00http://sassafras13.github.io/StatsTests<p>In this post I want to survey some common statistical tests for comparing two or more samples from different populations to determine if they are different or not. This post builds on ideas I discussed in a <a href="https://sassafras13.github.io/StatsBasics/">previous one</a>. The comparison tests that we will review include:</p>
<ol>
<li>Student’s t-test</li>
<li>ANOVA</li>
<li>Z-test</li>
<li>Chi-squared test</li>
</ol>
<h2 id="students-t-test">Student’s t-test</h2>
<p>The t-test is a common test for determining whether or not 2 groups are similar. The t-test compares the means of two groups [1]. The test was developed when William Sealy Gosset noted that if fewer than 30 random samples are drawn from a normally distributed population, then the distribution of the t statistic computed from these 30 random samples is <em>not</em> normally distributed [2]. The t statistic is [2]:</p>
\[t = \frac{\bar{y} - \mu}{s / \sqrt{n}}\]
<p>Gosset found that the tails of the distribution of \(t\) for small \(n\) were larger than for the standard normal distribution, but as \(n\) increases the distribution becomes more normal [2]. The t distributions have several properties [2]:</p>
<ol>
<li>Unimodal</li>
<li>Asymptotic to the horizontal</li>
<li>Symmetrical about 0</li>
<li>Dependent on the number of degrees of freedom, \(v\), where in this case \(v = n - 1\)</li>
<li>More variable than the standard normal distribution</li>
<li>But approximately standard normal if the number of degrees of freedom is large</li>
</ol>
<p>We can use the t distribution to estimate probabilities for a non-normal distribution if it is, at least, symmetric, unimodal and does not have a very large variance [2]. There are tables available (or software) that list the critical t-values for different degrees of freedom and different probabilities that t exceeds the critical value (often referred to as \(\alpha\)) [2]. For example, the critical t-value for the case where there are 21 degrees of freedom and the probability that the measured t-value exceeds the critical value is 0.05 is \(t_{0.05, 21} = 1.721\) [2]. Now let’s see how we can use it for comparison testing (it can be used for other things as well).</p>
<h3 id="matched-paired-t-test">Matched (Paired) t-test</h3>
<p>We can use the t-test to make inferences related to the mean of the difference between two <em>matched</em> groups [2]. We can do this as long as our data is [2]:</p>
<ol>
<li>From a normal distribution (or at least a symmetric and unimodal one)</li>
<li>From a population whose variance is unknown (we estimate it from our sample)</li>
<li>Randomly sampled</li>
</ol>
<p>Matched groups will have some similarity between them, for example test scores for a set of students before and after taking a specific class [2]. The important thing is that both sets of data must have come from the same source, in some fashion, to remove variability that confounds the results [2]. Given the two groups of data, we can compute the difference between the two groups and the mean and variance of the difference. We can use this to compute the t-value and then compare it to the critical t-value for this particular number of degrees of freedom and our predetermined value of \(\alpha\). If the calculated t-value exceeds the critical value, then we can reject the null hypothesis and say that there is truly a difference between the two groups [2]. Note that we can also use the t-test to compute <strong>confidence intervals</strong> as needed [2].</p>
<h3 id="independent-two-sample-t-test">Independent (Two Sample) t-test</h3>
<p>There is another version of the t-test that allows us to compare samples from two different populations. In this case, we assume that the two populations are [2]:</p>
<ol>
<li>Independent</li>
<li>(Approximately) normal</li>
<li>Of unknown variance, but we assume the same variance in both populations</li>
</ol>
<p>In this case, the t-test can be written as [2]:</p>
\[t = \frac{ (\bar{y}_1 - \bar{y}_2) - (\mu_1 - \mu_2) } { \sqrt{ \frac{s_p^2}{n_1} + \frac{s_p^2}{n_2} } }\]
<p>Where the variance, \(s_p\), can be estimated as the <strong>pooled sample variance</strong> using a weighted average approach [2]:</p>
\[s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}\]
<p>We should note, however, that this test is not accurate if the variances of the two groups are different. This test should really only be used if we know the two samples are independent.</p>
<h2 id="anova">ANOVA</h2>
<p>The Analysis of Variance (ANOVA) family of tests extends upon the capabilities of the t-test by allowing us to test hypotheses for 3 or more independent samples taken from different normal populations with different means but <strong>common variances</strong> [2]. In this scenario, the null hypothesis is that the means for all sets of samples are equal.</p>
<p>Let’s understand why the ANOVA test focuses on variance, and what that means from a mathematical standpoint. Let’s assume we’re going to compare the time required to solve a maze for 3 groups of mice who have had increasing amounts of knowledge of the maze [2]. There is an average time to solve the maze computed over all the mice for all the groups (\(\bar{y}\)), and there is an average time to solve the maze for each set of mice (\(\bar{y}_i\) for the \(i\)-th set of mice) [2]. Given this, we can write an additive model that describes the components contributing to the time to solve the maze for the \(j\)-th mouse in the \(i\)-th group [2]:</p>
\[y_{ij} = \bar{y} + (\bar{y}_i - \bar{y}) + \epsilon_{ij}\]
<p>Here, the first term is the mean time to solve the maze for all the mice, the second term is the <strong>mean treatment effect</strong>, or the adjustment to the mean, for all the mice in the \(i\)-th group, and finally the third term captures the stochastic variation in each mouse’s performance [2]. Note that we can assign the variable \(\alpha_i\) to the mean treatment effect term, and use it to rewrite the null hypothesis for this test [2]:</p>
\[H_0 : \alpha_1 = \alpha_2 = \alpha_3 = \alpha_4 = 0\]
<p>If the mean treatment effect is the same for all the mice, and we assume that the stochastic noise, \(\epsilon_{ij}\), is i.i.d. with mean 0 and variance \(\sigma^2\), then this is equivalent to saying we assume the mean for each group is equal [2]. This formulation of the problem is called a <strong>one-way completely randomized ANOVA</strong> [2].</p>
<p>As we saw earlier, there are different averages that we can calculate for our dataset, and there is a corresponding variance for each of these averages. Specifically [2]:</p>
<ul>
<li>
<p><strong>Total variance</strong>: \(\frac{\sum_i \sum_j (y_{ij} - \bar{y})^2} {na - 1}\) (where \(a\) is the number of groups and \(n\) is the number of samples within each group). This is the variance corresponding to the grand average for the entire dataset.</p>
</li>
<li>
<p><strong>Within-group variance</strong>: \(\frac{ \sum_i \sum_j (y_{ij} - \bar{y})^2}{a (n - 1)}\). This is the variance for the average over one group. It is also called the <strong>pooled variance</strong>.</p>
</li>
<li>
<p><strong>Among-group variance</strong>: \(n \left[ \frac{\sum_i (\bar{y}_i - \bar{y})^2} {a - 1} \right]\). This is the variance between groups with respect to the grand average.</p>
</li>
</ul>
<p>If the null hypothesis is true, then the within-group variance should be roughly the same as the among-group variance, but if we reject the null hypothesis, then the among-group variance should be larger [2].</p>
<p>To perform the actual statistical test, we compute the F statistic which is a ratio of the among-group variance to the within-group variance, computed using mean squares, specifically [2]:</p>
\[F = \frac{ \text{among-MS}}{ \text{within-MS}}\]
<p>We compare the F statistic with the corresponding critical value to decide to reject or accept the null hypothesis (this is a one-sided test because we will only reject the null hypothesis if the \(\text{among-MS}\) is greater than the \(\text{within-MS}\)) [2].</p>
<h2 id="z-test">Z-test</h2>
<p>The z-test is similar in concept to the t-test presented above, with the main difference being that the z statistic is assumed to be normally distributed [3]. This test also requires that we know the variance of the population from which we drew samples, which in practice is rare so this test is not used very often [3]. The z-test also does not have different critical values determined by sample size, unlike the t-test [3]. Based on these facts, I will forego the details of the calculation, assuming that in practice I would probably use the t-test instead for this particular application.</p>
<h2 id="chi-squared-test">Chi-Squared Test</h2>
<p>Unlike the other tests that we’ve seen so far, a chi-squared test is a non-parametric test [1] that can also be used to compare samples from different groups [4]. This test is primarily used to analyze data contained in <strong>contingency tables</strong> for large sample sizes [4]. A contingency table is used to record the number of samples that have different categorical properties, for example to tabulate how many dogs and cats have spots or stripes [5]. We typically use a chi-squared test to determine if two of these categorical variables (like spots and stripes) are independent of each other [4]. I’m also not going to spend a lot of time detailing this test because I do not typically work with data of this kind in my research.</p>
<p>Thanks for reading along as I explore some common statistical tests.</p>
<h2 id="references">References</h2>
<p>[1] Sirisilla, S. “7 Ways to Choose the Right Statistical Test for Your Research Study.” Enago.com. 4 Jan 2023. <a href="https://www.enago.com/academy/right-statistical-test/">https://www.enago.com/academy/right-statistical-test/</a> Visited 17 Feb 2023.</p>
<p>[2] Dowdy, S., Wearden, S. Statistics for Research, Second Edition. John Wiley & Sons, 1991.</p>
<p>[3] “Z-test.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Z-test">https://en.wikipedia.org/wiki/Z-test</a> Visited 17 Feb 2023.</p>
<p>[4] “Chi-squared test.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Chi-squared_test">https://en.wikipedia.org/wiki/Chi-squared_test</a> Visited 17 Feb 2023.</p>
<p>[5] “Contingency tables.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Contingency_table">https://en.wikipedia.org/wiki/Contingency_table</a> Visited 17 Feb 2023.</p>In this post I want to survey some common statistical tests for comparing two or more samples from different populations to determine if they are different or not. This post builds on ideas I discussed in a previous one. The comparison tests that we will review include:Using ML Models for Time Series Forecasting2023-02-17T00:00:00+00:002023-02-17T00:00:00+00:00http://sassafras13.github.io/TimeSeriesForecastingML<p>I have recently been doing a deep dive into time series forecasting with machine learning. I have discussed some of the basic principles, some considerations for pre-processing data, and techniques for performing an EDA on time series data. Today, I want to write about how to apply familiar ML models to time series forecasting. On the surface I assumed that this would be a simple process, but as we’ll see in a minute there is a paradigm shift that has to be made to think about predicting the future, and a lot of nuance to applying ML to this particular problem. Let’s get started!</p>
<h2 id="time-series-forecasting-is-supervised-learning">Time Series Forecasting is Supervised Learning</h2>
<p>Brownlee reveals to us that time series forecasting can be posed as a supervised learning problem to ML models [1]. The challenge, as we will see soon, is how to prepare our data to fit into the supervised learning paradigm so that we can apply our usual ML tools to it. As you recall, the supervised learning framework involves giving a model some input \(X\), and asking it to learn to predict a target variable, \(y\) [1].</p>
<h2 id="the-sliding-window-method">The Sliding Window Method</h2>
<p>So how do we apply this framework to a time series? One approach is known as the <strong>sliding window method</strong> or the <strong>lag method</strong> [1]. As an example, let’s consider that we have the daily average temperature in our neighborhood as a time series for the last 10 days:</p>
\[X = [40, 35, 37, 28, 24, 30, 32, 45, 37, 38]\]
<p>How do we convert this data to both input and output? We can define the target variable to be the next entry in the series, like so [1]:</p>
\[X = [?, 40, 35, 37, 28, 24, 30, 32, 45, 37, 38]\]
\[y = [40, 35, 37, 28, 24, 30, 32, 45, 37, 38, ?]\]
<p>Now, for every entry in \(X\) (excluding the first and last) the corresponding entry in \(y\) is the next day’s temperature [1]. The order of the time series is preserved in both \(X\) and \(y\) [1]. We will want to delete the first and last entries in both time series since we are missing entries there [1]. The lag in this method refers to the number of previous time steps there are - in this case, we have a lag of 1 [1].</p>
<p>The Pandas library has a useful function for doing this automatically, known as the <code class="language-plaintext highlighter-rouge">shift()</code> function, which can make copies of columns and pull the rows forwards or backwards in time [2].</p>
<h2 id="the-sliding-window-method-for-multivariate-time-series">The Sliding Window Method for Multivariate Time Series</h2>
<p>Now that we have a basic grasp of the sliding window method, let’s apply it to multivariate time series - that is, time series with multiple features. Brownlee argues that ML is particularly well suited, compared to classical methods, for modeling multivariate time series data [1]. We can use the same sliding window approach for multivariate time series, and we can do so to predict either one or multiple features [1]. As an example, let’s imagine we now have the daily average relative humidity data in addition to our temperature data:</p>
\[X = \begin{bmatrix} 40 & 35 & 37 & 28 & 24 & 30 & 32 & 45 & 37 & 38 \cr 0.4 & 0.32 & 0.38 & 0.2 & 0.1 & 0.15 & 0.4 & 0.39 & 0.38 & 0.28 \end{bmatrix}\]
<p>If our goal was to predict only the temperature data, then we could apply the sliding window and make the lagged temperature data our target variable, \(y\) [1]. But if we wanted to predict both the input variables, then we could just set both the lagged versions of the inputs as the target variables [1].</p>
<p>So far this approach has allowed us to predict one step into the future, but what if we want to predict multiple steps into the future?</p>
<h2 id="multi-step-forecasting">Multi-Step Forecasting</h2>
<p>Continuing with the sliding window method, one way we could increase the range of our forecasts - for example from one day to two days in the future - would be to create two target variables, with one and then two days’ lag, as follows [1]:</p>
\[X = [?, 40, 35, 37, 28, 24, 30, 32, 45, 37, 38]\]
\[y_1 = [40, 35, 37, 28, 24, 30, 32, 45, 37, 38, ?]\]
\[y_2 = [35, 37, 28, 24, 30, 32, 45, 37, 38, ?, ?]\]
<p>Brownlee points out that while this technique can work, we’re now putting a higher “burden” on the input data to help us predict more output data [1]. Zulkifli recommends having a lag that is long enough to capture the period of a cycle of the data, whatever that might be [5].</p>
<h2 id="data-pre-processing-considerations-for-sliding-window-methods">Data Pre-Processing Considerations for Sliding Window Methods</h2>
<p>One thing we wanted to mention is that when we prepare data for time series forecasting with ML models, there are a couple best practices to use. First, the data should be scaled or normalized on input [5]. Next, we should try to provide stationary data to the model, so we should detrend the input data if possible [5]. (I’m not sure if these trends should be added again after the predictions have been made?)</p>
<h2 id="using-random-forests-to-generate-time-series-forecasts">Using Random Forests to Generate Time Series Forecasts</h2>
<p>Now that we know how to prepare a dataset for time series forecasting, let’s understand what it means to use a common ML model - random forests - to make a time series prediction. The random forest is a popular model in industry because it is highly interpretable, since you can interrogate the tree structure and understand where the tree is splitting the dataset [5]. However, as Mwiti points out, random forests are not capable of extrapolation, because they are designed to make predictions by taking inputs and using them to output a leaf node whose value is based on the values of the input data [6]. This means that, while we can use random forests to make predictions based on the training data that we have, we should be prepared, in practice, to retrain these models often as new data comes in [6].</p>
<p>Recall that random forests are an ensemble model that use decision trees as the individual learners [3]. Each decision tree in the ensemble is trained on bootstrapped data - that is, data sampled from the dataset with replacement - and then the predictions made by each individual decision tree are combined together [3]. We often allow the decision trees inside the random forest to overfit somewhat on the bootstrapped data, which, in practice, means that we do not prune the trees [3]. In a random forest, the bootstrapping method will also choose a subset of features, so that not all decision trees are given the same features during training [3].</p>
<p>When training the random forest, we have to be careful to segment the data while preserving the ordering for validation [3]. The typical cross-validation method, k-fold cross-validation, would not be appropriate here because it randomizes the segmented data so the model might be presented with data from a later time point and asked to predict an earlier sequence [3]. There are two alternative approaches to performing evaluation with time series data. The first is to create multiple train-test splits, where each time the train data occurs before the test data [4]. The test data is the same size for each split, but the train data will increase in size as we split the data later and later in the series [4].</p>
<p>An alternative to having multiple train-test splits is to use the <strong>feed-forward</strong> or <strong>walk-forward</strong> validation method [3]. As before, we divide the data into train and test sets by splitting at a particular point in time, but this time the train data can be at least some minimum size or a fixed size, and each train-test split has a constant size and looks like a rolling window was applied to the time series [4]. The feed-forward method is more robust, but can also be more computationally expensive because the rolling window will only be shifted a limited number of steps forward each time, requiring many models to be built, trained and evaluated in the process [4].</p>
<p>Note that together, both these validation processes are known as <strong>backtesting</strong>, because we are checking the performance of the model on historical data [4].</p>
<h2 id="references">References</h2>
<p>[1] Brownlee, J. “Time Series Forecasting as Supervised Learning.” Machine Learning Mastery, 15 Aug 2020. <a href="https://machinelearningmastery.com/time-series-forecasting-supervised-learning/">https://machinelearningmastery.com/time-series-forecasting-supervised-learning/</a> Visited 17 Feb 2023.</p>
<p>[2] Brownlee, J. “How to Convert a Time Series to a Supervised Learning Problem in Python.” Machine Learning Mastery, 21 Aug 2019. <a href="https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/">https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/</a> Visited 17 Feb 2023.</p>
<p>[3] Brownlee, J. “Random Forest for Time Series Forecasting.” Machine Learning Mastery, 2 Nov 2020. <a href="https://machinelearningmastery.com/random-forest-for-time-series-forecasting/">https://machinelearningmastery.com/random-forest-for-time-series-forecasting/</a> Visited 17 Feb 2023.</p>
<p>[4] Brownlee, J. “How to Backtest Machine Learning Models for Time Series Forecasting.” Machine Learning Mastery, 28 Aug 2019. <a href="https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/">https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/</a> Visited 17 Feb 2023.</p>
<p>[5] Zulkifli, H. “Multivariate Time Series Forecasting Using Random Forest.” Towards Data Science, 31 Mar 2019. <a href="https://towardsdatascience.com/multivariate-time-series-forecasting-using-random-forest-2372f3ecbad1">https://towardsdatascience.com/multivariate-time-series-forecasting-using-random-forest-2372f3ecbad1</a> Visited 17 Feb 2023.</p>
<p>[6] Mwiti, D. “Random Forest Regression: When Does It Fail and Why?” Neptune.ai, 31 Jan 2023. <a href="https://neptune.ai/blog/random-forest-regression-when-does-it-fail-and-why">https://neptune.ai/blog/random-forest-regression-when-does-it-fail-and-why</a> Visited 17 Feb 2023.</p>I have recently been doing a deep dive into time series forecasting with machine learning. I have discussed some of the basic principles, some considerations for pre-processing data, and techniques for performing an EDA on time series data. Today, I want to write about how to apply familiar ML models to time series forecasting. On the surface I assumed that this would be a simple process, but as we’ll see in a minute there is a paradigm shift that has to be made to think about predicting the future, and a lot of nuance to applying ML to this particular problem. Let’s get started!A Brief Introduction to Time Series Forecasting2023-02-06T00:00:00+00:002023-02-06T00:00:00+00:00http://sassafras13.github.io/BasicsTimeSeries<p>In this post I want to introduce some of the concepts underpinning time series forecasting. I predict that this will be the first in a series of posts on this topic. In this post we will discuss some key concepts, discuss importing a dataset and imputing missing values.</p>
<p>Time series forecasting assumes that there is a <strong>data generating process</strong> (GDP) which is the underlying process that is producing the data we are analyzing. We do not know the actual DGP completely, and instead we are going to build models to try to capture it as accurately as possible [1,2].</p>
<p>One way to model time series data is using an autoregressive signal: a signal that depends on values at previous time steps, specifically linearly dependent on values at previous time steps, plus some stochastic term [1,3]. (Different from a moving average because it may contain a unit root – we’ll discuss this in more detail in a future post [3].)</p>
<p>Most modeling approaches assume that the DGP is stationary, but this can often be broken in practice, either when the mean or the variance (heteroscedasticity) of the distribution changes over time [1].</p>
<h2 id="process-for-importing-a-dataset">Process for Importing a Dataset</h2>
<p>We talk here about importing a dataset to your work environment, cleaning missing values and other initial steps we take before performing an exploratory data analysis.</p>
<ol>
<li>
<p>Understand the features in the dataset. Read the documentation, understand the relationships between features in your dataset.</p>
</li>
<li>
<p>Check the data types of the features in the dataset. Convert dates to a datetime format if they are not already in that form. This allows you to obtain information related to the date/time using packages like Pandas [1] - not just day/month/year either, but day of the week, etc.</p>
</li>
<li>
<p>Use Pandas’ <code class="language-plaintext highlighter-rouge">describe()</code> and <code class="language-plaintext highlighter-rouge">info()</code> functions to see the basic statistics and a summary of every feature in your dataset. These will help you get a preliminary understanding of the dataset, and can also help you detect any missing values.</p>
</li>
<li>
<p>Find and fill in missing values. See more detailed discussion below.</p>
</li>
</ol>
<h2 id="handling-missing-values">Handling Missing Values</h2>
<p>The first thing to ask is whether the values are really missing or not [1]. It might be that there is no data for a particular time period because there really was no data [1]. For example, if you are looking at data for books checked out of the library, not all books will be checked out all the time, so if there is no data for a particular book for a month, it might simply mean that no one was borrowing the book during that time.</p>
<p>Another thing to consider is whether or not there is a pattern in how and when the data is missing [1]. For example, if no books are checked out on Sunday because the library is closed that day, if we fill in the data for that day with a 0, then our time series forecasting model, which may rely on patterns from previous days, may predict that we will not see any books checked out on Monday either [1]. One way to address this might simply be to ensure our modeling approach knows the days of the week that the data corresponds to so it can learn this trend as well - and so filling in with zeros will be acceptable in this case [1].</p>
<p>Now if we assume that we have found missing values that we really do need to address, there are a number of ways to handle them [1]:</p>
<ul>
<li>
<p><strong>Forward fill</strong>: This method uses the last value that we have in the time series and uses it to fill all the missing values until we find the next value in our dataset.</p>
</li>
<li>
<p><strong>Backward fill</strong>: This is the same as above except we’re taking the next value after the missing data and using it to fill in the missing values.</p>
</li>
<li>
<p><strong>Mean value fill</strong>: Here we take the mean of the <strong>entire</strong> time series and use it to fill in the missing values.</p>
</li>
<li>
<p><strong>Linear interpolation</strong>: We fit a line to the values on either side of the missing data and use the line to interpolate the values of the missing points.</p>
</li>
<li>
<p><strong>Nearest interpolation</strong>: This method is a combination of forward and backward fill, where we choose the closest value that isn’t missing and use it to fill in the missing value.</p>
</li>
</ul>
<p>There are also a range of other interpolation techniques available in most data science libraries including fitting <strong>splines</strong> or <strong>polynomials</strong> to the entire dataset and using those fits to impute the missing data [1].</p>
<h2 id="references">References</h2>
<p>[1] Joseph, M. Modern Time Series Forecasting with Python : Explore Industry-Ready Time Series Forecasting Using Modern Machine Learning and Deep Learning. 1st ed. Birmingham: Packt Publishing, Limited, 2022.</p>
<p>[2] “Data generating process.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Data_generating_process">https://en.wikipedia.org/wiki/Data_generating_process</a> Visited 30 Jan 2023.</p>
<p>[3] “Autoregressive model.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Autoregressive_model">https://en.wikipedia.org/wiki/Autoregressive_model</a> Visited 30 Jan 2023.</p>In this post I want to introduce some of the concepts underpinning time series forecasting. I predict that this will be the first in a series of posts on this topic. In this post we will discuss some key concepts, discuss importing a dataset and imputing missing values.Performing an EDA, with a Focus on Correlation, for Time Series Forecasting2023-02-06T00:00:00+00:002023-02-06T00:00:00+00:00http://sassafras13.github.io/EDA<p>In my previous post, I started learning about time series forecasting, and focused on some foundational concepts and techniques for preparing data. In this post, we are going to learn about several more key concepts (the components of a time series) and then discuss the exploratory data analysis (EDA) process, with a focus on calculating correlations between time series. Let’s get started.</p>
<h2 id="parts-of-a-time-series">Parts of a Time Series</h2>
<p>A time series can contain [1]:</p>
<ul>
<li>
<p><strong>Trends</strong>: Long-term changes (linear or otherwise) in the average value of a time series. For example, global warming shows up in temperature data as a long-term rise in the yearly average for many regions in the world.</p>
</li>
<li>
<p><strong>Seasonality</strong>: These are more regular variations over a period of time, like a peak in power consumption during warm months (for air conditioning) or increased rainfall during the rainy season.</p>
</li>
<li>
<p><strong>Cycles</strong>: Cyclic patterns are often conflated with seasonality but the difference is that a cyclic pattern is irregular, while seasonal patterns are regularly repeating behaviors. Manu cites economic recessions as being cyclic because they tend to repeat, but not in a regularly predictable pattern like changes in temperature over different seasons [1].</p>
</li>
<li>
<p><strong>Irregularities</strong>: This term refers to the remainder of a time series signal after the trends, seasonal and cyclic patterns have been accounted for. It typically refers to the stochastic component, and is also called the <strong>residual</strong> or error term.</p>
</li>
</ul>
<p>Often data scientists combine these components via addition or multiplication to build models of the underlying DGP.</p>
<h2 id="exploratory-data-analysis">Exploratory Data Analysis</h2>
<ol>
<li>
<p>Look for correlations between the features and the target. Use correlation heat maps for some key variables, or generate line plots (use smoothing if necessary) with variables you hypothesize are related to see if they exhibit similar patterns [1].</p>
</li>
<li>
<p>Plot some or all of the data, perhaps grouping by geographic region, time period, etc. Box plots can be a good choice here because they will indicate the interquartile range, outliers and the medians. Another good choice is a calendar heatmap, which again shows variation in a variable over time [1].</p>
</li>
<li>
<p>Examine the distribution of the target variable.</p>
</li>
<li>
<p>Look for trends, seasonality, and other cyclic patterns in the dataset.</p>
</li>
</ol>
<h2 id="decomposing-time-series-data">Decomposing Time Series Data</h2>
<p>As we discussed above, time series data can contain a series of patterns combined together, and it is possible to break the time series apart into these components. Manu Joseph recommends <strong>detrending</strong> the data first, then removing the seasonality, and leaving behind the residual signal at the end [1].</p>
<p>To first detect trends in a time series, you can use something as simple as a moving average [1]. A moving average essentially smoothes out the data, often making it easier to see long-term trends [1]. However, moving averages usually leave behind a signal that is still somewhat noisy, which is not ideal - we want the trends and seasonality patterns that we extract to be very clean, and all the noise to be contained in our residual [1]. An alternative to a moving average is the locally estimated scatterplot smoothing (LOESS) regression, which fits a smooth curve onto a noisy signal using a non-parametric approach (meaning we do not assume normality, for example) [1]. The LOESS regression method is considered a more general version of the moving average algorithm, using a combination of regression models in a KNN meta-model [2].</p>
<p>Assuming we have successfully extracted trends from the data, the next thing we need to do is to extract seasonal patterns. Since seasonal patterns are cyclic in nature, we will use a different set of tools to identify them, for example using <strong>period-adjusted averages</strong> or a <strong>Fourier series</strong> [1]. The period-adjusted average is a straightforward method where we compute the average of the series for one specific part of the cycle [1]. For example, if we are looking for seasonality in rainfall, we can imagine that the pattern repeats every year, and that the rainfall in one particular month should be similar from year to year [1]. The period-adjusted average method assumes this and would, in our example, compute the average rainfall for each month of the year over all the years of data that we have [1].</p>
<p>Instead of the period-adjusted average, we can also decompose seasonal patterns into a Fourier series. A Fourier series is a sum of sines and cosines, which are naturally repeating signals. We can fit the coefficients of a Fourier series (with a limited number of terms) to a seasonal pattern in the data [1].</p>
<p>There are several packages in Python that can perform decomposition for you. One popular tool is the <code class="language-plaintext highlighter-rouge">seasonal_decompose()</code> function from the <code class="language-plaintext highlighter-rouge">statsmodels.tsa.seasonal</code> library [3].</p>
<h2 id="correlation-in-time-series">Correlation in Time Series</h2>
<p>Finding relationships between different time series is a different challenge from finding correlation between features in a dataset. For a non-time series dataset, we can compute the correlation between features using a number of methods including the Pearson, Kendall or Spearman methods [4]. The <strong>Pearson correlation coefficient</strong> is a commonly used measure of correlation between two datasets, and it is specifically based on the assumption that the correlation is linear [5]. The Pearson correlation coefficient can have values between -1 and 1, where values near 0 indicate weak correlations and values near one of the extremes represent high correlation (either positive or negative) [5]. For a population, we can write the Pearson correlation coefficient between two datasets $X$ and $Y$ as [5]:</p>
\[\rho_{X, Y} = \frac{ \mathbb{E} \left[ (X - \mu_X) (Y - \mu_Y) \right] }{\sigma_X \sigma_Y}\]
<p>The <strong>Kendall rank correlation coefficient</strong> and <strong>Spearman’s rank correlation coefficient</strong> also measure the similarity between two datasets. Both of them measure the amount of rank correlation between two variables [6,7].</p>
<p>But for time series data, how can we take the temporal nature of the data into account? According to [8], it is not necessarily required to use different metrics - it claims that we can still use the Pearson correlation metric described above to measure the correlation between two time series. Dr. Cheong warns us that outliers can skew the results of the calculation, and that the Pearson correlation coefficient assumes the data is homoskedastic, but still says we can use this metric [8]. It does not provide information about whether one time series leads or lags another, and is just a global estimate of the correlation over all time [8]. It is also possible to compute the correlation coefficient over a local window and to roll the window along the time series to get a better sense of how the correlation itself varies with time [8].</p>
<p>But we can use cross correlation techniques to get a better sense of which signals lead and lag others, which we cannot tell with Pearson correlation [8]. For example, <strong>time lagged cross correlation</strong> (TLCC) still calculates the Pearson correlation, but it does it repeatedly while shifting one signal relative to another one [8]. When we find the maximum correlation value (again, this is a global value for the entire time period), that indicates whether one signal leads or lags another one with respect to correlation [8]. If we wanted to know on a local level which signal is leading and how that is changing over time, we can again use a windowed approach (WTLCC) [8]. The limitation to these methods is that they assume that we should look for correlation either globally or over a fixed window size [8].</p>
<p>Another technique that we can use which allows for comparing signals of different lengths is called <strong>dynamic time warping</strong> (DTW) [8]. This method is calculating the shortest distance between two signals by computing the Euclidean distance at each time step against all other time steps [8]. It is particularly useful when the two signals represent things that are happening at different speeds, like comparing speech or walking gaits [9]. To be more specific, DTW looks for an optimal matching between two sequences based on some rules [9]:</p>
<ol>
<li>Every index from the first time series needs to have its complement(s) from the other sequence, and vice versa.</li>
<li>The first indices of each series need to match with each other, but these do not have to be the only matches they have. The same is true for the final indices of each sequence.</li>
<li>The mapping of the indices of the first series to the other must be monotonically increasing.</li>
</ol>
<p>There are multiple mappings that satisfy these results, but the optimal one will also minimize some cost, like the Euclidean distance [9]. The Python packages that implement DTW will often plot the two series in question and show where the matching occurred to indicate which parts of one series correlated the best with the second series [8].</p>
<p>Finally, we can also use <strong>instantaneous phase synchrony</strong> to see when two oscillating signals are in phase [8]. This method does require that we filter the inputs so that they exhibit the primary wavelength we want to study [8]. This technique uses the Hilbert transform to then divide the input signals into their phases and powers, and we compare the phases to see when the two signals are in or out of phase [8]. This method is similar to the rolling window correlation methods we saw earlier, but do not require the user to choose a window size [8].</p>
<h2 id="references">References</h2>
<p>[1] Joseph, M. Modern Time Series Forecasting with Python : Explore Industry-Ready Time Series Forecasting Using Modern Machine Learning and Deep Learning. 1st ed. Birmingham: Packt Publishing, Limited, 2022.</p>
<p>[2] “Local regression.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Local_regression">https://en.wikipedia.org/wiki/Local_regression</a> Visited 8 Feb 2023.</p>
<p>[3] “statsmodels.tsa.seasonal.seasonal_decompose.” statsmodels 0.13.5 02 Nov 2022. <a href="https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html">https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html</a> Visited 8 Feb 2023.</p>
<p>[4] “pandas.DataFrame.corr” pandas 1.5.3. <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html</a> Visited 17 Feb 2023.</p>
<p>[5] “Pearson correlation coefficient.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">https://en.wikipedia.org/wiki/Pearson_correlation_coefficient</a> Visited 17 Feb 2023.</p>
<p>[6] “Kendall rank correlation coefficient.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient">https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient</a> Visited 17 Feb 2023.</p>
<p>[7] “Spearman’s rank correlation coefficient.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient">https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient</a> Visited 17 Feb 2023.</p>
<p>[8] Cheong, J. “Four ways to quantify synchrony between time series data.” Towards Data Science. 13 May 2019. <a href="https://towardsdatascience.com/four-ways-to-quantify-synchrony-between-time-series-data-b99136c4a9c9">https://towardsdatascience.com/four-ways-to-quantify-synchrony-between-time-series-data-b99136c4a9c9</a> Visited 17 Feb 2023.</p>
<p>[9] “Dynamic time warping.” Wikipedia. <a href="https://en.wikipedia.org/wiki/Dynamic_time_warping">https://en.wikipedia.org/wiki/Dynamic_time_warping</a> Visited 17 Feb 2023.</p>In my previous post, I started learning about time series forecasting, and focused on some foundational concepts and techniques for preparing data. In this post, we are going to learn about several more key concepts (the components of a time series) and then discuss the exploratory data analysis (EDA) process, with a focus on calculating correlations between time series. Let’s get started.A Framework for Solving Dynamic Programming Problems2023-01-28T00:00:00+00:002023-01-28T00:00:00+00:00http://sassafras13.github.io/SolvingDPProblems<p>In this post we’re going to talk about strategies for solving dynamic programming problems. I have written about <a href="https://sassafras13.github.io/DynamicProgramming/">dynamic programming</a> and <a href="https://sassafras13.github.io/recursion/">recursion</a> before, but now I want to focus specifically on how to frame problems as dynamic programming problems, and develop solutions to them. Let’s get started!</p>
<p>I’ll begin by outlining a framework for solving any dynamic programming problem presented by Otasevic in [1], and then we’ll go through each step in more detail. The steps are [1]:</p>
<ol>
<li>Determine if the problem can be solved with dynamic programming</li>
<li>Identify the variables in the problem</li>
<li>Define the recurrence relation</li>
<li>Identify the base case(s)</li>
<li>Choose an iterative or recursive solution</li>
<li>Add memoization</li>
<li>Determine the time complexity</li>
</ol>
<h2 id="1-determine-if-the-problem-can-be-solved-with-dynamic-programming">1. Determine if the problem can be solved with dynamic programming</h2>
<p>The key characteristic of a problem that can be solved with dynamic programming is that it should be possible to split the problem up into smaller subproblems [1]. And in particular, the Principle of Optimality should apply - that is, if we find the optimal solution to one of the subproblems, it should also be optimal in the case of trying to solve the entire problem. Often problems that ask you to maximize or minimize a quantity, or to count the number of possible arrangements are dynamic programming problems [2].</p>
<h2 id="2-identify-the-variables-in-the-problem">2. Identify the variables in the problem</h2>
<p>In this step, we need to identify which variables are changing as we work through the subproblems [1]. One way to figure this out is to consider a couple of the subproblems and compare them, and see which variables changed from one subproblem to the next [1]. This is also known as defining the <strong>state</strong> in the problem [2]. This exercise will also help us define the subproblem and it also leads us to our next step…</p>
<h2 id="3-define-the-recurrence-relation">3. Define the recurrence relation</h2>
<p>The recurrence relation expresses out the subproblems we found above relate to each other [1]. The relation lays out all the possible next steps you can take from your current state, i.e. all the changes you can make to the variable(s) in your problem given where you are now [1]. Another way to think about the recurrence relation is to see that it is the definition of the <strong>state transition</strong> [2]. You can do this in words, or use some math to explain how each variable could be changed [1].</p>
<h2 id="4-identify-the-base-cases">4. Identify the base case(s)</h2>
<p>A base case is a subproblem that you can’t divide up into any smaller subproblems, and that doesn’t depend on any other subproblems for a solution [1]. It often can look like the solution to when your variables are equal to 0 or 1, for example. A good way to think about the base cases is to ask what the constraints are on the variables you identified in step 2 [1]. For example, if the variable must at minimum be equal to 0, what is the solution to the subproblem in that case?</p>
<h2 id="5-choose-an-iterative-or-recursive-solution">5. Choose an iterative or recursive solution</h2>
<p>Although we did talk about a recurrence relation in step 3, you don’t have to use recursion to solve this problem - you could also iterate over the subproblems in order [1]. Another way to think about the different approaches to solving the problem is to ask if you want to solve it top-down (using a memoization approach to save the solutions to the subproblem) or bottom-up (using a tabular form of saving the solutions) [3]. Often recursive solutions correspond to top-down approaches, while iterative approaches correspond to bottom-up methods [3]. Often there will be a way to solve the problem using either method, and likely in an interview setting either solution will do. However, Otasevic points out that in the real world, it might be possible for a recursive solution to lead to a stack overflow, and so it will be good practice to point out to the interviewer that you are aware that this is a possible danger [1].</p>
<h2 id="6-add-memoization">6. Add memoization</h2>
<p>We talked about this in my previous blog posts - memoization is a way of saving solutions to subproblems so that if the answer is needed again, you can look it up instead of repeating the computation. Often memoization can really help speed up your algorithm implementation - in my own experience, adding memoization usually makes the difference between failing and passing a Leetcode problem. In Python it can be tempting to just use the <code class="language-plaintext highlighter-rouge">lru_cache</code> decorator to automatically include memoization, but I try to manually include it to teach myself how to implement it on a variety of problems. Otasevic suggests that you memoize anything you are about to return at the end of a function call, and to always check for the memoized solution before performing the computation [1].</p>
<h2 id="7-determine-the-time-complexity">7. Determine the time complexity</h2>
<p>Okay, to be honest with you, I am usually very lazy and do not do this when I am Leetcoding, but you probably will want to do it in an actual interview context. Of course it is really important to know the time complexity of your algorithm because that will inform how well it will perform in the real world, and help you to look for other ways to speed up your algorithm. Otasevic recommends computing the time complexity using the following method [1]:</p>
<ol>
<li>Count the number of states, which usually depends on the number of variables you identified.</li>
<li>Determine the work done in each state</li>
</ol>
<p>Now that we have seen a framework for solving dynamic programming problems, let’s apply it to a classic problem, the Knapsack Problem.</p>
<h2 id="the-knapsack-problem">The Knapsack Problem</h2>
<p>The Knapsack Problem is a well-known problem that can be solved using dynamic programming. In the problem, you are given a knapsack that has the capacity to carry up to <code class="language-plaintext highlighter-rouge">W</code>units of weight. You need to find the set of <code class="language-plaintext highlighter-rouge">n</code> items, each of which has a value <code class="language-plaintext highlighter-rouge">v</code> and a weight <code class="language-plaintext highlighter-rouge">w</code>, that maximizes the total value of the objects in the knapsack, while not exceeding its weight capacity [4].</p>
<p>There are many variants of the problem - versions where you can break the objects, or where volume is also an issue, etc. - but the one we will consider here is called the 0-1 variant, where you can either put the entire item in the knapsack or you omit it, but you cannot break it up [4]. Now that we’ve defined the problem, let’s work through our framework. (While I am making my own notes here, I did refer to several solutions presented in [4] to help me fill out the framework.)</p>
<h3 id="1-can-the-problem-be-solved-with-dynamic-programming">1. Can the problem be solved with dynamic programming?</h3>
<p>Okay, so I guess I’ve given this one away a little bit, but we can also see that this problem could be broken into subproblems where I have to determine if each object should be added or not. You could also note that it is an optimization problem, which makes it a more likely candidate for a dynamic programming approach.</p>
<h3 id="2-what-are-the-problem-variables">2. What are the problem variables?</h3>
<p>As given in the problem description, there are 2 variables that we are concerned with here: <strong>weight</strong> and <strong>value</strong>. Any time we add an object, we will know its weight and value, and we will need to decide if adding it will maximize the total value of our knapsack without exceeding its weight capacity.</p>
<h3 id="3-what-is-the-recurrence-relation">3. What is the recurrence relation?</h3>
<p>Since this is the 0-1 variant, I know that I only have two actions I can take for each object: I can either add it to the knapsack, or discard it. What kind of criteria would I use to make my choice? I want to add the object if it maximizes my total value.</p>
<h3 id="4-what-is-the-base-case">4. What is the base case?</h3>
<p>As we saw in step 2, I have two variables, weight and value. I am keeping track of the total weight and total value of my knapsack. The total weight is allowed to be in the range <code class="language-plaintext highlighter-rouge">[0, W]</code> where <code class="language-plaintext highlighter-rouge">W</code> is the capacity of the knapsack. The total value of my knapsack can range from <code class="language-plaintext highlighter-rouge">[0, infinity]</code>. I should also note that I am only allowed to add each object once, so if I have already added all the objects then I can’t add any more. This leads me to my base cases:</p>
<ul>
<li>If my total weight exceeds the knapsack capacity, then I cannot add the object and my additional value is 0.</li>
<li>If there are no objects left, then I cannot add any more, and again my additional value is 0.</li>
</ul>
<h3 id="5-should-i-use-an-iterative-or-recursive-approach">5. Should I use an iterative or recursive approach?</h3>
<p>Looking back at step 3, I said that I would add an object to the knapsack if it maximized the total value. But I don’t know if adding that object will maximize the total value unless I know that it is part of the optimal set of objects and that every subsequent object I add is also part of this set. But if I frame my solution as a recursive function, then each level in the recursion will only add the object in question if it <em>does</em> maximize the total value optimally. So let’s use a recursive approach here.</p>
<h3 id="6-how-to-memoize-my-solution">6. How to memoize my solution?</h3>
<p>Since my goal is to maximize the total value, I can store the computed total value of every subproblem in order to speed up my algorithm. More specifically, I will want to store the total value of adding each object at every step of the algorithm - this is 2 axes, so I will want a 2D table for storing my total values, where <code class="language-plaintext highlighter-rouge">table[index][weight]</code> stores the value of all the objects in the knapsack at the step <code class="language-plaintext highlighter-rouge">index</code> and for the total weight <code class="language-plaintext highlighter-rouge">weight</code> at that step.</p>
<h3 id="7-what-is-the-time-complexity">7. What is the time complexity?</h3>
<p>Following the guidance Otasevic gave us, let’s first count the number of subproblems. This is easy in this case, because we are filling out the table of size <code class="language-plaintext highlighter-rouge">index</code> x <code class="language-plaintext highlighter-rouge">weight</code>, where <code class="language-plaintext highlighter-rouge">index</code> can have a maximum value equal to the number of objects, <code class="language-plaintext highlighter-rouge">n</code>, and the maximum weight is equal to the knapsack’s capacity, <code class="language-plaintext highlighter-rouge">W</code>.</p>
<p>How much work do we have to do at each step? We just do one computation - we select the maximum value computed with or without adding the next object. This is constant time, so in total our algorithm is $\mathcal{O}(n \cdot W)$ time.</p>
<p>With that, I will close this blog post and keep working on solving dynamic programming problems! I might return in the future with a selection of solutions that are interesting to explore in more detail. Thanks for reading.</p>
<h2 id="references">References</h2>
<p>[1] Otasevic, N. “Follow these steps to solve any Dynamic Programming interview problem.” Free Code Camp, 6 Jun 2018. <a href="https://www.freecodecamp.org/news/follow-these-steps-to-solve-any-dynamic-programming-interview-problem-cc98e508cd0e/">https://www.freecodecamp.org/news/follow-these-steps-to-solve-any-dynamic-programming-interview-problem-cc98e508cd0e/</a> Visited 28 Jan 2023.</p>
<p>[2] Kumar, N. “How to solve a Dynamic Programming Problem?” GeeksforGeeks, 10 Jan 2023. <a href="https://www.geeksforgeeks.org/solve-dynamic-programming-problem/">https://www.geeksforgeeks.org/solve-dynamic-programming-problem/</a> Visited 28 Jan 2023.</p>
<p>[3] Gavis-Hughson, S. “How to Solve Any Dynamic Programming Problem.” Learn to Code with Me, 12 Apr 2020. <a href="https://learntocodewith.me/posts/dynamic-programming/">https://learntocodewith.me/posts/dynamic-programming/</a> Visited 28 Jan 2023.</p>
<p>[4] Thelin, R. “Demystifying the 0-1 knapsack problem: top solutions explained.” Educative, 8 Oct 2020. <a href="https://www.educative.io/blog/0-1-knapsack-problem-dynamic-solution">https://www.educative.io/blog/0-1-knapsack-problem-dynamic-solution</a> Visited 28 Jan 2023.</p>In this post we’re going to talk about strategies for solving dynamic programming problems. I have written about dynamic programming and recursion before, but now I want to focus specifically on how to frame problems as dynamic programming problems, and develop solutions to them. Let’s get started!