20 Learning from Data

Author

Yun-Tien Lee

“In God we trust; all others bring data.” — W. Edwards Deming

“The goal is to turn data into information, and information into insight.” — Carly Fiorina

20.1 Chapter Overview

We will touch on how to use data to inform a model: fitting parameters, forecasting, and fundamental limitations on prediction.

20.2 How to learn from data

20.2.1 Understand the problem and define goals

Clarify objectives: What we want to achieve with the data (e.g., prediction, classification, clustering, or insight extraction).
Identify key metrics: Determine how success will be measured (accuracy, RMSE, precision, etc.).
Know the context: Understand the domain and business problem one is addressing to shape the data analysis process.

20.2.2 Collect data

Various data may be available in different formats. - Ensure data relevance: The data should be relevant to the problem. - Consider data quality: Collect data with high accuracy, completeness, and consistency.

20.2.3 Explore and preprocess the data

This involves data cleaning and preparation to ensure the dataset is suitable for analysis.

Handle missing data: We could impute missing values (mean, median, or KNN imputation), or drop rows/columns with excessive missing data.
Deal with outliers: Use statistical techniques (e.g., z-scores) to detect and remove or cap extreme values.
Feature scaling: Apply normalization or standardization to ensure features are on comparable scales (important for algorithms like SVM, K-means, etc.).
Encode categorical data: Use techniques such as: one-hot encoding for nominal data, or label encoding or ordinal encoding for ordered categories.
Data visualization: Use tools like Makie.jl to visualize distributions, correlations, and missing values.

20.2.4 Exploratory data analysis (EDA)

EDA helps discover patterns, relationships, and insights within the data. One can do the following, but not limited, to these analyses:

Summary statistics: Check mean, variance, skewness, and correlations between variables.
Visualize relationships: Use histograms, scatter plots, box plots, and heatmaps to identify trends and correlations.
Detect multicollinearity: Check correlations between independent variables (e.g., Pearson’s correlation matrix).

20.2.5 Select and engineer features

Feature selection and engineering help improve model performance by focusing on the most relevant information.

20.2.6 Choose the right algorithm or model

Depending on our problem type, choose appropriate algorithms for learning from the data:

Supervised Learning (with labeled data):
- Classification: Logistic regression, SVM, decision trees, random forests, or neural networks.
- Regression: Linear regression, ridge regression, or gradient boosting.
Unsupervised Learning (without labeled data):
- Clustering: K-means, DBSCAN, hierarchical clustering.
- Dimensionality Reduction: PCA, t-SNE, or UMAP.
Reinforcement Learning: Learn from interactions with an environment (e.g., Q-learning, Deep Q-Networks).

20.2.7 Train and evaluate the model

Split the data: Use either a train-test split (e.g., 80/20 or 70/30 split) or a cross-validation (e.g., k-fold cross-validation).
Fit the model: Train the model on the training set.
Evaluate the model: Use evaluation metrics appropriate to the task.

20.2.8 Tune hyperparameters

Hyperparameters control how models learn. One can use techniques like the following to tune hyperparameters:

Grid search: Test a range of hyperparameter values.
Random search: Randomly explore combinations of hyperparameters.
Bayesian optimization: Use probabilistic models to guide hyperparameter search.

20.2.9 Deploy and Monitor the Model

Once the model performs well, deploy it to make predictions on new data.

Model deployment platforms: Use tools like Flask, FastAPI, or MLOps platforms.
Monitor performance: Continuously monitor metrics to detect concept drift or performance degradation.

20.2.10 Draw Insights and Make Decisions

Finally, interpret the results and use insights to make decisions or recommendations. Effective communication of findings is essential, especially for stakeholders.

Visualization: Use dashboards or reports to communicate findings.
Interpretability: Use explainable AI (e.g., SHAP values) to make model predictions transparent.

20.2.11 Limitations

However, there are certain fundamental limitations:

There may often be inherent uncertainty and noise in the data itself.
Every model has its own assumptions and simplifications.
There may be non-stationarity in the data, especially in financial data. Non-stationary processes change over time, meaning that patterns learned from past data may no longer be valid in the future.
Models may be overfitting or underfitting. Overfitting occurs when a model is too complex and captures noise instead of the underlying pattern, leading to poor generalization to new data. Underfitting occurs when the model is too simple to capture the relevant structure in the data.
Sometimes in high-dimensional spaces, data becomes sparse, and meaningful patterns are harder to identify.
Some predictions may be limited by ethical concerns (e.g., predicting criminal behavior) or legal restrictions (e.g., privacy laws that limit data collection).

20.3 Applications

Learning from data lies at the heart of modern analytics, enabling us to uncover patterns, estimate underlying relationships, and make informed predictions. Two common applications are discussed below.

20.3.1 Parameter fitting

Refer to Chapter 17 on Optimization for more details.

20.3.2 Forecasting

Forecasting is the process of making predictions about future events or outcomes based on historical data, patterns, and trends. It involves the use of statistical methods, machine learning models, or expert judgment to estimate future values in a time series or predict the likelihood of specific events. Forecasting is widely used in fields like economics, finance, meteorology, supply chain management, and business planning.

Here is an example how to do time series forecasting in Julia, where point sizes show covariance of predictions:

using CSV, DataFrames, CairoMakie, StateSpaceModels

airp = CSV.read(StateSpaceModels.AIR_PASSENGERS, DataFrame)
log_air_passengers = log.(airp.passengers)
steps_existing = length(log_air_passengers)
steps_ahead = 30

# SARIMA
model_sarima = SARIMA(log_air_passengers; order=(0, 1, 1), seasonal_order=(0, 1, 1, 12))
fit!(model_sarima)
forec_sarima = forecast(model_sarima, steps_ahead)

f = Figure()
axis = Axis(f[1, 1], title="SARIMA")
scatter!(1:steps_existing, log_air_passengers)
scatter!(steps_existing+1:steps_existing+steps_ahead, map(x -> x[1], forec_sarima.expected_value), color=:red, markersize=map(x -> x[1] * 1000, forec_sarima.covariance))
f

┌ Warning: f_tol is deprecated. Use f_abstol or f_reltol instead. The provided value (1.0e-6) will be used as f_reltol.
└ @ Optim ~/.julia/packages/Optim/8dE7C/src/types.jl:120