“In God we trust; all others bring data.” — W. Edwards Deming
“The goal is to turn data into information, and information into insight.” — Carly Fiorina
20.1 Chapter Overview
We will touch on how to use data to inform a model: fitting parameters, forecasting, and fundamental limitations on prediction.
20.2 How to learn from data
20.2.1 Understand the problem and define goals
Clarify objectives: What we want to achieve with the data (e.g., prediction, classification, clustering, or insight extraction).
Identify key metrics: Determine how success will be measured (accuracy, RMSE, precision, etc.).
Know the context: Understand the domain and business problem one is addressing to shape the data analysis process.
20.2.2 Collect data
Various data may be available in different formats. - Ensure data relevance: The data should be relevant to the problem. - Consider data quality: Collect data with high accuracy, completeness, and consistency.
20.2.3 Explore and preprocess the data
This involves data cleaning and preparation to ensure the dataset is suitable for analysis.
Handle missing data: We could impute missing values (mean, median, or KNN imputation), or drop rows/columns with excessive missing data.
Deal with outliers: Use statistical techniques (e.g., z-scores) to detect and remove or cap extreme values.
Feature scaling: Apply normalization or standardization to ensure features are on comparable scales (important for algorithms like SVM, K-means, etc.).
Encode categorical data: Use techniques such as: one-hot encoding for nominal data, or label encoding or ordinal encoding for ordered categories.
Data visualization: Use tools like Makie.jl to visualize distributions, correlations, and missing values.
20.2.4 Exploratory data analysis (EDA)
EDA helps discover patterns, relationships, and insights within the data. One can do the following, but not limited, to these analyses:
Summary statistics: Check mean, variance, skewness, and correlations between variables.
Visualize relationships: Use histograms, scatter plots, box plots, and heatmaps to identify trends and correlations.
Reinforcement Learning: Learn from interactions with an environment (e.g., Q-learning, Deep Q-Networks).
20.2.7 Train and evaluate the model
Split the data: Use either a train-test split (e.g., 80/20 or 70/30 split) or a cross-validation (e.g., k-fold cross-validation).
Fit the model: Train the model on the training set.
Evaluate the model: Use evaluation metrics appropriate to the task.
20.2.8 Tune hyperparameters
Hyperparameters control how models learn. One can use techniques like the following to tune hyperparameters:
Grid search: Test a range of hyperparameter values.
Random search: Randomly explore combinations of hyperparameters.
Bayesian optimization: Use probabilistic models to guide hyperparameter search.
20.2.9 Deploy and Monitor the Model
Once the model performs well, deploy it to make predictions on new data.
Model deployment platforms: Use tools like Flask, FastAPI, or MLOps platforms.
Monitor performance: Continuously monitor metrics to detect concept drift or performance degradation.
20.2.10 Draw Insights and Make Decisions
Finally, interpret the results and use insights to make decisions or recommendations. Effective communication of findings is essential, especially for stakeholders.
Visualization: Use dashboards or reports to communicate findings.
Interpretability: Use explainable AI (e.g., SHAP values) to make model predictions transparent.
20.2.11 Limitations
However, there are certain fundamental limitations:
There may often be inherent uncertainty and noise in the data itself.
Every model has its own assumptions and simplifications.
There may be non-stationarity in the data, especially in financial data. Non-stationary processes change over time, meaning that patterns learned from past data may no longer be valid in the future.
Models may be overfitting or underfitting. Overfitting occurs when a model is too complex and captures noise instead of the underlying pattern, leading to poor generalization to new data. Underfitting occurs when the model is too simple to capture the relevant structure in the data.
Sometimes in high-dimensional spaces, data becomes sparse, and meaningful patterns are harder to identify.
Some predictions may be limited by ethical concerns (e.g., predicting criminal behavior) or legal restrictions (e.g., privacy laws that limit data collection).
20.3 Applications
20.3.1 Parameter fitting
Refer to Chapter 17 on Optimization for more details.
20.3.2 Forecasting
Forecasting is the process of making predictions about future events or outcomes based on historical data, patterns, and trends. It involves the use of statistical methods, machine learning models, or expert judgment to estimate future values in a time series or predict the likelihood of specific events. Forecasting is widely used in fields like economics, finance, meteorology, supply chain management, and business planning.
Here is an example how to do time series forecasting in Julia, where point sizes show covariance of predictions:
┌ Warning: f_tol is deprecated. Use f_abstol or f_reltol instead. The provided value (1.0e-6) will be used as f_reltol.
└ @ Optim ~/.julia/packages/Optim/8dE7C/src/types.jl:120