Predictive Analytics Engine — Complete Project

Dataset → 📂 Upload CSV

—

Initializing pipeline…

Data Points

Train / Test

Models Trained

Best Model

Best RMSE

Best R²

MAPE

✓ Z-score Normalization ✓ Lag Features (1,2,3,5) ✓ Rolling Mean/Std (3,5) ✓ Quadratic Trend Term ✓ 80/20 Temporal Split ✓ Residual CI (±1σ) ✓ 6 Models Compared ✓ Data Cross-Verified vs World Bank / BLS / GCP

01 — Historical Data & Train/Test Split

Full Historical Series · Training vs Test Partition

02 — Future Forecast with Confidence Intervals

Multi-Model Forecast · Best ML + ARIMA + Exp. Smoothing

Horizon:

Shaded band = ±1σ confidence interval · Dashed = ARIMA · Dotted = Exp. Smoothing

ARIMA(1,1,1) Forecast Detail

03 — Model Evaluation & Residual Analysis

Prediction vs Actual — Test Set (Best Model)

Residuals over Time

04 — Model Comparison & Leaderboard

RMSE Comparison (lower = better)

■ Best model ■ Other models · Values shown on bars

R² Score (higher = better)

■ ≥0.7 Good fit ■ 0.4–0.7 Moderate ■ 0–0.4 Weak ■ <0 Worse than mean

All Models — Full Accuracy Metrics

Rank	Model	R²	RMSE	MAE	MAPE %	Type

05 — Feature Importance & Error Distribution

Feature Importance — Random Forest (Permutation)

Prediction Error Distribution (Histogram)

X = error (Actual − Predicted) in dataset units · Y = number of test predictions in that error range · ■ green = underestimated ■ red = overestimated · Ideal: tallest bar near 0

06 — Rolling Statistics & Autocorrelation

Rolling Mean & ±1σ Band (window=10)

Autocorrelation Function (ACF) — Lags 1–15

Orange lines = 95% confidence bounds (±1.96/√n) · ● = statistically significant lag

07 — Time-Series Decomposition (Additive)

Trend · Seasonal · Residual Components

📈 Trend Component

🔄 Seasonal (period=10)

⚡ Residual

08 — Forecast Values — All Models

Numeric Forecast Table

Period	Best ML	CI Upper (+1σ)	CI Lower (-1σ)	ARIMA(1,1,1)	Exp. Smoothing

09 — Upload Your Own Dataset

Upload a CSV — Run the Full ML Pipeline on Your Data

📂

Drop your CSV here or click to browse

Required format: date, value — one row per time period

Minimum 10 rows · Header row auto-detected · Any time period works

📋 Example CSV Format

date,value
2015,42350.5
2016,45120.8
2017,48930.2
2018,51240.1
2019,49870.6
2020,38420.3
2021,55780.9
2022,61340.4
2023,63200.7

What Happens When You Upload?

📥

STEP 1 — Parse Your CSV

Your file is read in the browser (never sent anywhere). Dates and values are extracted. Header row auto-skipped if non-numeric.

⚙️

STEP 2 — Full Pipeline Runs on Your Data

The same 6-model ML pipeline runs: normalization → 9 feature engineering → 80/20 split → Linear Reg, Ridge, Random Forest, Gradient Boost, ARIMA, Exp. Smoothing all train on your data.

📊

STEP 3 — All 13 Sections Update

Every chart, table, metric, ACF, decomposition, feature importance, scatter plot, stats, confidence meters — all rebuild using your data. Forecasts show future periods beyond your last date.

🔮

STEP 4 — Get Future Predictions

The best model forecasts 6–15 future time periods beyond your last data point, with confidence intervals shown on the chart.

✓ WHAT YOU CAN UPLOAD

Sales figures · Temperature records · Stock prices · Website traffic · Any sensor reading · Revenue data · Energy consumption · Population data — anything with a date and a number per row.

10 — Descriptive Statistics & Data Quality

Full Dataset Statistics

Statistic	Value	Interpretation

Model Confidence Meters

11 — Predicted vs Actual Scatter Plot

Scatter — Predicted vs Actual (Best Model · Test Set)

Points on the diagonal line = perfect prediction · Spread = error magnitude

MAPE Comparison — All Models

■ Best model ■ <5% Excellent ■ 5–15% Good ■ >15% Poor

12 — Methodology & Model Reference

6 Models — How Each Works

📐Linear Regression

OLS regression on 9 engineered features (lags, rolling stats, trend). Fits a hyperplane minimizing sum of squared residuals. Fast, interpretable, best for linear trends.

🔒Ridge Regression

Like linear regression but with L2 regularization (α=0.3) that penalizes large coefficients. Reduces overfitting when features are correlated. More stable than plain OLS.

🌲Random Forest

Ensemble of 60 decision trees trained on bootstrap samples. Final prediction = mean of all trees. Captures non-linear patterns. Provides permutation-based feature importance.

ML · Ensemble

🚀Gradient Boosting

100 trees built sequentially, each correcting the previous tree's residuals. Learning rate 0.1, max depth 3. Often most accurate for tabular time series data.

ML · Boosting

📊ARIMA(1,1,1)

Autoregressive Integrated Moving Average. Differences the series once (d=1) for stationarity, uses 1 AR lag and 1 MA term. Classic time-series model, best for trend-dominated series.

ARIMA

📉Exp. Smoothing (Holt)

Double exponential smoothing with optimized α (level) and β (trend) parameters via grid search. Forecasts by extrapolating the estimated level and trend. Simple yet effective.

Time-Series

13 — Glossary & Metric Reference

Project Requirements — Full Checklist

✅ REAL DATA — NO FAKE DATA

All 7 datasets sourced from World Bank, FRED/BLS, Global Carbon Project, Macrotrends/Shiller, and EIA. Values cross-verified and corrected. Data reflects real historical events (2008 crash, COVID 2020, inflation spike 2022).

✅ REGRESSION & TIME-SERIES MODELS

Linear Regression (OLS), Ridge Regression (L2), Random Forest (ensemble), Gradient Boosting (sequential trees), ARIMA(1,1,1) (classic time-series), Holt Double Exponential Smoothing. All 6 run and are compared.

✅ CLEAN & PREPROCESS DATA

Z-score normalization, lag features (1,2,3,5), rolling mean/std (window 3 & 5), quadratic trend term, 80/20 temporal train/test split with no data leakage.

✅ EVALUATE MODEL ACCURACY

R², RMSE, MAE, MAPE computed for every model. Model leaderboard table, RMSE bar chart, R² bar chart, MAPE chart, residual analysis, scatter plot, error histogram, confidence meters.

✅ VISUALIZE PREDICTIONS

13 sections of charts: historical series, train/test split, multi-model forecast with CI bands, ARIMA detail, predicted vs actual, residuals, RMSE/R² bars, feature importance, ACF, decomposition (trend/seasonal/residual), scatter, MAPE, histogram.

✅ FORECAST FUTURE TRENDS

Models trained on data up to 2023, then auto-forecast 2024, 2025, 2026... (6–15 steps adjustable). Each future year's prediction is driven by the model's learned trend from real historical data. Confidence intervals shown.

How the Forecast Works — Does It Reflect Real Trends?

WHAT DRIVES THE 2024–2028 FORECASTS

The models are trained on all real data up to 2023. Then for each future year:

1. The last 6 known values are used as "rolling context"
2. 9 features are built: lag_1 (last year's value), lag_2, lag_3, lag_5, rolling means, rolling std, and trend index
3. The best ML model predicts the next value
4. That prediction becomes the new "lag_1" for the next step

So if inflation was rising, the model learned that pattern and continues it. If GDP was recovering, it extrapolates that recovery. The forecast is entirely driven by the real learned trend, not a formula.

WHAT THE FORECAST CAN AND CANNOT DO

✓ CAN: Extrapolate trend direction from real historical data · Show confidence ranges · Compare 3 different model forecasts · Update instantly when you change the horizon

✗ CANNOT: Know about events after 2023 (wars, elections, pandemics) · Guarantee accuracy beyond 1–2 periods · Replace expert economic forecasting

Key Terms Explained

R² (R-Squared)

Proportion of variance explained by the model. R²=1.0 is perfect; R²=0 means model is no better than predicting the mean. Can be negative for very poor models.

RMSE (Root Mean Squared Error)

Square root of the average squared residuals. In the same units as your data. Penalizes large errors more heavily. Lower is better.

MAE (Mean Absolute Error)

Average of absolute residuals. More robust to outliers than RMSE. Easier to interpret: "on average, predictions are off by X units."

MAPE (Mean Absolute Percentage Error)

Average percentage error. Good for comparing across datasets with different scales. Undefined when actual values are zero.

Confidence Interval (CI)

Range where the true value is expected to fall with a given probability. Here ±1σ ≈ 68% CI based on test set residual standard deviation.

Autocorrelation (ACF)

Correlation of a series with its own lagged values. High ACF at lag 1 suggests strong momentum; oscillating ACF suggests cyclical patterns.

Feature Importance

Permutation importance: how much model error increases when each feature is randomly shuffled. Larger = more important. Computed on training data using Random Forest.

Train/Test Split (80/20)

80% of data (chronologically earliest) trains the model; 20% (most recent) evaluates it on unseen data. Temporal split prevents data leakage.

Z-score Normalization

Each value is scaled to (value − mean) / std. Brings all features to the same scale, which is required for regression models and improves convergence.