The problem
The core question: are daily closing trajectories correlated — evidence of real trends — or effectively random? Answering it needed both rigorous statistical testing and a predictive model to quantify how much signal is actually recoverable.
Approach
The workflow runs from data preprocessing through exploratory analysis, clustering, correlation testing, and finally predictive modeling.
- Imputation of missing values in the far_price and near_price columns.
- Min-Max scaling and feature engineering on bid-ask spreads and reference prices.
- K-means clustering chosen via the Elbow method, with t-SNE and hierarchical clustering for structure.
- Permutation testing and daily-correlation heatmaps to test the trend-versus-random hypothesis.
Tech deep-dive
Four regression models were compared under 5-fold cross-validation: Linear Regression, Ridge, Lasso, and HistGradientBoosting.
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import cross_val_score
model = HistGradientBoostingRegressor()
scores = cross_val_score(model, X, y, cv=5)Outcomes & learnings
HistGradientBoostingRegressor was the clear winner, outperforming the linear baselines at predicting the target. The correlation and permutation analysis showed structure in daily closing prices that is not fully explained by randomness — modest, exploitable signal rather than pure noise.