Founding pricing — locked in for early customers, on for life

How to Evaluate a Trading Model — Reading the Trader Detail View in darwintIQ

A model that looks good at a glance can look very different once you examine it metric by metric.

Knowing how to evaluate a trading model properly is the difference between making an informed decision and being misled by a number that looked impressive in isolation. The Trader Detail View in darwintIQ surfaces a comprehensive set of metrics for any model in the ranked population — but having access to metrics and knowing how to read them together are different skills.

This article walks through how to approach that view systematically, which metrics to prioritise, and where surface-level performance can conceal structural fragility.

Start With Fitness and Robustness Before Anything Else

The two metrics that sit at the top of the evaluation hierarchy are Fitness and Robustness Score. These are not return metrics — they reflect the structural quality of the model's results. A high-fitness model is one whose performance characteristics hold together across the evaluation window; a high robustness score indicates that the results are not driven by a handful of exceptional trades or lucky periods.

The distinction matters because a model with high return and low robustness may simply have had a good run on few trades. If the sample is small or the results are concentrated in a short window, the probability of the pattern repeating is lower than the headline number suggests.

A useful starting point is to look at fitness and robustness together before looking at return. If both are strong, the return figure becomes more meaningful. If robustness is weak relative to fitness, examine how many trades are in the sample and what the standard deviation of returns looks like.

Metrics That Reveal Fragility

Several metrics in the Trader Detail View are specifically useful for identifying models that look solid but carry hidden risk.

Win rate and risk/reward in combination. A high win rate with a poor average risk/reward — say, 75% winners with a 1:0.5 ratio — is not the same as a 50% win rate with a 1:2 ratio. The expected value tells you what the combination actually produces per trade. Examining win rate in isolation, as many traders do, gives an incomplete picture.

Standard deviation. A model with wide variance in its returns can produce the same average as one with narrow variance — but the experience of trading it will be very different, and the drawdown risk is substantially higher. In the Trader Detail View, standard deviation gives context to both the return and the drawdown figures.

Drawdown and Calmar Ratio. Maximum drawdown alone does not tell you whether that drawdown was reasonable given the return it accompanied. The Calmar ratio divides the return by the maximum drawdown, producing a normalised view. A model with a 10% return and a 5% drawdown has a different risk profile from one with a 10% return and a 20% drawdown, even if the headline return is identical.

Exposure. High exposure — a large proportion of time in open positions — amplifies all risks. A model with high exposure and moderate drawdown is carrying more uncompensated risk than one with low exposure and a similar drawdown.

Reading the Distribution Statistics

The lower section of the Trader Detail View contains distribution statistics: Jensen-Shannon Divergence, Wasserstein Distance, KS Statistic, PSI, and Mutual Information. These metrics assess whether the model's current return distribution still matches its validated historical pattern.

Each captures a slightly different aspect of distribution shift. The KS statistic tests whether the current returns come from the same distribution as the backtest period. PSI measures whether the model's input distribution — the market conditions it is operating in — has changed. Wasserstein distance captures how much effort it would take, conceptually, to transform one distribution into the other.

When these statistics are low, the model's current behaviour matches its validated behaviour closely. When they rise, something has changed — either in the market or in how the model is performing within it. A model with excellent fitness but elevated PSI and KS values warrants caution: the surface looks good, but the statistical foundation may be shifting.

What Good Looks Like in Context

There is no universal threshold that makes a model worth following. A strong model in a trend-dominant market will look different from a strong model in a ranging market — different metrics will be elevated, different entry types will be leading.

What good looks like is internal consistency: fitness and robustness aligned, expected value positive, distribution statistics stable, exposure appropriate, and Calmar ratio indicating that the return justifies the drawdown required to produce it. When all of those point in the same direction, the model has genuine structural quality rather than surface-level appeal.

The Stability Score provides one useful summary metric: it measures how consistently the model produces its results over time. A high Stability Score alongside strong fitness is a meaningful combination. A high Stability Score on a model with poor robustness is less reassuring — it means the model is consistently underwhelming rather than variably impressive.

Final thoughts

Evaluating a trading model is a process of reading a set of metrics together rather than optimising for any single number. The Trader Detail View in darwintIQ provides the full set of information required to do this — but the work of reading them in combination, understanding what each reveals, and knowing which combinations signal fragility still requires systematic thinking.

The highest-ranked models in any regime will typically show alignment across the key structural metrics. When one or more metrics are out of step with the others, that divergence is the most important thing to examine before deciding whether to act on a model's signals.