Financial Data Science: a fable

I write this as precautionary tale intended to Engineering undergraduate students interested in the areas of Data Science and Artificial Intelligence (initially given as a guest lecture at Mexico’s ITAM video-link in Spanish)

Data Science Venn Diagram (source)

Data Science is supposed to be the intersection of Computer Science (hacking, understanding databases), Math and Statistics (advanced modelling) and Domain Knowledge (Finance, for this tale). Each discipline traditionally required its own career syllabus, and in real life is almost impossible to find someone who is proficient in all three fields.

In my experience, I have noticed that some people might be very strong in two out of three:

  • Economists with Econometric specialties can solve Stochastic Equations on their sleep … and use VBA and Excel for Monte Carlo simulations.
  • Computer Scientists can build databases that can handle tick by tick data … and be unable to compute the duration of a Bond.
  • And Mathematicians assume cows are spherical and live in a vacuum.

Instead, the above diagram should be used as reminder of the skills we need to solve a problem and team up to fill the gaps.

Within Data Science Artificial Intelligence has gained popularity. For a more detailed review read my blog: AI in Finance: Cutting Through the Hype, but for the purposes of this story just keep in mind that currently AI is just a collection of tools that specialise in a very narrow task. I’ll borrow this graph for a moment:


AI has conquered the game space (Alpha Go), is great at writing like Shakespeare (take a look at the “Unreasonable Effectiveness of Recurrent Neural Networks” — go to the end of this blog for a sample code that uses Reddit posts), and can be used to create deepfake lip syncing videos (open source code or use — see art singing)

Art Singing — link

Can we harness AI’s power for Finance ?

Now to the fable. I will (very informally) detail a typical Financial Data Science project — but, as opposed to fairy tales that end happily ever after, I will highlight all the wrong turns that the protagonists usually make (over and over again) and write down the lessons to keep.

Pair trading (see wiki) is one of the oldest quantitative strategies around. Among Financial professionals there is the rule of thumb that the publication of a quantitative strategy kills its effectiveness ( Does Academic Research Destroy Stock Return Predictability?), but websites like pair trading lab are happy to provide data to their paying customers: more that 10 million pairs analyzed.

90’s quant traders (initially Math/Physics Phd’s who learned Finance on the job) gave up trying to forecast the direction of the market while denigrating technical analysis as “they are not able to form this opinion in a scientifically sound way” ( (read The Evolution of Technical Analysis: Financial Prediction from Babylonian Tablets to Bloomberg Terminals for a counterpoint from Andrew Lo, but also see the origin of the vomiting camel trend). Instead of predicting the market direction (a job now left to Macro Hedge Funds) they hoped to make profit from what they thought were temporary statistical divergences.

I know, a linear regression looks too simple …

… but by keeping it simple we can more easily understand the problems.

If you plot the prices of two stocks over time you will notice that they tend to move together:

click here for interactive version (click on the green GME for out of sample data)

If you now plot a scatter plot of the prices of one stock versus the other you will instinctively try to plot a line (as taught to high school children), a higher order polynomial function, or even further — we can try a full fledged Machine Learning / Supervised Learning / Regression technique.

click here for interactive version (click on ref GME for out of sample data)

The above graphs are interesting, but the magic of the linear model is that it immediately gives out the trade:

click here for interactive version (click on red out of sample)

If the residual is above the mean, the first stock is ‘expensive’ relative to the base stock — you can sell it for a while, buy the base stock (in the ratios defined by the linear regression) and then wait for it to mean revert. (and vice versa, if the price spread is below the mean) — buying and short selling simultaneously is what Hedge Funds do. (Note — the above graph uses the residual linear regression standard deviation; a more appropriate technique would use time series mean reversion analytics — see Mean reversion in Finance: definitions and the companion colab notebook; I do not use them in this story to avoid introducing even more complex implicit assumptions)

As to when to sell, how much, when to buy — the pair trading books above provide rules.

At this point is now useful to bring the three specialists together and figure out how to profit from this behaviour

Statistics view: A linear regression model includes some assumptions that we need to test; we can use a more robust error measure to avoid outliers; we can model mean reversion (see my Trading Mean Reversion blog); we need proper back testing. There are whole book and python libraries devoted to pair trading statistics (Optimal Mean Reversion Trading, A Python Package for Optimal Mean Reversion Trading)

AI twist: The classical linear regression formulas minimize the squared error of the prediction in one go, but you could use gradient descent and/or iterative methods to estimate the beta and constant.

Computer Science view: Pair Trading Lab claims to have analyzed 10 million US stock pairs. The hacker should be able to connect to heterogeneous databases to get all the price data from thousands of stocks, set up a computer architecture able to make all the calculations, set up a database able to store the raw data plus the results, etc.

Fundamentals view: The finance guy should be able to nip out from the bud non-sensical trade strategies (Spurious correlations — remember, correlation is not causation); but above all the domain knowledge expert should be able to identify what are the possible scenarios when the strategy can fail. If you have read Black Swan, you will realize that the domain knowledge guy should be aware of the blind areas from the model derived by the Statistician — look at the implied assumptions and try to figure out ‘what if’ scenarios.

Read the simplified Pepsi vs Coca-Cola wiki example (wikipedia):

Pepsi (PEP) and Coca-Cola (KO) are different companies that create a similar product, soda pop. Historically, the two companies have shared similar dips and highs, depending on the soda pop market. If the price of Coca-Cola were to go up a significant amount while Pepsi stayed the same, a pairs trader would buy Pepsi stock and sell Coca-Cola stock, assuming that the two companies would later return to their historical balance point. If the price of Pepsi rose to close that gap in price, the trader would make money on the Pepsi stock, while if the price of Coca-Cola fell, they would make money on having shorted the Coca-Cola stock.

The reason for the deviated stock to come back to original value is itself an assumption. It is assumed that the pair will have similar business idea as in the past during the holding period of the stock.

How does that assumption lead to a linear model ? I have not found a reference, but if you:

  • assume a stock price can be valued as a function of many underlying factors (interest rate r, dividend D, growth g, macro factors mf, etc — see intrinsic value of a stock)
  • assume that for the two companies with same business model the input values will be similar
  • assume a linear approximation of the intrinsic value can be obtained (which requires small variations around the variables — using Taylor’s theorem)

I will not derive the formulas, but intuitively you can sense that the stock prices will be linear for small changes of the variables impacting the equity.

Regime Change/Black Swan: Every time a specific statistics technique is used, assumptions are introduced.

Take a look at what will happen in January 2021 to the Stock pair I showed before. Open the following links, then click on the ‘o.o.s’ hidden plots (for out of sample data — I hide it to avoid spoiling the impact)

Spoiler about the GME price action: read A refresher on what the $#@! is going on with GME

A linear model introduces four assumptions, among them one called ‘homoscedasticity’ — which translated back to real life lingo means that all the errors share the same variance; but, if there is a variance regime change (if the variance for the new samples is higher) the model stops working. As the wiki page of ‘heteroscedasticity’ says:

“The existence of heteroscedasticity is a major concern in regression analysis and the analysis of variance, as it invalidates statistical tests of significance that assume that the modelling errors all have the same variance.”

Furthermore, from the Finance side we had assumed ‘that the pair will have similar business idea’.

Our simple data science project found one pair of two correlated stocks that with two different business ideas — that was the first red flag; the second was to assume that no regime changes would happen.

Artificial Intelligence models based on the same principles (correlation, regression) will fail in the same ways — if there is a regime change, the model fitted with the training data will be unable to extrapolate. (Why Financial Time Series LSTM prediction does not work)

Assuming stocks are spherical and move in a vacuum, you can program a robo-trader that will outperform all human traders (Giving a worms brain to a robo-trader and Teaching a robot to buy low and sell high)

Using only previous price data does not allow us to identify the regime changes; but there is no rule that tells us that we cannot look around for other sources of clues.

In the late noughties Bollen claimed that Twitter mood predicts the stock market. That gave the (bored) Computer Scientist something to do — now we needed to ‘hack’ into the internet to get all this magic alternative data that would allow us to make millions from stocks.

The code relies on Sentiment Analysis (mapping text to a numerical number) and apparently could predict the returns for the prices of the companies mentioned. You can read how the basis of Sentiment Analysis work in my blog (Demystifying) Sentiment Analysis in Finance (the spoiler: make sure you use the correct sentiment measure — do not use movie reviews on r/WallStreetBets)

However, there are other ways around it — in “Mean Reversion II: Pairs Trading Strategies” Deutsche Bank uses Sentiment Analysis to identify regime changes (simplistically, if there are many news about a company do not engage in pair trading).

Even simpler, we can add the count of mentions of a stock in r/WallStreetBets as an indicator of regime change (or ‘stay away’ indicator)

GME stock price (close) versus count of GME mentions in r/WallStreetBets (full graph)

(you can see the daily counts on my blog Reddit Analysis)

Alternative Data is a never ending whack-a-mole race — new sources keep appearing (in the 90’s it was internet news, in the noughties it was Twitter, now is Reddit, Discord, Tik Tok)

Judea Pearl gives a beautiful description of the beginning of Scatter plots (read his Book of Why — chapter 2 about Francis Galton) and mean reversion (how the height of the sons return to the population average over time — the sons of very tall parents tend to be shorter). Read this section:

“Economists essentially never used path diagrams and continue not to use them to this day, relying instead on numerical equations and matrix algebra. A dire consequence of this is that, because algebraic equations are nondirectional (that is, x= y is the same as y = x), economists had no notational means to distinguish causal from regression equations”

Causality is the reason why the domain knowledge set of Data Science is fundamental — without it, regardless of how advanced the mathematics, how powerful the computing system or how large the data set is, a Data Science project developed without an understanding of the underlying Financial dynamics is equivalent to technical analysis.

Current deep learning Artificial Intelligence systems derived from algebraic equations are not able to incorporate causal effect (read this lecture from this MIT class for a strict academic reason, or look at this cat causes motorbike crash gif). Using Pearl’s ladder of causation, Deep Learning robots remain in the first ladder: (as of 2020; current research [1] [2] is in place to move up the ladder

Illustration: Maayan Harel

For a (toy)example of how causality can be applied, read Illiquid Wine.


Find here a working LSTM model (link) to write as a Reddit collaborator!

Co-Founder of Lamat, a company specialized in solving high-value problems in finance by applying cutting edge numerical methods.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store