### Fake data, part 3: Bypassing the central limit theorem

*May 2017: This is from my other blog that's no longer online. The original comments are no longer available, but you are welcome to add more.*

Now that I have my black swan distribution, and I verified that it fits market returns remarkably well while being mathematically tractable, I want to use it to generate artificial market data.

Generating a series of closing prices is easy.

- Select a starting price, take the logarithm
- Select a mean
*μ*and standard deviation*σ*of log returns ln(*P*_{0}/*P*_{1}) where*P*_{0}is the most recent price and*P*_{1}is the previous price. - Generate a uniformly-distributed random probability
*p*between 0 and 1. Plug it into the inverse black swan distribution (using*a*=1.6):

$$B^{-1}(p;\mu,s) = \mu - 2 s \, \sinh \left[\frac {\tanh^{-1}(1-2p)} {a} \right]$$ - Add the result to a running total.
- Go to step 3. Repeat as often as desired.

This works fine. But what if we want to generate more than just single prices? How do I get a series of daily price bars, with high, low, and closing values for my fake market? The way to mimic real market behavior is to generate a series of *n* values for each "day" and record the highest, lowest, and final price for high, low, and close of the day. That is, define *S _{d}* as a daily step and

*S*as an intraday step:

_{i}The scaling of the intraday mean *μ _{i}* and standard deviation

*σ*(corresponding to the shape parameter

_{i}*s*) is necessary to preserve the desired mean

_{i}*μ*and standard deviation

_{d}*σ*for daily values. As described in the previous part, the approximation for the distribution shape parameter s is:

_{d}*a*=1.6. But look at what happened! A disaster. The figure shows the resulting distributions for 1, 10, and 100 intraday steps. Look at the big rounded top made up of green dots representing

*n*=100. As the number of intraday steps

*n*grows, the resulting distribution of daily values approaches a

*normal*distribution.

I'm back where I started! The whole point was to find something *better* than a normal distribution to model the markets. I don't want to end up with a normal distribution, I want to end up with my black swan distribution.

So what happened? The central limit theorem, evidently the "sovereign law" of probability theory, came in and took over. It says that the sum of many random variables with a finite mean and variance will be normally distributed, regardless of the underlying distribution we start with.

Although the black swan distribution has infinite kurtosis, it does have a finite mean and variance. This means, if we generate a bunch of tiny black swan steps to build large "daily" steps in a random walk, the large steps will approximate a normal distribution.

How can I break this law?

After much experimentation, I found a way to get past the central limit theorem. If we perturb *μ _{d}* to vary black-swanly with each daily step (not each intraday step), we get values that still appear distributed according to black swan, and so are the sums!
$$\large \varepsilon = \begin {cases} \frac{ B^{-1}(p,0,s_d)}{\sqrt{n}\sqrt{n+1}} = \frac{\mu_d - 2 s_d \, \sinh \left[\frac {\tanh^{-1}(1-2p)}{a} \right]}{\sqrt{n}\sqrt{n+1}} & n>1 \\ 0 & n=1 \end {cases}$$
Re-generate

*ε*

**once per day**, and use that value for all intraday steps (if you generate a new

*ε*for each intraday step, you end up with a normal distribution again). For each intraday step

*S*, use these values of

_{i}*μ*and

_{i}*s*: $$\mu_i = \frac{\mu_d}{n}+\varepsilon, \quad s_i = \frac {s_d}{n}$$ For large

_{i}*n*, the adjustment to

*μ*basically perturbs the mean each day by a small black swan distribution having a standard deviation of

_{d}*σ/n*. For

*n*>50 or so, one can simply use

*n*by itself in the denominator of the expression for

*ε*. Again,

*p*is a random probability between 0 and 1.

Here's how it worked out:

That seems to work. I can now generate an artificial series of high-low-close prices that have similar statistical properties to actual markets. Now I must investigate whether any dependencies exist between successive values in the series.

I want to mention that the plots in this article use the Rnd() function in Excel's Visual Basic. The Visual Basic random number generator is fairly crude by modern standards, which may explain the noisy tails of the distributions shown (using 16,000 samples) versus the relatively cleaner tails of the distributions measured from actual market data (using less samples). I can't know for sure unless I try a better random number generator, but I'm not inclined to do so now — I'm satisfied with these results.

## Comments

## Post a Comment