Jan 29, 2018

Statistics Question

I took a statistics class in college, but I didn't pay much attention due to never seeing a need for the knowledge. Much of my education is like this: "Why the hell would I ever need this", only to find out years later that I desperately needed it (yeah, I'm talking to you, high school algebra).

Thimbleweed Park is out on seven different digital stores and each of them has a completely different way of reporting sales. It's time-consuming and frustrating to download each sales report once a month and hand massage the data into a useable format.

I spent the last few days writing a python script to take all the data and move it into a single SQL database so I could quickly query tons of useful information.

SELECT countries.country, SUM(sales.units_sold) AS units_sold
FROM sales JOIN countries
ON sales.country_code = countries.country_code
WHERE units_sold > 0
GROUP BY sales.country_code
ORDER BY units_sold DESC

One thing I want to know is an "average" units/day for each store, so I can keep an eye on long-term trends. Sounds easy, except for sales and promotions happening all-to-often, and they produce huge spikes in the data that I don't care about. I just want to see an average for average non-sale days and I don't want to go in and remove the sale ranges as sometimes they happen without my knowledge.

I'm looking for a way to take an average that removes spikes that are well above the average. This doesn't have to be in SQL, that computation will most likely happen in a spreadsheet, or could happen in the report generation if it's too complex.

Maybe this isn't possible, but I figured I'd ask.

synthmeat

Jan 29, 2018
This will definitely get you started:
https://www.periscopedata.com/blog/outlier-detection-in-sql

LowLevel

Jan 29, 2018
If you are going to use Excel or other spreadsheets, just use the TRIMMEAN() function to get the mean excluding the outliers:

https://support.office.com/en-us/article/TRIMMEAN-function-d90c9878-a119-4746-88fa-63d988f511d3

It might require some tweaking, depending on how many outliers exist in the data sample, which in turn depends on how common the sales/promotions are and how long is the period of time that you are analyzing.

This consultancy is not free. Please send Thimbleweed Park merchandise.

Johnicholas Hines

Jan 29, 2018
Seems to me you could compute the median (of a window), and flag as outliers things with large difference from the median, and then compute the average of the non-outliers.

Katelyn Gadd

Jan 29, 2018
A histogram is a good approach. They're not too hard to build programmatically and Excel has them built in. Some SQL databases like Postgres have tools built in for them too.

Histograms are handy for this in general since if you build one appropriately, you can use it to estimate pretty much any percentile (say 5%, 10%) in your whole dataset.

John

Jan 29, 2018
My initial thought is to do a fourier transform of the data and lob off the high frequency data from that. It would be more robust than just going after outliers

You could eliminate data that is higher frequency than, say,  5 days, leaving you with data that trends for at least a week of sales. That would also mean eliminating the indirect effects of the sale, assuming that there are still some sales happening after the sale, but still indirectly because of it due to word of mouth or what have you. Basically, any short term trends

Then convert it back from the frequency domain to the time domain and you'll be left with just a nice trendline, without spikes, and can go from there for averages. Some tweaking for exactly what data to cut off would need to be done to tune it just how you want it

Jok

Jan 29, 2018
Any of histogram, median, percentile or interquartile representation of your daily sales per month/year will be helpful for you to not care of daily outliers.

Ben

Jan 29, 2018
I put some thoughts in this tweet https://twitter.com/spiritek/status/958089004078936064

Ron Gilbert

Jan 29, 2018
One of my requirements was not to track sales and promotions. Not doing so makes the problem hard, and therefore interesting.  As I stated in the post, I often don't know when promotions are happening. And, sales increases from dales and promotions don't stop the second it's over, there is a tail and I need to ignore that as well. But mostly, this becomes a simple/uninteresting problem if you hand remove promotional periods.

Chris

Jan 29, 2018
You could try runnign a linear regression?  
http://daynebatten.com/2015/08/sql-linear-regression/

Fangz

Jan 29, 2018
A running median is the simpliest way to do it.

Alternatively calculate a running winzorised mean. https://en.wikipedia.org/wiki/Winsorizing

seguso

Jan 29, 2018
It sounds like you want to smooth out a curve. Maybe you could use a Simple Moving Average? (SMA)

Al

Jan 29, 2018
I would use python rolling median for this. It's even built-in with pandas DataSeries data type.

Alan Brookland

Jan 29, 2018
I'm no statisticIan but I think what you're after is a Rolling Average.

Matthew

Jan 29, 2018
One way to do it would be fitting a bimodal distribution:
https://en.wikipedia.org/wiki/Multimodal_distribution

For each store, you can think of non-sale days and sale days as a distribution around a mean with deviation associated.  Draw a histogram, fit the two curves, and remove the upper one to get stats about the lower one.

It might not work so well though because they're more decay functions than normal distributions, so the better version might be to fit a bimodal log-normal curve:
https://en.wikipedia.org/wiki/Log-normal_distribution

But that might be overkill..

Gustavo

Jan 29, 2018
Hi Ron! This might be too simple, but it's what I would do in SQL:

declare @filter decimal(5,2)
set @filter = 1.5

SELECT c.country, SUM(s.units_sold) AS units_sold
FROM sales s JOIN countries c
ON s.country_code = c.country_code
WHERE units_sold > 0
GROUP BY sales.country_code
HAVING SUM(sales.units_sold) < @filter * (
SELECT AVG(unitsperday) from
(SELECT SUM(sales.units_sold) AS unitsperday
FROM sales
WHERE units_sold > 0 and sales.country_code = s.country_code) as average)
ORDER BY units_sold DESC
FROM sales
WHERE units_sold > 0 and sales.country_code = s.country_code) as average)
ORDER BY units_sold DESC

If im not mistaken, that query would remove every day that's higher than 1,5 * day average for that country.
Hope this helps.

Gustavo

Jan 29, 2018
Sorry, the query would be the following:

declare @filter decimal(5,2)
set @filter = 1.5

SELECT c.country, SUM(s.units_sold) AS units_sold
FROM sales s JOIN countries c
ON s.country_code = c.country_code
WHERE units_sold > 0
GROUP BY sales.country_code
HAVING SUM(sales.units_sold) < @filter * (
SELECT AVG(unitsperday) from
(SELECT SUM(sales.units_sold) AS unitsperday
FROM sales
WHERE units_sold > 0 and sales.country_code = s.country_code) as average)
ORDER BY units_sold DESC

Gustavo

Jan 29, 2018
Obviusly I didn't think it through, the subquery that calculates the sum, should group it by day. And then the query should also group it by day first.

Nobstar

Jan 30, 2018
For a quick way of getting a robust value, I would use a median. Take the copies of sales for each day, sort the values ascending and take the middle value. The reasoning behind it is, that if it is an average value it sould apear quite often in the middle area of the sorted list, spikes by low or high sales per day are at the top and buttom. So by taking the middle value, you should get a good and stable guess of the average sales per day without spikes.
In a spread sheet, there should already be a median function implemented. In SQL maybe there is also already a median funciton, if not it should be easy to implement.

Zak Phoenix McKracken

Jan 30, 2018
My solution is to consider the values that deviate from the average by a certain percentage, that you decide.
Basically, for each series:
   AVERAGE + (DEVIATION STANDARD)*K
(K = your amount of deviation from the average). That amount is called "average square deviation".
The Excel formula would be, for instance:
=AVERAGE(A1:A1138) + ( STDEV(A1:A1138) *5 )
assuming your range is from A1 to A1138, and your arbitrary parameter that deviates from the average is 5.

Zak Phoenix McKracken

Jan 30, 2018
Errata corrige: "My solution is to DISCARD the values that deviate from the average..."

Colin Morris

Jan 30, 2018
Some more information on outliers with strategies like Box Plots and histograms: https://conversionxl.com/blog/outliers/

Zak Phoenix McKracken

Jan 30, 2018
I have just found this Excel Function, tested it, and it works:
=HARMEAN(range)
It returns the HARMONIC MEAN.
I have tested it on an excel spreadsheet, comparing the =AVERAGE function in a range of datas without peaks, and the =HARMEAN on the same range of datas with peaks, and the results are very close.

Hamakei

Jan 30, 2018
Standard deviation. Remove data points that exceed a pre-defined threshold.

Augusto Ruiz

Jan 31, 2018
In order to exclude promotion days per store I would have a look at those store APIs to check whether they have a promotion API, and store that info in your database. That way the query would exclude promotion days.

Other option, as already mentioned is to exclude outliers from your sales distribution. In Excel you can use TRIMMEAN, but that will remove also the lowest entries in your distribution. You probably want to exclude only highest peaks, and I think an easy way to accomplish that is using the average and standard deviation.

So, assuming your sales data is in the A column, you calculate the average of all values:
Make the cell B1 be
=AVERAGE(A1:An)
And cell C1 be:
=STDEV(A1:An)

Cell D1 will be:
=IF(A1-B1>C1*deviationFactor,"",A1)

Being the deviationFactor a number that controls how high the peaks have to be to be removed.

Copy this formula to cells D2:Dn, taking care only the index in A is changed (average is always in B1 and standard deviation in C1, but original data is in A1:An).

You will have the data without the peaks in the D column. So you just need to get the Average of that column.

uriel

Jan 31, 2018
If you're after long-term trends, fitting simply models (poly + exponential for example)  and minimizing the misfit using a standard norm should ignore the spikes due to sales.  If you're using Python, check SciPy, plenty of ready to use functions.
Here are the rules for commenting.