Looking for the code only?: it’s here.

RFM segmentation (RFM = recency, frequency, monetary) is a very easy and surprisingly useful strategy for segmenting customers. RFM analysis in R is quite easy to do, thanks to the dplyr library.

In this post we will show how can this be done in R using the dplyr package.

library(dplyr)
library(ggplot2)

df <- read.csv("./data/orders.csv")

Let’s do a quick inspection of the data:

head(df)

and check that the data types are correctly read:

summary(df)
##   Customer.ID            Date          Subtotal            State      
##  Min.   :    7   2/25/2015 :   84   Min.   :    0.01          :21352  
##  1st Qu.: 4011   2/24/2015 :   61   1st Qu.:   31.08   ON     :  451  
##  Median : 7554   10/27/2015:   29   Median :   56.19   QC     :  278  
##  Mean   : 7656   10/14/2014:   28   Mean   :   89.20   AB     :   66  
##  3rd Qu.:11176   11/16/2015:   28   3rd Qu.:  103.95   BC     :   64  
##  Max.   :15360   10/17/2013:   27   Max.   :21001.05   SK     :   21  
##                  (Other)   :22151                      (Other):  176  
##            Country     
##  Canada        :21733  
##  United States :  558  
##  Australia     :   29  
##  United Kingdom:   21  
##  Belgium       :    4  
##  Germany       :    4  
##  (Other)       :   59

First, observe that the Date variable is not correctly read. How do we know that? The date format in RStudio is in the form YYYY-MM-DD, which is clearly not what is shown above. So we do a little type coercion here.

df$Date <- as.Date(df$Date, format = "%m/%d/%Y")
summary(df)
##   Customer.ID         Date               Subtotal            State      
##  Min.   :    7   Min.   :2007-06-21   Min.   :    0.01          :21352  
##  1st Qu.: 4011   1st Qu.:2011-05-23   1st Qu.:   31.08   ON     :  451  
##  Median : 7554   Median :2012-12-19   Median :   56.19   QC     :  278  
##  Mean   : 7656   Mean   :2012-11-19   Mean   :   89.20   AB     :   66  
##  3rd Qu.:11176   3rd Qu.:2014-07-04   3rd Qu.:  103.95   BC     :   64  
##  Max.   :15360   Max.   :2015-12-19   Max.   :21001.05   SK     :   21  
##                                                          (Other):  176  
##            Country     
##  Canada        :21733  
##  United States :  558  
##  Australia     :   29  
##  United Kingdom:   21  
##  Belgium       :    4  
##  Germany       :    4  
##  (Other)       :   59

Now, to the point: we need to calculate recency (the time spent since the last purchase), frequency (number of repeat purchases) and monetary (total money spent in our shop). Once we have these three quantities we can create segments of our customers: we want to identify those customers that bought recently (recency is low), are regular customers (high frequency) and have spent good money on us (monetary is high). These customers might be more important from us than, say, customers with low frequency and low monetary. There are also those customers in the middle, with moderately high frequency and monetary, but also very high recency: would there be a way to bring those back? What do they want? Perhaps these customers bought a product we no longer offer.

Let’s calculate the RFM summary of each customer to address these questions:

rfm <- df%>%
  group_by(Customer.ID) %>%
  summarise(last_purchase=max(Date), 
            recency = as.double(difftime("2016-01-01",
                                         as.Date(last_purchase, origin="1970-01-01"),  
                                        units = "day")), 
            frequency = n(), 
            monetary = sum(Subtotal)
            )
  
head(rfm)

We use a group_by statement (which must be always followed by a summarise). We create an auxiliary column, last_purchase to help us calculate recency, by substracting it from a reference date, in this case, January 1, 2016.

On any data science project we must take a scientific mindset: we need to ask concrete questions to our data, formulate hypothesis. Data science projects done “to get insight” are usually fruitless.

So, let’s ask our data: Are customers purchasing more recently more profitable?

rfm %>% 
  ggplot(aes(x=recency, y=monetary, size=monetary, color=monetary)) + 
  geom_point(position = "jitter")+
  theme_bw()

It’s a bit hard to read, but it seems that our more profitable clients where around 1000 days ago. What happened then? That’s a great question for our operations people!

Another interesting question to ask: Do one-time-purchasers spend more or less in average? Let’s find out!

rfm2 <- rfm %>% 
        mutate(one_time = ifelse(frequency==1, "One-timer","More than once"), 
               avg_per_purchase = monetary/frequency) %>%
        group_by(one_time)

Here we use mutate to change the frequency column for an indicator of how much money people have spent in average. This is all stored in an rfm2 dataframe to do more analysis.

Another question we can ask our data: what is the distribution of money spent between one-time purchase and the rest?

rfm2 %>% ggplot(aes(x=one_time, y=monetary))+geom_violin()

Ok, we obviously have an outlier. Let’s get rid of it using filter:

rfm2 %>% 
  filter(monetary<1000)%>%
  ggplot(aes(x=one_time, y=monetary))+geom_violin()+
  xlab("One-time purchasers")+theme_bw()

We see that one-time purchasers tend to spend much less than our recurrent customers. Would it be worth perhaps to focus on developing our existing customers than in acquiring? Our data suggests this might be the case.

As you see, there are many ways in which this analysis can be extended and, if available, combined with other data sources.