客户细分_客户细分初学者指南

news/2024/5/20 10:15:45 标签: 聚类, java, 大数据, python, 可视化

客户细分

In this post I’m going to talk about something that’s relatively simple but fundamental to just about any business: Customer Segmentation. At the core of customer segmentation is being able to identify different types of customers and then figure out ways to find more of those individuals so you can… you guessed it, get more customers!

在这篇文章中,我将讨论相对简单但对任何业务都至关重要的事情:客户细分。 客户细分的核心是能够识别不同类型的客户,然后找出找到更多此类个人的方法,以便您……您猜对了,获得了更多客户!

In this post, I’ll detail how you can use K-Means clustering to help with some of the exploratory aspects of customer segmentation. I’ll be walking through the example using Yhat’s own Python IDE, Rodeo, which you can download for Windows, Mac or Linux here. If you’re using a Windows machine, Rodeo ships with Python (via Continuum’s Miniconda). How convenient!

在这篇文章中,我将详细介绍如何使用K-Means聚类来帮助客户细分的一些探索性方面。 我将使用Yhat自己的Python IDE Rodeo遍历该示例,您可以在此处为Windows,Mac或Linux下载该示例。 如果您使用的是Windows计算机,则Rodeo附带Python(通过Continuum的Miniconda)。 多么方便!

我们的数据 (Our Data)

The data we’re using comes from John Foreman’s book Data Smart. The dataset contains both information on marketing newsletters/e-mail campaigns (e-mail offers sent) and transaction level data from customers (which offer customers responded to and what they bought).

我们正在使用的数据来自John Foreman的书Data Smart 。 数据集既包含有关营销新闻通讯/电子邮件活动的信息(发送电子邮件要约),也包含来自客户的交易级别数据(向客户提供响应和购买的商品)。

import pandas as pd

df_offers = pd.read_excel("./WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()

import pandas as pd

df_offers = pd.read_excel("./WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()

offer_id offer_id campaign 运动 varietal 品种 min_qty min_qty discount 折扣 origin 起源 past_peak past_peak
0 0 1 1个 January 一月 Malbec 马尔贝克 72 72 56 56 France 法国 False
1 1个 2 2 January 一月 Pinot Noir 黑比诺 72 72 17 17 France 法国 False
2 2 3 3 February 二月 Espumante Espumante 144 144 32 32 Oregon 俄勒冈州 True 真正
3 3 4 4 February 二月 Champagne 香槟酒 72 72 48 48 France 法国 True 真正
4 4 5 5 February 二月 Cabernet Sauvignon 赤霞珠 144 144 44 44 New Zealand 新西兰 True 真正

And the transaction level data…

还有交易级别的数据…

customer_name 顾客姓名 offer_id offer_id n ñ
0 0 Smith 史密斯 2 2 1 1个
1 1个 Smith 史密斯 24 24 1 1个
2 2 Johnson 约翰逊 17 17 1 1个
3 3 Johnson 约翰逊 24 24 1 1个
4 4 Johnson 约翰逊 26 26 1 1个

Inside of Rodeo, that’ll look something like…

在Rodeo内部,外观类似于…

If you’re new to Rodeo, note that you can move and resize tabs, so if you prefer a side-by-side editor and terminal layout, or you want to make the editor full screen, you can.

如果您不熟悉Rodeo,请注意可以移动和调整选项卡的大小,因此,如果您希望并排编辑器和终端布局,或者希望使编辑器全屏显示,则可以。

You can also copy and save the formatted outputs in your history tab, like the data frames we produced above.

您还可以将格式化的输出复制并保存在“历史记录”选项卡中,例如上面我们生成的数据框。

快速的K均值入门 (A quick K-Means primer)

In order to segment our customers, we need a way to compare them. To do this we’re going to use K-Means clustering. K-means is a way of taking a dataset and finding groups (or clusters) of points that have similar properties. K-means works by grouping the points together in such a way that the distance between all the points and the midpoint of the cluster they belong to is minimized.

为了细分我们的客户,我们需要一种比较他们的方法。 为此,我们将使用K-Means聚类 。 K均值是一种获取数据集并查找具有相似属性的点的组(或群集)的方法。 K-means的工作方式是将这些点分组在一起,以使所有点之间的距离以及它们所属的群集的中点最小。

Think of the simplest possible example. If I told you to create 3 groups for the points below and draw a star where the middle of each group would be, what would you do?

想想最简单的例子。 如果我告诉您为下面的点创建3个组,并在每个组的中间画一个星星,您会怎么做?

Probably (or hopefully) something like this…

可能(或希望)这样的事情……

In K-Means speak, the “x”‘s are called “centroids” and indicate (you guessed it), the center of a given cluster. I’m not going to go into the ins and outs of what K-Means is actually doing under the hood, but hopefully this illustration gives you a good idea.

用K均值表示,“ x”被称为“质心”,表示(您猜对了)给定簇的中心。 我不会深入探讨K-Means实际在幕后所做的事情,但是希望此图可以为您提供一个好主意。

聚集客户 (Clustering our customers)

Okay, so how does clustering apply to our customers? Well since we’re trying to learn more about how our customers behave, we can use their behavior (whether or not they purchased something based on an offer) as a way to group similar minded customers together. We can then study those groups to look for patterns and trends which can help us formulate future offers.

好吧,那么群集如何适用于我们的客户? 好吧,因为我们试图了解有关客户行为的更多信息,所以我们可以使用他们的行为(无论他们是否根据要约购买了商品)作为将相似想法的客户聚集在一起的一种方式。 然后,我们可以研究这些小组以寻找可以帮助我们制定未来报价的模式和趋势。

The first thing we need is a way to compare customers. To do this, we’re going to create a matrix that contains each customer and a 0/1 indicator for whether or not they responded to a given offer. This is easy enough to do in Python:

我们需要的第一件事是比较客户的方法。 为此,我们将创建一个矩阵,其中包含每个客户和一个0/1指标,用于指示他们是否响应给定的报价。 这在Python中很容易做到:

# join the offers and transactions table
df = pd.merge(df_offers, df_transactions)
# create a "pivot table" which will give us the number of times each customer responded to a given offer
matrix = df.pivot_table(index=['customer_name'], columns=['offer_id'], values='n')
# a little tidying up. fill NA values with 0 and make the index into a column
matrix = matrix.fillna(0).reset_index()
# save a list of the 0/1 columns. we'll use these a bit later
x_cols = matrix.columns[1:]

# join the offers and transactions table
df = pd.merge(df_offers, df_transactions)
# create a "pivot table" which will give us the number of times each customer responded to a given offer
matrix = df.pivot_table(index=['customer_name'], columns=['offer_id'], values='n')
# a little tidying up. fill NA values with 0 and make the index into a column
matrix = matrix.fillna(0).reset_index()
# save a list of the 0/1 columns. we'll use these a bit later
x_cols = matrix.columns[1:]

Now to create the clusters, we’re going to use the KMeans functionality from scikit-learn. I arbitrarily chose 5 clusters. My general rule of thumb is to have at least 7x as many records as I do clusters.

现在创建集群,我们将使用scikit-learnKMeans功能。 我任意选择了5个集群。 我的一般经验法则是,至少有7倍的记录是集群的记录。

Notice that in Rodeo, you can view the histogram in the terminal, history or plots tab. If you’re working on multiple monitors, you can even pop out the plot into its own window.

请注意,在Rodeo中,您可以在终端,历史记录或绘图选项卡中查看直方图。 如果您正在使用多台显示器,甚至可以将绘图弹出到其自己的窗口中。

可视化集群 (Visualizing the clusters)

A really cool trick that the probably didn’t teach you in school is Principal Component Analysis. There are lots of uses for it, but today we’re going to use it to transform our multi-dimensional dataset into a 2 dimensional dataset. Why you ask? Well once it is in 2 dimensions (or simply put, it has 2 columns), it becomes much easier to plot!

一个可能在学校没教过你的绝妙的技巧是主成分分析 。 它有很多用途,但是今天我们将使用它将多维数据集转换为二维数据集。 你为什么问? 好吧,一旦它在2维(或简单地说,它有2列)中,绘制起来就容易多了!

Once again, scikit-learn comes to the rescue!

再次, scikit-learn可以解救!

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
matrix['x'] = pca.fit_transform(matrix[x_cols])[:,0]
matrix['y'] = pca.fit_transform(matrix[x_cols])[:,1]
matrix = matrix.reset_index()

customer_clusters = matrix[['customer_name', 'cluster', 'x', 'y']]
customer_clusters.head()

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
matrix['x'] = pca.fit_transform(matrix[x_cols])[:,0]
matrix['y'] = pca.fit_transform(matrix[x_cols])[:,1]
matrix = matrix.reset_index()

customer_clusters = matrix[['customer_name', 'cluster', 'x', 'y']]
customer_clusters.head()

offer_id offer_id customer_name 顾客姓名 clusterx X y ÿ
0 0 Adams 亚当斯 2 2 -1.007580 -1.007580 0.108215 0.108215
1 1个 Allen 艾伦 4 4 0.287539 0.287539 0.044715 0.044715
2 2 Anderson 安德森 1 1个 0.392032 0.392032 1.038391 1.038391
3 3 Bailey 贝利 2 2 -0.699477 -0.699477 -0.022542 -0.022542
4 4 Baker 贝克 3 3 -0.088183 -0.088183 -0.471695 -0.471695

What we’ve done is we’ve taken those x_cols columns of 0/1 indicator variables, and we’ve transformed them into a 2-D dataset. We took one column and arbitrarily called it x and then called the other y. Now we can throw each point into a scatterplot. We’ll color code each point based on it’s cluster so it’s easier to see them.

我们要做的是获取0/1指标变量的x_cols列,并将其转换为2D数据集。 我们取一列,并随意将其命名为x ,然后将其命名为另一列y 。 现在我们可以将每个点放入散点图中。 我们将根据每个点的簇对它们进行颜色编码,以便于查看。

If you want to get fancy, you can also plot the centers of the clusters as well. These are stored in the KMeans instance using the cluster_centers_ variable. Make sure that you also transform the cluster centers into the 2-D projection.

如果想花哨的话,还可以绘制群集的中心。 它们使用cluster_centers_变量存储在KMeans实例中。 确保您还将群集中心转换为二维投影。

cluster_centers = pca.transform(cluster.cluster_centers_)
cluster_centers = pd.DataFrame(cluster_centers, columns=['x', 'y'])
cluster_centers['cluster'] = range(0, len(cluster_centers))

ggplot(df, aes(x='x', y='y', color='cluster')) + 
    geom_point(size=75) + 
    geom_point(cluster_centers, size=500) +
    ggtitle("Customers Grouped by Cluster")

cluster_centers = pca.transform(cluster.cluster_centers_)
cluster_centers = pd.DataFrame(cluster_centers, columns=['x', 'y'])
cluster_centers['cluster'] = range(0, len(cluster_centers))

ggplot(df, aes(x='x', y='y', color='cluster')) + 
    geom_point(size=75) + 
    geom_point(cluster_centers, size=500) +
    ggtitle("Customers Grouped by Cluster")

深入挖掘集群 (Digging deeper into the clusters)

Let’s dig a little deeper into the clusters. Take cluster 4 for example. If we break out cluster 4 and compare it to the remaining customers, we can start to look for interesting facets that we might be able to exploit.

让我们更深入地研究集群。 以集群4为例。 如果我们分解集群4并将其与其余客户进行比较,我们就可以开始寻找可以利用的有趣方面。

As a baseline, take a look at the varietal counts for cluster 4 vs. everyone else. It turns out that almost all of the Cabernet Sauvignon offers were purchased by members of cluster 4. In addition, none of the Espumante offers were purchased by members of cluster 4.

作为基准,请看一下聚类4与其他所有对象的varietal计数。 事实证明,几乎所有赤霞珠的报价都由集群4的成员购买。此外,Espumante的报价都不是集群4的成员购买的。

is_4 is_4 varietal 品种 count 计数
FalseChampagne 香槟酒 45 45
Espumante Espumante 40 40
Prosecco 普罗塞克 37 37
Pinot Noir 黑比诺 37 37
Malbec 马尔贝克 17 17
Pinot Grigio 比诺格里乔 16 16
Merlot 美乐 8 8
Cabernet Sauvignon 赤霞珠 6 6
Chardonnay 霞多丽 4 4
True 真正 Champagne 香槟酒 36 36
Cabernet Sauvignon 赤霞珠 26 26
Malbec 马尔贝克 15 15
Merlot 美乐 12 12
Chardonnay 霞多丽 11 11
Pinot Noir 黑比诺 7 7
Prosecco 普罗塞克 6 6
Pinot Grigio 比诺格里乔 1 1个

You can also segment out numerical features. For instance, look at how the mean of the min_qty field breaks out between 4 vs. non-4. It seems like members of cluster 4 like to by in bulk!

您还可以分割数字特征。 例如,查看min_qty字段的均值如何在4与非4之间min_qty 。 似乎集群4的成员喜欢批量购买!

df.groupby("is_4")[['min_qty', 'discount']].mean()

df.groupby("is_4")[['min_qty', 'discount']].mean()

min_qty min_qty discount 折扣
is_4 is_4
False47.685484 47.685484 59.120968 59.120968
True 真正 93.394737 93.394737 60.657895 60.657895

Send a bulk Cab Sav offer Cluster 4’s way!

发送大量Cab Sav Sav提供Cluster 4的方式!

最后的想法 (Final Thoughts)

翻译自: https://www.pybloggers.com/2017/02/beginners-guide-to-customer-segmentation/

客户细分


http://www.niftyadmin.cn/n/986675.html

相关文章

简单写了下修改win7高性能电源方案的批处理 (原作者joinlidong IT天空)

我个人封装时一直没有找到一键更改Windows电源设置的地方,每次都要手动去挨个调整。花了一上午时间研究了下Win7的电源选项批处理。核心命令powercfg。其实大家都可通过powercfg /? 来查看其详细使用方法 这里GUID我是在Win7 下获得的。其他系统的GUID还没时间试。…

浙大远程教育计算机作业2,2016浙大远程教育计算机应用基础作业-2选编.docx

2016浙大远程教育计算机应用基础作业-2选编PAGEPAGE 18第2章 Windows操作系统及其应用(单选题)完成下列题目,这些题目与期末考试题或将来的统考题类似,请同学们认真做一遍。注意:上传你所做的答案后就可以下载我的参考答案。一. 操…

LNMP一键安装包是什么?

http://lnmp.org/index.htmlLNMP一键安装包是一个用Linux Shell编写的可以为CentOS/RadHat、Debian/Ubuntu VPS(VDS)或独立主机安装LNMP(Nginx、MySQL、PHP、phpMyAdmin)生产环境的Shell程序。 我们为什么需要它? 编译安装需要输入大量的命令,如果是配置生产环境需…

swift UIview上添加视频播放

1. /// 是否显示过广告private lazy var isLoadAV false/// 15秒宣传视频private var play: AVPlayer?/// 宣传视频背景private lazy var logoView : UIView {let v JYUIModel.createView()v.backgroundColor UIColor.orangereturn v}()2. /// 添加宣传视频private func a…

区块链 - ubuntu系统搭建以太坊私有链

1、安装curl、git apt-get update apt-get install git apt-get install curl 2、安装go wget https://storage.googleapis.com/golang/go1.10.linux-amd64.tar.gz sudo tar -xzf go1.8.1.linux-xxx.tar.gz -C /usr/local 3、配置go的环境变量 mkdir -p ~/go; echo "exp…

未来已来!金融壹账通高管京交会论道银行数字化转型

2019独角兽企业重金招聘Python工程师标准>>> “要么数字化,要么灭亡“,在5月31日举行的“2019中国金融科技论坛”上,金融壹账通联席总经理邱寒用一句孙正义的名言指出了数字化的必要性。在银行业,数字化转型同样是全行…

深度学习实现象棋_象棋的深度学习

深度学习实现象棋Erik Bernhardsson | 2017年2月2日 (by Erik Bernhardsson | February 2, 2017)About Erik: Dad and CTO (Chief Troll Officer) at a fintech startup in NYC. Ex-Spotify, co-organizing NYC ML meetup, open source sometimes (Luigi, Annoy), blogs random…

北航计算机学院教授马帅,北航离散数学大一课件(马帅)指南.pdf

数理逻辑-(2)命题逻辑马殿富北航计算机学院马 帅dfmabuaa.edu.cn计算机学院2012-9mashuaibuaa.edu.cn提纲 基本概念:集合、函数、归纳、数理逻辑 1.1命题和联结词 1.2公式和真值赋值 1.3等值演算 1.4对偶定理 1.5联结词的完全集 计算机学院 1.6范式…