Python Pandas - GroupBy 操作
任何 groupby 操作都涉及对原始对象的以下操作之一。 他们是 −
分割对象
应用函数
组合结果
在许多情况下,我们将数据拆分为多个集合,并对每个子集应用一些功能。 在应用功能中,我们可以执行以下操作 −
聚合 − 计算汇总统计量
变换 − 执行一些特定于组的操作
过滤 − 在某些条件下丢弃数据
现在让我们创建一个 DataFrame 对象并对其执行所有操作 −
#import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df
它的输出如下 −
Points Rank Team Year 0 876 1 Riders 2014 1 789 2 Riders 2015 2 863 2 Devils 2014 3 673 3 Devils 2015 4 741 3 Kings 2014 5 812 4 kings 2015 6 756 1 Kings 2016 7 788 1 Kings 2017 8 694 2 Riders 2016 9 701 4 Royals 2014 10 804 1 Royals 2015 11 690 2 Riders 2017
将数据分组
Pandas 对象可以拆分为它们的任何对象。 有多种方法可以拆分对象,例如 −
- obj.groupby('key')
- obj.groupby(['key1','key2'])
- obj.groupby(key,axis=1)
现在让我们看看如何将分组对象应用于 DataFrame 对象
示例
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby('Team')
它的输出如下 −
<pandas.core.groupby.DataFrameGroupBy object at 0x7fa46a977e50>
查看群组
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby('Team').groups
它的输出如下 −
{'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Devils': Int64Index([2, 3], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings' : Int64Index([5], dtype='int64')}
示例
Group by 多列 −
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby(['Team','Year']).groups
它的输出如下 −
{('Kings', 2014): Int64Index([4], dtype='int64'), ('Royals', 2014): Int64Index([9], dtype='int64'), ('Riders', 2014): Int64Index([0], dtype='int64'), ('Riders', 2015): Int64Index([1], dtype='int64'), ('Kings', 2016): Int64Index([6], dtype='int64'), ('Riders', 2016): Int64Index([8], dtype='int64'), ('Riders', 2017): Int64Index([11], dtype='int64'), ('Devils', 2014): Int64Index([2], dtype='int64'), ('Devils', 2015): Int64Index([3], dtype='int64'), ('kings', 2015): Int64Index([5], dtype='int64'), ('Royals', 2015): Int64Index([10], dtype='int64'), ('Kings', 2017): Int64Index([7], dtype='int64')}
遍历组
有了 groupby 对象,我们可以像 itertools.obj 一样遍历对象。
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') for name,group in grouped: print name print group
它的输出如下 −
2014 Points Rank Team Year 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014 2015 Points Rank Team Year 1 789 2 Riders 2015 3 673 3 Devils 2015 5 812 4 kings 2015 10 804 1 Royals 2015 2016 Points Rank Team Year 6 756 1 Kings 2016 8 694 2 Riders 2016 2017 Points Rank Team Year 7 788 1 Kings 2017 11 690 2 Riders 2017
默认情况下,groupby 对象的标签名称与组名称相同。
选择一个组
使用 get_group() 方法,我们可以选择单个组。
# import the pandas library import pandas as pd ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print grouped.get_group(2014)
它的输出如下 −
Points Rank Team Year 0 876 1 Riders 2014 2 863 2 Devils 2014 4 741 3 Kings 2014 9 701 4 Royals 2014
聚合
聚合函数为每个组返回一个聚合值。 一旦创建了 group by 对象,就可以对分组数据执行多个聚合操作。
一个明显的方法是通过聚合或等效的 agg 方法进行聚合 −
# import the pandas library import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Year') print grouped['Points'].agg(np.mean)
它的输出如下 −
Year 2014 795.25 2015 769.50 2016 725.00 2017 739.00 Name: Points, dtype: float64
查看每个组大小的另一种方法是应用 size() 函数 −
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) Attribute Access in Python Pandas grouped = df.groupby('Team') print grouped.agg(np.size)
它的输出如下 −
Points Rank Year Team Devils 2 2 2 Kings 3 3 3 Riders 4 4 4 Royals 2 2 2 kings 1 1 1
一次应用多个聚合函数
使用分组系列,您还可以传递 list 或 函数字典 进行聚合,并生成 DataFrame 作为输出 −
# import the pandas library import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') print grouped['Points'].agg([np.sum, np.mean, np.std])
它的输出如下 −
Team sum mean std Devils 1536 768.000000 134.350288 Kings 2285 761.666667 24.006943 Riders 3049 762.250000 88.567771 Royals 1505 752.500000 72.831998 kings 812 812.000000 NaN
转换
对组或列的转换会返回一个对象,该对象的索引大小与正在分组的对象的大小相同。 因此,转换应返回与组块大小相同的结果。
# import the pandas library import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) grouped = df.groupby('Team') score = lambda x: (x - x.mean()) / x.std()*10 print grouped.transform(score)
它的输出如下 −
Points Rank Year 0 12.843272 -15.000000 -11.618950 1 3.020286 5.000000 -3.872983 2 7.071068 -7.071068 -7.071068 3 -7.071068 7.071068 7.071068 4 -8.608621 11.547005 -10.910895 5 NaN NaN NaN 6 -2.360428 -5.773503 2.182179 7 10.969049 -5.773503 8.728716 8 -7.705963 5.000000 3.872983 9 -7.071068 7.071068 -7.071068 10 7.071068 -7.071068 7.071068 11 -8.157595 5.000000 11.618950
过滤
过滤根据定义的标准过滤数据并返回数据子集。 filter() 函数用于过滤数据。
import pandas as pd import numpy as np ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) print df.groupby('Team').filter(lambda x: len(x) >= 3)
它的输出如下 −
Points Rank Team Year 0 876 1 Riders 2014 1 789 2 Riders 2015 4 741 3 Kings 2014 6 756 1 Kings 2016 7 788 1 Kings 2017 8 694 2 Riders 2016 11 690 2 Riders 2017
在上述过滤条件中,我们要求返回参加过 3 次或以上 IPL 的团队。