Pandasのマージ(結合)

スポンサーリンク
スポンサーリンク

縦の結合(行を増やす

df = pd.concat([df_1, df_2, df_3])とすれば結合できます。ただしカラムが一致していないとエラーになります。

データフレームの数はいくつでも対応していますが、現実的には、For文などでループさせていくかなと思います。

df_1 = pd.DataFrame([["2020/05/01 12:34:56",1,2],["2020/05/01 16:34:56",3,4]], columns = ["Date Time", "aaa", "bbb"])
df_1.index = pd.DatetimeIndex(pd.to_datetime(df_1['Date Time']), name='Date Time')
del df_1['Date Time']

df_2 = pd.DataFrame([["2020/05/01 12:34:56",10,20],["2020/05/01 16:34:56",30,40]], columns = ["Date Time", "aaa", "bbb"])
df_2.index = pd.DatetimeIndex(pd.to_datetime(df_2['Date Time']), name='Date Time')
del df_2['Date Time']

df_3 = pd.DataFrame([["2020/05/01 12:34:56",100,200],["2020/05/01 16:34:56",300,400]], columns = ["Date Time", "aaa", "bbb"])
df_3.index = pd.DatetimeIndex(pd.to_datetime(df_3['Date Time']), name='Date Time')
del df_3['Date Time']

print(df_1)
print(df_2)
print(df_3)

#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56    1    2
# 2020-05-01 16:34:56    3    4
#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56   10   20
# 2020-05-01 16:34:56   30   40
#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56  100  200
# 2020-05-01 16:34:56  300  400

df = pd.concat([df_1, df_2, df_3])
print(df)

#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56    1    2
# 2020-05-01 16:34:56    3    4
# 2020-05-01 12:34:56   10   20
# 2020-05-01 16:34:56   30   40
# 2020-05-01 12:34:56  100  200
# 2020-05-01 16:34:56  300  400

横の結合(列を増やす

df = pd.concat([df_1, df_2, df_3], axis=1)axis=1を指定すれば結合できます。

df_1 = pd.DataFrame([["2020/05/01 12:34:56",1,2],["2020/05/01 16:34:56",3,4]], columns = ["Date Time", "aaa", "bbb"])
df_1.index = pd.DatetimeIndex(pd.to_datetime(df_1['Date Time']), name='Date Time')
del df_1['Date Time']

df_2 = pd.DataFrame([["2020/05/01 12:34:56",10,20],["2020/05/01 16:34:56",30,40]], columns = ["Date Time", "aaa", "bbb"])
df_2.index = pd.DatetimeIndex(pd.to_datetime(df_2['Date Time']), name='Date Time')
del df_2['Date Time']

df_3 = pd.DataFrame([["2020/05/01 12:34:56",100,200],["2020/05/01 16:34:56",300,400]], columns = ["Date Time", "aaa", "bbb"])
df_3.index = pd.DatetimeIndex(pd.to_datetime(df_3['Date Time']), name='Date Time')
del df_3['Date Time']

print(df_1)
print(df_2)
print(df_3)

#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56    1    2
# 2020-05-01 16:34:56    3    4
#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56   10   20
# 2020-05-01 16:34:56   30   40
#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56  100  200
# 2020-05-01 16:34:56  300  400

df = pd.concat([df_1, df_2, df_3], axis=1)
print(df)

#                      aaa  bbb  aaa  bbb  aaa  bbb
# Date Time                                        
# 2020-05-01 12:34:56    1    2   10   20  100  200
# 2020-05-01 16:34:56    3    4   30   40  300  400

全てのインデックスで結合

上記の日時型のインデックスが一致していない場合は、以下のようになります。
(df_3の一行目の日時を13時にしています)

df_1 = pd.DataFrame([["2020/05/01 12:34:56",1,2],["2020/05/01 16:34:56",3,4]], columns = ["Date Time", "aaa", "bbb"])
df_1.index = pd.DatetimeIndex(pd.to_datetime(df_1['Date Time']), name='Date Time')
del df_1['Date Time']

df_2 = pd.DataFrame([["2020/05/01 12:34:56",10,20],["2020/05/01 16:34:56",30,40]], columns = ["Date Time", "aaa", "bbb"])
df_2.index = pd.DatetimeIndex(pd.to_datetime(df_2['Date Time']), name='Date Time')
del df_2['Date Time']

df_3 = pd.DataFrame([["2020/05/01 13:34:56",100,200],["2020/05/01 16:34:56",300,400]], columns = ["Date Time", "aaa", "bbb"])
df_3.index = pd.DatetimeIndex(pd.to_datetime(df_3['Date Time']), name='Date Time')
del df_3['Date Time']

print(df_1)
print(df_2)
print(df_3)

#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56    1    2
# 2020-05-01 16:34:56    3    4
#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56   10   20
# 2020-05-01 16:34:56   30   40
#                      aaa  bbb
# Date Time                    
# 2020-05-01 13:34:56  100  200
# 2020-05-01 16:34:56  300  400

df = pd.concat([df_1, df_2, df_3], axis=1)
print(df)

#                      aaa  bbb   aaa   bbb    aaa    bbb
# Date Time                                              
# 2020-05-01 12:34:56  1.0  2.0  10.0  20.0    NaN    NaN
# 2020-05-01 13:34:56  NaN  NaN   NaN   NaN  100.0  200.0
# 2020-05-01 16:34:56  3.0  4.0  30.0  40.0  300.0  400.0

インデックスが一致のみ

上記のようにはせず、インデックスが一致しているものだけで結合する場合は、
pd.concat([df_1, df_2, df_3], axis=1, join='inner')
join='inner'と指定します。

df_1 = pd.DataFrame([["2020/05/01 12:34:56",1,2],["2020/05/01 16:34:56",3,4]], columns = ["Date Time", "aaa", "bbb"])
df_1.index = pd.DatetimeIndex(pd.to_datetime(df_1['Date Time']), name='Date Time')
del df_1['Date Time']

df_2 = pd.DataFrame([["2020/05/01 12:34:56",10,20],["2020/05/01 16:34:56",30,40]], columns = ["Date Time", "aaa", "bbb"])
df_2.index = pd.DatetimeIndex(pd.to_datetime(df_2['Date Time']), name='Date Time')
del df_2['Date Time']

df_3 = pd.DataFrame([["2020/05/01 13:34:56",100,200],["2020/05/01 16:34:56",300,400]], columns = ["Date Time", "aaa", "bbb"])
df_3.index = pd.DatetimeIndex(pd.to_datetime(df_3['Date Time']), name='Date Time')
del df_3['Date Time']

print(df_1)
print(df_2)
print(df_3)

#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56    1    2
# 2020-05-01 16:34:56    3    4
#                      aaa  bbb
# Date Time                    
# 2020-05-01 12:34:56   10   20
# 2020-05-01 16:34:56   30   40
#                      aaa  bbb
# Date Time                    
# 2020-05-01 13:34:56  100  200
# 2020-05-01 16:34:56  300  400

df = pd.concat([df_1, df_2, df_3], axis=1, join='inner')
print(df)

#                      aaa  bbb  aaa  bbb  aaa  bbb
# Date Time                                        
# 2020-05-01 16:34:56    3    4   30   40  300  400

インデックスがない場合(日時のカラムをキーとする

pd.merge(df_1, df_2, how='outer')を使えば、以下のように自動的に日時型をキーとしてマージしてくれます。ただし、日時の順番がおかしくなるので、ソートをする必要があります。

全ての日時

df_1 = pd.DataFrame([["2020/05/01 12:34:56",1,2],["2020/05/01 16:34:56",3,4]], columns = ["Date Time", "aaa", "bbb"])

df_2 = pd.DataFrame([["2020/05/01 12:34:56",10,20],["2020/05/01 16:34:56",30,40]], columns = ["Date Time", "ccc", "ddd"])

df_3 = pd.DataFrame([["2020/05/01 13:34:56",100,200],["2020/05/01 16:34:56",300,400]], columns = ["Date Time", "eee", "fff"])

print(df_1)
print(df_2)
print(df_3)

#              Date Time  aaa  bbb
# 0  2020/05/01 12:34:56    1    2
# 1  2020/05/01 16:34:56    3    4
#              Date Time  ccc  ddd
# 0  2020/05/01 12:34:56   10   20
# 1  2020/05/01 16:34:56   30   40
#              Date Time  eee  fff
# 0  2020/05/01 13:34:56  100  200
# 1  2020/05/01 16:34:56  300  400

df = pd.merge(df_1, df_2, how='outer')
df = pd.merge(df, df_3, how='outer')
print(df)

#              Date Time  aaa  bbb   ccc   ddd    eee    fff
# 0  2020/05/01 12:34:56  1.0  2.0  10.0  20.0    NaN    NaN
# 1  2020/05/01 16:34:56  3.0  4.0  30.0  40.0  300.0  400.0
# 2  2020/05/01 13:34:56  NaN  NaN   NaN   NaN  100.0  200.0

日時が一致のみ

df_1 = pd.DataFrame([["2020/05/01 12:34:56",1,2],["2020/05/01 16:34:56",3,4]], columns = ["Date Time", "aaa", "bbb"])

df_2 = pd.DataFrame([["2020/05/01 12:34:56",10,20],["2020/05/01 16:34:56",30,40]], columns = ["Date Time", "ccc", "ddd"])

df_3 = pd.DataFrame([["2020/05/01 13:34:56",100,200],["2020/05/01 16:34:56",300,400]], columns = ["Date Time", "eee", "fff"])

print(df_1)
print(df_2)
print(df_3)

#              Date Time  aaa  bbb
# 0  2020/05/01 12:34:56    1    2
# 1  2020/05/01 16:34:56    3    4
#              Date Time  ccc  ddd
# 0  2020/05/01 12:34:56   10   20
# 1  2020/05/01 16:34:56   30   40
#              Date Time  eee  fff
# 0  2020/05/01 13:34:56  100  200
# 1  2020/05/01 16:34:56  300  400

df = pd.merge(df_1, df_2)
df = pd.merge(df, df_3)
print(df)

#              Date Time  aaa  bbb  ccc  ddd  eee  fff
# 0  2020/05/01 16:34:56    3    4   30   40  300  400

最後に

結合方法が2つで仕様が異なるので注意しましょう。

  • pd.concatjoin=’inner’(デフォルトはouter
  • pd.mergehow=’inner’(デフォルトはinner
タイトルとURLをコピーしました