Python データ分割(トレーニングとテストデータに分ける

スポンサーリンク

ディープラーニングで、トレーニングデータとテストデータ(評価データ)に分ける必要が出てくる。
検索するとNumpyを使った例が多いが、Pandasで十分対応できるし簡単です。
ここではPandasをメインに説明しますが、Numpyは触りだけ紹介します。

スポンサーリンク

Pandas

以下の例は無駄が多いように思えますが、汎用性があります。

  • test_endを指定しない場合は、test_endを削除してもらった方がスッキリします。
  • 基本的にはtrain_endに行数を設定すればいいです。ここでは行数だけでなく、日時でも指定が可能です。また割合にする場合は、int(df.shape[0]/0.7)などと設定すれば良いでしょう。日時や割合にする場合はをtest_end削除するなど注意してください。

投資関連では、年などの時間のサイクルがあるように思えることが多いので、単純に割合だけではなく、このように汎用性を持たせています。もちろん、長々とかけば設定値だけで、上記を自動判別させることもできるでしょうけど、自分が使えるシステムができればいいので、このように曖昧になっています。

サンプルデータはこちら(sample

# ファイル読み込み
filename = "sample.csv"
df = pd.read_csv(filename)
if "Unnamed: 0" in df.columns:
del df['Unnamed: 0']
print("del")
print(df)
# Date Time Open High Low Close target
# 0 2019-01-01 22:00:00+00:00 109.590 109.590 109.590 109.590 1.0
# 1 2019-01-01 22:01:00+00:00 109.673 109.673 109.673 109.673 1.0
# 2 2019-01-01 22:02:00+00:00 109.673 109.673 109.673 109.673 1.0
# 3 2019-01-01 22:03:00+00:00 109.673 109.673 109.673 109.673 1.0
# 4 2019-01-01 22:04:00+00:00 109.673 109.673 109.673 109.673 1.0
# 5 2019-01-01 22:05:00+00:00 109.673 109.673 109.673 109.673 1.0
# 6 2019-01-01 22:06:00+00:00 109.669 109.669 109.669 109.669 1.0
# 7 2019-01-01 22:07:00+00:00 109.669 109.669 109.669 109.669 1.0
# 8 2019-01-01 22:08:00+00:00 109.669 109.669 109.669 109.669 1.0
# 9 2019-01-01 22:09:00+00:00 109.669 109.669 109.669 109.669 1.0
# トレーニング用とテスト用の範囲を設定
train_start = 0
train_end = 7
test_start = train_end
test_end = df.shape[0]
# データを分割
X_train = df[train_start:train_end].drop('target', axis=1).values
X_test = df[test_start:test_end].drop('target', axis=1).values
y_train = df['target'][train_start:train_end].values
y_test = df['target'][test_start:test_end].values
print(X_train)
print(X_test)
print(y_train)
print(y_test)
# [['2019-01-01 22:00:00+00:00' 109.59 109.59 109.59 109.59]
# ['2019-01-01 22:01:00+00:00' 109.67299999999999 109.67299999999999
# 109.67299999999999 109.67299999999999]
# ['2019-01-01 22:02:00+00:00' 109.67299999999999 109.67299999999999
# 109.67299999999999 109.67299999999999]
# ['2019-01-01 22:03:00+00:00' 109.67299999999999 109.67299999999999
# 109.67299999999999 109.67299999999999]
# ['2019-01-01 22:04:00+00:00' 109.67299999999999 109.67299999999999
# 109.67299999999999 109.67299999999999]
# ['2019-01-01 22:05:00+00:00' 109.67299999999999 109.67299999999999
# 109.67299999999999 109.67299999999999]
# ['2019-01-01 22:06:00+00:00' 109.669 109.669 109.669 109.669]]
# [['2019-01-01 22:07:00+00:00' 109.669 109.669 109.669 109.669]
# ['2019-01-01 22:08:00+00:00' 109.669 109.669 109.669 109.669]
# ['2019-01-01 22:09:00+00:00' 109.669 109.669 109.669 109.669]]
# [1. 1. 1. 1. 1. 1. 1.]
# [1. 1. 1.]
# ファイル読み込み filename = "sample.csv" df = pd.read_csv(filename) if "Unnamed: 0" in df.columns: del df['Unnamed: 0'] print("del") print(df) # Date Time Open High Low Close target # 0 2019-01-01 22:00:00+00:00 109.590 109.590 109.590 109.590 1.0 # 1 2019-01-01 22:01:00+00:00 109.673 109.673 109.673 109.673 1.0 # 2 2019-01-01 22:02:00+00:00 109.673 109.673 109.673 109.673 1.0 # 3 2019-01-01 22:03:00+00:00 109.673 109.673 109.673 109.673 1.0 # 4 2019-01-01 22:04:00+00:00 109.673 109.673 109.673 109.673 1.0 # 5 2019-01-01 22:05:00+00:00 109.673 109.673 109.673 109.673 1.0 # 6 2019-01-01 22:06:00+00:00 109.669 109.669 109.669 109.669 1.0 # 7 2019-01-01 22:07:00+00:00 109.669 109.669 109.669 109.669 1.0 # 8 2019-01-01 22:08:00+00:00 109.669 109.669 109.669 109.669 1.0 # 9 2019-01-01 22:09:00+00:00 109.669 109.669 109.669 109.669 1.0 # トレーニング用とテスト用の範囲を設定 train_start = 0 train_end = 7 test_start = train_end test_end = df.shape[0] # データを分割 X_train = df[train_start:train_end].drop('target', axis=1).values X_test = df[test_start:test_end].drop('target', axis=1).values y_train = df['target'][train_start:train_end].values y_test = df['target'][test_start:test_end].values print(X_train) print(X_test) print(y_train) print(y_test) # [['2019-01-01 22:00:00+00:00' 109.59 109.59 109.59 109.59] # ['2019-01-01 22:01:00+00:00' 109.67299999999999 109.67299999999999 # 109.67299999999999 109.67299999999999] # ['2019-01-01 22:02:00+00:00' 109.67299999999999 109.67299999999999 # 109.67299999999999 109.67299999999999] # ['2019-01-01 22:03:00+00:00' 109.67299999999999 109.67299999999999 # 109.67299999999999 109.67299999999999] # ['2019-01-01 22:04:00+00:00' 109.67299999999999 109.67299999999999 # 109.67299999999999 109.67299999999999] # ['2019-01-01 22:05:00+00:00' 109.67299999999999 109.67299999999999 # 109.67299999999999 109.67299999999999] # ['2019-01-01 22:06:00+00:00' 109.669 109.669 109.669 109.669]] # [['2019-01-01 22:07:00+00:00' 109.669 109.669 109.669 109.669] # ['2019-01-01 22:08:00+00:00' 109.669 109.669 109.669 109.669] # ['2019-01-01 22:09:00+00:00' 109.669 109.669 109.669 109.669]] # [1. 1. 1. 1. 1. 1. 1.] # [1. 1. 1.]
# ファイル読み込み
filename = "sample.csv"

df = pd.read_csv(filename)
if "Unnamed: 0" in df.columns:
    del df['Unnamed: 0']
    print("del")

print(df)
#                    Date Time     Open     High      Low    Close  target
# 0  2019-01-01 22:00:00+00:00  109.590  109.590  109.590  109.590     1.0
# 1  2019-01-01 22:01:00+00:00  109.673  109.673  109.673  109.673     1.0
# 2  2019-01-01 22:02:00+00:00  109.673  109.673  109.673  109.673     1.0
# 3  2019-01-01 22:03:00+00:00  109.673  109.673  109.673  109.673     1.0
# 4  2019-01-01 22:04:00+00:00  109.673  109.673  109.673  109.673     1.0
# 5  2019-01-01 22:05:00+00:00  109.673  109.673  109.673  109.673     1.0
# 6  2019-01-01 22:06:00+00:00  109.669  109.669  109.669  109.669     1.0
# 7  2019-01-01 22:07:00+00:00  109.669  109.669  109.669  109.669     1.0
# 8  2019-01-01 22:08:00+00:00  109.669  109.669  109.669  109.669     1.0
# 9  2019-01-01 22:09:00+00:00  109.669  109.669  109.669  109.669     1.0

# トレーニング用とテスト用の範囲を設定
train_start = 0
train_end   = 7
test_start  = train_end
test_end    = df.shape[0]

# データを分割
X_train     = df[train_start:train_end].drop('target', axis=1).values
X_test      = df[test_start:test_end].drop('target', axis=1).values
y_train     = df['target'][train_start:train_end].values
y_test      = df['target'][test_start:test_end].values

print(X_train)
print(X_test)
print(y_train)
print(y_test)

# [['2019-01-01 22:00:00+00:00' 109.59 109.59 109.59 109.59]
#  ['2019-01-01 22:01:00+00:00' 109.67299999999999 109.67299999999999
#   109.67299999999999 109.67299999999999]
#  ['2019-01-01 22:02:00+00:00' 109.67299999999999 109.67299999999999
#   109.67299999999999 109.67299999999999]
#  ['2019-01-01 22:03:00+00:00' 109.67299999999999 109.67299999999999
#   109.67299999999999 109.67299999999999]
#  ['2019-01-01 22:04:00+00:00' 109.67299999999999 109.67299999999999
#   109.67299999999999 109.67299999999999]
#  ['2019-01-01 22:05:00+00:00' 109.67299999999999 109.67299999999999
#   109.67299999999999 109.67299999999999]
#  ['2019-01-01 22:06:00+00:00' 109.669 109.669 109.669 109.669]]

# [['2019-01-01 22:07:00+00:00' 109.669 109.669 109.669 109.669]
#  ['2019-01-01 22:08:00+00:00' 109.669 109.669 109.669 109.669]
#  ['2019-01-01 22:09:00+00:00' 109.669 109.669 109.669 109.669]]

# [1. 1. 1. 1. 1. 1. 1.]

# [1. 1. 1.]

Numpy

sklearnというライブラリを使用します。標準ではないのでインストールが必要です。

PIP install scikit-learn
PIP install scikit-learn
PIP install scikit-learn

以下のようにインポートします。
from sklearn.model_selection import train_test_split

主なパラメーター

  • X:特徴データ
  • y:正解ラベル
  • test_size:テストデータの量(30%をテストデータとする場合は0.3と指定)
  • shuffle:データを分割する前にシャッフルするかどうか(デフォルトはTrueです。投資の場合は時系列が多いのでFlaseとした方がいいと思います。

出力結果

  • X_train:トレーニング用の特徴データ
  • X_test:テスト用の特徴データ
  • y_train:トレーニング用の正解ラベル
  • y_test:テスト用の正解ラベル

以下をコピペすれば基本的に使えます。ただし、何割をトレーニングとテストに割り当てるかになります。

サンプルデータはこちら(sample2

# ファイル読み込み
filename = "sample2.csv"
data = np.loadtxt(filename, delimiter=",", skiprows=1)
print(data)
# dataの例
# [[109.59 109.59 109.59 109.59 1. ]
# [109.673 109.673 109.673 109.673 1. ]
# [109.673 109.673 109.673 109.673 1. ]
# [109.673 109.673 109.673 109.673 1. ]
# [109.673 109.673 109.673 109.673 1. ]
# [109.673 109.673 109.673 109.673 1. ]
# [109.669 109.669 109.669 109.669 1. ]
# [109.669 109.669 109.669 109.669 1. ]
# [109.669 109.669 109.669 109.669 1. ]
# [109.669 109.669 109.669 109.669 1. ]]
# 訓練データと正解ラベルに分割(上記のように5列のデータの場合
X = data[:,:4]
y = data[:,4:]
# トレーニング用、テスト用に分割
from sklearn.model_selection import train_test_split
(X_train, X_test, y_train, y_test) = train_test_split(
X, y, test_size=0.3, random_state=0,
)
print(X_train)
print(X_test)
print(y_train)
print(y_test)
# [[109.669 109.669 109.669 109.669]
# [109.673 109.673 109.673 109.673]
# [109.669 109.669 109.669 109.669]
# [109.669 109.669 109.669 109.669]
# [109.673 109.673 109.673 109.673]
# [109.59 109.59 109.59 109.59 ]
# [109.673 109.673 109.673 109.673]]
# [[109.673 109.673 109.673 109.673]
# [109.669 109.669 109.669 109.669]
# [109.673 109.673 109.673 109.673]]
# [[1.]
# [1.]
# [1.]
# [1.]
# [1.]
# [1.]
# [1.]]
# [[1.]
# [1.]
# [1.]]
# ファイル読み込み filename = "sample2.csv" data = np.loadtxt(filename, delimiter=",", skiprows=1) print(data) # dataの例 # [[109.59 109.59 109.59 109.59 1. ] # [109.673 109.673 109.673 109.673 1. ] # [109.673 109.673 109.673 109.673 1. ] # [109.673 109.673 109.673 109.673 1. ] # [109.673 109.673 109.673 109.673 1. ] # [109.673 109.673 109.673 109.673 1. ] # [109.669 109.669 109.669 109.669 1. ] # [109.669 109.669 109.669 109.669 1. ] # [109.669 109.669 109.669 109.669 1. ] # [109.669 109.669 109.669 109.669 1. ]] # 訓練データと正解ラベルに分割(上記のように5列のデータの場合 X = data[:,:4] y = data[:,4:] # トレーニング用、テスト用に分割 from sklearn.model_selection import train_test_split (X_train, X_test, y_train, y_test) = train_test_split( X, y, test_size=0.3, random_state=0, ) print(X_train) print(X_test) print(y_train) print(y_test) # [[109.669 109.669 109.669 109.669] # [109.673 109.673 109.673 109.673] # [109.669 109.669 109.669 109.669] # [109.669 109.669 109.669 109.669] # [109.673 109.673 109.673 109.673] # [109.59 109.59 109.59 109.59 ] # [109.673 109.673 109.673 109.673]] # [[109.673 109.673 109.673 109.673] # [109.669 109.669 109.669 109.669] # [109.673 109.673 109.673 109.673]] # [[1.] # [1.] # [1.] # [1.] # [1.] # [1.] # [1.]] # [[1.] # [1.] # [1.]]
# ファイル読み込み
filename = "sample2.csv"

data = np.loadtxt(filename, delimiter=",", skiprows=1)
print(data)

# dataの例
# [[109.59  109.59  109.59  109.59    1.   ]
#  [109.673 109.673 109.673 109.673   1.   ]
#  [109.673 109.673 109.673 109.673   1.   ]
#  [109.673 109.673 109.673 109.673   1.   ]
#  [109.673 109.673 109.673 109.673   1.   ]
#  [109.673 109.673 109.673 109.673   1.   ]
#  [109.669 109.669 109.669 109.669   1.   ]
#  [109.669 109.669 109.669 109.669   1.   ]
#  [109.669 109.669 109.669 109.669   1.   ]
#  [109.669 109.669 109.669 109.669   1.   ]]

# 訓練データと正解ラベルに分割(上記のように5列のデータの場合
X = data[:,:4]
y = data[:,4:]

# トレーニング用、テスト用に分割
from sklearn.model_selection import train_test_split

(X_train, X_test, y_train, y_test) = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train)
print(X_test)
print(y_train)
print(y_test)

# [[109.669 109.669 109.669 109.669]
#  [109.673 109.673 109.673 109.673]
#  [109.669 109.669 109.669 109.669]
#  [109.669 109.669 109.669 109.669]
#  [109.673 109.673 109.673 109.673]
#  [109.59  109.59  109.59  109.59 ]
#  [109.673 109.673 109.673 109.673]]

# [[109.673 109.673 109.673 109.673]
#  [109.669 109.669 109.669 109.669]
#  [109.673 109.673 109.673 109.673]]

# [[1.]
#  [1.]
#  [1.]
#  [1.]
#  [1.]
#  [1.]
#  [1.]]

# [[1.]
#  [1.]
#  [1.]]

 

タイトルとURLをコピーしました