SQLだけで機械学習を実現するBigQueryMLをさわってみる（回帰分析編） #機械学習

はじめに

　Googleが提供するBigQueryにBigQueryMLという機能がリリースされた。
　これで、BigQuery上で機械学習モデルを構築する事ができる。

　今回、BigQueryだけで前処理-学習-評価-予測までを試したので共有したい。

BigQueryとは？

　Googleがクラウドサービスとして展開する超巨大クラウドデータベース。
　本来はクラウド利用のためクレジットカードを登録する必要があるが、
　最近は一定以下の容量であればカード登録なし and 無料で利用できる。

　サービスを簡潔に言うと「でかい、安い、早い」。
　・データの許容量はほぼ無限と言っていいレベル。
　　画面にリミットが表示されていないので、
　恐らくいくらでも格納できる。

　・利用料金はクエリで取得するデータ量で課金される。
　稼働時間ベースで課金されるサービスでないため、
　かなり安価に利用できる。
　
　・また、通常のDBと異なりインデックスの機能はないのだが、
　クエリ実行にかなりのリソースを使用しているため応答も早い。
　
　注意点として、通常のクエリ発行時はテーブルをすべてチェックする。
　[select * from table]などを発行すると、テーブルの容量によっては
　かなりの課金が発生する。

　物理的にテーブルを分けておくか、分割テーブルを使うことを推奨する。

利用したデータセット：

　kaggleで公開されている小売店で行われたトランザクションのサンプルデータセット。
　https://www.kaggle.com/mehdidag/black-friday

　各列は以下のような構成です。（恐らく）

特徴列	説明
User_ID	ユーザID
Product_ID	商品ID
Gender	性別
Age	年齢
Occupation	職業ID
City_Category	住んでいる都市ID
Stay_In_Current_City_Years	都市に住んでいる年数
Marital_Status	結婚の有無
Product_Category_1	購入商品カテゴリ１
Product_Category_2	購入商品カテゴリ２
Product_Category_3	購入商品カテゴリ３
Purchase	購入金額

　今回はPurchase(購入金額)を予測するための回帰問題を解いた。
　このデータを9:1の割合で学習・評価に分割し、BigQueryのテーブルへ投入した。

前処理

　BigQueryはSQLで操作するが、関数も複数用意されている。
　そのため、前処理もBigQuery上で行うことができる。
　今回実施した前処理は以下の通り。
　　１型変換
　　　・以下の特徴はInt型で投入されているが、数値自体に意味があり
　　　カテゴリとして取り扱う必要がある。よって、cast()を用いて文字列型に変換した。
　　　・User_ID, Marital_Status, Occupation, Product_Category_1, Product_Category_2, Product_Category_3

　　２欠損値処理：
　　　・BigQueryMLは欠損値が許可されない。よって、ifnull()を用いて
　　　 nullの値を0に変換した。
　　　・Product_Category_1, Product_Category_2, Product_Category_3

　　３標準化：
　　　・BigQueryMLの回帰は線形回帰のみ用意されている。よって、データは基本的に正規分布で
　　　あることが前提とされる。計算を楽にするためlog()で対数を計算したものに変換した。
　　　・Purchase

　前処理を行う場合のSQLは以下となる。

select
 cast(User_ID as String) as User_ID
 ,Product_ID
 ,Gender
 ,Age
 ,cast(Occupation as String) as Occupation
 ,City_Category
 ,Stay_In_Current_City_Years
 ,cast(Marital_Status as String) as Marital_Status
 ,cast(ifnull(Product_Category_1,0) as String) as Product_Category_1
 ,cast(ifnull(Product_Category_2,0) as String) as Product_Category_2
 ,cast(ifnull(Product_Category_3,0) as String) as Product_Category_3
 ,log(Purchase) as Purchase
from
 handson_blackfriday.train # 学習用データセット

# 参考：https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators?authuser=2&hl=ja

モデル構築：

　次にモデルを構築していく。モデルを構築する際もSQLを用いる必要がある。
　上述のSQLを用いて、モデルを構築する場合のSQLは以下となる。

CREATE OR REPLACE MODEL                         # モデルを新規作成。存在すれば置き換える 
 `handson_blackfriday.reg_black`                # reg_blackというモデルを作成
options(
 　 model_type='linear_reg',               # 線形回帰を実施
 　 max_iterations=50,                 # トレーニング回数を最大50回に指定
    l2_reg=0.2,                     # L2正則化を0.2に指定
    data_split_col='User_ID',               # User_IDを軸に指定（入力には使用しない）
    data_split_method='seq') as             # データ分割手法に「seq」を指定
select
 cast(User_ID as String) as User_ID
 ,Product_ID
 ,Gender
 ,Age
 ,cast(Occupation as String) as Occupation
 ,City_Category
 ,Stay_In_Current_City_Years
 ,cast(Marital_Status as String) as Marital_Status
 ,cast(ifnull(Product_Category_1,0) as String) as Product_Category_1
 ,cast(ifnull(Product_Category_2,0) as String) as Product_Category_2
 ,cast(ifnull(Product_Category_3,0) as String) as Product_Category_3
 ,log(Purchase) as label
from
 handson_blackfriday.train　                           # 学習用データセット

# 参考：https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create

評価：

　モデル構築後は評価指標を確認しておくことが重要である。
　モデル構築時は誤差は確認できますが、評価指標は指定できない。

　よって、Evaluateを用いて評価指標を確認することが推奨される。
　回帰の場合は以下の指標を確認可能。
　　・mean_absolute_error
　　・mean_squared_error
　　・mean_squared_log_error
　　・median_absolute_error
　　・r2_score
　　・explained_variance

　モデルを構築する場合のSQLは以下となる。

select 
 *
from
 ML.Evaluate(MODEL `handson_blackfriday.reg_black`,
  (
   select
    cast(User_ID as String) as User_ID
    ,Product_ID
    ,Gender
    ,Age
    ,cast(Occupation as String) as Occupation
    ,City_Category
    ,Stay_In_Current_City_Years
    ,cast(Marital_Status as String) as Marital_Status
    ,cast(ifnull(Product_Category_1,0) as String) as Product_Category_1
    ,cast(ifnull(Product_Category_2,0) as String) as Product_Category_2
    ,cast(ifnull(Product_Category_3,0) as String) as Product_Category_3
    ,log(Purchase) as label
　 from
    handson_blackfriday.test        # 評価用データセット
   ))
# 参考：https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-evaluate

重みの確認：

以下のSQLで各特徴に加えた重みも確認することもできる。

SELECT
 *
FROM
 ML.WEIGHTS(MODEL `handson_blackfriday.reg_black`)
# 参考：https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-weights

予測：

　最後に予測。作成したモデルを用いてML.Predictを使用すると予測することができる。
　ここではユーザIDごとに購入金額の合計値を抽出する。

　注意として目的変数の対数に設定しているため、exp()を用いて指数計算し、
　本来の価格に戻していることだ。

select 
 User_ID
 ,sum(exp(label)) as true_price　          # expで指数計算し、本来の価格に
 ,sum(predicted_label) as predict_label　  # 予測ラベル
 ,sum(label) as label　                    # 正解ラベル
from
 ML.Predict(MODEL `handson_blackfriday.reg_black`,
  (
   select
    cast(User_ID as String) as User_ID
    ,Product_ID
    ,Gender
    ,Age
    ,cast(Occupation as String) as Occupation
    ,City_Category
    ,Stay_In_Current_City_Years
    ,cast(Marital_Status as String) as Marital_Status
    ,cast(ifnull(Product_Category_1,0) as String) as Product_Category_1
    ,cast(ifnull(Product_Category_2,0) as String) as Product_Category_2
    ,cast(ifnull(Product_Category_3,0) as String) as Product_Category_3
    ,log(Purchase) as label
　 from
    handson_blackfriday.test                # 評価用データセット
   ))
   group by
    User_ID

# 参考：https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict

終わりに

　BigQueryMLで簡単な機械学習プロセスをたどれることを確認できた。
　ただし、線形回帰は弱いアルゴリズムなので、精度が求められるモデルには
　使用できないかもしれない。

　そのため、ベースモデルや簡単なテストを行い、PythonやRでモデルを構築していく
　フローとしては重宝するのではないだろうか。

　※ 本記事は個人の見解であり、所属する団体の見解ではございません。
　　私の理解に相違などあればコメントいただければ幸いです。