はじめに

web上にあがっている情報を解析したい場合、有効な手段としてスクレイピングがあります。
よって、今回はスクレイピングでサイト情報を取得し、データベースにストアするまでの流れをまとめていきたいと思います。

前提

今回は、複数のページに跨る情報を取得します。

スクレイピングの流れ

ライブラリを導入する

まずは、Mechanizeというスクレイピングを行うためのGemをGemfileに記述し、bundle installを実行します。
MechanizeのGemを入れるとMechanizeクラスが使えるようになり、Mechanizeクラスに用意されている様々なメソッドを使えるようになります。

gem 'mechanize'

$ bundle install

導入したライブラリをrubyファイルで利用するためにrequire 'mechanize' をファイルの先頭に記述します。
また、スクレイピングによって、データベースにデータを挿入するために、モデルクラスを作成する必要があります。
app/models/配下にscraping.rbファイルを作成し、その中にスクレイピングのアルゴリズムを記述していきます。

クラスを作成する

scraping.rb

require 'mechanize'

class Scraping

end

サイトのHTML情報を取得する

まずは、getメソッドでスクレイピングしたいwebサイトのHTMLを取得します。
getメソッドの引数にはURL文字列を指定します。

scraping.rb

require 'mechanize'

class Scraping
  # インスタンスの生成
  agent = Mechanize.new

  # 変数pageにHTMLの情報を持ったMechanize::Pageオブジェクトを代入
  page = agent.get("https://hoge.com/fuga") 
end

アルゴリズムの大枠を組む

今回は複数ページのデータ情報を取得するので、まずはその枠組みを組んでいきます。

scraping.rb

require 'mechanize'

class Scraping
  agent = Mechanize.new

  # 複数ページのリンクを収納するために事前に配列を用意
  links = [] 

  # 最初は次のURL（パス）がないため空文字列を用意
  next_urls = ""

  while true
    # 変数pageをcurrent_pageに変更（複数ページが存在するため）
    current_page = agent.get("https://hoge.com/fuga") 

    # 次のページのリンクタグを代入する変数を準備
    next_link = 

    # 次のページのリングタグがなければwhile文から抜ける
    break unless next_link

    # パスを取得
    next_url = 
  end
end

必要なタグ情報を取得する

組んだ枠組みの中を少し詳しく記述していきますが、
前知識として以下の3点を理解しておきましょう。

searchメソッドで、取得したウェブサイトのHTML情報の中から指定したHTML要素の内容を検索できます。該当するHTMLのタグ要素が1つでも、返り値は配列の形式で返ってきます。
また、似たメソッドとしてatメソッドがありますが、こちらは該当した1件のみを取得します。
そして、get_attributeメソッドでタグの属性値を取得できます。

コンソール上でスクレイピングを行うために、クラスメソッドを定義しておきます。

scraping.rb

require 'mechanize'

class Scraping
  def self.fuga_urls
    agent = Mechanize.new
    links = []
    next_url = ""

    while true
      # next_urlを文字列結合する
      current_page = agent.get("https://hoge.com/fuga/" + next_url)

      # 該当するタグを全て取得し、elementsに収納
      # 注）.futa-titleは適当に置き換えてください
      elements = current_page.search('.fuga-title a')
      elements.each do |ele|
        links << ele.get_attribute('href')
      end

      # 注）.pagination .next aは適当に置き換えてください
      next_link = current_page.at('.pagination .next a')
      break unless next_link
      next_url = next_link.get_attribute('href')
    end
  end
end

データベースに取得データを挿入するクラスを別に作成する

次に、while文で抜けた後の処理を記述します。
また、別クラスメソッドも作成します。

scraping.rb

require 'mechanize'

class Scraping
  def self.fuga_urls
    agent = Mechanize.new
    links = []
    next_url = ""

    while true
      current_page = agent.get("https://hoge.com/fuga/" + next_url)
      elements = current_page.search('.fuga-title a')
      elements.each do |ele|
        links << ele.get_attribute('href')
      end

      next_link = current_page.at('.pagination .next a')
      break unless next_link
      next_url = next_link.get_attribute('href')
    end

    # リンク（パス）を全て取得した後(while文をbreakで抜けた後)に下記を実行
    links.each do |link|
      # リンク先の詳細ページで必要な情報を取得
      # linkは（今回は）パスとなっているため下記のように文字列結合する
      # 処理を別のクラスメソッドに分離
      get_info_details('https://hoge.com/fuga' + link)
    end
  end

  def self.get_info_details(link)
    .
    .
    .
  end

end

最後にデータベースに値を挿入する処理を別クラスメソッド内に記述します。

scraping.rb

require 'mechanize'

class Scraping
  def self.fuga_urls
    agent = Mechanize.new
    links = []
    next_url = ""

    while true
      current_page = agent.get("https://hoge.com/fuga/" + next_url)
      elements = current_page.search('.fuga-title a')
      elements.each do |ele|
        links << ele.get_attribute('href')
      end

      next_link = current_page.at('.pagination .next a')
      break unless next_link
      next_url = next_link.get_attribute('href')
    end

    links.each do |link|
      get_book('https://hoge.com/fuga' + link)
    end
  end

  def self.get_book(link)
    agent = Mechanize.new
    page = agent.get(link)

    # 今回はタイトル(title)と、画像(image_url)と、詳細(detail)をデータベースに保存したいとする
    # if文は条件がnilまたはfalseの時のみfalseとなる
    # よって、ifを利用することで情報が存在しない時でもエラーが発生しないようにできる
    title = page.at('.fuga-title').inner_text if page.at('.fuga-title')
    image_url = page.at('.fuga-content img')[:src] if page.at('.fuga-content img')
    detail = page.at('.fuga-content p').inner_text if page.at('.fuga-content p')

    # first_or_initializeメソッドでbooksテーブルのタイトルカラムに特定のタイトルが存在しなければ新たにインスタンスを作成する
    book = Book.where(title: title).first_or_initialize
    book.image_url = image_url
    book.detail = detail
    # インスタンスのカラムに値を代入後をレコードとして保存
    book.save
  end
end

これでスクレイピング完了です。

コンソール上でスクレイピングを実行する

コンソールを起動し、スクレイピングを実行しましょう。

$ rails c
Running via Spring preloader in process 5879
Loading development environment (Rails 5.2.1)
[1] pry(main)> Scraping.fuga_urls
.
.
.

データベースにきちんと情報が保存されて入れば成功です。

最終ソースコード

最後に、
完成ソースコードを載せておきます。

scraping.rb

require 'mechanize'

class Scraping
  def self.fuga_urls
    agent = Mechanize.new
    links = []
    next_url = ""

    while true
      current_page = agent.get("https://hoge.com/fuga/" + next_url)
      elements = current_page.search('.fuga-title a')
      elements.each do |ele|
        links << ele.get_attribute('href')
      end

      next_link = current_page.at('.pagination .next a')
      break unless next_link
      next_url = next_link.get_attribute('href')
    end

    links.each do |link|
      get_book('https://hoge.com/fuga' + link)
    end
  end

  def self.get_book(link)
    agent = Mechanize.new
    page = agent.get(link)

    title = page.at('.fuga-title').inner_text if page.at('.fuga-title')
    image_url = page.at('.fuga-content img')[:src] if page.at('.fuga-content img')
    detail = page.at('.fuga-content p').inner_text if page.at('.fuga-content p')

    book = Book.where(title: title).first_or_initialize
    book.image_url = image_url
    book.detail = detail
    book.save
  end
end

おわりに

基本的には今回のような流れで欲しい情報を取ってこれますが、
状況に応じてアルゴリズムを組み替え、データ分析していきたいですね。

少しでも役に立ったという方は、いいね、お願いします(^^)

スクレイピングでサイト情報を取得し、データベースにストアするまでの流れをまとめてみた[Rails]

はじめに

前提

スクレイピングの流れ

ライブラリを導入する

クラスを作成する

サイトのHTML情報を取得する

アルゴリズムの大枠を組む

必要なタグ情報を取得する

データベースに取得データを挿入するクラスを別に作成する

コンソール上でスクレイピングを実行する

最終ソースコード

おわりに