Python|BeautifulSoupとurllibでWEBサイトをスクレイピング

Python3のBeautifulSoupとurllibを使って、WEBサイトの情報をスクレイピングする方法を紹介します。スクレイピング自体はとても簡単ですが、取得したHTMLを解析するのに多少の知識が必要になります。

はじめに
1. スクレイピングとは
2. 使用するライブラリ
BeautifulSoup4
WEBサイトをスクレイピングする

はじめに

今回、ご紹介するWEBサイトの情報をスクレイピングする方法は、自動検索プログラム（ボット）による収集を禁止しているサイトもありますのでご注意ください。

スクレイピングとは

スクレイピングは、WEBサイトに掲載されている情報をシステムで取得するということです。
別名ウェブ・クローラーとかウェブ・スパイダーなどと言われることもあります。

以前は厳しくなかったのですが、スクレイピングという言葉が出てから？かWEBサイトによっては厳禁していることもあるようです。
WEBサイトの利用規約などを読んで、問題ないことを確認してください。

使用するライブラリ

今回は2つのライブラリを使用します。

BeautifulSoup：スクレイピングした情報をHTMLに解析するライブラリ
urllib：WEBサイトの情報を取得（スクレイピング）するライブラリ

BeautifulSoupは標準ライブラリではありませんので、インストールが必要になります。

pip install beautifulsoup4

公式サイト

beautifulsoup4

Screen-scraping library

BeautifulSoup4

詳しい情報はドキュメントを参照してください。

Beautiful Soup Documentation — Beautiful Soup 4.13.0 documentation

タグなどで検索

HTMLの検索

soup.find('タグ', class_='クラス')
- 指定した検索条件の一つの結果を取得します。
- 見つからないときは、Noneが返ってきます。
soup.find_all('タグ', class_='クラス')
- 指定した検索条件の全ての結果をリスト型で取得します。
- 見つからないときは、リスト型が空になります。

検索条件の指定

soup.find_all('タグ', class_='クラス')：タグとclassのAND。＊class_とアンダーバー（_）が付いているのに注意
soup.find_all('タグ', id='ID')：HTMLタグとid値のANDで指定。
soup.find_all(id='ID')：HTMLタグのid値を指定。
soup.find_all(name='NAME')：HTMLタグのname値のみ。＊タグを入れるとエラーになる。
soup.find_all(attrs={"data-XXX":"値"})：HTMLタグのカスタム属性を指定する場合はattrsを使用します。
soup.find_all(string="文字列など")：タグで囲まれてた文字を指定。
soup.find_all(text="文字列など")：stringと同じ。
soup.find_all(text=["文字列１","文字列２","文字列３"])：複数の文字列を検索する場合はリスト型で指定します。
soup.find_all(text=re.compile(正規表現))：正規表現で検索する場合は正規表現ライブラリのreを使用します。
soup.find_all(text="文字列など", recursive=False)：, recursive=Falseで子要素のを検索します。
soup.find_all(text="文字列など", limit=2)：, limit=2で取得数を制限します。

CSSセレクタで検索

soup.select(CSSセレクタ)

soup.select("#contents")：IDを指定する。
soup.select(".contents")：classを指定する。
soup.select("body a")：ある要素より後ろにある要素を指定する。
soup.select("head > title")：ある要素の直後の要素を指定する。
soup.select("#contents ~ .contents")：ある要素と兄弟要素を指定する。

親要素・子要素・兄弟要素・前後の要素の取得

親要素

soup.parent：ひとつ上の親要素を取得
soup.parents：祖先要素の全てを取得

子要素

soup.contents：子要素をリスト形式で取得。
soup.children：子要素をイテレーターで扱える形式で取得。

＊イテレーターは、データを順番に取り出すことができる機能です。

兄弟要素

soup.next_sibling：パースツリーで同じレベルの次の要素を取得
soup.previous_sibling：パースツリーで同じレベルの前の要素を取得

前後の要素

soup.next_element：単純に次に記載がある要素を取得
soup.previous_element：単純に前に記載がある要素を取得

要素の値の取得

要素（タグ名）を取得

soup.name

要素の属性やID、classを取得

hrefなどの属性やID、classを以下のどちらでも取得できます。

soup.p[属性など]
soup.get(属性など)

要素の文字列を取得

soup.strings：子要素の全ての文字列を取得
soup.get_text()：その要素の文字列を取得

WEBサイトをスクレイピングする

前提（取得する情報

今回は以下の２つのサイトから情報を収集しようと思います。

天然ガスの在庫発表スケジュール

301 Moved Permanently

天然ガスの在庫（表の右上の欄の、最新の在庫合計）

Weekly Working Gas in Underground Storage

上記のサイトのPrivacy Statement and Security Policyを確認すると以下のように過度に収集してはいけないとありましたので、少しくらいなら問題ないようです。

Automated retrieval program (robot) activity
EIA is committed to providing data promptly and according to established schedules. Therefore any automated retrieval program (commonly referred to as a “robot” or “bot”) that excessively accesses information from EIA’s website is prohibited. Excessive robot activity on EIA’s website can cause delays and interfere with other customers’ timely access to information.

U.S. Energy Information Administration - EIA - Independent Statistics and Analysis

Energy Information Administration - EIA - Official Energy Statistics from the U.S. Government

システム

from datetime import datetime
from pytz import timezone

from bs4 import BeautifulSoup
from urllib import request

schedule_url    = 'http://ir.eia.gov/ngs/schedule.html'                 # 発表日時のページ
storage_url     = 'https://www.eia.gov/dnav/ng/ng_stor_wkly_s1_w.htm'   # 在庫量の発表ページ

# 発表時間
now_time    = datetime.now(timezone('America/New_York'))
release_date = []

response = request.urlopen(schedule_url)
soup = BeautifulSoup(response, "html.parser")
response.close()

data_table  = soup.find('table', class_='simpletable')
data_tr     = data_table.find_all('tr')

for tr in data_tr:
    data_td = tr.find_all('td')
    if len(data_td) != 0:
        next_date = data_td[2].text
        next_date = next_date.replace('a.m.', 'AM')
        next_date = next_date.replace('p.m.', 'PM')
        next_date = datetime.strptime(data_td[0].text + ' ' + next_date + '-05:00', '%m/%d/%Y %I:%M %p%z')

        if now_time <= next_date:
            release_date.append(next_date)

print(release_date[-1])

# 在庫量
response = request.urlopen(storage_url)
soup = BeautifulSoup(response, "html.parser")
response.close()

data_table = soup.find('table', class_='data1')
new_storage = data_table.find_all('td', class_='Current2')[0].text

new_storage = int(new_storage.replace(',', ''))

print(new_storage)

解説

構成

序盤：各種ライブラリのインポートと情報を取得するサイトのURLの設定
中盤：発表時間の取得（スクレイピング）
終盤：在庫量の取得（スクレイピング）

スクレイピングの方法

response = request.urlopen(URL)でWEBサイトに接続して情報を取得します。
取得した情報をsoup = BeautifulSoup(response, "html.parser")で解析します。
response.close()で接続したWEBサイトを切断します。

以上、たった3ステップで情報収集のスクレイピングは完了です。

取得したHTMLを解析

取得したHTMLを解析するには多少のHTMLの知識が必要になります。

上記の例で、簡単な方の在庫量で説明します。

在庫量のHTML

<table class="data1" width="760" border="0">
          <tbody><tr>
            <th colspan="11" align="left">
              <div id="tlinks" name="tlinks"> 		
                <table class="data2" cellspacing="1" cellpadding="0" border="0">
                <tbody><tr>
                  <td width="15"></td>
                  <td align="center">
                  <a href="./xls/NG_STOR_WKLY_S1_W.xls" class="reg">
                    <img src="img/Excel_Hist3.jpg" alt="Download Series History" width="21" height="19" border="0"></a></td>
                  <td width="2"></td>         
                  <td class="Info" valign="middle" align="left">
                  <a class="crumb" href="./xls/NG_STOR_WKLY_S1_W.xls">Download Series History</a></td>  		  
                  <td width="20"></td>        
                  <td width="21" align="center"><a href="TblDefs/ng_stor_wkly_tbldef2.asp">
                    <img src="img/Notes_Icon1.jpg" alt="Definitions, Sources &amp; Notes" width="21" height="19" border="0"></a></td>
                  <td width="1"></td>
                  <td class="Info" valign="middle" align="left">
                  <a class="crumb" href="TblDefs/ng_stor_wkly_tbldef2.asp">Definitions, Sources &amp; Notes</a></td>
                </tr>
                </tbody></table>			
          </div>			
            </th>
          </tr>
        
          <tr>
  	        <th class="Series">

                           <table class="data2" cellspacing="0" cellpadding="0" border="0" align="left">
              <tbody><tr align="left">	
                <td width="10"><img src="img/spacer_transp.gif" alt="" width="1" height="1" border="0"></td>
                <td class="LabelB">Region</td>
                <td width="20"><img src="img/spacer_transp.gif" alt="" width="1" height="1" border="0"></td>	
              </tr>
              <tr>
                <td colspan="3" height="2"></td>
              </tr>
              </tbody></table>	
             <img src="img/spacer_transp.gif" alt="" width="1" height="1" border="0">
             </th><th class="Series"><button style="font-size: 90%;" disabled="disabled">Graph</button><button style="font-size: 90%;" disabled="disabled">Clear</button></th> 	
             		 
                 <th class="Series5">04/17/20</th>
    <th class="Series5">04/24/20</th>
    <th class="Series5">05/01/20</th>
    <th class="Series5">05/08/20</th>
    <th class="Series5">05/15/20</th>
    <th class="Series5">05/22/20</th>
    <th class="Cross">View<br>History</th>  
</tr>
             		<tr class="DataRow">
            <td class="DataStub" width="228">
                <table class="data2" cellspacing="0" cellpadding="0" border="0">
                    <tbody><tr>
                        <td width="3"></td>
                        <td class="DataStub1"><b>Total Lower 48 States<b></b></b></td>
                    </tr>
                </tbody></table>
            </td>
            <td class="DataB" style="text-align: center;" width="84"><a target="_blank" href="/opendata/qb.php?sdid=NG.NW2_EPG0_SWO_R48_BCF.W"><span class="ico_sourcekey" title="Click to view series in API browser"></span></a><input type="checkbox" class="chartCheckBox"></td><td class="DataB" width="76">2,140</td>
            <td class="DataB" width="76">2,210</td>
            <td class="DataB" width="76">2,319</td>
            <td class="DataB" width="76">2,422</td>
            <td class="DataB" width="76">2,503</td>
            <td class="Current2" width="76">2,612</td>
            <td class="DataHist" width="76"><a href="./hist/nw2_epg0_swo_r48_bcfw.htm" class="Hist">2010-2020</a></td>				
        </tr>
        <tr class="DataRow">
            <td class="DataStub" width="228">
                <table class="data2" cellspacing="0" cellpadding="0" border="0">
                    <tbody><tr>
                        <td width="23"></td>
                        <td class="DataStub1">East</td>
                    </tr>
                </tbody></table>
            </td>
            <td class="DataB" style="text-align: center;" width="84"><a target="_blank" href="/opendata/qb.php?sdid=NG.NW2_EPG0_SWO_R31_BCF.W"><span class="ico_sourcekey" title="Click to view series in API browser"></span></a><input type="checkbox" class="chartCheckBox"></td><td class="DataB" width="76">400</td>
            <td class="DataB" width="76">405</td>
            <td class="DataB" width="76">424</td>
            <td class="DataB" width="76">452</td>
            <td class="DataB" width="76">469</td>
            <td class="Current2" width="76">504</td>
            <td class="DataHist" width="76"><a href="./hist/nw2_epg0_swo_r31_bcfw.htm" class="Hist">2010-2020</a></td>				
        </tr>
        <tr class="DataRow">
            <td class="DataStub" width="228">
                <table class="data2" cellspacing="0" cellpadding="0" border="0">
                    <tbody><tr>
                        <td width="23"></td>
                        <td class="DataStub1">Midwest</td>
                    </tr>
                </tbody></table>
            </td>
            <td class="DataB" style="text-align: center;" width="84"><a target="_blank" href="/opendata/qb.php?sdid=NG.NW2_EPG0_SWO_R32_BCF.W"><span class="ico_sourcekey" title="Click to view series in API browser"></span></a><input type="checkbox" class="chartCheckBox"></td><td class="DataB" width="76">493</td>
            <td class="DataB" width="76">506</td>
            <td class="DataB" width="76">530</td>
            <td class="DataB" width="76">554</td>
            <td class="DataB" width="76">576</td>
            <td class="Current2" width="76">606</td>
            <td class="DataHist" width="76"><a href="./hist/nw2_epg0_swo_r32_bcfw.htm" class="Hist">2010-2020</a></td>				
        </tr>
        <tr class="DataRow">
            <td class="DataStub" width="228">
                <table class="data2" cellspacing="0" cellpadding="0" border="0">
                    <tbody><tr>
                        <td width="23"></td>
                        <td class="DataStub1">Mountain</td>
                    </tr>
                </tbody></table>

今回、取得したい情報（64行目の2,612）は、1行目のtableタグ内に記載されています。
そのため、まず対象のtableの情報をdata_table = soup.find('table', class_='data1')で取得します。
64行目には<td class="Current2" width="76">2,612</td>と幸運にも対象の列を識別するclass="Current2"がありましたので、new_storage = data_table.find_all('td', class_='Current2')[0].textで値を取得します。
- data_table.find_allとしていますので、対象のHTMLタグをすべて配列として取得することになりますので今回は一番上ですので[0]と指定しています。
- そのHTMLタグの<td class=”Current2″ width=”76″>2,612</td>のテキスト部分を取得するのに.textとしています。

以上で、目当ての情報を取得することができました。