Web Scraping With Python: Beautiful Soup

2020. 12. 7. 10:55

** ML 알고리즘 관련은 아니지만, 쉬어가는 의미에서 데이터 수집등에서 많이 사용하는 웹스크래핑에 대해 알아보려고 한다.

Web Scraping With Python: Beautiful Soup

웹 스크래퍼이 무엇이고 파이썬의 beautiful soup 라이브러리를 사용하여 구현하는 방법을 알아본다.
(아마존 웹사이트 데이터를 사용)

언터넷은 너무 많은 데이터로 풍부하고 분명하게 데이터가 새로운 유전이 된 시대에서 웹 스크래핑은 다양한 어플리케이션에서 좀더 중요하고 실용적이 되었다. 웹 스크래핑은 웹사이트로부터 정보를 추출 또는 수집(scraping)한다. 또한 웹 스크래핑은 때때로 웹 수확(web harvesting) 또는 데이터 추출(data extraction)으로써 나타난다. 웹 사이트로부터 텍스트를 복사하고 이를 로컬 시스템에 붙이는 것 또한 웹 스크래핑이다. 그러나, 이것은 수작업이다. 보통 웹 스크래핑은 웹 크롤러(crawler)로 자동적으로 데이터를 추출하는 것이다. 웹 크롤러는 HTTP 프로토콜을 사용하여 www(world wide web)에 접속하는 스크립트이고 자동화된 방법으로 데이터를 가져온다.

데이터 과학자, 엔지니어, 아주 많은 데이터셋을 분석하는 누구라도 웹으로부터 데이터를 수집하기 위한 능력은 가져야하는 유용한 기술이다. 웹에서 데이터를 찾고 이 데이터를 직접적으로 다운로드 할 수 있는 방법이 없다면 파이썬을 사용한 웹 스크래핑은 다양한 방법으로 가져와서 사용될 수 있는 유용한 형태로 데이터를 추출하기 위해 사용할 수 있는 기술이다.

웹 스크래핑의 실용적인 어플르케이션은 아래와 같을 수 있다.

특정 기술을 가진 후보자의 이력서 수집
특정 해시태그를 가진 트윗을 트위터에서 추출
마케팅에서 리드 생성(Lead generation)
e-commerce 웹사이트에서 상품 상세설명과 리뷰 수집

위의 사용예외에도 웹 스크래핑은 딥러닝 모델을 훈련하기 위해 웹사이트로부터 텍스트를 추출하기 위해 NLP(Natural Language Processing)에서 폭넓게 사용된다.

웹 스크래핑의 잠재적 문제

웹사이트에서 정보를 수집하는 동안 직면할 수 있는 문제 하나는 웹사이트의 다양한 구조이다. 즉, 웹사이트의 템플릿이 다르고 유일할 것이다. 여기서 다양한 웹사이트에 맞추는 것이 문제가 될 것이다.
또다른 도전은 지속성일 것이다. 웹 개발자는 자신의 웹사이트를 최신으로 유지하기 때문에 오랫동안 하나의 스크래퍼에 의존할 수 없다. 비록 수정이 작다고 하더라도 데이터를 가지고 오는 동안 방해하는 것을 만들었을 수도 있다.

위의 문제를 해결하기 위한 다양한 해법이 있을 수 있다. 한가지는 계속 통합과 개발(CI/CD - Continuous Integration & Development)을 하고 웹사이트가 동적일 수 있기 때문에 지속적인 유지를 하는 것이다.

다른 좀더 현실적인 접근은 다양한 웹사이트와 플래폼에서 제공되는 API를 사용하는 것이다. 예를 들면, Facebook과 twitter는 그들의 데이터로 시럼을 하거나 정보가 모든 친구 또는 서로간 친구간 관련이 있다고 추출하고 이것의 연결 그래프를 그리기 원하는 개발자를 위해 특별히 설계된 API를 제공한다. API를 사용할 때의 데이터 형식은 보통 웹 스크래핑과는 다르다. 즉, JSON 또는 XML이다. 반면 표준적인 웹 스크래핑에서는 주로 HTML을 주로 다룬다.

Beautiful Soup?

Beautiful Soup은 웹사이트에서 구조화된 데이터를 추출하기 위한 순수 파이썬 라이브러리로 HTML 또는 XML 파일에서 데이터를 파싱할 수 있다. 이것은 helper 모듈로써 동작하고 다른 사용가능한 개발 툴을 이용하여 웹 페이지와 상호작용하는 방법과 유사하면서 더 나은 방법으로 HTML과 상호작용한다.

탐색, 검색 그리고 파싱 트리 수정에 대한 유지적 파이썬 방법들을 제공하기 위한 lxml과 html5lib같은 선호하는 파서(parser)와 동작하기 때무에 개발자의 시간 또는 작업일을 줄여준다.
Beautiful Soup의 또다른 강력하고 유용한 특징은 가지고 오는 문서는 유니코드로 발신문서는 UTF-8로 변환하는 똑똑함이다. 개발자 입장에서는 본질적인 문서가 특정 인코딩을 사용하거나 Beautiful Soup이 인코딩을 탐지할 수 없는 것이 아닌한 신경쓸 필요가 없어진다.
다른 일반적인 파싱 또는 스크래핑 기술과 비교해서 더 빠르다.
```
Source
```

Beautiful Soup을 설치하고 파이썬을 이용하여 특징과 능력을 알아보자.


!pip3 install beautifulsoup4

Importing necessary libraries

웹사이트에서 데이터를 수집하기 위해 사용할 필수 패키지를 임포트하고 seaborn, matplotlib, bokeh로 시각화해보자.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import re
import time
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests

Scraping the Amazon Best Selling Books

다음 형식의 URL을 사용하여 스크래핑 한다. (만약 열리지 않는다면, 이곳이 부모 페이지의 링크이다.)

https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo)

보이는 바와 같이 페이지 인자는 각 페이지에 대한 데이터에 접근하기 위해 수정될 수 있다. 여기서 모든 페이지에 접근하기 위해 필요한 데이터셋을 얻기위해 모든 페이지를 접하는 루프(loop)가 필요하다. 하지만 우선, 웹사이트로부터 페이지의 수를 알아내야 한다.

URL에 접속하고 HTML 컨텐츠를 가지고 오기 위해 아래 항목이 필요하다.

인자로 페이지 수를 입력받는 get_data함수 정의
스크래퍼로써 탐지는 통과하는것에 도움이 되는 user-agent 정의
URL을 request.get에 지정하고 인자로 user-agent header를 전달
request.get에서 컨텍스트 추출
특정 페이지를 수집하고 soup 변수에 할당

다음으로 중요한 단계는 남길 모든 데이터를 parent_tag아래에서 구분하는 것이다. 추출할 데이터는 다음과 같다.

Book Name
Author
Rating
Customers Rated
Price

아래 이미지는 parent tag가 위치한 곳을 보여준다. 그리고 마우스를 parent tag 위로 가져가면 모든 필요한 요소(elements)가 강조된다.

Parent tag와 유사하게 책의 이름(name), 저자(author), 평점(rating), 고객 평점(customers rating)과 가격에 대한 속성을 찾아야 한다. 스크랩하려고 하는 웹페이지에 가서 속성을 선택하고 오른쪽 마우스 클릭한다. 그리고 검사(inspect) 요소를 선택한다. 이 작업은 아래 그림처럼 순수 HTML 웹페이지에서 추출하려는 특정 정보 필드를 찾는데 도움이 된다.

몇몇 저자 이름이 아마존에 등록되어 있지 않다는 것에 주의하자. 따라서 이들 저자에 대한 추가적인 탐색(find)를 적용해야 한다. 아래 소스 코드에서 저자 이름에 대한 내포된(nested) if-else 조건문을 볼 수 있다. 이 조건문에서 저자/간행물 이름을 추출한다.


no_pages = 2

def get_data(pageNo):  
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

    r = requests.get('https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo), headers=headers)#, proxies=proxies)
    content = r.content
    soup = BeautifulSoup(content)
    #print(soup)

    alls = []
    for d in soup.findAll('div', attrs={'class':'a-section a-spacing-none aok-relative'}):
        #print(d)
        name = d.find('span', attrs={'class':'zg-text-center-align'})
        n = name.find_all('img', alt=True)
        #print(n[0]['alt'])
        author = d.find('a', attrs={'class':'a-size-small a-link-child'})
        rating = d.find('span', attrs={'class':'a-icon-alt'})
        users_rated = d.find('a', attrs={'class':'a-size-small a-link-normal'})
        price = d.find('span', attrs={'class':'p13n-sc-price'})

        all1=[]

        if name is not None:
            #print(n[0]['alt'])
            all1.append(n[0]['alt'])
        else:
            all1.append("unknown-product")

        if author is not None:
            #print(author.text)
            all1.append(author.text)
        elif author is None:
            author = d.find('span', attrs={'class':'a-size-small a-color-base'})
            if author is not None:
                all1.append(author.text)
            else:    
                all1.append('0')

        if rating is not None:
            #print(rating.text)
            all1.append(rating.text)
        else:
            all1.append('-1')

        if users_rated is not None:
            #print(price.text)
            all1.append(users_rated.text)
        else:
            all1.append('0')     

        if price is not None:
            #print(price.text)
            all1.append(price.text)
        else:
            all1.append('0')
        alls.append(all1)    
    return alls

아래 소스코드는 다음의 기능을 수행한다.

loop에서 get_data 함수 호출
for 루프는 get_data 함수를 1에서 pages+1의 수까지 반복
출력이 nested list이기 때문에 리스트를 펼쳐서(flatten) 데이터프레임으로 전달

데이터프레임을 CSV파일로 저장


results = []
for i in range(1, no_pages+1):
  results.append(get_data(i))
flatten = lambda l: [item for sublist in l for item in sublist]
df = pd.DataFrame(flatten(results),columns=['Book Name','Author','Rating','Customers_Rated', 'Price'])
df.to_csv('amazon_products.csv', index=False, encoding='utf-8')

Reading CSV File

위에서 생성하고 저장한 CSV 파일을 로딩하자. 이 단계는 선택적인 단계로 이 단계를 무시하고 바로 데이터프레임 df를 사용할 수 있다.


df = pd.read_csv("amazon_products.csv")
df.shape

(100, 5)

데이터프레임의 모습은 CSV파일이 100개의 열과 5개의 컬럼을 갖는 다는 것을 나타낸다.

처음 61개의 열을 출력해 보자.


df.head(61)

	Book Name	Author	Rating	Customers_Rated	Price
0	The Power of your Subconscious Mind	Joseph Murphy	4.5 out of 5 stars	13,948	₹ 99.00
1	Think and Grow Rich	Napoleon Hill	4.5 out of 5 stars	16,670	₹ 99.00
2	Word Power Made Easy	Norman Lewis	4.4 out of 5 stars	10,708	₹ 130.00
3	Mathematics for Class 12 (Set of 2 Vol.) Exami...	R.D. Sharma	4.5 out of 5 stars	18	₹ 930.00
4	The Girl in Room 105	Chetan Bhagat	4.3 out of 5 stars	5,162	₹ 149.00
...	...	...	...	...	...
56	COMBO PACK OF Guide To JAIIB Legal Aspects Pri...	MEC MILLAN	4.5 out of 5 stars	114	₹ 1,400.00
57	Wren & Martin High School English Grammar and ...	Rao N	4.4 out of 5 stars	1,613	₹ 400.00
58	Objective General Knowledge	Sanjiv Kumar	4.2 out of 5 stars	742	₹ 254.00
59	The Rudest Book Ever	Shwetabh Gangwar	4.6 out of 5 stars	1,177	₹ 194.00
60	Sita: Warrior of Mithila (Ram Chandra Series -...	Amish Tripathi	4.4 out of 5 stars	3,110	₹ 248.00

61열 X 5행

Rating, customers_rated, price 열에 대한 전처리

평점이 5점 만점인것을 알기 때문에 평점만을 남기고 나머지는 제거할 수 있다.
고객 평점으로부터 콤마(,)를 제거할 수 있다.
가격에서 루피(₹) 실폼, 콤마를 제거하고 점(.)으로 나눌 수 있다.

이 3개 열을 정수(integer)또는 부동소수(float)으로 변환한다.


df['Rating'] = df['Rating'].apply(lambda x: x.split()[0])
df['Rating'] = pd.to_numeric(df['Rating'])
df["Price"] = df["Price"].str.replace('₹', '')
df["Price"] = df["Price"].str.replace(',', '')
df['Price'] = df['Price'].apply(lambda x: x.split('.')[0])
df['Price'] = df['Price'].astype(int)
df["Customers_Rated"] = df["Customers_Rated"].str.replace(',', '')
df['Customers_Rated'] = pd.to_numeric(df['Customers_Rated'], errors='ignore')

df.head()

| |Book Name |Author |Rating |Customers_Rated |Price |
|:-:|:---------:|:---------:|:---------:|:-----------------:|:-----:|
|0 |The Power of your Subconscious Mind| Joseph Murphy| 4.5| 13948| 99|
|1 |Think and Grow Rich| Napoleon Hill| 4.5| 16670| 99|
|2 |Word Power Made Easy| Norman Lewis| 4.4| 10708| 130|
|3 |Mathematics for Class 12 (Set of 2 Vol.) Exami...| R.D. Sharma| 4.5| 18| 930|
|4 |The Girl in Room 105| Chetan Bhagat| 4.3| 5162| 149|

데이터프레임의 데이터 타입을 확인하자.


df.dtypes

Book Name           object
Author              object
Rating             float64
Customers_Rated      int64
Price                int64
dtype: object

데이터프레임내 0값을 NaN으로 바꾼다.


df.replace(str(0), np.nan, inplace=True)
df.replace(0, np.nan, inplace=True)

Counting the Number of NaNs in the DataFrame


count_nan = len(df) - df.count()

count_nan

Book Name          0
Author             6
Rating             0
Customers_Rated    0
Price              1
dtype: int64

위의 출력에서 6권의 책이 저자이름이 없는 반면 한권이 가격이 없는 것을 알 수 있다. 이러한 정보는 책을 팔고자하는 저자에겐 중요하며 방치해서는 안된다.

이러한 NaN을 제거하자.


df = df.dropna()

가장 고가인 책의 저자

가장 고가인 책의 저자를 찾아보자. 가장 고가인 책 상위 15명의 저자를 시각화해보자.


data = df.sort_values(["Price"], axis=0, ascending=False)[:15]
data

	Book Name	Author	Rating	Customers_Rated	Price
56	COMBO PACK OF Guide To JAIIB Legal Aspects Pri...	MEC MILLAN	4.5	114	1400.0
98	Diseases of Ear, Nose and Throat	P L Dhingra	4.7	118	1285.0
3	Mathematics for Class 12 (Set of 2 Vol.) Exami...	R.D. Sharma	4.5	18	930.0
96	Madhymik Bhautik Vigyan -12 (Part 1-2) (NCERT ...	Kumar-Mittal	5.0	1	765.0
6	My First Library: Boxset of 10 Board Books for...	Wonder House Books	4.5	3116	750.0
38	Indian Polity - For Civil Services and Other S...	M. Laxmikanth	4.6	1210	700.0
42	A Modern Approach to Verbal & Non-Verbal Reaso...	R.S. Aggarwal	4.4	1822	675.0
27	The Intelligent Investor (English) Paperback –...	Benjamin Graham	4.4	6201	650.0
99	Law of CONTRACT & Specific Relief	Dr. Avtar Singh	4.4	23	643.0
49	All In One ENGLISH CORE CBSE Class 12 2019-20	Arihant Experts	4.4	493	599.0
72	The Secret	Rhonda Byrne	4.5	11220	556.0
86	How to Prepare for Quantitative Aptitude for t...	Arun Sharma	4.4	847	537.0
8	Quantitative Aptitude for Competitive Examinat...	R S Aggarwal	4.4	4553	435.0
16	Sapiens: A Brief History of Humankind	Yuval Noah Harari	4.6	14985	434.0
84	Concept of Physics Part-2 (2019-2020 Session) ...	H.C. Verma	4.6	1807	433.0

BokehJS를 사용해보자.


from bokeh.models import ColumnDataSource
from bokeh.transform import dodge
import math
from bokeh.io import curdoc
curdoc().clear()
from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
from bokeh.models import Legend
output_notebook()

p = figure(x_range=data.iloc[:,1], plot_width=800, plot_height=550, title="Authors Highest Priced Book", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,1], top=data.iloc[:,4], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2

show(p)

위 그래프에서 2권의 가장 고가인 책은 Mecmillan과 P L Dhingra인 것을 볼 수 있다.

고객 평점에 관해 최고 평점 책과 저자

어떤 저자가 상위 평가의 책을 가지고 있고 해당 저자의 어떤 책이 상위 평가를 받았는지 알아보자. 그러나 이를 찾으면서 고객 평가 1000보다 작은 저자는 걸러낼 것이다.


data = df[df['Customers_Rated'] > 1000]
data = data.sort_values(['Rating'],axis=0, ascending=False)[:15]
data

	Book Name	Author	Rating	Customers_Rated	Price
26	Inner Engineering: A Yogi’s Guide to Joy	Sadhguru	4.7	4091	254.0
70	Bhagavad-Gita (Hindi)	A. C. Bhaktivedanta	4.7	1023	150.0
11	The Alchemist	Paulo Coelho	4.7	22182	264.0
47	Harry Potter and the Philosopher's Stone	J.K. Rowling	4.7	7737	234.0
84	Concept of Physics Part-2 (2019-2020 Session) ...	H.C. Verma	4.6	1807	433.0
16	Sapiens: A Brief History of Humankind	Yuval Noah Harari	4.6	14985	434.0
38	Indian Polity - For Civil Services and Other S...	M. Laxmikanth	4.6	1210	700.0
29	Wings of Fire: An Autobiography of Abdul Kalam	Arun Tiwari	4.6	3513	301.0
39	The Theory of Everything	Stephen Hawking	4.6	2004	199.0
25	The Immortals of Meluha (Shiva Trilogy)	Amish	4.6	4538	248.0
23	Life's Amazing Secrets: How to Find Balance an...	Gaur Gopal Das	4.6	3422	213.0
34	Dear Stranger, I Know How You Feel	Ashish Bagrecha	4.6	1130	167.0
17	The Monk Who Sold His Ferrari	Robin Sharma	4.6	5877	137.0
13	How to Win Friends and Influence People	Dale Carnegie	4.6	15377	99.0
59	The Rudest Book Ever	Shwetabh Gangwar	4.6	1177	194.0


p = figure(x_range=data.iloc[:,0], plot_width=800, plot_height=600, title="Top Rated Books with more than 1000 Customers Rating", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,0], top=data.iloc[:,2], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2
show(p)

위 결과에서 1000보다 큰 고객평가를 가진 3권의 상위 평가된 책이 Inner Engineering: A Yogi’s Guide to Joy, Bhagavad-Gita (Hindi) 그리고 The Alchemist인 것을 볼 수 있다.


p = figure(x_range=data.iloc[:,1], plot_width=800, plot_height=600, title="Top Rated Books with more than 1000 Customers Rating", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,1], top=data.iloc[:,2], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2
show(p)

위 그래프는 1000보다 큰 고객 평가를 갖고 가장 높이 평가된 책을 가진 상위 10명의 저자를 내림차순으로 보여준다.

가장 높은 고객평가 저자와 책

위에서 상위 평가된 책과 상위 평가된 저자를 보았지만,그 책에 대해 평가한 고객의 수에 기초하여 최고의 저자와 책을 결론짖는 것이 여전히 더 설득력있고 신뢰할 수 있다.

그러면, 빠르게 이를 찾아보자.


data = df.sort_values(["Customers_Rated"], axis=0, ascending=False)[:20]
data

	Book Name	Author	Rating	Customers_Rated	Price
11	The Alchemist	Paulo Coelho	4.7	22182	264.0
1	Think and Grow Rich	Napoleon Hill	4.5	16670	99.0
13	How to Win Friends and Influence People	Dale Carnegie	4.6	15377	99.0
16	Sapiens: A Brief History of Humankind	Yuval Noah Harari	4.6	14985	434.0
18	Rich Dad Poor Dad : What The Rich Teach Their ...	Robert T. Kiyosaki	4.5 14591	296.0
10	The Subtle Art of Not Giving a F*ck	Mark Manson 4.4	14418	365.0
0	The Power of your Subconscious Mind	Joseph Murphy	4.5	13948	99.0
48	The Power of Your Subconscious Mind	Joseph Murphy	4.5	13948	99.0
72	The Secret Rhonda Byrne	4.5	11220	556.0
41	1984 George Orwell	4.5	10829	95.0
2	Word Power Made Easy	Norman Lewis	4.4	10708	130.0
46	Man's Search For Meaning: The classic tribute ...	Viktor E Frankl	4.4	8544	245.0
67	The 7 Habits of Highly Effective People	R. Stephen Covey	4.3	8229	397.0
47	Harry Potter and the Philosopher's Stone	J.K. Rowling	4.7	7737	234.0
40	One Indian Girl Chetan Bhagat	3.8	7128	113.0
65	Thinking, Fast and Slow (Penguin Press Non-Fic...	Daniel Kahneman	4.4	7087	410.0
27	The Intelligent Investor (English) Paperback –...	Benjamin Graham	4.4	6201	650.0
17	The Monk Who Sold His Ferrari	Robin Sharma	4.6	5877	137.0
53	Ram - Scion of Ikshvaku (Ram Chandra)	Amish Tripathi	4.2	5766	262.0
93	The Richest Man in Babylon	George S. Clason	4.5	5694	129.0


from bokeh.transform import factor_cmap
from bokeh.models import Legend
from bokeh.palettes import Dark2_5 as palette
import itertools
from bokeh.palettes import d3
#colors has a list of colors which can be used in plots
colors = itertools.cycle(palette)

palette = d3['Category20'][20]
index_cmap = factor_cmap('Author', palette=palette,
                         factors=data["Author"])

p = figure(plot_width=700, plot_height=700, title = "Top Authors: Rating vs. Customers Rated")
p.scatter('Rating','Customers_Rated',source=data,fill_alpha=0.6, fill_color=index_cmap,size=20,legend='Author')
p.xaxis.axis_label = 'RATING'
p.yaxis.axis_label = 'CUSTOMERS RATED'
p.legend.location = 'top_left'

BokehDeprecationWarning: 'legend' keyword is deprecated, use explicit 'legend_label', 'legend_field', or 'legend_group' keywords instead

show(p)

위 그래프는 고객평점과 실제 평점에 대한 저자의 산포도이다. 위 그래프를 본 후 다음의 결론을 내릴 수 있다.

Paulo Coelho의 책 The Alchemist는 평가와 평가한 고객의 수 모두 동기화되기 때무에 가장 많이 팔리는 책이다.
저자 Amish Tripathi의 책 Ram-Ikshvaku의 후예 (Scion of Ikshvaku - Ram Chandra)는 5766개의 개객 평가로 4.2의 평가를 갖는다. 그러나, George S. Clason의 책 The Richest Man in Babylon은 거의 유사한 고객 평가를 갖지만 저체 평점은 4.5이다. 따라서 더 많은 고객이 Richest Man in Babylon에게 높은 평점을 주었다는 것으로 결론지을 수 있다.

저작자표시 비영리 동일조건

Dead & Street