크롤링(Crawling) 기초(1)

1. 크롤링(Crawling)

1.1 크롤링이란?

웹 페이지를 가져와서 데이터를 추출하는것
크롤링하는 소프트웨어를 크롤러(crawler)라고 함

1.2 크롤링 소프트웨어

Beautifulsoup

2. 실습

2.1 간단한 사이트 크롤링하기

https://beans.itcarlow.ie/prices.html
커피 가격을 올리는 사이트, 학습용 사이트 이다

2.2 Beautifulsoup

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = 'http://beans.itcarlow.ie/prices.html'
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')

웹페이지의 url 주소는 urllib.request.urlopen으로 접근
웹페이지의 html 언어는 Beautifulsoup으로 파싱함

2.3 HTML tag

print(soup.prettify())

<html>
 <head>
  <title>
   Welcome to the Beans'R'Us Pricing Page
  </title>
  <link href="beansrus.css" rel="stylesheet" type="text/css"/>
 </head>
 <body>
  <h2>
   Welcome to the Beans'R'Us Pricing Page
  </h2>
  <p>
   Current price of coffee beans =
   <strong>
    $5.62
   </strong>
  </p>
  <p>
   Price valid for 15 minutes from Wed Oct 14 02:55:01 2020.
  </p>
 </body>
</html>

prettify()는 bs가 지원하는 명령으로 print와 함께 사용하면 들여쓰기로 표현해줌

2.4 Find

print(soup.find('head').prettify())

<head>
 <title>
  Welcome to the Beans'R'Us Pricing Page
 </title>
 <link href="beansrus.css" rel="stylesheet" type="text/css"/>
</head>

Find 명령은 찾는 Tag 중 첫번째 결과를 보여줌

2.5 Body

print(soup.find('body').prettify())

<body>
 <h2>
  Welcome to the Beans'R'Us Pricing Page
 </h2>
 <p>
  Current price of coffee beans =
  <strong>
   $5.62
  </strong>
 </p>
 <p>
  Price valid for 15 minutes from Wed Oct 14 02:55:01 2020.
 </p>
</body>

Body 태그는 화면에 보이는 내용

2.6 H tag

print(soup.find('h2').prettify())

<h2>
 Welcome to the Beans'R'Us Pricing Page
</h2>

H tag는 제목 역할

2.7 P tag

print(soup.find('p').prettify())

<p>
 Current price of coffee beans =
 <strong>
  $5.62
 </strong>
</p>

P tag는 문단 구분

2.8 Filn_all

print(soup.find_all('p'))
print(soup.find_all('p')[0])
print(soup.find_all('p')[1])

[<p>Current price of coffee beans = <strong>$5.62</strong></p>, <p>Price valid for 15 minutes from Wed Oct 14 02:55:01 2020.</p>]
<p>Current price of coffee beans = <strong>$5.62</strong></p>
<p>Price valid for 15 minutes from Wed Oct 14 02:55:01 2020.</p>

Find_all 메서드는 해당 태그를 모두 찾음

2.9 String과 Get_text()

soup.find('strong')

<strong>$5.62</strong>

strong 태그를 찾음

soup.find('strong').string

'$5.62'

strong 태그의 문자열(string)을 출력

soup.find('strong').get_text()

'$5.62'

get_text를 이용하여 text형식을 출력

3. 크롬 (Chrome)

3.1 크롬 (Chrome)의 개발자 도구

위 예시처럼 html을 파싱해서 태그를 찾을수 있지만, 크롬 개발자 도구룰 통해 조금더 쉽게 알수 있음
Chrome에서 더보기(오른쪽 상단 점 3개) -> 도구 더보기 -> 개발자 도구
Windows : Shift + Ctrl + I
Mac : Option + Command + I

크롤링(Crawling) 기초(1)

1. 크롤링(Crawling)

1.1 크롤링이란?

1.2 크롤링 소프트웨어

2. 실습

2.1 간단한 사이트 크롤링하기

2.2 Beautifulsoup

2.3 HTML tag

2.4 Find

2.5 Body

2.6 H tag

2.7 P tag

2.8 Filn_all

2.9 String과 Get_text()

3. 크롬 (Chrome)

3.1 크롬 (Chrome)의 개발자 도구

3.2 엘레먼츠 (Elements)

4. 요약

4.1 요약

Recent Update

Trending Tags

Contents

Trending Tags

크롤링(Crawling) 기초(1)

1. 크롤링(Crawling)

1.1 크롤링이란?

1.2 크롤링 소프트웨어

2. 실습

2.1 간단한 사이트 크롤링하기

2.2 Beautifulsoup

2.3 HTML tag

2.4 Find

2.5 Body

2.6 H tag

2.7 P tag

2.8 Filn_all

2.9 String과 Get_text()

3. 크롬 (Chrome)

3.1 크롬 (Chrome)의 개발자 도구

3.2 엘레먼츠 (Elements)

4. 요약

4.1 요약

Recent Update

Trending Tags

Contents

Further Reading

크롤링(Crawling) 기초(2)

Beautiful Soup을 이용한 네이버 영화 평점 크롤링

Beautiful Soup을 이용한 시카고 샌드위치 맛집 정보 추출

Trending Tags