01_Python

51_정적 크롤링

chuu_travel 2025. 1. 15. 22:43

728x90

01. 정적 크롤링 모듈

from bs4 import BeautifulSoup
from urllib.request import urlopen

요청 모듈로 가져온 HTML 코드를 파이썬이 쓸 수 있는 형태로 변환해주는 역할

url = "https://chuuvelop.tistory.com/"

page = urlopen(url)

soup = BeautifulSoup(page, "lxml") ##lxml자리에 원하는 파서 이름을 쓰면 됨

print(soup)

HTML코드가 출력됨

02. 파서

내가 원하는 데이터를 특정 패턴이나 순서로 추출하여 정보를 가공해주는 프로그램
- lxml
  - c언어로 구현되어 속도가 가장 빠름
- html5lib
  - 속도가 가장 느림
  - 가장 안정적
- html.parser
  - lxml과 html5lib의 중간 속도

※ lxml, html.parser를 많이 씀

03. 속성 데이터

html = """<html> <head><title class="t" id="ti">test site</title></head> <body> test test1 test2 </body></html>"""

html
- head
  - title
- body
  - p
  - p
  - p

soup = BeautifulSoup(html, "lxml")

tag_title = soup.title

print(tag_title)
print(tag_title.attrs) # 태그의 속성 가져오기
print(tag_title["class"])
print(tag_title["id"])

<title class="t" id="ti">test site</title>
{'class': ['t'], 'id': 'ti'}
['t']
ti

# 키가 없다면 에러 발생
print(tag_title["class1"])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[15], line 2
      1 # 키가 없다면 에러 발생
----> 2 print(tag_title["class1"])

File C:\ProgramData\anaconda3\Lib\site-packages\bs4\element.py:1573, in Tag.__getitem__(self, key)
   1570 def __getitem__(self, key):
   1571     """tag[key] returns the value of the 'key' attribute for the Tag,
   1572     and throws an exception if it's not there."""
-> 1573     return self.attrs[key]

KeyError: 'class1'

tag_title.get("class1", "default_value")

'default_value'

# tag 타입은 딕셔너리처럼 접근할 수 있고 딕셔너리 문법을 그대로 적용할 수 있음
type(tag_title)

bs4.element.Tag

04. 태그 접근

soup.태그이름의 형태로 첫 번째로 등장하는 태그의 정보를 가져올 수 있음

tag_title = soup.title
print(tag_title)

<title class="t" id="ti">test site</title>

print(tag_title.text)
print(tag_title.string)

test site
test site

# text와 string의 차이
html = """<html> <head><title>test site</title></head> <body> test1test2 </body></html>"""

html
- head
  - title
- body
  - p
    - span
    - span

soup = BeautifulSoup(html, "lxml")

tag_p = soup.p

print(tag_p)

<p><span>test1</span><span>test2</span></p>

data_text = tag_p.text
data_string = tag_p.string ##정확하게 자기 자신의 text만 출력

print("text : ", data_text, type(data_text))
print("string : ", data_string, type(data_string))

text :  test1test2 <class 'str'>
string :  None <class 'NoneType'>

text
- 하위 태그들의 값도 모두 출력
string
- 정확히 해당 태그에 대한 값만 출력

tag_p.span.string

'test1'

자식 태그 접근

contents와 children 속성을 이용하여 자식 태그 가져오기

tag_p_contents = soup.p.contents # 리스트 형태로 자식 태그들을 가져옴

print(tag_p_contents)

[<span>test1</span>, <span>test2</span>]

tag_p_children = soup.p.children # children으로 가져온 값은 반복문을 사용해야함

for child in tag_p_children:
print(child)

<span>test1</span>
<span>test2</span>

print(tag_p_children) ##리스트를 쓰지 않으면 에러 발생

<list_iterator object at 0x0000021B35BE20B0>

부모 태그 접근

parent와 parents로 부모 태그 가져오기

tag_span = soup.span
tag_title = soup.title

print(tag_span)
print(tag_title)

<span>test1</span>
<title>test site</title>

span_parent = tag_span.parent
title_parent = tag_title.parent

print(span_parent)
print(title_parent)

<span>test1</span>
<title>test site</title>

<p><span>test1</span><span>test2</span></p>
<head><title>test site</title></head>

# parents는 반복문을 사용해야함
span_parents = tag_span.parents

for parent in span_parents:
print(parent)

<p><span>test1</span><span>test2</span></p>
<body> <p><span>test1</span><span>test2</span></p> </body>
<html> <head><title>test site</title></head> <body> <p><span>test1</span><span>test2</span></p> </body></html>
<html> <head><title>test site</title></head> <body> <p><span>test1</span><span>test2</span></p> </body></html>

형제 태그 접근

형제 태그 : 동등한 위치의 태그

tag_span

<span>test1</span>

a = tag_span.next_sibling
b = a.previous_sibling

print(a)
print(b)

<span>test2</span>
<span>test1</span>

print(a.next_sibling)
print(b.previous_sibling)

None
None

html = """<html> <head><title>test site</title></head> <body> <a>test1</a>test2<c>test3</c> </body></html>"""

html
- head
  - title
- body
  - p
    - a
    - b
    - c

soup = BeautifulSoup(html, "lxml")

tag_a = soup.a
tag_b = soup.b
tag_c = soup.c

tag_a_nexts = tag_a.next_siblings
tab_b_prevs = tag_b.previous_siblings

for sibling in tag_a_nexts:
print(sibling)

<b>test2</b>
<c>test3</c>

for sibling in tab_b_prevs:
print(sibling)

<a>test1</a>

다음 요소, 이전 요소 접근하기

next_element, previous_element
형제 태그와의 차이
- 형제 태그 : 동일한 위치의 태그들만
- 요소 : 태그도 포함하지만 그 안의 자식 태그와 문자도 포함하는 개념

tag_a_nexts = tag_a.next_elements

for i in tag_a_nexts:
print(i)

test1
<b>test2</b>
test2
<c>test3</c>
test3

원하는 요소에 접근하기

find_all()

원하는 태그들을 리스트 형태로 가져오기

html = """<html> <head><title>test site</title></head> <body> test1test2test3<a>a tag</a> b tag</body></html>"""

soup = BeautifulSoup(html, "lxml")

html
- head
  - title
- body
  - p
  - p
  - p
  - a
  - b

# title태그들
soup.find_all("title")

[<title>test site</title>]

# p태그들
soup.find_all("p")

[<p class="a" id="i">test1</p>,
 <p class="d" id="d">test2</p>,
 <p class="c">test3</p>]

id값으로 태그 가져오기

# id가 "d"인 태그들
soup.find_all(id = "d")

[<p class="d" id="d">test2</p>]

# id의 존재 여부로 데이터 가져오기
print(soup.find_all(id = True))

[<p class="a" id="i">test1</p>, <p class="d" id="d">test2</p>]

print(soup.body.find_all(id = False))

[<p class="c">test3</p>, <a>a tag</a>, <b>b tag</b>]

원하는 태그, 원하는 id값으로 태그 가져오기

print(soup.find_all("p", id = "d"))
print(soup.find_all("p", id = "c")) ##[None] 속이 비어있는 리스트

[<p class="d" id="d">test2</p>]
[]

원하는 태그, 원하는 class값으로 태그 가져오기

print(soup.find_all("p", class_ = "d"))
print(soup.find_all("p", class_ = "c"))

[<p class="d" id="d">test2</p>]
[<p class="c">test3</p>]

text 속성으로 태그 가져오기

print(soup.find_all("p", text = "test1")) # p 태그 중에서 test1 이라는 값을 가진 태그

[<p class="a" id="i">test1</p>]

C:\Users\ITSC\AppData\Local\Temp\ipykernel_9952\1624252227.py:1: DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead.
  print(soup.find_all("p", text = "test1")) # p 태그 중에서 test1 이라는 값을 가진 태그

limit 으로 가져오는 태그 수 제한

print(soup.find_all("p", limit = 2))

[<p class="a" id="i">test1</p>, <p class="d" id="d">test2</p>]

print(soup.find_all("p", limit = 4)) #limit값이 태그의 양보다 커도 에러를 띄우지 않음

[<p class="a" id="i">test1</p>, <p class="d" id="d">test2</p>, <p class="c">test3</p>]

여러 태그 동시에 가져오기

soup.find_all(["a", "b"])

[<a>a tag</a>, <b>b tag</b>]

find_all() 연속으로 사용하기

tag_body = soup.find_all("body")
print(tag_body)

[<body> <p class="a" id="i">test1</p><p class="d" id="d">test2</p><p class="c">test3</p><a>a tag</a> <b>b tag</b></body>]

tag_body.find_all("p")

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[77], line 1
----> 1 tag_body.find_all("p")

File C:\ProgramData\anaconda3\Lib\site-packages\bs4\element.py:2433, in ResultSet.__getattr__(self, key)
   2431 def __getattr__(self, key):
   2432     """Raise a helpful exception to explain a common code fix."""
-> 2433     raise AttributeError(
   2434         "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
   2435     )

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

tag_body[0].find_all("p")

[<p class="a" id="i">test1</p>,
 <p class="d" id="d">test2</p>,
 <p class="c">test3</p>]

find()

하나의 요소만 가져옴
찾고자 하는 요소가 하나만 있을 때 사용

soup.find("p")

<p class="a" id="i">test1</p>

print(soup.find("p", class_ = "d")) # class가 d인 p태그
print(soup.find("p", id = "i")) # id가 i인 p태그
print(soup.find(id = "i")) # id가 i인 태그

<p class="d" id="d">test2</p>
<p class="a" id="i">test1</p>
<p class="a" id="i">test1</p>

# 연속으로 find() 사용
soup.find("body").find("p", class_ = "d")

<p class="d" id="d">test2</p>

select()

find_all() 과 마찬가지로 매칭되는 모든 결과를 리스트로 반환
클래스는 마침표(.), 아이디는 샵(#)으로, 자식태그는 > 로, 자손태그는 띄어쓰기로 표현
select_one()으로 하나의 결과만 반환하는 것도 가능

print(soup.select("p")) # p태그들
print(soup.select(".d")) # class가 d인 태그들
print(soup.select("p.d")) # class가 d인 p태그들
print(soup.select("#i")) #id가 i인 태그들
print(soup.select("p#i")) #id가 i인 p태그들

[<p class="a" id="i">test1</p>, <p class="d" id="d">test2</p>, <p class="c">test3</p>]
[<p class="d" id="d">test2</p>]
[<p class="d" id="d">test2</p>]
[<p class="a" id="i">test1</p>]
[<p class="a" id="i">test1</p>]

html = """<html> <head><title>test site</title></head> <body> <div>test1test2</div>test3 <a>a tag</a> b tag</body></html>"""

soup = BeautifulSoup(html, "lxml")

html
- head
  - title
- body
  - div
    - p
    - p
  - p
  - a
  - b

05. 웹 크롤링 허용 문제

모든 사이트에는 웹 크롤링 권한에 관해 명시한 페이지가 있음
- 사이트 url 끝에 robots.txt를 붙여서 확인
  - 예) http://www.google.com/robots.txt
  - Disallow : 허용되지 않은 경로
  - Allow : 크롤링을 허용하는 경로

예제 1-1. 티스토리 크롤링

url = "https://xxxxx.tistory.com/1"

page = urlopen(url)

soup = BeautifulSoup(page, "lxml")

soup

해당 URL의 html코드가 출력됨

제목 수집

soup.select_one("div.hgroup > h1").string

게시물 내용 수집

soup.select_one("div.contents_style.tt_article_useless_p_margin > p").string

예제 1-2. 티스토리 크롤링

표 안의 내용 크롤링 -> 리스트에 담기
모니터, CPU -> 리스트에 담기

url = "https://XXXXXX.com/2"

page = urlopen(url)

soup = BeautifulSoup(page, "lxml")

테이블 내용 수집

・테이블의 내용을 수집하여 하나의 리스트에 담기

table_text = soup.select("div.tt_article_useless_p_margin td")

table_list = []

for i in table_text:
table_list.append(i.string)

# 리스트내포로 표현
#[i.string for i in table_text]

table_list

['상품',
 '색상',
 '가격',
 '셔츠1',
 '빨강',
 '20000',
 '셔츠2',
 '파랑',
 '19000',
 '셔츠3',
 '초록',
 '18000',
 '바지1',
 '검정',
 '50000',
 '바지2',
 '파랑',
 '51000']

result = soup.select("div.tt_article_useless_p_margin li")

[i.string for i in result]

['모니터', 'CPU', '메모리', '그래픽카드', '하드디스크', '키보드', '마우스']

쇼핑몰 상품이미지 크롤링

url = "https://shoppingmallstuffimage.co.xx/1/"
page = urlopen(url)
soup = BeautifulSoup(page, "html")

#soup.select_one("div.prdImg > a > img")["src"]

#get이 안정적
img_url = soup.select_one("div.prdImg > a > img").get("src", "")

img_url

urlopen(img_url)

with open("test.jpg", "wb") as f: ##wb: 바이너리모드로 쓰기
f.write(urlopen(img_url).read())

# 한 페이지 내의 모든 이미지 크롤링
img_urls = [i.get("src", "") for i in soup.select("div.prdImg > a > img")]

for idx, item in enumerate(img_urls):
with open(f"./crawl_img/{idx}.jpg", "wb") as f:
f.write(urlopen(item).read())

'01_Python' 카테고리의 다른 글

53_동적 크롤링 (2)	2025.01.16
52_셀레니움(selenium)설치 (0)	2025.01.16
50_요청 모듈 (2)	2025.01.15
49_예외처리 (0)	2025.01.15
48_모듈 생성 (0)	2025.01.15

현재글51_정적 크롤링

츄래블의 개발여정✈️

Salesforce Consultant & Developer

250x250

비즈니스모델, 재무제표, 의사결정, 파이썬튜플, 파이썬리스트, 빅데이터, pandas, 전략컨설팅, 파이썬컬렉션, 츄래블, 파이썬조건식, 파이썬continue, 컨설팅, 데이터분석, 현금흐름, 파이썬제어문, 파이썬enumerate, 파이썬세트, 파이썬리스트내포, DX컨설팅,

Today :
Yesterday :

츄래블의 개발여정✈️

51_정적 크롤링

자식 태그 접근

부모 태그 접근

형제 태그 접근

다음 요소, 이전 요소 접근하기

원하는 요소에 접근하기

find_all()

find()

select()

제목 수집

게시물 내용 수집

테이블 내용 수집

쇼핑몰 상품이미지 크롤링

'01_Python' 카테고리의 다른 글

'01_Python'의 다른글

티스토리툴바

« 2026/06 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

51_정적 크롤링

자식 태그 접근

부모 태그 접근

형제 태그 접근

다음 요소, 이전 요소 접근하기

원하는 요소에 접근하기

find_all()

find()

select()

제목 수집

게시물 내용 수집

테이블 내용 수집

쇼핑몰 상품이미지 크롤링

'01_Python' 카테고리의 다른 글

'01_Python'의 다른글

관련글

티스토리툴바