[Web Crawling] Beautiful Soup

Python/Web Crawling

[Web Crawling] Beautiful Soup

데이터 세상 2022. 4. 27. 09:37

728x90

Beautiful Soup

HTML과 XML 파일로부터 데이터를 뽑아내기 위한 파이썬 라이브러리
Web scraping에 사용되는 주요 라이브러리 중 하나

www.crummy.com/software/BeautifulSoup/bs4/doc/

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object (unicode() in Python 2), or on a Tag within it: str(soup) # ' I linked to example.com ' str(soup.a) # ' I linked to example.com ' The str(

www.crummy.com

Beautiful Soup 설치

pip install beautifulsoup4

Beautiful Soup 사용

URL 경로에서 HTML 페이지 정보 가져오기

from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen('https://www.pythonscraping.com/pages/page1.html')
bsObj = BeautifulSoup(html, 'html.parser')

find(): HTML에서 특정 태그의 정보 가져오기

특정 태그 여러 개 중 속성에 따라 가져오고 싶은 경우에는 해당 속성 정보를 추가하면 된다.

html_str = '''
<html>
    <body>
        <ul class='greet'>
            <li>hello</li>
            <li>bye</li>
            <li>welcome</li>
        </ul>
        <ul class='mind'>
            <li>good</li>
            <li>bad</li>
            <li>love</li>
        </ul>
    </body>
</html>
'''

bsObj = BeautifulSoup(html_str, 'html.parser')
ul_tag = bsObj.find('ul', {'class':'mind'})

findAll(): HTML에서 특정 태그의 정보 모두 가져오기

findAll()의 결과값은 리스트 형태로 전달된다.

lis = ul_tag.findAll('li')

text 속성: 태그 내에 포함된 텍스트 정보

for li in lis:
    print(li.text)

ul_tag.text

child tag 정보 가져오기

obj['child tag 명'] 형태로 해당 child tag의 정보를 가져올 수 있다.

html_str2 = '''
<html>
    <body>
        <ul class='ok'>
            <li>
                <a href='https://www.naver.com'>네이버</a>
            </li>
            <li>
                <a href='https://www.daum.net'>다음</a>
            </li>
        </ul>
        <ul class='fsite'>
            <li>
                <a href='https://www.facebook.com'>페이스북</a>
            </li>
            <li>
                <a href='https://www.google.com'>구글</a>
            </li>
        </ul>
    </body>
</html>
'''

bsObj = BeautifulSoup(html_str2, 'html.parser')

a tag에서 주소 링크 정보 가져오기

a_tags = bsObj.findAll('a')
link_list = [a_tag['href'] for a_tag in a_tags]

a tag의 타이틀 정보 가져오기

title_list = [a_tag.text for a_tag in a_tags]

728x90

저작자표시 비영리 변경금지

'Python > Web Crawling' 카테고리의 다른 글

[Web Crawling] Scraping & Crawling (0)	2022.04.27
[Web Crawling] Selenium (0)	2021.05.11

현재글[Web Crawling] Beautiful Soup

데이터와 인공지능 훑어보기