DBILITY

python html parse and image file save 본문

python

python html parse and image file save

DBILITY 2021. 8. 13. 11:11
반응형
  1. requests
    Requests is a simple, yet elegant, HTTP library.
    >>> import requests
    >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
    >>> r.status_code
    200
    >>> r.headers['content-type']
    'application/json; charset=utf8'
    >>> r.encoding
    'utf-8'
    >>> r.text
    '{"type":"User"...'
    >>> r.json()
    {'disk_usage': 368627, 'private_gists': 484, ...}
    
    #image save
    imgRequest = requests.get(image_url)
    image =open(file_name, mode='wb')
    image.write(imgRequest.content)
    image.close()
    #또는
    with open(file_name, 'wb') as image:
    image.write(imgRequest.content)

  2. BeautifulSoup4
    Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
    >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup("<p>Some<b>bad<i>HTML","html.parser")
    >>> print(soup.prettify())
    <html>
     <body>
      <p>
       Some
       <b>
        bad
        <i>
         HTML
        </i>
       </b>
      </p>
     </body>
    </html>
    >>> soup.find(text="bad")
    'bad'
    >>> soup.i
    <i>HTML</i>
    #
    >>> soup = BeautifulSoup("<tag1>Some<tag2/>bad<tag3>XML", "xml")
    #
    >>> print(soup.prettify())
    <?xml version="1.0" encoding="utf-8"?>
    <tag1>
     Some
     <tag2/>
     bad
     <tag3>
      XML
     </tag3>
    </tag1>
    #select(), find_all() 매뉴얼 참조
  3. urllib
    #file저장
    #urllib.request.urlretrieve(image_url, file_name)
    urllib.request.urlretrieve("https://www.crummy.com/software/BeautifulSoup/bs4/doc/_images/6.1.jpg","6.1.jpg")
    저장된 결과

6.1.jpg

 

반응형
Comments