Web Scraping

Scraping Stacks

  • Scrapy
    • Full Framework
    • Support for many tools below
    • Good for big projects
  • Requests/Beautiful Soup/Pandas
    • Good for beginners
    • Easy use with Google Colab
    • Speed can be improved via multithreading libraries such as asyncio
  • Selenium/Pandas
    • Slowest but also extremely versatile

Essential Tools

  • Developer Tools (I use Edge)
    • XPath Testing
    • Copy API requests as Curl (bash)
      • Convert with Curl to Python website
    • View webpage source files/storage
  • Regular Expressions
    • Data extraction when site structure is poor
    • Versatile, but steep learning curve
  • XPaths
    • Harder than Beautiful Soup but easier than Regular Expressions
    • Easily tested in devtools
  • CSS Selectors
    • More readable than XPaths
    • Not always applicable
  • Splash
    • Can be used to render javascript without selenium
    • Great Integration with Scrapy

Deciding What Stack To Use

graph TD; 1(start)-->q1 q1{Is Site Static?}-- No -->q2{Does Site Use <br> Api for Data?} q2-- Yes --> e1(Recreate requests in Colab <br>Notebook and convert responses <br>to pandas datafasme); q2-- No --> q4{What is the scale <br> of the project?} q4-- Larger Scale --> e4(Scrapy Project Using <br>Selenium or Splash) q4-- Smaller Scale --> e5(Selenium Project) q1-- Yes --> q3{What is the scale <br> of the project?} q3-- Larger Scale --> e2(Create Scrapy Project) q3-- Smaller Scale --> e3(Requests and Beautiful Soup <br>in Colab Notebok)

Things to Consider

  • How much time will I be spending on this?
  • Do I want to risk my IP being blocked (Colab great workaround)
  • How much data am I collecting (csv vs Database storage)
  • How often will the site be scraped
  • Legality/Morality of scrape