Venmo

In order to index the web properly, corbia spider uses domain based templates. ## Domain declaration For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing. New domains are added to `Domain` class: ```python class Domain: def __init__(self, domain, html, url): self.domain = domain self.soup = BeautifulSoup(html, 'html.parser') self.url = url self.scrapers = { "www.azlyrics.com" : self.scrape_azlyrics, "www.monde-diplomatique.fr" : self.scrape_diplo, "www.amnesty.org" : self.scrape_amnesty, "www.vindefrance.com" : self.scrape_vdf, "www.tasteofcinema.com" : self.scrape_taste_of_cinema, "www.blast-info.fr" : self.scrape_blast, "www.byredo.com" : self.scrape_byredo } ``` Then, the function is added to the class as `def scrape_domain(self)`. ## Domain based scrapping function The function must **ALWAYS** return a JSON with a `"success"` key set to either `True` or `False`. To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a `try` / `except` block: ```python def scrape_domain(self): try: data = { "success" : True } return json.dumps(data, indent=4, ensure_ascii=False) except Exception as e: return json.dumps({"success": False}) ``` For each page that will be processed by the domain based function, the `BeautifulSoup` object is accessible via `self.soup` Consider this example html: ```html Venmo

01

02

03

04

Venmo truffaut shabby chic organic

I'm baby wayfarers tote bag gochujang cred food truck VHS quinoa kogi Brooklyn yr vegan etsy.

Portland squid DSA, raclette flannel pinterest craft beer cloud bread pour-over same.

Air plant pickled man braid tilde drinking vinegar ascot DIY poke meditation iceland JOMO sustainable. Hell of tbh kombucha +1 listicle.

``` (if you're wondering what is this dummy text, it's coming from [hipsum](https://hipsum.co/), a [Lorem ipsum](https://en.wikipedia.org/wiki/Lorem_ipsum) like generator) From this `html`, we would like to retrieve this `json`: ```json { "title" : "Venmo truffaut shabby chic organic", "text" : "I'm baby wayfarers tote bag gochujang cred food truck VHS quinoa kogi Brooklyn yr vegan etsy. Portland squid DSA, raclette flannel pinterest craft beer cloud bread pour-over same. Air plant pickled man braid tilde drinking vinegar ascot DIY poke meditation iceland JOMO sustainable. Hell of tbh kombucha +1 listicle.", "success" : True } ``` ```python def scrape_domain(self): try: content = self.soup.find("div", class_="article") title = content.find("h2") article_text = content.find("section", class_="text") data = { "title" : title.text.strip(), "text": article_text.text.strip(), "success" : True } return json.dumps(data, indent=4, ensure_ascii=False) except Exception as e: print(e) return json.dumps({"success": False}) ``` Moliere6*

01 02 03 04

Venmo truffaut shabby chic organic

01

02

03

04