In order to index the web properly, corbia spider uses domain based templates. ## Domain declaration For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing. New domains are added to `Domain` class: ```python class Domain: def __init__(self, domain, html, url): self.domain = domain self.soup = BeautifulSoup(html, 'html.parser') self.url = url self.scrapers = { "www.azlyrics.com" : self.scrape_azlyrics, "www.monde-diplomatique.fr" : self.scrape_diplo, "www.amnesty.org" : self.scrape_amnesty, "www.vindefrance.com" : self.scrape_vdf, "www.tasteofcinema.com" : self.scrape_taste_of_cinema, "www.blast-info.fr" : self.scrape_blast, "www.byredo.com" : self.scrape_byredo } ``` Then, the function is added to the class as `def scrape_domain(self)`. ## Domain based scrapping function The function must **ALWAYS** return a JSON with a `"success"` key set to either `True` or `False`. To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a `try` / `except` block: ```python def scrape_domain(self): try: data = { "success" : True } return json.dumps(data, indent=4, ensure_ascii=False) except Exception as e: return json.dumps({"success": False}) ``` For each page that will be processed by the domain based function, the `BeautifulSoup` object is accessible via `self.soup` Consider this example html: ```html <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Venmo</title> </head> <body> <div id="root"> <div class="top"> <h2> <ul> <li><a href="/01">01</a></li> <li><a href="/02">02</a></li> <li><a href="/03">03</a></li> <li><a href="/04">04</a></li> </ul> </h2> </div> <div class="article main"> <h2 class="text">Venmo truffaut shabby chic organic</h2> <section class="text"> <p>I'm baby wayfarers tote bag gochujang cred food truck VHS quinoa kogi Brooklyn yr vegan etsy.</p> <p>Portland squid DSA, raclette flannel pinterest craft beer cloud bread pour-over same.</p> <p>Air plant pickled man braid tilde drinking vinegar ascot DIY poke meditation iceland JOMO sustainable. Hell of tbh kombucha +1 listicle.</p> </section> </div> <div class="footer"> <ul> <li><a href="/01">01</a></li> <li><a href="/02">02</a></li> <li><a href="/03">03</a></li> <li><a href="/04">04</a></li> </ul> <p class="text"></p> </div> </div> </body> </html> ``` (if you're wondering what is this dummy text, it's coming from [hipsum](https://hipsum.co/), a [Lorem ipsum](https://en.wikipedia.org/wiki/Lorem_ipsum) like generator) From this `html`, we would like to retrieve this `json`: ```json { "title" : "Venmo truffaut shabby chic organic", "text" : "I'm baby wayfarers tote bag gochujang cred food truck VHS quinoa kogi Brooklyn yr vegan etsy. Portland squid DSA, raclette flannel pinterest craft beer cloud bread pour-over same. Air plant pickled man braid tilde drinking vinegar ascot DIY poke meditation iceland JOMO sustainable. Hell of tbh kombucha +1 listicle.", "success" : True } ``` ```python def scrape_domain(self): try: content = self.soup.find("div", class_="article") title = content.find("h2") article_text = content.find("section", class_="text") data = { "title" : title.text.strip(), "text": article_text.text.strip(), "success" : True } return json.dumps(data, indent=4, ensure_ascii=False) except Exception as e: print(e) return json.dumps({"success": False}) ```