diff --git a/Corbia spider.md b/Corbia spider.md index 0928e59..e6dc35c 100644 --- a/Corbia spider.md +++ b/Corbia spider.md @@ -1,4 +1,49 @@ In order to index the web properly, corbia spider uses domain based templates. -For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing. \ No newline at end of file +## Domain declaration + +For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing. + +New domains are added to `Domain` class: + +```python +class Domain: + +def __init__(self, domain, html, url): + + self.domain = domain + self.soup = BeautifulSoup(html, 'html.parser') + self.url = url + self.scrapers = { + "www.azlyrics.com" : self.scrape_azlyrics, + "www.monde-diplomatique.fr" : self.scrape_diplo, + "www.amnesty.org" : self.scrape_amnesty, + "www.vindefrance.com" : self.scrape_vdf, + "www.tasteofcinema.com" : self.scrape_taste_of_cinema, + "www.blast-info.fr" : self.scrape_blast, + "www.byredo.com" : self.scrape_byredo + } +``` + +Then, the function is added to the class as `def scrape_domain(self)`. + +## Domain based scrapping function + +The function must **ALWAYS** return a JSON with a `"success"` key set to either `True` or `False`. + +To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a `try` / `except` block: + +```python +def scrape_domain(self): + try: + data = { + "success" : True + } + return json.dumps(data, indent=4, ensure_ascii=False) + except Exception as e: + return json.dumps({"success": False}) +``` + + +