vault backup: 2024-01-02 12:59:04

Affected files: Corbia spider.md
2024-01-02 12:59:04 +01:00 · 2024-01-02 12:59:04 +01:00 · 3f459c3929
parent 561a835150
commit 3f459c3929
1 changed files with 46 additions and 1 deletions
--- a/spider.md
+++ b/spider.md
@ -1,4 +1,49 @@
 In order to index the web properly, corbia spider uses domain based templates.
 ## Domain declaration
 For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.
 New domains are added to `Domain` class:
 ```python
 class Domain:
 def __init__(self, domain, html, url):
 	self.domain = domain
 	self.soup = BeautifulSoup(html, 'html.parser')
 	self.url = url
 	self.scrapers = {
 		"www.azlyrics.com" : self.scrape_azlyrics,
 		"www.monde-diplomatique.fr" : self.scrape_diplo,
 		"www.amnesty.org" : self.scrape_amnesty,
 		"www.vindefrance.com" : self.scrape_vdf,
 		"www.tasteofcinema.com" : self.scrape_taste_of_cinema,
 		"www.blast-info.fr" : self.scrape_blast,
 		"www.byredo.com" : self.scrape_byredo
 	}
 ```
 Then, the function is added to the class as `def scrape_domain(self)`.
 ## Domain based scrapping function
 The function must **ALWAYS** return a JSON with a `"success"` key set to either `True` or `False`. 
 To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a `try` / `except` block:
 ```python
 def scrape_domain(self):
 	try:
 		data = {
 			"success" : True
 		}
 		return json.dumps(data, indent=4, ensure_ascii=False)
 	except Exception as e:
 		return json.dumps({"success": False})
 ```