vault backup: 2024-01-02 12:59:04

Affected files: Corbia spider.md
2024-01-02 12:59:04 +01:00 · 2024-01-02 12:59:04 +01:00 · 3f459c3929
parent 561a835150
commit 3f459c3929
1 changed files with 46 additions and 1 deletions
--- a/spider.md
+++ b/spider.md
@ -1,4 +1,49 @@

 In order to index the web properly, corbia spider uses domain based templates.

-For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.
+## Domain declaration
+
+For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.
+
+New domains are added to `Domain` class:
+
+```python
+class Domain:
+
+def __init__(self, domain, html, url):
+
+	self.domain = domain
+	self.soup = BeautifulSoup(html, 'html.parser')
+	self.url = url
+	self.scrapers = {
+		"www.azlyrics.com" : self.scrape_azlyrics,
+		"www.monde-diplomatique.fr" : self.scrape_diplo,
+		"www.amnesty.org" : self.scrape_amnesty,
+		"www.vindefrance.com" : self.scrape_vdf,
+		"www.tasteofcinema.com" : self.scrape_taste_of_cinema,
+		"www.blast-info.fr" : self.scrape_blast,
+		"www.byredo.com" : self.scrape_byredo
+	}
+```
+
+Then, the function is added to the class as `def scrape_domain(self)`.
+
+## Domain based scrapping function
+
+The function must **ALWAYS** return a JSON with a `"success"` key set to either `True` or `False`. 
+
+To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a `try` / `except` block:
+
+```python
+def scrape_domain(self):
+	try:
+		data = {
+			"success" : True
+		}
+		return json.dumps(data, indent=4, ensure_ascii=False)
+	except Exception as e:
+		return json.dumps({"success": False})
+```
+
+
+