vault backup: 2024-01-02 12:59:04
Affected files: Corbia spider.md
This commit is contained in:
		
							parent
							
								
									561a835150
								
							
						
					
					
						commit
						3f459c3929
					
				| 
						 | 
					@ -1,4 +1,49 @@
 | 
				
			||||||
 | 
					
 | 
				
			||||||
In order to index the web properly, corbia spider uses domain based templates.
 | 
					In order to index the web properly, corbia spider uses domain based templates.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Domain declaration
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.
 | 
					For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					New domains are added to `Domain` class:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					class Domain:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					def __init__(self, domain, html, url):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
						self.domain = domain
 | 
				
			||||||
 | 
						self.soup = BeautifulSoup(html, 'html.parser')
 | 
				
			||||||
 | 
						self.url = url
 | 
				
			||||||
 | 
						self.scrapers = {
 | 
				
			||||||
 | 
							"www.azlyrics.com" : self.scrape_azlyrics,
 | 
				
			||||||
 | 
							"www.monde-diplomatique.fr" : self.scrape_diplo,
 | 
				
			||||||
 | 
							"www.amnesty.org" : self.scrape_amnesty,
 | 
				
			||||||
 | 
							"www.vindefrance.com" : self.scrape_vdf,
 | 
				
			||||||
 | 
							"www.tasteofcinema.com" : self.scrape_taste_of_cinema,
 | 
				
			||||||
 | 
							"www.blast-info.fr" : self.scrape_blast,
 | 
				
			||||||
 | 
							"www.byredo.com" : self.scrape_byredo
 | 
				
			||||||
 | 
						}
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Then, the function is added to the class as `def scrape_domain(self)`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Domain based scrapping function
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The function must **ALWAYS** return a JSON with a `"success"` key set to either `True` or `False`. 
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a `try` / `except` block:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					def scrape_domain(self):
 | 
				
			||||||
 | 
						try:
 | 
				
			||||||
 | 
							data = {
 | 
				
			||||||
 | 
								"success" : True
 | 
				
			||||||
 | 
							}
 | 
				
			||||||
 | 
							return json.dumps(data, indent=4, ensure_ascii=False)
 | 
				
			||||||
 | 
						except Exception as e:
 | 
				
			||||||
 | 
							return json.dumps({"success": False})
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in New Issue