langbrain/Corbia spider.md

1.3 KiB

In order to index the web properly, corbia spider uses domain based templates.

Domain declaration

For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.

New domains are added to Domain class:

class Domain:

def __init__(self, domain, html, url):

	self.domain = domain
	self.soup = BeautifulSoup(html, 'html.parser')
	self.url = url
	self.scrapers = {
		"www.azlyrics.com" : self.scrape_azlyrics,
		"www.monde-diplomatique.fr" : self.scrape_diplo,
		"www.amnesty.org" : self.scrape_amnesty,
		"www.vindefrance.com" : self.scrape_vdf,
		"www.tasteofcinema.com" : self.scrape_taste_of_cinema,
		"www.blast-info.fr" : self.scrape_blast,
		"www.byredo.com" : self.scrape_byredo
	}

Then, the function is added to the class as def scrape_domain(self).

Domain based scrapping function

The function must ALWAYS return a JSON with a "success" key set to either True or False.

To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a try / except block:

def scrape_domain(self):
	try:
		data = {
			"success" : True
		}
		return json.dumps(data, indent=4, ensure_ascii=False)
	except Exception as e:
		return json.dumps({"success": False})