langbrain/Corbia spider.md


In order to index the web properly, corbia spider uses domain based templates.

## Domain declaration

For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.

New domains are added to `Domain` class:

```python
class Domain:

def __init__(self, domain, html, url):

	self.domain = domain
	self.soup = BeautifulSoup(html, 'html.parser')
	self.url = url
	self.scrapers = {
		"www.azlyrics.com" : self.scrape_azlyrics,
		"www.monde-diplomatique.fr" : self.scrape_diplo,
		"www.amnesty.org" : self.scrape_amnesty,
		"www.vindefrance.com" : self.scrape_vdf,
		"www.tasteofcinema.com" : self.scrape_taste_of_cinema,
		"www.blast-info.fr" : self.scrape_blast,
		"www.byredo.com" : self.scrape_byredo
	}
```

Then, the function is added to the class as `def scrape_domain(self)`.

## Domain based scrapping function

The function must **ALWAYS** return a JSON with a `"success"` key set to either `True` or `False`.

To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a `try` / `except` block:

```python
def scrape_domain(self):
	try:
		data = {
			"success" : True
		}
		return json.dumps(data, indent=4, ensure_ascii=False)
	except Exception as e:
		return json.dumps({"success": False})
```

For each page that will be processed by the domain based function, the `BeautifulSoup` object is accessible via `self.soup`

Consider this example html:

```html
<!DOCTYPE html>
<html>
<head>
	<meta charset="UTF-8">
	<title>Venmo</title>
</head>
<body>
<div id="root">

<div class="top">
	<h2>
		<ul>
			<li><a href="/01">01</a></li>
			<li><a href="/02">02</a></li>
			<li><a href="/03">03</a></li>
			<li><a href="/04">04</a></li>
		</ul>
	</h2>
</div>

<div class="article main">
	<h2 class="text">Venmo truffaut shabby chic organic</h2>

	<section class="text">
		<p>I'm baby wayfarers tote bag gochujang cred food truck VHS quinoa kogi Brooklyn yr vegan etsy.</p>
		<p>Portland squid DSA, raclette flannel pinterest craft beer cloud bread pour-over same.</p>
		<p>Air plant pickled man braid tilde drinking vinegar ascot DIY poke meditation iceland JOMO sustainable. Hell of tbh kombucha +1 listicle.</p>
	</section>
</div>

<div class="footer">
	<ul>
		<li><a href="/01">01</a></li>
		<li><a href="/02">02</a></li>
		<li><a href="/03">03</a></li>
		<li><a href="/04">04</a></li>
	</ul>
	<p class="text"></p>
</div>

</div>
</body>
</html>
```

(if you're wondering what is this dummy text, it's coming from [hipsum](https://hipsum.co/), a [Lorem ipsum](https://en.wikipedia.org/wiki/Lorem_ipsum) like generator)

From this `html`, we would like to retrieve this `json`:
```json
{
	"title" : "Venmo truffaut shabby chic organic",
	"text" : "I'm baby wayfarers tote bag gochujang cred food truck VHS quinoa kogi Brooklyn yr vegan etsy. Portland squid DSA, raclette flannel pinterest craft beer cloud bread pour-over same. Air plant pickled man braid tilde drinking vinegar ascot DIY poke meditation iceland JOMO sustainable. Hell of tbh kombucha +1 listicle.",
	"success" : True
}
```


```python
def scrape_domain(self):
	try:
		content = self.soup.find("div", class_="article")
		title = content.find("h2")
		article_text = content.find("section", class_="text")

		data = {
			"title" : title.text.strip(),
			"text": article_text.text.strip(),
			"success" : True
		}
		return json.dumps(data, indent=4, ensure_ascii=False)
	except Exception as e:
		print(e)
		return json.dumps({"success": False})
```

Warner - 3# 17-21 (4N)