3.5 KiB
In order to index the web properly, corbia spider uses domain based templates.
Domain declaration
For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.
New domains are added to Domain
class:
class Domain:
def __init__(self, domain, html, url):
self.domain = domain
self.soup = BeautifulSoup(html, 'html.parser')
self.url = url
self.scrapers = {
"www.azlyrics.com" : self.scrape_azlyrics,
"www.monde-diplomatique.fr" : self.scrape_diplo,
"www.amnesty.org" : self.scrape_amnesty,
"www.vindefrance.com" : self.scrape_vdf,
"www.tasteofcinema.com" : self.scrape_taste_of_cinema,
"www.blast-info.fr" : self.scrape_blast,
"www.byredo.com" : self.scrape_byredo
}
Then, the function is added to the class as def scrape_domain(self)
.
Domain based scrapping function
The function must ALWAYS return a JSON with a "success"
key set to either True
or False
.
To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a try
/ except
block:
def scrape_domain(self):
try:
data = {
"success" : True
}
return json.dumps(data, indent=4, ensure_ascii=False)
except Exception as e:
return json.dumps({"success": False})
For each page that will be processed by the domain based function, the BeautifulSoup
object is accessible via self.soup
Consider this example html:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Venmo</title>
</head>
<body>
<div id="root">
<div class="top">
<h2>
<ul>
<li><a href="/01">01</a></li>
<li><a href="/02">02</a></li>
<li><a href="/03">03</a></li>
<li><a href="/04">04</a></li>
</ul>
</h2>
</div>
<div class="article main">
<h2 class="text">Venmo truffaut shabby chic organic</h2>
<section class="text">
<p>I'm baby wayfarers tote bag gochujang cred food truck VHS quinoa kogi Brooklyn yr vegan etsy.</p>
<p>Portland squid DSA, raclette flannel pinterest craft beer cloud bread pour-over same.</p>
<p>Air plant pickled man braid tilde drinking vinegar ascot DIY poke meditation iceland JOMO sustainable. Hell of tbh kombucha +1 listicle.</p>
</section>
</div>
<div class="footer">
<ul>
<li><a href="/01">01</a></li>
<li><a href="/02">02</a></li>
<li><a href="/03">03</a></li>
<li><a href="/04">04</a></li>
</ul>
<p class="text"></p>
</div>
</div>
</body>
</html>
(if you're wondering what is this dummy text, it's coming from hipsum, a Lorem ipsum like generator)
From this html
, we would like to retrieve this json
:
{
"title" : "Venmo truffaut shabby chic organic",
"text" : "I'm baby wayfarers tote bag gochujang cred food truck VHS quinoa kogi Brooklyn yr vegan etsy. Portland squid DSA, raclette flannel pinterest craft beer cloud bread pour-over same. Air plant pickled man braid tilde drinking vinegar ascot DIY poke meditation iceland JOMO sustainable. Hell of tbh kombucha +1 listicle.",
"success" : True
}
def scrape_domain(self):
try:
content = self.soup.find("div", class_="article")
title = content.find("h2")
article_text = content.find("section", class_="text")
data = {
"title" : title.text.strip(),
"text": article_text.text.strip(),
"success" : True
}
return json.dumps(data, indent=4, ensure_ascii=False)
except Exception as e:
print(e)
return json.dumps({"success": False})