128 lines
3.5 KiB
Markdown
128 lines
3.5 KiB
Markdown
|
|
In order to index the web properly, corbia spider uses domain based templates.
|
|
|
|
## Domain declaration
|
|
|
|
For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.
|
|
|
|
New domains are added to `Domain` class:
|
|
|
|
```python
|
|
class Domain:
|
|
|
|
def __init__(self, domain, html, url):
|
|
|
|
self.domain = domain
|
|
self.soup = BeautifulSoup(html, 'html.parser')
|
|
self.url = url
|
|
self.scrapers = {
|
|
"www.azlyrics.com" : self.scrape_azlyrics,
|
|
"www.monde-diplomatique.fr" : self.scrape_diplo,
|
|
"www.amnesty.org" : self.scrape_amnesty,
|
|
"www.vindefrance.com" : self.scrape_vdf,
|
|
"www.tasteofcinema.com" : self.scrape_taste_of_cinema,
|
|
"www.blast-info.fr" : self.scrape_blast,
|
|
"www.byredo.com" : self.scrape_byredo
|
|
}
|
|
```
|
|
|
|
Then, the function is added to the class as `def scrape_domain(self)`.
|
|
|
|
## Domain based scrapping function
|
|
|
|
The function must **ALWAYS** return a JSON with a `"success"` key set to either `True` or `False`.
|
|
|
|
To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a `try` / `except` block:
|
|
|
|
```python
|
|
def scrape_domain(self):
|
|
try:
|
|
data = {
|
|
"success" : True
|
|
}
|
|
return json.dumps(data, indent=4, ensure_ascii=False)
|
|
except Exception as e:
|
|
return json.dumps({"success": False})
|
|
```
|
|
|
|
For each page that will be processed by the domain based function, the `BeautifulSoup` object is accessible via `self.soup`
|
|
|
|
Consider this example html:
|
|
|
|
```html
|
|
<!DOCTYPE html>
|
|
<html>
|
|
<head>
|
|
<meta charset="UTF-8">
|
|
<title>Venmo</title>
|
|
</head>
|
|
<body>
|
|
<div id="root">
|
|
|
|
<div class="top">
|
|
<h2>
|
|
<ul>
|
|
<li><a href="/01">01</a></li>
|
|
<li><a href="/02">02</a></li>
|
|
<li><a href="/03">03</a></li>
|
|
<li><a href="/04">04</a></li>
|
|
</ul>
|
|
</h2>
|
|
</div>
|
|
|
|
<div class="article main">
|
|
<h2 class="text">Venmo truffaut shabby chic organic</h2>
|
|
|
|
<section class="text">
|
|
<p>I'm baby wayfarers tote bag gochujang cred food truck VHS quinoa kogi Brooklyn yr vegan etsy.</p>
|
|
<p>Portland squid DSA, raclette flannel pinterest craft beer cloud bread pour-over same.</p>
|
|
<p>Air plant pickled man braid tilde drinking vinegar ascot DIY poke meditation iceland JOMO sustainable. Hell of tbh kombucha +1 listicle.</p>
|
|
</section>
|
|
</div>
|
|
|
|
<div class="footer">
|
|
<ul>
|
|
<li><a href="/01">01</a></li>
|
|
<li><a href="/02">02</a></li>
|
|
<li><a href="/03">03</a></li>
|
|
<li><a href="/04">04</a></li>
|
|
</ul>
|
|
<p class="text"></p>
|
|
</div>
|
|
|
|
</div>
|
|
</body>
|
|
</html>
|
|
```
|
|
|
|
(if you're wondering what is this dummy text, it's coming from [hipsum](https://hipsum.co/), a [Lorem ipsum](https://en.wikipedia.org/wiki/Lorem_ipsum) like generator)
|
|
|
|
From this `html`, we would like to retrieve this `json`:
|
|
```json
|
|
{
|
|
"title" : "Venmo truffaut shabby chic organic",
|
|
"text" : "I'm baby wayfarers tote bag gochujang cred food truck VHS quinoa kogi Brooklyn yr vegan etsy. Portland squid DSA, raclette flannel pinterest craft beer cloud bread pour-over same. Air plant pickled man braid tilde drinking vinegar ascot DIY poke meditation iceland JOMO sustainable. Hell of tbh kombucha +1 listicle.",
|
|
"success" : True
|
|
}
|
|
```
|
|
|
|
|
|
```python
|
|
def scrape_domain(self):
|
|
try:
|
|
content = self.soup.find("div", class_="article")
|
|
title = content.find("h2")
|
|
article_text = content.find("section", class_="text")
|
|
|
|
data = {
|
|
"title" : title.text.strip(),
|
|
"text": article_text.text.strip(),
|
|
"success" : True
|
|
}
|
|
return json.dumps(data, indent=4, ensure_ascii=False)
|
|
except Exception as e:
|
|
print(e)
|
|
return json.dumps({"success": False})
|
|
```
|
|
|
|
Warner - 3# 17-21 (4N) |