vault backup: 2024-01-02 12:59:04
Affected files: Corbia spider.md
This commit is contained in:
parent
561a835150
commit
3f459c3929
|
@ -1,4 +1,49 @@
|
||||||
|
|
||||||
In order to index the web properly, corbia spider uses domain based templates.
|
In order to index the web properly, corbia spider uses domain based templates.
|
||||||
|
|
||||||
|
## Domain declaration
|
||||||
|
|
||||||
For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.
|
For each domain, a custom python function will be used in order to extract the relevant information and store it as JSON before further processing.
|
||||||
|
|
||||||
|
New domains are added to `Domain` class:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class Domain:
|
||||||
|
|
||||||
|
def __init__(self, domain, html, url):
|
||||||
|
|
||||||
|
self.domain = domain
|
||||||
|
self.soup = BeautifulSoup(html, 'html.parser')
|
||||||
|
self.url = url
|
||||||
|
self.scrapers = {
|
||||||
|
"www.azlyrics.com" : self.scrape_azlyrics,
|
||||||
|
"www.monde-diplomatique.fr" : self.scrape_diplo,
|
||||||
|
"www.amnesty.org" : self.scrape_amnesty,
|
||||||
|
"www.vindefrance.com" : self.scrape_vdf,
|
||||||
|
"www.tasteofcinema.com" : self.scrape_taste_of_cinema,
|
||||||
|
"www.blast-info.fr" : self.scrape_blast,
|
||||||
|
"www.byredo.com" : self.scrape_byredo
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Then, the function is added to the class as `def scrape_domain(self)`.
|
||||||
|
|
||||||
|
## Domain based scrapping function
|
||||||
|
|
||||||
|
The function must **ALWAYS** return a JSON with a `"success"` key set to either `True` or `False`.
|
||||||
|
|
||||||
|
To make sure the spider doesn't stop working, the scrapping function must be wrapped inside a `try` / `except` block:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def scrape_domain(self):
|
||||||
|
try:
|
||||||
|
data = {
|
||||||
|
"success" : True
|
||||||
|
}
|
||||||
|
return json.dumps(data, indent=4, ensure_ascii=False)
|
||||||
|
except Exception as e:
|
||||||
|
return json.dumps({"success": False})
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue