2023-06-22 15:44:24 +00:00
|
|
|
# Crawling
|
|
|
|
|
2023-08-01 20:47:37 +00:00
|
|
|
This document is a draft.
|
2023-06-22 15:44:24 +00:00
|
|
|
|
|
|
|
## WARNING
|
|
|
|
Please don't run the crawler unless you intend to actually operate a public
|
|
|
|
facing search engine! Use crawl sets from downloads.marginalia.nu instead.
|
|
|
|
|
|
|
|
See the documentation in run/ for more information.
|
|
|
|
|
|
|
|
Reckless crawling annoys webmasters and makes it harder to run an independent search engine.
|
|
|
|
Crawling from a domestic IP address is also likely to put you on a greylist
|
|
|
|
of probable bots. You will solve CAPTCHAs for almost every website you visit
|
|
|
|
for weeks.
|
|
|
|
|
|
|
|
## Prerequisites
|
|
|
|
|
|
|
|
You probably want to run a local bind resolver to speed up DNS lookups and reduce the amount of
|
2023-06-22 16:01:43 +00:00
|
|
|
DNS traffic.
|
|
|
|
|
|
|
|
These processes require a lot of disk space. It's strongly recommended to use a dedicated disk,
|
|
|
|
it doesn't need to be extremely fast, but it should be a few terabytes in size. It should be mounted
|
|
|
|
with `noatime` and partitioned with a large block size. It may be a good idea to format the disk with
|
|
|
|
a block size of 4096 bytes. This will reduce the amount of disk space used by the crawler.
|
2023-06-22 15:44:24 +00:00
|
|
|
|
2023-08-01 20:47:37 +00:00
|
|
|
Make sure you configure the user-agent properly. This will be used to identify the crawler,
|
|
|
|
and is matched against the robots.txt file. The crawler will not crawl sites that don't allow it.
|
|
|
|
|
|
|
|
This can be done by editing the file `${WMSA_HOME}/conf/user-agent`.
|
|
|
|
|
2023-06-22 15:44:24 +00:00
|
|
|
## Setup
|
|
|
|
|
2023-06-22 16:01:43 +00:00
|
|
|
To operate the crawler, you need to set up a filesystem structure.
|
|
|
|
|
|
|
|
You need
|
|
|
|
|
|
|
|
* a directory for crawl data
|
|
|
|
* a directory for processed data
|
|
|
|
* a crawl specification file
|
|
|
|
* a crawl plan file
|
|
|
|
|
|
|
|
Assuming we want to keep our crawl and processed data in
|
|
|
|
`/data`, then we would create the following directories:
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ mkdir /data/crawl
|
|
|
|
$ mkdir /data/processed
|
|
|
|
```
|
2023-06-22 15:44:24 +00:00
|
|
|
|
|
|
|
### Specifications
|
|
|
|
|
|
|
|
A crawl specification file is a compressed JSON file with each domain name to crawl, as well as
|
2023-08-01 20:47:37 +00:00
|
|
|
known URLs for each domain. These are created in the `storage -> specifications` view in the operator's gui.
|
2023-06-22 16:01:43 +00:00
|
|
|
|
2023-08-01 20:47:37 +00:00
|
|
|
To bootstrap the system, you need a list of known domains. This is just a text file with one domain name per line,
|
|
|
|
with blanlines and comments starting with `#` ignored.
|
2023-06-22 15:44:24 +00:00
|
|
|
|
2023-08-01 20:47:37 +00:00
|
|
|
Make it available over HTTP(S) and select `Download a list of domains from a URL` in the `Create New Specification`
|
|
|
|
form. Make sure to give this specification a good description, as it will follow you around for a while.
|
2023-06-22 16:01:43 +00:00
|
|
|
|
2023-06-22 15:44:24 +00:00
|
|
|
## Crawling
|
|
|
|
|
2023-08-01 20:47:37 +00:00
|
|
|
Refresh the specification list in the operator's gui. You should see your new specification in the list.
|
|
|
|
Click the `[Info]` link next to it and select `[Crawl]` under `Actions`.
|
2023-06-22 15:44:24 +00:00
|
|
|
|
2023-08-01 20:47:37 +00:00
|
|
|
Depending on the size of the specification, this may take anywhere between a few minutes to a few weeks.
|
|
|
|
You can follow the progress in the `Actors` view.
|
2023-06-22 15:44:24 +00:00
|
|
|
|
|
|
|
## Converting
|
|
|
|
|
2023-08-01 20:47:37 +00:00
|
|
|
Once the crawl is finished, you can convert the data to a format that can be loaded into the database.
|
|
|
|
This is done by going to the `storage -> crawl` view in the operator's gui, clicking the `[Info]` link
|
|
|
|
and pressing `[Convert]` under `Actions`.
|
2023-06-22 15:44:24 +00:00
|
|
|
|
2023-08-01 20:47:37 +00:00
|
|
|
The rest of the process should be automatic. Follow the progress in the `Actors` view; the actor
|
|
|
|
`RECONVERT_LOAD` drives the process. The process can be stopped by terminating this actor. Depending on the
|
|
|
|
state, it may be necessary to restart from the beginning.
|