This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.
Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment.
Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters.
Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.
This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.
The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything.
The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'.
The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.