MarginaliaSearch/code/processes/converting-process/test/nu/marginalia/converting
Viktor Lofgren 895cee7004 (crawler) Improved feed discovery, new domain state db per crawlset
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided.  To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.

Solves issue #135
2024-12-26 15:05:52 +01:00
..
logic (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
model (feature-extraction) Add new DocumentHeaders class encapsulating Html headers. 2024-11-11 13:26:15 +01:00
processor Add specialization for steam store and GOG 2024-12-11 18:32:45 +01:00
sideload (encyclopedia-sideloader) Add test suite and clean up urlencoding logic 2024-11-26 13:34:15 +01:00
util (setup) Remove OpenNLP tokenization model 2024-11-28 16:03:05 +01:00
ConvertingIntegrationTest.java (model) Remove deprecated fields from CrawledDocument and CrawledDomain 2024-11-20 15:27:05 +01:00
ConvertingIntegrationTestModule.java (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents 2024-11-21 16:00:09 +01:00
CrawlingThenConvertingIntegrationTest.java (crawler) Improved feed discovery, new domain state db per crawlset 2024-12-26 15:05:52 +01:00