MarginaliaSearch/code
Viktor Lofgren ed373eef61 Refactor crawler and add special logic for some platforms
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
..
api Update readme.md 2023-04-22 16:05:57 +02:00
common Fix serialization bug with CompressedBigString 2023-06-27 10:57:54 +02:00
features-convert Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right. 2023-06-20 14:15:05 +02:00
features-crawl Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00
features-index Use document generator to complement the document selection. 2023-06-22 17:21:33 +02:00
features-search Add a ranking parameter for biasing toward recent or old content. 2023-04-20 16:00:59 +02:00
libraries Optimize SentenceExtractor. 2023-06-19 17:58:19 +02:00
process-models Add bits and keywords for generator classes (docs, forum, wiki). 2023-06-23 21:35:28 +02:00
processes Refactor crawler and add special logic for some platforms 2023-06-27 10:57:54 +02:00
services-core Add search profiles for wiki, forum and docs. 2023-06-24 12:17:35 +02:00
services-satellite Api service response cache (#16) 2023-04-22 15:42:32 +02:00
tools Refactor crawler and add special logic for some platforms 2023-06-27 10:57:54 +02:00
readme.md Fix broken diagram links after doc/ restructuring. 2023-03-25 16:32:10 +01:00

Code

This is a pretty large and diverse project with many moving parts.

You'll find a short description in each module of what it does and how it relates to other modules. The modules each have names like "library" or "process" or "feature". These have specific meanings. See doc/module-taxonomy.md.

Overview

A map of the most important components and how they relate can be found below.

image

Services

Processes

Processes are batch jobs that deal with data retrieval, processing and loading.

Tools

Features

Features are relatively stand-alone components that serve some part of the domain. They aren't domain-independent, but isolated.

Libraries and primitives

Libraries are stand-alone code that is independent of the domain logic.

  • common elements for creating a service, a client etc.
  • libraries containing non-search specific code.
    • array - large memory mapped area library
    • btree - static btree library