Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled.
Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs.
This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends. Integrating this information with the ranking is not performed here.
Refactoring keyword extraction to extract spans information.
Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.
This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object. Separator information is encoded as a bit set instead of an array of integers.
The change also cleans up the SentenceExtractor class a fair bit. It no longer extracts ngrams, and a significant amount of redundant operations were removed as well. This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.