Viktor Lofgren
6e1aa7b391
(search) Make style.css depend on jte file changes
...
Also add a hack to ensure classes generated from java code get included in the stylesheet as intended.
2024-12-07 14:11:22 +01:00
Viktor Lofgren
deab9b9516
(search) Clean up start views for search and site-info
2024-12-07 14:11:22 +01:00
Viktor Lofgren
39d99a906a
(search) Add proper tailwind build and host fontawesome locally
2024-12-07 14:11:22 +01:00
Viktor Lofgren
6f72e6e0d3
(explore) Add lazy loading and alt attributes to images
2024-12-07 14:11:22 +01:00
Viktor Lofgren
d786d79483
(site-info) Add whitespace-nowrap to pubDay span in overview.jte
2024-12-07 14:11:22 +01:00
Viktor Lofgren
01510f6c2e
(serp) Add wayback link to search results
2024-12-07 14:11:22 +01:00
Viktor Lofgren
7ba43e9e3f
(site) Adjust sizing of navbars
2024-12-07 14:11:16 +01:00
Viktor Lofgren
97bfcd1353
(site) Layout changes site-info
2024-12-07 14:11:16 +01:00
Viktor Lofgren
aa3c85c196
(site) Mobile layout fixes
2024-12-07 14:11:16 +01:00
Viktor Lofgren
ee2d5496d0
Revert "(experiment) Modify atags exporter to permit duplicates from different source domains"
...
This reverts commit 5c858a2b94
.
2024-12-07 14:01:50 +01:00
Viktor Lofgren
5c858a2b94
(experiment) Modify atags exporter to permit duplicates from different source domains
...
This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.
2024-12-06 14:10:15 +01:00
Viktor Lofgren
fb75a3827d
(site) Adjust coloration of search results
2024-12-05 16:58:00 +01:00
Viktor Lofgren
7d546d0e2a
(site) Make SearchParameters generate relative URLs instead of absolute
2024-12-05 16:47:22 +01:00
Viktor Lofgren
8fcb6ffd7a
(site-info) Increase contrast in search results for forums, wikis
2024-12-05 16:42:16 +01:00
Viktor Lofgren
f97de0c15a
(site-info) Fix layout
2024-12-05 16:33:46 +01:00
Viktor Lofgren
be9e192b78
(site-info) Fix pagination in backlinks and documents views
2024-12-05 16:26:11 +01:00
Viktor Lofgren
75ae1c9526
(site-info) Do not show 'suggest for crawling' when the ndoe affinity is already set to 0
...
This indicates the domain is already slated for crawling.
2024-12-05 16:18:46 +01:00
Viktor Lofgren
33761a0236
(site-info) Make the search box in the site viewer functional
2024-12-05 16:16:29 +01:00
Viktor Lofgren
19b69b1764
(site-info) Only show samples if feed is absent, never both.
2024-12-05 16:05:03 +01:00
Viktor Lofgren
8b804359a9
(serp) Layout fixes for mobile
2024-12-05 15:59:33 +01:00
Viktor Lofgren
f050bf5c4c
(WIP) Initial semi-working transformation to new tailwind UI
...
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.
There's also a lot of polish remaining everywhere, dead links, etc.
2024-12-05 14:00:17 +01:00
Viktor Lofgren
fdc3efa250
(setup) Remove OpenNLP tokenization model
...
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
2024-11-28 16:03:05 +01:00
Viktor Lofgren
c97c66a41c
(ranking) Reduce the verbatim score multiplier
2024-11-28 13:37:11 +01:00
Viktor Lofgren
7b64377fd6
(ranking) Promote documents with multiple phrase matches with a log-scale bonus
2024-11-28 13:36:56 +01:00
Viktor Lofgren
e11ebf18e5
(span) Correct intersection counting logic, add comprehensive tests
2024-11-28 13:36:25 +01:00
Viktor Lofgren
ba47d72bf4
(ranking) Adjust scores for external link matches
2024-11-27 14:27:23 +01:00
Viktor Lofgren
52bc0272f8
(atag) Add alias domain support and improve domain handling
...
Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.
2024-11-27 14:26:44 +01:00
Viktor Lofgren
d4bce13a03
(export) Add export actors to precession
...
Adding a tracking message to the export actor means it's possible to run them in a precession.
Adding a new precession actor, and some GUI components for triggering exports.
The change also adds a heartbeat to the export process.
2024-11-26 15:07:03 +01:00
Viktor Lofgren
b9842b57e0
(encyclopedia-sideloader) Add test suite and clean up urlencoding logic
2024-11-26 13:34:15 +01:00
Viktor Lofgren
95776e9bee
(encyclopedia) Fix commit gore resulting in bad SQL query
2024-11-26 12:44:49 +01:00
Viktor Lofgren
077d8dcd11
(result-score) Adjust ranking parameters a tiny bit
2024-11-25 18:30:59 +01:00
Viktor Lofgren
9ec41e27c6
(keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended
2024-11-25 18:30:22 +01:00
Viktor Lofgren
200743c84f
(minor) Remove delomobok debris
2024-11-25 18:29:21 +01:00
Viktor Lofgren
6d7998e349
(index) Correct behavior of debug function positionValues(), which was misleadingly incorrect
2024-11-25 18:28:53 +01:00
Viktor Lofgren
7d1ef08a0f
(index) Correct ranking bonus for external linktext appearnces
2024-11-25 17:40:15 +01:00
Viktor Lofgren
3ec9c4c5fa
(export) Filter non-HTML documents in exporters
...
Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.
2024-11-25 15:06:42 +01:00
Viktor Lofgren
0b6b5dab07
(index) Add score bonuses for single-word anchor tag spans
...
Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.
2024-11-25 14:44:41 +01:00
Viktor Lofgren
ff17473105
Fix UTF-8 URL normalization issue in sideloader.
...
Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue.
Fixes issue #109 .
2024-11-25 14:25:47 +01:00
Viktor Lofgren
dc5f97e737
(index) Add bonus for single-word title matches when the title is also a single word
2024-11-25 13:24:12 +01:00
Viktor Lofgren
d919179ba3
(index) Correct off-by-1 error in DocumentSpan.containsRange
2024-11-25 13:24:03 +01:00
Viktor Lofgren
f09669a5b0
(index) Correct usage of DocumentSpan.length() instead of DocumentSpan.size()
...
The latter counts the number of spans, and is not what you want here.
2024-11-25 13:11:55 +01:00
Viktor Lofgren
b3b0f6fed3
(actor) Add side-load profile to PROC_CONVERTER_SPAWNER.
...
This fell off during the profile split, but is necessary for sideloading.
2024-11-25 12:40:14 +01:00
Viktor Lofgren
88caca60f9
(live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list
2024-11-23 17:07:16 +01:00
Viktor Lofgren
923ebbac81
(feeds) Add logic to handle URI fragments in feed items
...
Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.
2024-11-23 16:38:56 +01:00
Viktor Lofgren
552b246099
(live-crawl) Improve error handling for errors during robots.txt-retrieval
...
Reduce log-spam and don't treat errors other than 404 as "all is permitted".
2024-11-22 14:15:32 +01:00
Viktor Lofgren
80e6d0069c
(live-crawl-actor) Clear index journal before starting live crawl
...
This is to prevent data corruption. This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.
2024-11-22 14:04:57 +01:00
Viktor Lofgren
b941604135
(live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with.
2024-11-22 13:58:57 +01:00
Viktor Lofgren
52eb5bc84f
(live-crawler) Keep track of bad URLs
...
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
2024-11-22 00:55:46 +01:00
Viktor Lofgren
4d23fe6261
(feeds) Simplify RSS User-Agent header
...
Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.
2024-11-21 16:43:56 +01:00
Viktor Lofgren
14519294d2
Merge branch 'master' into live-search
2024-11-21 16:00:20 +01:00
Viktor Lofgren
51e46ad2b0
(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents
...
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.
While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.
2024-11-21 16:00:09 +01:00
Viktor Lofgren
665c8831a3
(model) Fix resource leak in partially read crawl data streams.
...
Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.
2024-11-20 19:29:13 +01:00
Viktor Lofgren
47dfbacb00
(conf) Introduce a new concept of node profiles
...
Node profiles decide which actors are started, and which views are available in the control GUI. This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.
2024-11-20 18:15:22 +01:00
Viktor Lofgren
f94911541a
(live-crawl) Reduce the risk of id collisions with the main indexes
...
This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.
2024-11-20 16:01:10 +01:00
Viktor Lofgren
89d8af640d
(live-crawl) Rename the live crawler code module to be more consistent with the other processes
2024-11-20 15:55:15 +01:00
Viktor Lofgren
6e4252cf4c
(live-crawl) Make the actor poll for feeds changes instead of being a one-shot thing.
...
Also changes the live crawl process to store the live crawl data in a fixed directory in the storage base rather than versioned directories.
2024-11-20 15:36:25 +01:00
Viktor Lofgren
79ce4de2ab
(model) Remove deprecated fields from CrawledDocument and CrawledDomain
2024-11-20 15:27:05 +01:00
Viktor Lofgren
d6575dfee4
(live-crawler) Crude first-try process for live crawling #WIP
...
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 21:00:18 +01:00
Viktor Lofgren
a91ab4c203
(live-crawler) Crude first-try process for live crawling #WIP
...
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 19:35:01 +01:00
Viktor Lofgren
6a3079a167
(search) Fix missing getter for proto
2024-11-18 21:05:22 +01:00
Viktor Lofgren
c728a1e2f2
(rss) Add endpoint for extracting URLs changed withing a timespan.
2024-11-18 14:59:32 +01:00
Viktor Lofgren
d874d76a09
(rss) Add an endpoint that can be used for identifying when RSS data has changed
2024-11-18 14:22:17 +01:00
Viktor Lofgren
41c11be075
(status) Clean up the status page a bit
2024-11-17 20:00:44 +01:00
Viktor Lofgren
163ce19846
(test) Tag status service endpoint tests as flaky
...
These tests have outside dependencies that inherently makes them unreliable and unsuitable for CI.
2024-11-17 19:48:01 +01:00
Viktor Lofgren
9eb16cb667
(test) Remove tests from fast suite
...
Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI.
Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.
2024-11-17 19:45:59 +01:00
Viktor Lofgren
af40fa327b
(status-service) Correct measurement pruning to use correct sqlite datetimes, as to not delete the database
2024-11-17 18:35:34 +01:00
Viktor Lofgren
cf6d28e71e
(status-service) Enable auto-commit
2024-11-17 18:25:15 +01:00
Viktor Lofgren
3791ea1e18
(service) Add a new application service for external liveness monitoring
...
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.
2024-11-17 18:01:08 +01:00
Viktor Lofgren
e5db3f11e1
(chore) Clean up some of the uglier delomboking artifacts
2024-11-15 13:57:20 +01:00
Viktor Lofgren
9f47ce8d15
(chore) Remove lombok
...
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23
(chore) Remove use of deprecated STR.-style string templates
2024-11-11 18:02:28 +01:00
Viktor Lofgren
8b8bf0748f
(feature-extraction) Add new DocumentHeaders class encapsulating Html headers.
...
Also adds a few new html features for CDNs and S3 hosting for use in ranking and query refinement.
2024-11-11 13:26:15 +01:00
Viktor Lofgren
a456ec9599
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
2024-11-10 18:30:28 +01:00
Viktor Lofgren
a2bc9a98c0
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
2024-11-10 17:45:20 +01:00
Viktor Lofgren
e24a98390c
(feed) Update API to allow specifying clean vs refresh update
...
Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.
2024-11-09 18:43:47 +01:00
Viktor Lofgren
6f858cd627
(feed) Decrease update interval to 24 hours
2024-11-09 18:17:51 +01:00
Viktor Lofgren
a293266ccd
(feed) Wipe the feeds db and start over from system URLs periodically.
2024-11-09 18:17:16 +01:00
Viktor Lofgren
b8e0dc93d7
(search) Correctly show the feeds view when items are present
...
... otherwise show samples. This commit also removes the (Experimental) bit, as this is getting fairly mature.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
d774c39031
(feeds) Reduce log spam
2024-11-09 17:56:43 +01:00
Viktor Lofgren
ab17af99da
(feeds) Refresh the feed db using the previous db, when it is available.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
b0ac3c586f
(feeds) Correct parallelism using SimpleBlockingThreadPool
2024-11-09 17:56:43 +01:00
Viktor Lofgren
139fa85b18
(feeds) Add working heartbeat tracking progress
2024-11-09 17:56:43 +01:00
Viktor Lofgren
bfeb9a4538
(feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service
2024-11-09 17:56:43 +01:00
Viktor Lofgren
76e9053dd0
(setup) Move some file-downloads from setup script to the first boot of the control node of the system
...
We can only do this for files that are not required for unit tests.
As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions. The boot orchestration will also ensure that no other services will boot up before the downloading is complete.
2024-11-06 15:28:20 +01:00
Viktor Lofgren
dbb8bcdd8e
(crawler) Use a better hashInt implementation in CrawlDataReference
...
Guava's hash functions are slow as hell.
2024-10-15 18:25:55 +02:00
Viktor Lofgren
7305afa0f8
(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris
2024-10-15 17:27:59 +02:00
Viktor Lofgren
481f999b70
(crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full.
...
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
2024-10-15 14:22:40 +02:00
Viktor Lofgren
4b16022556
(crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains
2024-10-15 14:21:59 +02:00
Viktor Lofgren
89dd201a7b
(link-parser) Make mailing list blocking optional
2024-10-15 13:48:32 +02:00
Viktor Lofgren
ab486323f2
(converter) Increase the number of links the converter will pick up per document
2024-10-15 13:46:19 +02:00
Viktor Lofgren
6460c11107
(index) Short-circuit rankResults when there are no results
2024-10-14 13:47:35 +02:00
Viktor Lofgren
89f7f3c17c
(query-parser) Fix regression where advice terms weren't parsed properly
2024-10-14 13:46:37 +02:00
Viktor Lofgren
fe800b3af7
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 19:04:49 +02:00
Viktor Lofgren
2a1077ff43
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:57:27 +02:00
Viktor Lofgren
01a16ff388
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:55:59 +02:00
Viktor Lofgren
eb60ddb729
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:49:39 +02:00
Viktor Lofgren
db5faeceee
(download-sample) Break apart actor for better error recovery
...
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:39:43 +02:00
Viktor Lofgren
45d3e6aa71
(download-sample) Break apart actor for better error recovery
...
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:19:09 +02:00
Viktor Lofgren
d84a2c183f
(*) Remove the crawl spec abstraction
...
The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled.
Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs.
This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
ecb5eedeae
(crawler, EXPERIMENT) Disable content type probing and use Accept header instead
...
There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.
2024-09-30 14:53:01 +02:00
Viktor Lofgren
90a2d4ae38
(index) Fix partial buffer writing in PrioDocIdsTransformer
...
Ensure all data is written to writeChannel by looping until the buffer is fully drained. This prevents potential data loss during the close operation and maintains data integrity.
2024-09-29 17:53:40 +02:00
Viktor Lofgren
2b8ab97ec1
(bit-writer) Do not clear buffer when creating a bit writer
2024-09-29 17:52:43 +02:00
Viktor Lofgren
43ca9c8a12
(sequence) Return Integer.MAX_VALUE for empty position lists.
...
Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.
2024-09-29 17:21:17 +02:00
Viktor Lofgren
69d99c91dd
(index) Optimize buffer handling in PrioDocIdsTransformer
2024-09-29 17:20:49 +02:00
Viktor Lofgren
a8cc98a0f6
(index) Fix write offset calculation in PrioDocIdsTransformer
...
Adjust the write offset calculation by adding the position of the write buffer. Updated the test to validate the transformation process and ensure correctness of output file positions.
2024-09-29 17:20:29 +02:00
Viktor Lofgren
2ee58f4bc9
(index) Adjust ranking parameters to dial down the importance of tcfProximity and firstPosition
2024-09-29 15:33:12 +02:00
Viktor Lofgren
938431e514
(scrape-feeds-actor) Add deduplication of insertion data
...
To avoid unnecessary db churn, the domains to be added are put in a set instead of a list, ensuring that they are unique.
2024-09-28 14:41:14 +02:00
Viktor Lofgren
b2de3c70fa
(scrape-feeds-actor) Add explicit commit in case it's disabled
2024-09-28 14:36:57 +02:00
Viktor Lofgren
542690d9f6
(search-service) Hide pagination when there is only 1 page of results
2024-09-28 13:48:09 +02:00
Viktor Lofgren
596a7fb4ea
(actor) Disable the feed scraper on all nodes but the first
2024-09-28 12:36:16 +02:00
Viktor Lofgren
c3f726a01f
(actor) Add a feed scraping actor
...
Add a new actor that polls an URL every 6 hours and amends the domain database with any unseen domains, flagging them to be crawled by the next crawl job.
The URLs are specified in data/scrape-urls.txt. If this file is absent, the actor shuts down.
2024-09-28 12:33:29 +02:00
Viktor Lofgren
4538ade156
(live-capture) Add readme to live-capture function
2024-09-28 11:35:46 +02:00
Viktor Lofgren
f4709d8f32
(live-capture) Handle case when screenshot bytes are empty.
...
Add logic to flag the domain as fetched when the pngBytes array is empty. This ensures we won't try to re-fetch this domain again for a while.
2024-09-27 15:53:17 +02:00
Viktor Lofgren
3dda8c228c
(live-capture) Handle failed screenshot fetch in BrowserlessClient
...
Return an empty byte array when screenshot fetch fails, ensuring downstream processes are not impacted by null responses. Additionally, only attempt to upload the screenshot if the byte array is non-empty, preventing invalid data from being stored.
2024-09-27 14:52:05 +02:00
Viktor Lofgren
ccf6b7caf3
(assistant) Refactor scheduling of tasks within SimilarDomainsService
...
Changed the scheduling function to use a single schedule call instead of a fixed delay for the init task. The updateScreenshotInfo method was also moved and slightly refactored for clearer readability and consistency.
2024-09-27 14:43:19 +02:00
Viktor Lofgren
fed33ed64a
(search-service) Update screenshot request handling
...
Always request the main site screenshot to ensure staleness checks and necessary updates. Limit additional screenshot requests for similar and linking domains to avoid overloading with a maximum of 5 requests per view.
2024-09-27 14:27:25 +02:00
Viktor Lofgren
ca27d95ce1
(assistant) Add bounds checks for domain idx
2024-09-27 14:24:04 +02:00
Viktor Lofgren
3566fe296a
(assistant) Add scheduled update job for screenshot information
2024-09-27 14:16:28 +02:00
Viktor Lofgren
c91435e314
(assistant) Don't attempt to respond to similarity and linkedness queries before the data is ready
...
This will reduce the number of exceptions in the assistant logs quite significantly.
2024-09-27 14:08:08 +02:00
Viktor Lofgren
31f30069a4
(live-capture) Dial down logging a bit
2024-09-27 14:00:55 +02:00
Viktor Lofgren
23cce0c78a
Add a new function 'Live Capture' for on-demand screenshot capture
...
The screenshots are requested by the site-service, and triggered via the site-info view.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
1bd29a586c
(service-discovery) Add common base interface to all Grpc services
...
To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
c757d116bf
(misc) Fix Broken Tests
2024-09-27 13:46:34 +02:00
Viktor Lofgren
4565bfe359
(crawler) Make the crawler report crawling progress correctly when stopped and resumed.
2024-09-26 18:30:29 +02:00
Viktor Lofgren
336d6fdd14
(index-client) Fix error when zero results are found
2024-09-25 20:23:13 +02:00
Viktor Lofgren
95cde242ca
(assistant) Fix NPE when IP information is absent
2024-09-25 20:19:17 +02:00
Viktor Lofgren
0d2390fd13
(search-service) Only autofocus on the query when the query is empty
2024-09-25 14:27:03 +02:00
Viktor Lofgren
4a0356e26f
(search-service) Add pagination support to the search GUI
2024-09-25 14:26:49 +02:00
Viktor Lofgren
73f973cc06
(search-query) Add pagination to search query API and the direct query-service interface
2024-09-25 14:20:59 +02:00
Viktor Lofgren
e9e8580913
(converter) Fix NPE bugs in converter due to the reintroduction of CrawledDocument.headers
2024-09-25 12:18:56 +02:00
Viktor Lofgren
8b85a58fea
(search UX) Autofocus on the search form
2024-09-24 15:56:03 +02:00
Viktor Lofgren
40512511af
(crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl
...
This code is still a bit too complex, but it's slowly getting better.
2024-09-24 15:08:22 +02:00
Viktor Lofgren
3dec4b6b34
(index) Fix bug where tcfFirstPosition lit up because one term was in the title and the other was missing from the document
...
This was because firstPosition calculation was not invalidated when positions were missing.
2024-09-24 13:33:37 +02:00
Viktor Lofgren
162fc25ebc
(minor) Fix accidental commit errors
2024-09-23 18:03:09 +02:00
Viktor Lofgren
e9854f194c
(crawler) Refactor
...
* Restructure the code to make a bit more sense
* Store full headers in crawl data
* Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong
2024-09-23 17:51:07 +02:00
Viktor Lofgren
9c292a4f62
(doc) Fix outdated links in documentation
2024-09-22 13:56:17 +02:00
Viktor Lofgren
edb42836da
(vcs) Fix shared state issues with VarintCodedSequence's iterators.
...
Also cleans up the code a bit.
2024-09-21 16:09:15 +02:00
Viktor Lofgren
1ff88ff0bc
(vcs) Stopgap fix for quoted queries with the same term appearinc multiple times
...
There are reentrance issues with VarintCodedSequence, this hides the symptom but these need to be corrected properly.
2024-09-21 14:07:59 +02:00
Viktor Lofgren
28e7c8e5e0
Increase temporal bias weight to give the recent results filter a bit more recency
2024-09-17 18:11:40 +02:00
Viktor Lofgren
8e78286068
Merge branch 'master' into term-positions
2024-09-17 15:20:46 +02:00
Viktor Lofgren
f4eeef145e
(index) Reduce fetch size to improve timeout characteristics
2024-09-17 15:20:41 +02:00
Viktor Lofgren
87aa869338
(index) Correct positions mask to take into account offsets when overlapping
2024-09-17 14:40:37 +02:00
Viktor Lofgren
60ad4786bc
(index) Use MemorySegment.copy for LongArray->LongArray transfers
2024-09-17 13:56:31 +02:00
Viktor Lofgren
a74df7f905
(index) Increase buffer size for PrioDocIdsTransformer
2024-09-17 13:52:52 +02:00
Viktor Lofgren
9f9c6736ab
(index) Use MemorySegment.copy for LongArray->LongArray transfers
2024-09-17 13:49:02 +02:00
Viktor Lofgren
b95646625f
(index) Correct prio index construction with mmap
...
Accidentally snuck in behavior from full index
2024-09-17 13:39:08 +02:00
Viktor Lofgren
6e47eae903
(index) Correct strange close handling of PositionsFileConstructor
2024-09-13 16:34:14 +02:00
Viktor Lofgren
934af0dd4b
(index) Correct units in log message when shrinking the documents file
2024-09-13 16:33:19 +02:00
Viktor Lofgren
a8bec13ed9
(index) Evaluate using mmap reads during index construction in favor of filechannel reads
...
It's likely that this will be faster, as the reads are on average small and sequential, and can't be buffered easily.
2024-09-13 16:14:56 +02:00
Viktor Lofgren
1cf62f5850
(doc) Correct dead links and stale information in the docs
2024-09-13 11:02:13 +02:00
Viktor Lofgren
8047e77757
(doc) Correct dead links and stale information in the docs
2024-09-13 11:01:05 +02:00
Viktor Lofgren
2a92de29ce
(loader) Fix it so that the loader doesn't explode if it sees an invalid URL
2024-09-12 11:36:00 +02:00
Viktor Lofgren
99523ca079
(query-parser) Remove test that is no longer relevant
2024-09-10 10:35:56 +02:00
Viktor Lofgren
35f49bbb60
(coded-sequence) Add equals and hashCode to VCS
2024-09-10 10:33:56 +02:00
Viktor Lofgren
50ec922c2b
(index) Fix broken index tests
...
Also cleaned up the tests to be less fragile to ranking algorithm changes.
2024-09-10 10:23:46 +02:00
Viktor Lofgren
cfbbeaa26e
(ranking) Clean up ranking test code
2024-09-08 15:46:51 +02:00
Viktor Lofgren
a3b0189934
Fix build errors after merge
2024-09-08 10:22:32 +02:00
Viktor Lofgren
8f367d96f8
Merge branch 'master' into term-positions
...
# Conflicts:
# code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java
# code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
# code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
# code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java
# code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java
# code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java
# code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java
# code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java
2024-09-08 10:14:43 +02:00
Viktor Lofgren
f78ef36cd4
(slop) Upgrade to 0.0.8, add encodings to string columns.
2024-09-04 15:19:00 +02:00
Viktor Lofgren
dc67c81f99
(summary) Fix a few cases where noscript tags would sometimes be used for document summary
2024-09-04 15:00:40 +02:00
Viktor Lofgren
50ba8fd099
(query-parsing) Correct handling of trailing parentheses
2024-09-03 11:45:14 +02:00
Viktor Lofgren
99b3b00b68
(query-parsing) Merge QueryTokenizer into QueryParser and add escaping of query grammar
2024-09-03 11:35:32 +02:00
Viktor Lofgren
f6d981761d
(query-parsing) Drop search term elements that aren't indexed by the search engine
2024-09-03 11:24:05 +02:00
Viktor Lofgren
8290c19e24
(query-parsing) Drop search term elements that aren't indexed by the search engine
2024-09-03 11:21:01 +02:00
Viktor Lofgren
7a69dff6cf
(search) Correct handling of languages on fandom
2024-09-01 13:46:01 +02:00
Viktor Lofgren
bfb7ed2c99
(search) Translate cursed medium URLs to scribe.rip links via the search application
2024-09-01 13:32:14 +02:00
Viktor Lofgren
e19dc9b13e
(search) Translate cursed fandom URLs to breezewiki links via the search application
2024-09-01 13:23:35 +02:00
Viktor Lofgren
74148c790e
(crawler) Pull additional new domains from node-affinity 0
...
Previously a bit ambiguously defined, node affinity 0 is now indicative that a domain is up for grabs for the next crawler
2024-09-01 13:00:36 +02:00
Viktor Lofgren
3d77456110
(*) Add domain parking service to ip blocklist
2024-09-01 12:53:22 +02:00
Viktor Lofgren
ab6a4b1749
(control) Correct id value for domain addition tool
2024-09-01 12:25:15 +02:00
Viktor Lofgren
aeeb1d0cb7
(control) Add utility for adding domains from an external URL
2024-09-01 12:14:21 +02:00
Viktor Lofgren
185b79f2a5
(converter) Fix bug where sideloaded reddit content was errouneously categoriszed as wiki-generated.
2024-09-01 11:30:25 +02:00
Viktor Lofgren
8d0f9652c7
(crawler) Correct RSS-sitemap behavior
2024-08-31 11:38:34 +02:00
Viktor Lofgren
5353805cc6
(crawler) Correct RSS-sitemap behavior
2024-08-31 11:37:09 +02:00
Viktor Lofgren
5407da5650
(crawler) Grab favicons as part of root sniff
2024-08-31 11:32:56 +02:00
Viktor Lofgren
b1bfe6f76e
(control) New view for domains
...
Add capability to assign domains, and bulk-add new domains.
2024-08-30 17:06:48 +02:00
Viktor Lofgren
74e25370ca
(control) New view for domains
...
Still a work in progress, but at this point it's possible to use for viewing domains
2024-08-29 15:40:40 +02:00
Viktor Lofgren
bb5d946c26
(index, EXPERIMENTAL) Clean up ranking code
2024-08-29 11:34:23 +02:00
Viktor Lofgren
abab5bdc8a
(index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data
2024-08-26 14:20:39 +02:00
Viktor Lofgren
30bf845c81
(index) Speed up minDist calculations by excluding large lists
2024-08-26 13:04:15 +02:00
Viktor Lofgren
77efce0673
(paper-doll) Fix compilation
2024-08-26 12:51:29 +02:00
Viktor Lofgren
67a98fb0b0
(coded-sequence) Handle weird legacy HTML that puts everything in a heading
2024-08-26 12:49:15 +02:00
Viktor Lofgren
7d471ec30d
(coded-sequence) Evaluate new minDist implementation
2024-08-26 12:45:11 +02:00
Viktor Lofgren
f3182a9264
(coded-sequence) Evaluate new minDist implementation
2024-08-26 12:02:37 +02:00
Viktor Lofgren
805cb5ad58
(coded-sequence) Correct behavior of findIntersections
2024-08-25 14:54:17 +02:00
Viktor Lofgren
fdf05cedae
(index) Optimize DocumentSpan.countIntersections
2024-08-25 14:12:30 +02:00
Viktor Lofgren
9c5f463775
(index) Optimize DocumentSpan.countIntersections
2024-08-25 13:59:11 +02:00
Viktor Lofgren
893fae6d59
(index) Optimize DocumentSpan.countIntersections
2024-08-25 13:51:43 +02:00
Viktor Lofgren
5660f291af
(index) Optimize DocumentSpan.countIntersections
2024-08-25 13:43:29 +02:00
Viktor Lofgren
efd56efc63
(index) Optimize SequenceOperations.minDistance
2024-08-25 13:28:06 +02:00
Viktor Lofgren
d94373f4b1
(index) Optimize calculatePositionsMask
2024-08-25 13:24:37 +02:00
Viktor Lofgren
0d01a48260
(index) Optimize SequenceOperations
2024-08-25 13:19:37 +02:00
Viktor Lofgren
00ab2684fa
(index) Optimize SequenceOperations
2024-08-25 13:17:38 +02:00
Viktor Lofgren
a5585110a6
(index) Optimize SequenceOperations
2024-08-25 13:16:31 +02:00
Viktor Lofgren
965c89798e
(index) Optimize DocumentSpan
2024-08-25 12:44:33 +02:00
Viktor Lofgren
982b03382b
(index) Optimize DocumentSpan
2024-08-25 12:31:15 +02:00
Viktor Lofgren
24b805472a
(index) Evaluate performance implication of decoding gcs early
2024-08-25 12:23:09 +02:00
Viktor Lofgren
6ce029b317
(index) Remove vestigial parameter
2024-08-25 12:14:12 +02:00
Viktor Lofgren
63e5b0ab18
(index) Correct weightedCounts calculations
2024-08-25 12:06:56 +02:00
Viktor Lofgren
6dda2c2d83
(coded-sequence) Reduce allocations in GCS.values()
2024-08-25 12:06:31 +02:00
Viktor Lofgren
3fb3c0b92e
(index) Optimize ranking calculations
2024-08-25 11:56:11 +02:00
Viktor Lofgren
aa2c960b74
(index) Optimize ranking calculations
2024-08-25 11:53:44 +02:00
Viktor Lofgren
4fbcc02f96
(index) Adjust sensible defaults for ranking parameters
2024-08-25 11:24:16 +02:00
Viktor Lofgren
9aa8f13731
(index) Remove tcfAvgDist ranking parameter
...
This is captured by tcfProximity already
2024-08-25 11:20:19 +02:00
Viktor Lofgren
65bee366dc
(index) Try harmonic mean for avgMinDist
2024-08-25 11:11:52 +02:00
Viktor Lofgren
53700e6667
(index) Try harmonic mean for avgMinDist
2024-08-25 11:08:41 +02:00
Viktor Lofgren
7f498e10b7
(index) Adjust proximity score
2024-08-25 11:01:35 +02:00
Viktor Lofgren
6eb0f13411
(index) Adjust handling of full phrase matches to prioritize full query matches over large partial matches
2024-08-25 10:54:04 +02:00
Viktor Lofgren
773377fe84
(index) Correct handling of full phrase match group
2024-08-25 10:48:34 +02:00
Viktor Lofgren
4372c8c835
(index) Give ranking components more consistent names
2024-08-25 10:44:27 +02:00
Viktor Lofgren
099133bdbc
(index) Fix verbatim match score after moving full phrase group to a separate entity
2024-08-25 10:43:35 +02:00
Viktor Lofgren
b09e2dbeb7
(build) Fix dependency churn from testcontainers
...
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
96bcf03ad5
(index) Address broken tests
...
They are still broken, but less so.
2024-08-25 10:34:36 +02:00
Viktor Lofgren
0999f07320
(search-query) Add new ranking parameters for proximity and verbatim matches
2024-08-25 10:34:12 +02:00
Viktor Lofgren
5d2b455572
(search) Clean up inconsistent usage of MathClient in SearchOperator
...
Also clean up SearchOperator and adjacent code
2024-08-24 10:39:31 +02:00
Viktor Lofgren
ea75ddc0e0
(search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator
2024-08-22 11:50:52 +02:00
Viktor Lofgren
2db0e446cb
(search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator
2024-08-22 11:49:29 +02:00
Viktor Lofgren
557bdaa694
(search) Clean up SearchQueryIndexService and surrounding code
2024-08-22 11:45:28 +02:00
Viktor Lofgren
9eb1f120fc
(index) Repair positions bitmask for search result presentation
2024-08-22 11:28:23 +02:00
Viktor Lofgren
266d6e4bea
(slop) Replace SlopPageRef<T> with SlopTable.Ref<T>
2024-08-21 10:13:49 +02:00
Viktor Lofgren
e4c97a91d8
(*) Comment clarity
2024-08-21 10:12:00 +02:00
Viktor Lofgren
b0a874a842
(*) Upgrade slop library -> 0.0.5
2024-08-18 11:05:27 +02:00
Viktor Lofgren
bca40de107
(*) Upgrade slop library
2024-08-18 10:43:41 +02:00
Viktor Lofgren
93652e0937
(qdebug) Accurately display positions when intersecting with spans
2024-08-15 11:55:48 +02:00
Viktor Lofgren
0a383a712d
(qdebug) Accurately display positions when intersecting with spans
2024-08-15 11:44:17 +02:00
Viktor Lofgren
03d5dec24c
(*) Refactor termCoherences and rename them to phrase constraints.
2024-08-15 11:02:19 +02:00
Viktor Lofgren
b2a3cac351
(*) Remove broken imports
2024-08-15 11:01:34 +02:00
Viktor Lofgren
a18edad04c
(index) Remove stopword list from converter
...
We want to index all words in the document, stopword handling is moved to the index where we change the semantics to elide inclusion checks in query construction for a very short list of words tentatively hard-coded in SearchTerms.
2024-08-15 09:36:50 +02:00
Viktor Lofgren
92522e8d97
(index) Attenuate bm25 score based on query length
2024-08-15 08:41:38 +02:00
Viktor Lofgren
049d94ce31
(index) Add body position match to qdebug fields
2024-08-15 08:39:37 +02:00
Viktor Lofgren
dbc6a95276
(index) Consume the new 'body' span in index to make it used in ranking
2024-08-15 08:33:43 +02:00
Viktor Lofgren
75b0888032
(slop) Migrate to latest Slop version
2024-08-14 11:44:35 +02:00
Viktor Lofgren
2ad93ad41a
(*) Clean up
2024-08-14 11:43:45 +02:00
Viktor Lofgren
623ee5570f
(slop) Break slop out into its own repository
2024-08-13 09:50:05 +02:00
Viktor Lofgren
fd2bad39f3
(keyword-extraction) Add body field for terms that are not otherwise part of a field
2024-08-13 09:49:26 +02:00
Viktor Lofgren
e6c8a6febe
(index) Add index-side deduplication in selectBestResults
2024-08-10 10:51:59 +02:00
Viktor Lofgren
4ece5f847b
(index) Add more qdebug factors
2024-08-10 10:45:30 +02:00
Viktor Lofgren
e4f04af044
(index) Give BODY matches a verbatim match value
2024-08-10 10:22:19 +02:00
Viktor Lofgren
b730b17f52
(index) Correct handling of firstPosition to avoid d/z
2024-08-10 10:21:59 +02:00
Viktor Lofgren
98c40958ab
(index) Simplify verbatim match calculation
2024-08-10 09:54:56 +02:00
Viktor Lofgren
41b52f5bcd
(index) Simplify verbatim match calculation
2024-08-10 09:51:03 +02:00
Viktor Lofgren
4264fb9f49
(query-service) Clean up qdebug UI a bit
2024-08-10 09:51:03 +02:00
Viktor Lofgren
016a4c62e1
(index) Bugs and error fixes, chasing and fixing mystery results that did not contain all relevant keywords
2024-08-10 09:51:03 +02:00
Viktor Lofgren
2f38c95886
(index) Backport bugfix from term-positions branch
...
The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search. This is no bueno.
This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.
2024-08-09 21:17:02 +02:00
Viktor Lofgren
df89661ed2
(index) In SearchResultItem, populate combinedId with combinedId and not its ranking-removed documentId cousin
2024-08-09 16:32:32 +02:00
Viktor Lofgren
41da4f422d
(search-query) Always generate the "all"-segmentation
2024-08-09 13:20:00 +02:00
Viktor Lofgren
2e89b55593
(wip) Repair qdebug utility and show new ranking details
2024-08-09 12:57:25 +02:00
Viktor Lofgren
7babdb87d5
(index) Remove intermediate models
2024-08-07 10:10:44 +02:00
Viktor Lofgren
680ad19c7d
(keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors
2024-08-06 11:16:56 +02:00
Viktor Lofgren
f01267bc6b
(index) Don't load fwd index offsets into a hash table at start.
...
This makes the service take forever to start up. Memory map the data instead and binary search. This is a bit slower, but not by much.
2024-08-06 11:16:28 +02:00