I’ve been struggling in crawling a wordpress with a crappy theme producing an endless loop of links, with many identical path segments.
What could be a preferred way to implement a feature like PathologicalPathDecideRule
of Heritrix in Browsertrix? A js behaviour?
PathologicalPathDecideRule
Rule REJECTs any URI which contains an excessive number of identical,
consecutive path-segments
(eg http://example.com/a/a/a/boo.html == 3 ‘/a’ segments)