I’ve asked a similar question here before with no luck, but it came up once more so I’m asking again.
A site I’m trying to archive hosts all their image and document content on cloudfront CDN’s with what I believe are “signed” urls.
These urls look similar to the following:
The expires value is an epoch time that expires within 5 minutes of page load. These files are downloadable as long as the expiry epoch time hasn’t passed.
What appears to be happening during my crawl is that a majority of these file urls end up far enough down in the queue, that by the time they are processed they links are expired.
Any way I can force Browsertrix to download these pdfs/images/excels/etc. first? Or perhaps prioritize certain items in the queue by url?