Unable to exclude an embedded audio hosted by google drive on Browsertrix-Crawler

PatoPan · October 24, 2023, 5:54pm

I am trying to make an archive of a blogger page with Browsertrix-Crawler, but the blog has an audio file that appears on every page and post, causing it to get redownloaded for every page and blog, since google drive keeps providing a different download link each time. This is causing hundreds of duplicates

I am stuck trying to figure out how to achieve this. So far I tried the following

    exclude:
      - .*1UBgLcRFGNatNULprWqn6SWFQu26kCUuO.*
      - .*drive.google.com/uc\?id=1UBgLcRFGNatNULprWqn6SWFQu26kCUuO&export=download
      - .*drive.google.com/uc\?id=1UBgLcRFGNatNULprWqn6SWFQu26kCUuO&export=download
      - doc-04-38-docs.googleusercontent.com.*
      - drive.google.com.*
    blockRules:
      - url: .*1UBgLcRFGNatNULprWqn6SWFQu26kCUuO.*
      - url: .*drive.google.com/uc\?id=1UBgLcRFGNatNULprWqn6SWFQu26kCUuO&export=download
      - url: doc-04-38-docs.googleusercontent.com

I also created a test blog for this issue. https://patotester14.blogspot.com

and I tried the parameters --behaviors autoscroll,autofetch,siteSpecific to disable autoplay, which at first I thought solved the issue, but it didn’t. The only improvement is that drive.google no longer appears in the logs, but it will still show up on replayweb.page.

This is necessary because this is turning a website that should only be sized 100mbs or so, into a website sized over 1GB.