No pages recognized in Upload

Hey all, first off: Browsertrix is a lifesaver. Thank you so much for the tool and for making it open source.

I’m having some trouble with uploads to the browsertrix service. As background, Youtube embededed videos weren’t being properly archived when I ran a crawl on the browsertrix service (it kept saying i was a bot and needed to login). I decided to run a copy of browsertrix locally for a single page with a Youtube embed. When I did that, the crawl worked perfectly and replaying the page worked perfectly. Then I downloaded the wacz file from my local instance and uploaded it to the browsertrix service. When it’s uploaded though it says there aren’t any pages in the overview. And when I go to replay, it doesn’t list any pages or urls to visit. What’s weird though is that, if I download the uploaded wacz again and then run it in ReplayWeb.page, it works perfectly fine.

Any ideas what is going on? This is particular upload: Browsertrix

Hi @wwahammy, I believe you’ve hit this bug having to do with multi-WACZ uploads in Browsertrix: WACZ-files dowloaded from Browsertrix and then uploaded to Browsertrix using "Upload WACZ" contains 0 pages · Issue #2814 · webrecorder/browsertrix · GitHub

We should have a fix for that in the coming weeks, but in the meantime if you go to the Files tab of the crawl and download the individual WACZ file(s) by clicking on their names rather than using the Download button, those WACZs will upload as expected into Browsertrix.

We also have changes coming up in the crawler to always crawl Youtube through a dedicated proxy, which should make Youtube crawling more consistent and remove the need for this kind of workaround!

1 Like

Thanks @tessa-webrecorder, that was indeed the issue.

As a first step on addressing this, is there any chance you could give a warning in the UI about this problem or an error if a multi-wacz is sensed? I don’t know how much time you’d want to spend on this since it will be fixed but it would be a little bit improved experience in the interim.

We can definitely consider flagging this issue somehow until it’s addressed. We’re also working on another change that should be out soonish so that the download buttons will only download a multi-WACZ when it’s actually necessary (i.e. a crawl has multiple WACZ files), and otherwise will just download the crawl’s WACZ file as-is. That won’t solve the issue on its own but should make it less likely to happen in many cases.

1 Like