Browsertrix depth

JackMc · June 2, 2021, 2:32pm

I tested my first Browsertrix crawl last night. For my needs of backing up a 1 year course with daily activity, and underlying links, the ArchiveWeb.page extension is proving a huge undertaking. I have to click for days.

The test I did last night had some youtube links that do not archive as a video. The actual archive I want to do includes vimeo, and other external links from the main site, that I’d like in the WACZ. I’m not a coder and am struggling to understand what I read on github. It sounds like I might need to use --scope and/or --headless to include youtube. There is currently an add depth option open issue: Add Crawl Depth Option · Issue #16 · webrecorder/browsertrix-crawler · GitHub

I tried --headless and it did not get archive any youtube links. For --scope I was lost in Regex of URLs (could not understand). Also concerned if I put in --scope of youtube it tries to download all of youtube.

Any help modifying the default:
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url [URL] --generateWACZ --text --collection test
to include youtube, vimeo, or other underlying/external links would be greatly appreciated

Thank you

JackMc · June 2, 2021, 2:47pm

I’m reading this GitHub - webrecorder/browsertrix-behaviors: Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler. I guess I have to setup behaviors?

JackMc · June 2, 2021, 5:57pm

I tried this parameter --behaviors autoscroll,autoplay,autofetch,siteSpecific . My new WACZ was larger but I didn’t find the videos (without going to a live link), will keep trying.

JackMc · June 2, 2021, 8:08pm

I’ve moved away from my testing and am now over to my 1 year course backup/archive. I need login credentials for this site and follow the Creating and Using Browser Profiles instructions here GitHub - webrecorder/browsertrix-crawler: Run a high-fidelity browser-based crawler in a single Docker container . It seems to work but the crawl fails. By leaving --text on does this generate the file pages.jsonl? Looking in here it appears I’m logged in. But the very end it says

\nUnsupported Browser\nPN recommends latest Chrome, Firefox, Safari or Edge.\n×"}

Does inherently mean I have to use the --headless parameter?

JackMc · June 2, 2021, 8:18pm

I get the same result with --headless. I’m pretty sure I’m logged in because I see text in pages.jsonl that would only appear if logged in.

Any advice would be extremely helpful as I would have days of clicking ahead of me with the browser extension if I cannot get this to work. Thank you again for these amazing archiving tools. Much appreciated.

JackMc · June 6, 2021, 3:18pm

If I am having no success with the ArchiveWeb.page extensions’s “Start With Autopilot” option on, but can archive by manually clicking on various links on the site I’m trying to archive, does this mean browsertrix-crawler will inherently not work on the site I’m trying to archive? Is there anyway to tell why I have no success in archiving a site with browsertrix-crawler from log files? On a test site it works. On the site I want to do it does not. I’m pretty sure I’m correctly logged in (if I don’t use a profile pages.jsonl looks completely different than if I do).

ilya · June 7, 2021, 4:15am

Hi Jack,
Sorry you’re having trouble with all of this! Can you share the site you’re trying to archive, its hard to give specific advice w/o looking at the site. Are the videos embedded on the page, or do they link out to other sites (youtube, etc…). Embedded videos should get archived automatically, but perhaps something isn’t working. We can continue over e-mail as well to look at this specific site.

JackMc · June 12, 2021, 3:17pm

Hi Ilya,

Thanks for your reply. It’s a challenge to share the site, as login credentials are required, and there is no way to make a dummy user account to test it out. I’ve just been getting after it with the chrome extension. There it seems to work. The videos are on Vimeo. They are not overtly displayed. There is an image and a down arrow. Clicking the down arrow reveals the video. For each ‘section’ of the course, with a video, there can be 2-4 ‘options’ with radio buttons. Choosing different radio buttons reveals different videos and images. I’m trying to archive all the ‘options’. Would Browsertrix Crawler attempt to select all radio buttons, down arrows, etc.? Perhaps manual crawling is the only way here. I understand it is next to impossible to troubleshoot w/o being able to look at the specific site. Thanks again for these incredible tools.

Kind regards
Jack