I tested my first Browsertrix crawl last night. For my needs of backing up a 1 year course with daily activity, and underlying links, the ArchiveWeb.page extension is proving a huge undertaking. I have to click for days.
The test I did last night had some youtube links that do not archive as a video. The actual archive I want to do includes vimeo, and other external links from the main site, that I’d like in the WACZ. I’m not a coder and am struggling to understand what I read on github. It sounds like I might need to use --scope and/or --headless to include youtube. There is currently an add depth option open issue: Add Crawl Depth Option · Issue #16 · webrecorder/browsertrix-crawler · GitHub
I tried --headless and it did not get archive any youtube links. For --scope I was lost in Regex of URLs (could not understand). Also concerned if I put in --scope of youtube it tries to download all of youtube.
Any help modifying the default:
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url [URL] --generateWACZ --text --collection test
to include youtube, vimeo, or other underlying/external links would be greatly appreciated