Youtube Videos captured with Browsertrix-Crawler not playable in Pywb

mona · April 15, 2025, 4:52pm

I’ve made a crawl of a blog with embedded youtube videos, that were than missing in the archive. Therefor I extracted all youtube embed urls (https://www.youtube.com/embed/) from the logs to capture them again.

I found, that when giving the youtube embed url directly as a seed to browsertrix-crawler, the video could get archived. I see, that the behavior for Youtube is failing right now (Autoplay Behavior: YouTube Embed · Workflow runs · webrecorder/browsertrix-behaviors · GitHub), but I was still able to capture the youtube pages like that - as a functional replay in ReplayWeb.Page shows.

This leads to the replay issue…

Replay issue

The youtube captures are functional in ReplayWeb.page but not in pywb.

	pywb (version 2.8.0)	ReplayWeb.Page (v2.3.4)
Browsertrix-Crawler capture (1.5.8, with warcio.js 2.4.3)	not playable	playable
ArchiveWeb.Page capture ( 0.14.2, using warcio.js 2.4.2)	playable	playable

I think it has something to do with the resource: https://www.youtube.com/youtubei/v1/player?prettyPrint=false. For those combinations of tools, were the video is playable, the response to this url request is 200. In the combination browsertrix-crawler & pywb, the response is 404.

Browsertrix Capture

pywb replay: https://webarchives.rhizome.org/youtube_embeds_5_1741774579/20250312101726/https://www.youtube.com/embed/n7ky-nuw-us
zipped pywb collection: https://monaulrich.online/web_archives/youtube_embeds_5_1741774579.zip

ArchiveWeb.page Capture

downloaded wacz from AWP: https://monaulrich.online/web_archives/youtube_embeds_5_awp.wacz
pywb replay (reindexed): https://webarchives.rhizome.org/youtube_embeds_5_awp/20250312101726/https://www.youtube.com/embed/n7ky-nuw-us
zipped pywb collection: https://monaulrich.online/web_archives/youtube_embeds_5_awp.zip

Index & WARCs Checks

I have also checked the resource (https://www.youtube.com/youtubei/v1/player?prettyPrint=false) in both warcs and indexes. And the pywb index of the browsertrix capture does not contain the hole POST Request Header and Payload - where the pywb index of the ArchiveWeb.Page Captures contains it.

…

If I can provide any more details, or if should check something else, please let me know.
I am also not 100% sure, if that is the right track.
Thank you very much in advance!

edsu · April 17, 2025, 10:21am

I wonder if client side replay in pywb will fix this?

github.com/webrecorder/pywb

Add optional client-side playback to pywb

webrecorder:main ← webrecorder:issue-924-client-side-playback

opened 06:25PM - 12 Mar 25 UTC

tw4l

+169 -4

## Description This PR adds optional client-side replay in pywb's framed repl…ay mode, using wabac.js. This is implemented using wabac.js's live proxy mode, similar to the implementation by Alex Osborne's [proof of concept](https://github.com/ato/pywb-wabac) and enabled via the `config.yaml` file. Documentation has also been added. The wabac.js static worker is included in the pywb static directory and a new route added to serve it. The wabac.js version can be updated using the included `build-wabac.sh` script, which fetches the service worker from the npm CDN and copies it into the static directory with the correct filename (changed in pywb from `sw.js` to `wabacWorker.js`, as we have several service workers). In addition, I've made a few small housekeeping changes: - The Python version in the pywb Dockerfile is updated to 3.11 to avoid using an unsupported version of Python - Similarly, CI now runs on Python 3.9-3.11 to drop older versions that are no longer supported in GH Actions runners Note that there are currently some unrelated failing tests which will be addressed in separate PRs. ## Motivation and Context Fixes #924 ## To Do Before Merging - [x] Bump wabac.js to 2.21.4 when it's released to fix issue noticed in testing with redirects - [ ] Test with a wider range of sites and pywb deployment types ## Types of changes - [ ] Replay fix (fixes a replay specific issue) - [ ] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Checklist: - [X] My change requires a change to the documentation. - [X] I have updated the documentation accordingly. - [ ] I have added or updated tests to cover my changes. - [ ] All new and existing tests passed.

mona · May 12, 2025, 8:50am

Hello, thank you very much for your fast response.

I’ve checked the client-side replay: I installed pywb from the source, and changed the config.yaml to
“client_side_replay: true”. But the video is still not playable.

I found the issue.
It is the index entry of the player resource. When adding the post request body to the url search key (first part of the index entry), the resource can be found and the video is playable.

player resource in index and warc

AWP Capture
** Index Entry: awp_index_entry_player_resource.txt
** WARC Record: awp_warc_block_player_resource.txt
Browsertrix Capture Pywb Collection
** Index Entry: btc_index_entry_player_resource.txt
** WARC Record: btc_warc_block_player_resource.txt

fixed index entry: btc_index_entry_player_resource_fixed.txt

GitHub Issue: Post Request Body missing in index entry · Issue #941 · webrecorder/pywb · GitHub