Extract page source, not just text, into WARC record

Browsertrix Crawler offers a nice --text final-to-warc option to record the final rendered text of a page as a WARC record, but I find that what I really want is to extract the final HTML source of the page, since it contains a lot more rich information (e.g. you might want to have it to run through Readability to get just the body text from the page, or you might want to do something with a JS-rendered table), and you lose that structure with plain text output. Mainly I see this as a useful tool or shortcut for extracting information from dynamically rendered pages that present as just an empty page if you are looking at the HTML that was delivered over the wire (i.e. what’s in the response record in the WARC).

Ideally, it would be great to have this as a similar built-in feature to text extraction, but I’m wondering if there’s a way to accomplish it in the short term with a behavior. Looking at the docs, it seems like yield { msg: "some text", other: "stuff" } gets logged, so that might work, but is there a way for a behavior to create WARC records or surface data to be stuffed into pages.json?

1 Like

Oh, I guess this is kind of similar to Capture page modified by userscript (sorry for starting a new thread), although I’m hoping for a little cleaner of a solution than they landed on there.

Whoops, I guess I did not search GitHub well enough — it looks like there is/was already some work in this direction that I missed: Add WARC resource containing DOM tree after load by magbb · Pull Request #730 · webrecorder/browsertrix-crawler · GitHub.