Browsertrix Crawler offers a nice --text final-to-warc option to record the final rendered text of a page as a WARC record, but I find that what I really want is to extract the final HTML source of the page, since it contains a lot more rich information (e.g. you might want to have it to run through Readability to get just the body text from the page, or you might want to do something with a JS-rendered table), and you lose that structure with plain text output. Mainly I see this as a useful tool or shortcut for extracting information from dynamically rendered pages that present as just an empty page if you are looking at the HTML that was delivered over the wire (i.e. what’s in the response record in the WARC).
Ideally, it would be great to have this as a similar built-in feature to text extraction, but I’m wondering if there’s a way to accomplish it in the short term with a behavior. Looking at the docs, it seems like yield { msg: "some text", other: "stuff" } gets logged, so that might work, but is there a way for a behavior to create WARC records or surface data to be stuffed into pages.json?