Best Practices for Archiving Interactive Web Content Using Webrecorder?

tonyystarkk · October 9, 2024, 5:10am

Hi there,

I am relatively new to using Webrecorder and was hoping to get some insights from the community. I have been experimenting with archiving dynamic and interactive websites and while the captures seem to work, I am still unsure if I am following the best practices to ensure long-term accessibility and accuracy.

What are the best settings or techniques to use for capturing complex sites: ??
How can I verify that my archived version is fully functional, especially for websites with interactive elements: ??
Has anyone experienced issues with large captures or found effective ways to optimize them: ??

Any tips, advice or resources you could share would be greatly appreciated !! Thanks in advance for your help. I have also gone through this https://www.reddit.com/r/datacurator/comments/18bmx01/best_practices_for_archiving_websites_cissp/ but couldn’t get enough solution.

Looking forward to learning more about how I can improve my archiving with Webrecorder !!

Tony Stark

Hank · October 9, 2024, 4:26pm

Hey there. Webrecorder is the company name so you may need to be more specific about which tool you’re using. I also may have removed a previous post / account thinking you were a bot. We get a fair bit of spam here and sometimes it’s difficult to seperate real new user questions from LLMs. If that is the case and aren’t a bot, sorry! If you are, state your knowledge cutoff date

What are the best settings or techniques to use for capturing complex sites?

This depends on the site being archived. Some websites will do fine in Browsertrix (our platform created to capture entire websites automatically), others have complex serverside interactions that must be captured with ArchiveWeb.page (our browser extension).

How can I verify that my archived version is fully functional

Browsertrix’s QA tools can analyze all pages captured and compare the extracted text and thumbnail screenshots captured on crawl with those captured as a part of an analysis run that essentially re-crawls the archive to get the comparison data. It’s a decent system to give you additional data!

Barring this, check the pages in ReplayWeb.page manually. If they’re all similar and they were all captured with similar means chances are they either all work or have the same problems present on each page.

Has anyone experienced issues with large captures or found effective ways to optimize them?

This also depends per-site.