I’m a graduate student using ArchiveWeb for my research, and I’m trying to ensure that my data management practices are in line with the university’s ethics requirements for my project.
One of the reasons I’m using ArchiveWeb over something like Archivebox is because I don’t have a background in computer science or tech, and ArchiveWeb was really intuitive to download and learn how to use. However, I’m aware that it stores data on the browser, and I get the impression that this is not generally considered a secure place to store data (even temporarily, I know I can download files and delete them on the browser).
My question is this: if I use the ArchiveWeb desktop app to archive and view my data, does this get around this problem? Or are the archived files still stored somewhere that is not my own computer drive before I download them? If ArchiveWeb isn’t likely to be compliant with data security policies, do people have any recommendations for other software I could use?
Happy to hear it was easy to get started archiving things!
ArchiveWeb.page stores everything locally on your machine. No data is uploaded anywhere unless you specifically sign into your Browsertrix account (which it doesn’t sound like you have) and upload your archives to the server.
ArchiveWeb.page stores its data in the IndexDB database. The desktop application does the same thing, but instead of using your web browser, ArchiveWeb.page standalone uses an application framework called Electron — basically a way of using Chrome but packaged up as a desktop app with unique application menu bars, icons, and all that jazz. Both the extension and standalone app are functionally equivalent information security wise. They’re both only as secure as your local machine.
Information in IndexDB can only be accessed by the website or extension that created it. As MDN explains in the above link:
IndexedDB uses the same-origin principle, which means that it ties the store to the origin of the site that creates it (typically, this is the site domain or subdomain), so it cannot be accessed by any other origin.
What IndexDB isn’t great at is giving you the same organizational ability as your local filesystem! To this end, I would recommend exporting anything you make with ArchiveWeb.page as a WACZ file — a self-contained archive file containing the contents of your archiving session. For particularly sensitive data, you may wish to delete it from ArchiveWeb.page once exported and store the WACZ on an encrypted air-gapped drive (not connected to a computer or the internet). IndexDB is also not awesome restoring your data if something goes wrong and your browser is completely reset or you move to another machine. To this end, see my guide on restoring this database from a backup for a little bit of a deeper dive on exactly where your data is stored.
As an additional note, anything you view with https://replayweb.page (our standalone archive viewer website) or the ReplayWeb.page desktop app is also completely private and runs entirely within your browser. No data from your archive gets sent to us and we don’t have analytics on that website.
Anything crawled using Browsertrix — our hosted cloud service — can be viewed by Webrecorder employees but can otherwise be considered private. We will not share our customer data with any outside parties except where absolutely required by law. If you self-host Browsertrix or use the command line Browsertrix Crawler application, we cannot view your archives.
Thankyou so much Hank, this is exactly what I wanted to know! I’m definitely going to be exporting WACZ files and deleting them from ArchiveWeb when I’m not using it to archive or view the pages.