yeah, seems a better idea to me. this is the first time I’m using the text extraction, maybe others have better experience. i have other crawls running now, will test these too.
otherwise, but this involves a change in the UI, pages.jsonl could be parsed to extract only .url, and have an additional tab with “Text Search”
Thanks for bringing this up! Yes, this is an issue that we should resolve, as page lists are inevitably going to be big.
In the latest 0.5.0 beta, which uses the latest py-wacz, the approach I’m testing is to only store the seed list pages in pages.jsonl, while storing all other pages in extraPages.jsonl. The pages.jsonl is loaded on initial load of the WACZ, while the extraPages.jsonl is only loaded for search. Not sure if the WACZ file you have has this setup or not, if you share it, can take a look, and if not, can try repackaging.
This probably needs a bit more thought though, and something we should address in next iteration of the spec.