Replayweb.page with a medium size wacz and text extraction

i captured a site with browsertrix using --generateWACZ --text
pages.jsonl is 283MB, the full wacz 963 MB

replayweb.page is stuck with Initializing Search... even if i can manually enter the URL and the replay works correctly
image

the devtools console is not showing anything sensitive.
there is a best practice to use text extraction with wacz?

Hmm, maybe this suggests that it might be better to store the text separate from pages.jsonl which is needed for navigation?

yeah, seems a better idea to me. this is the first time I’m using the text extraction, maybe others have better experience. i have other crawls running now, will test these too.

otherwise, but this involves a change in the UI, pages.jsonl could be parsed to extract only .url, and have an additional tab with “Text Search

Thanks for bringing this up! Yes, this is an issue that we should resolve, as page lists are inevitably going to be big.
In the latest 0.5.0 beta, which uses the latest py-wacz, the approach I’m testing is to only store the seed list pages in pages.jsonl, while storing all other pages in extraPages.jsonl. The pages.jsonl is loaded on initial load of the WACZ, while the extraPages.jsonl is only loaded for search. Not sure if the WACZ file you have has this setup or not, if you share it, can take a look, and if not, can try repackaging.
This probably needs a bit more thought though, and something we should address in next iteration of the spec.

well, yes, i notice now that wacz has extraPages.jsonl

~ docker run -it webrecorder/browsertrix-crawler crawl --version
0.5.0-beta.2
unzip -v bf000.wacz
Archive:  bf000.wacz
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
 2867715  Stored  2867715   0% 2022-02-08 23:31 669257de  indexes/index.cdx.gz
    7009  Defl:N     2491  65% 2022-02-08 23:31 1e0a857c  indexes/index.idx
49968447  Stored 49968447   0% 2022-02-08 23:31 61df1c89  archive/rec-20220208153251742081-a9a63b12a11f.warc.gz
339264847  Stored 339264847   0% 2022-02-08 23:30 23d88ca5  archive/rec-20220208153252357269-a9a63b12a11f.warc.gz
162393538  Stored 162393538   0% 2022-02-08 23:31 86e24bf9  archive/rec-20220208153252468916-a9a63b12a11f.warc.gz
164268254  Stored 164268254   0% 2022-02-08 23:30 9946100f  archive/rec-20220208153252628783-a9a63b12a11f.warc.gz
11129964  Stored 11129964   0% 2022-02-08 23:30 20948f87  archive/rec-20220208153252988261-a9a63b12a11f.warc.gz
50662995  Stored 50662995   0% 2022-02-08 23:31 8c49dc9c  archive/rec-20220208153253016236-a9a63b12a11f.warc.gz
48393437  Stored 48393437   0% 2022-02-08 23:30 1d354b55  archive/rec-20220208153253122275-a9a63b12a11f.warc.gz
53706363  Stored 53706363   0% 2022-02-08 23:31 c2d7615e  archive/rec-20220208153259117946-a9a63b12a11f.warc.gz
   43651  Defl:N    16491  62% 2022-02-08 23:31 3665f80c  pages/pages.jsonl
305686993  Defl:N 80745684  74% 2022-02-08 23:31 3b3c5211  pages/extraPages.jsonl
    2995  Defl:N      913  70% 2022-02-08 23:31 69f049e0  datapackage.json
     117  Defl:N      102  13% 2022-02-08 23:31 3e27fdaf  datapackage-digest.json
--------          -------  ---                            -------
1188396325         963421241  19%                            14 files