Sorry, this URL could not be loaded because the size of the file is not accessible

Hi,

I am trying to load this wacz file via a replayweb.page viewer (Embedding ReplayWeb.page - ReplayWeb.page Docs). The file is publicly accessible (e.g https://collections.digital.utsc.utoronto.ca/system/files/2024-04/lives-and-legacies_0.wacz). However, getting the following message:

“Sorry, this URL could not be loaded because the size of the file is not accessible. Make sure this is a valid URL and you have access to this file.”

Any insight into what maybe going on here. Thank you.

1 Like

I’m not sure that page is publically accessible, when I go there I get an Access Denied message.

Sorry, provided a wrong example. Here is one that is publicly available: https://collections.digital.utsc.utoronto.ca/system/files/2023-10/serai.wacz. Getting the same message. Thank you.

Great, that one works!

The next thing to check would be CORS errors. Are you trying to embed this on a utoronto.ca domain, or on another site?

I know UTSC has used embedded ReplayWeb.page on their own sites in the past so if it’s within the library maybe talk to a staff member there who was involved in setting that up? If it’s a personal site, you will probably need to host it on your own or see if UofT provides a CORS friendly server for things like this. For my personal archive projects, I like using Backblaze’s B2 storage. We don’t have a section in the docs on configuring their CORS settings (has to be done via the command line) but I could post my config there if it would help!

EDIT: Tested this, looks like CORS is indeed disabled for the server. If you’re creating an embed on that site it should work fine but if you want to embed a WACZ hosted there on a different site it will not.

I am a developer with the UTSC Library.

We are trying to load it from the same server and domain as the file is made accessible from: Serai: Early Modern Encounters | U of T Scarborough Library Digital Collections Thus, CORS should not be an issue!

Another example here:
https://d10test.digital.utsc.utoronto.ca/system/files/2024-04/digital_history_0.wacz
test | Default (as field formatter for Drupal media file)
https://d10test.digital.utsc.utoronto.ca/test.html (simple html embed of the above file)
https://d10test.digital.utsc.utoronto.ca/test2.html (from the documentation example from Embedding ReplayWeb.page - ReplayWeb.page Docs)

The one difference I note via curl -I request for this file compared to fully publicly accessible files is that it was not returning content length.

The viewer was working earlier as expected with similar setting, and we are trying to narrow down what maybe have changed to cause this issue.

1 Like

I am a developer with the UTSC Library.

But doctor! I am Pagliacci! :upside_down_face:

Sounds like you know more about your infra than I do! In retrospect, I think the CORS error is a specific message about that and you’re likely on the right track with the file not returning the content length. When I try to download https://collections.digital.utsc.utoronto.ca/system/files/2023-10/serai.wacz it starts working but the total file size is unknown. Pretty sure this is a server config issue of some sort, though I have no specific insights as to why https://memory.digital.utsc.utoronto.ca/sites/default/files/2023-11/utsc_pulse.wacz does work but your new one doesn’t (other than the subdomains are different). Perhaps something to compare against?

Whatever the issue turns out to be, please let us know if it should be added to the docs!

1 Like

Yes, good eye! The Content-Length header in the HTTP Response is required by the replay mechanism in order to determine how to perform Range Requests to fetch parts of the WACZ on demand, rather than downloading the entire WACZ file from the server.

The WACZ is a ZIP file, and the ZIP “directory” (a manifest of the contained files and their location) is located at the end of the file. In order to read specific files from within the ZIP file, ReplayWebPage needs to first read the Directory, and uses a Range Request to read backwards from the end, and so it needs to know the Content-Length. Sorry if that’s TMI :slight_smile:

ReplayWebPage does a HEAD request to the WACZ URL:

$ curl --head https://collections.digital.utsc.utoronto.ca/system/files/2023-10/serai.wacz
HTTP/2 200
cache-control: private
date: Thu, 18 Apr 2024 12:36:27 GMT
content-language: en
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
expires: Sun, 19 Nov 1978 05:00:00 GMT
x-generator: Drupal 10 (https://www.drupal.org)
accept-ranges: bytes
content-security-policy: upgrade-insecure-requests;
access-control-allow-headers: x-requested-with, Content-Type, origin, authorization, accept, client-security-token, X-ISLANDORA-TOKEN, X-Forwarded-For
strict-transport-security: max-age=63072000
last-modified: Mon, 30 Oct 2023 17:49:49 GMT
vary: User-Agent,Origin
content-security-policy: frame-ancestors 'self';
content-type: application/gzip
server: Apache

Sure enough Content-Length is not there. But it does seem to be there for the older server?

$ curl --head  https://memory.digital.utsc.utoronto.ca/sites/default/files/2023-11/utsc_pulse.wacz
HTTP/2 200
content-security-policy: upgrade-insecure-requests;
access-control-allow-headers: x-requested-with, Content-Type, origin, authorization, accept, client-security-token, X-ISLANDORA-TOKEN, X-Forwarded-For
strict-transport-security: max-age=63072000
x-content-type-options: nosniff
last-modified: Thu, 30 Nov 2023 15:50:04 GMT
etag: "4ba323e9-60b609aefb4ff"
accept-ranges: bytes
content-length: 1268982761
cache-control: max-age=31536000
expires: Fri, 18 Apr 2025 13:18:38 GMT
vary: User-Agent,Origin
content-security-policy: frame-ancestors 'self';
content-type: application/x-zip
date: Thu, 18 Apr 2024 13:18:38 GMT
server: Apache

I wonder if your new web server (Apache or Nginx?) is configured to try to gzip compress the ZIP file, and Drupal is deciding it cannot determine the Content-Length?

https://www.drupal.org/project/drupal/issues/3396559

1 Like

Thank you for the input. Will report back.

1 Like