Warcio and streams

edsu · December 8, 2020, 10:55pm

I was wondering if it was possible to use warcio to read WARC data from an HTTP stream? I was hoping this would work but it seems to throw an error:

url = 'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-45/segments/1603107863364.0/wat/CC-MAIN-20201019145901-20201019175901-00000.warc.wat.gz'

resp = requests.get(url, stream=True)

for record in warcio.ArchiveIterator(resp):
  print(record)

But I get an error:

AttributeError: 'Response' object has no attribute 'read'

Is there a way to stream data over HTTP to warcio?

anj · December 9, 2020, 9:28pm

Hi Ed,

You need to let it at the raw stream:

    import requests
    import warcio
    
    url = 'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-45/segments/1603107863364.0/wat/CC-MAIN-20201019145901-20201019175901-00000.warc.wat.gz'
    
    resp = requests.get(url, stream=True)
    
    for record in warcio.ArchiveIterator(resp.raw):
        print(record.rec_headers)

See e.g. https://colab.research.google.com/drive/1YPbGCL6TwauFc432Gq3c-aIq6YtwJaiq?usp=sharing

edsu · December 9, 2020, 11:49pm

Ahah, thanks @anj I will give that a try!