I was wondering if it was possible to use warcio to read WARC data from an HTTP stream? I was hoping this would work but it seems to throw an error:
url = 'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-45/segments/1603107863364.0/wat/CC-MAIN-20201019145901-20201019175901-00000.warc.wat.gz' resp = requests.get(url, stream=True) for record in warcio.ArchiveIterator(resp): print(record)
But I get an error:
AttributeError: 'Response' object has no attribute 'read'
Is there a way to stream data over HTTP to warcio?