I was wondering if it was possible to use warcio to read WARC data from an HTTP stream? I was hoping this would work but it seems to throw an error:
url = 'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-45/segments/1603107863364.0/wat/CC-MAIN-20201019145901-20201019175901-00000.warc.wat.gz'
resp = requests.get(url, stream=True)
for record in warcio.ArchiveIterator(resp):
print(record)
But I get an error:
AttributeError: 'Response' object has no attribute 'read'
Is there a way to stream data over HTTP to warcio?