warc: Python library to work with ARC and WARC files¶
ARC is a file format for storing web crawls as sequences of content blocks. It was developed in 1996 by Internet Archive.
WARC (Web ARChive) is an extension of the ARC file format, which adds more freedom by adding more metadata to each record and allowing named headers.
This python library works with files stored in both ARC and WARC formats.
This documentation belongs to the fork of original WARC repository owned by Knowledge Technology Research Group. Its content is mostly inherited from older forks and the original repository.
Installation¶
Installing warc is simple with pip (from PyPI):
$ pip install warc-knot
Or you can get the sources by cloning the public git repository:
git clone git://github.com/KNOT-FIT-BUT/warc3.git
… and importing the library from local directory.
Reading a WARC File¶
Reading a warc file is as simple as reading a simple file. Instead of returning lines, it returns WARC records.
import warc
f = warc.open("test.warc.gz")
for record in f:
print(record['WARC-Target-URI'], record['Content-Length'])
The open
function is a shorthand for warc.WARCFile
.:
f = warc.WARCFile("test.warc.gz", "rb")
f = warc.WARCFile(fileobj=StringIO(text))
Writing WARC File¶
Writing to a warc file is similar to writing to a regular file.:
f = warc.open("test.warc.gz", "w")
f.write_record(warc_record1)
f.write_record(warc_record2)
f.close()
Working with WARC Header¶
The warc.WARCHeader
object contains the list of WARC headers specified before the payload. It is just a dictionary.
>>> h = warc.WARCHeader({
... "WARC-Type": "response",
... "WARC-Date": "2012-02-03T04:05:06Z",
... "WARC-Record-ID": "<urn:uuid:80fb9262-5402-11e1-8206-545200690126>",
... "Content-Length": "42"
... })
>>>
>>> h['WARC-Type']
'response'
>>> h['WARC-Record-ID']
'<urn:uuid:80fb9262-5402-11e1-8206-545200690126>'
>>> h['Content-Length']
'42'
The headers are case-insensitive.
>>> h['warc-type']
'response'
>>> h['WARC-RECORD-ID']
'<urn:uuid:80fb9262-5402-11e1-8206-545200690126>'
The WARCHeader
object is a real dictionary.
>>> h.keys()
['warc-type', 'content-length', 'warc-date', 'warc-record-id']
>>> h.values()
['response', '42', '2012-02-03T04:05:06Z', '<urn:uuid:80fb9262-5402-11e1-8206-545200690126>']
>>> h.get("Content-Type", "application/octet-stream")
'application/octet-stream'
The commonly used headers are accessible as attributes as well.
>>> h.type
'response'
>>> h.record_id
'<urn:uuid:80fb9262-5402-11e1-8206-545200690126>'
>>> h.content_length
42
>>> h.date
"2012-02-03T04:05:06Z"
Note that, h.content_length
is an integer where as h['Content-Length']
is a string.
When a new WARCHeader
object is created, the WARC-Record-ID
, WARC-Date
and Content-Type
headers can be initialized automatically.
>>> h = warc.WARCHeader({"WARC-Type": "response"}, defaults=True)
>>> h['WARC-Record-ID']
'<urn:uuid:3457ee2c-5e2c-11e1-a8ff-c42c0325ac11>'
>>> h['WARC-Date']
'2012-02-23T14:39:34Z'
>>> h['Content-Type']
'application/http; msgtype=response'
The WARC-Record-ID
is set to a UUID, WARC-Date
is set to current datetime and Content-Type
is initialized based on the WARC-Type
.
Working with WARCRecord¶
A WARCRecord
can be created by passing a WARCHeader
object and payload, which defaults to None when unspecified.
>>> header = warc.WARCHeader({"WARC-Type": "response"}, defaults=True)
>>> record = warc.WARCRecord(header, "helloworld")
Or by passing a dictionary of headers.
>>> record = warc.WARCRecord(payload="helloworld", headers={"WARC-Type": "response"})