Streaming Tar Files in Python

A project that I have been working on required some functionality to do the following; fetch a compressed tar archive from the internet, extract it, do some munging on the files and then dump it out to blob storage. In the interest of efficiency I didn’t want to have to download the files, save to to disk and then extract before beginning to process them. Instead I opted to stream the download, and then decompress, munge and dump on the fly.

The story today wont involve any munging or dumping, sorry folks. instead we are just looking to fetch a file, decompress on the fly and iterate over the extracted files.

The first archive I was interested in consuming was a .tar.gz, so I cooked up a quick function to do the work.

from pathlib import Path
import tarfile

def iter_tar_gz(tar_bytes):

    tfile = tarfile.open(fileobj=tar_bytes, mode='r|gz')

    for t in tfile:
        
        if not t.isfile(): continue
        path = Path(t.path)
        f = tfile.extractfile(t)
        yield path, f

Pretty good, this generator will produce tuples of the filename and file descriptor from which the bytes of the function can be read. Now it’s simple case of supplying this function with a stream and we’re good.

import requests

archive_url = "https://github.com/torvalds/linux/archive/refs/tags/v6.12-rc6.tar.gz"
r = requests.get(archive_url, stream=True)

for path, f in iter_tar_gz(r.raw):
	do_work(path, f)

Everything good here, we can stream the contents of the archive file by file.

I went back to working on some other things and after some time came across another archive that I wanted to fetch in a similar fashion. This time the archive was bz2 compressed however, so I would need to modify my approach.

After checking the tar file docs I saw setting the mode simply to r would allow my function to transparently decompress lzma, gzip and bzip2.

from pathlib import Path
import requests
import tarfile

def iter_tar_gz(tar_bytes):

    tfile = tarfile.open(fileobj=tar_bytes)

    for t in tfile:
        
        if not t.isfile(): continue
        path = Path(t.path)
        f = tfile.extractfile(t)

        yield path, f

archive_url = "https://anaconda.org/pytorch/pytorch/2.5.1/download/win-64/pytorch-2.5.1-py3.12_cuda11.8_cudnn9_0.tar.bz2"

r = requests.get(archive_url, stream=True)

for path, f in iter_tar_gz(r.raw):
	do_work(path, f)

I’ll fix the name at a later date, but the meat of the fix is in, but here comes the problem :(

File ~/miniconda3/envs/midi-etl-new/lib/python3.11/_compression.py:29, in BaseStream._check_can_seek(self)
     26     raise io.UnsupportedOperation("Seeking is only supported "
     27                                   "on files open for reading")
     28 if not self.seekable():
---> 29     raise io.UnsupportedOperation("The underlying file object "
     30                                   "does not support seeking")

UnsupportedOperation: The underlying file object does not support seeking

(if anyone knows how to properly syntax highlight Python errors in markdown let me know!)

Seek not allowed here, hmm that’s weird, for a couple reasons..

Why does bzip2 need seek?
Why doesn’t gzip?
They use the same underlying object, and that is not seekable..so how does gzip support seek?

My first thought was that the issue might be a lack of support for range requests at the archive’s host. However, after checking some alternate .tar.bz2s, I discovered that this was not the case.

Next, I had a look at the seeking behavior;

import gzip, bz2
bz_url = "https://anaconda.org/pytorch/pytorch/2.5.1/download/win-64/pytorch-2.5.1-py3.12_cuda11.8_cudnn9_0.tar.bz2"
gz_url = "https://github.com/torvalds/linux/archive/refs/tags/v6.12-rc6.tar.gz"

def can_seek(f, n):
    try:
        f.seek(n)
        return "can"
    except:
        return "cant"


with (
    requests.get(gz_url, stream=True).raw as gz_,
    gzip.open(gz_) as gz,
    requests.get(bz_url, stream=True).raw as bz_,
    bz2.open(bz_) as bz
):
    
    print(f"gz is seekable: {gz.seekable()}")
    print(f"bz is seekable: {bz.seekable()}")

    print(f"gz {can_seek(gz, 10)} seek forward")
    print(f"bz {can_seek(bz, 10)} seek forward")

    print(f"gz {can_seek(gz, 0)} seek backward")
    print(f"bz {can_seek(bz, 0)} seek backward")

Which gives;

gz is seekable: True
bz is seekable: False
gz can seek forward
bz cant seek forward
gz cant seek backward
bz cant seek backward

So that’s interesting, gzip seems to have implemented some limited seek functionality, does that mean my streaming .tar.bz2 dreams are dead? Let us see!

I did some significant web hunting and couldn’t find much in the way of information, lots on streaming decompression but they all only ever came from disk where seek is supported, so not useful to me.

After much digging I eventually found conda-package-streaming and on PyPi there is an example that implies they can stream Conda packages and these happen to be .tar.bz2, so maybe I can look there for clues.

I followed the code path and compared for differences, these are the things I found and tested, in order of checking;

Session is used rather than requests directly and the headers are different. here

session = requests.Session()
session.headers["User-Agent"] = "conda-package-streaming/0.1.0"
response = session.get(url, stream=True, headers={"Connection": "close"})

The stream is decompressed using bz2, so tar doesn’t handle the decompression here
```
reader = bz2.open(fileobj or filename, mode="rb")
```

The tarfile mode and encoding are different here

with tarfile_open(fileobj=fileobj, mode="r|", encoding=encoding) as tar

And there we see it, must be the encoding right?

No, turns out if you read just a little further into the tar file docs, you’ll see there is a wealth of information on consuming streams and the magic to achieving that is to set the mode to r|. You might also notice this was used in the very first iteration of iter_tar_gz!

The docs show you can allow for transparent stream decompression using r|*, this will have tarfile detect which compression is in use, so with that and a bit of typing to round things off, here’s the final function;

from pathlib import Path
import tarfile
from typing import Iterable, Tuple
import typing

class BinaryFileLike(typing.Protocol):
    def read() -> bytes:
        ...

TarFiles = Iterable[Tuple[Path, BinaryFileLike]]

def iter_tar_stream(tar_stream: BinaryFileLike) -> TarFiles:

    tfile = tarfile.open(fileobj=tar_stream, mode='r|*')

    for t in tfile:
        
        if not t.isfile(): continue
        path = Path(t.path)
        f = tfile.extractfile(t)
        yield path, f

There’s nothing like spending several hours looking around the Internet to avoid 5 minutes of reading the docs! On the plus side I see conda-package-streaming setting the user agent and that seems like a good idea to adopt….also that thing going on in the seeking exploration script with the chained withs :)