Click here to Skip to main content
15,881,833 members
Articles / Programming Languages / C++
Tip/Trick

Reading Compressed LZMA Files on-the-fly using a Custom C++ std::streambuf

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
17 Jun 2019CPOL2 min read 11.7K   7   4
How to read compressed LZMA files on-the-fly using a custom C++ std::streambuf

Introduction

Handling big text files can be quite demanding in terms of disk space and bandwidth if these files are frequently shared via network; especially when the total amount of data exceeds several Tera-Bytes.

A workaround provides the LZMA compression which yields very good compression rates for text files (in a current case, it reduces the file size to about 25% of the original size). Simultaneously, it also provides relatively fast decompression and is thus eligible for on-the-fly decompression which is necessary if the data needs to be accessed frequently.

To access LZMA compressed files, I use liblzma (XZ utils, https://tukaani.org/xz/) which is usually distributed on linux OS but it is also available for Windows.

In this tip, I present a customized std::streambuf which can be used in conjunction with a std::istream to read data from a LZMA compressed file on-the-fly. The code is tested on linux with GCC 6.4.0 and liblzma 5.2.2.

Using the Code

Minimum Example

As an example, the following snippet opens the LZMA compressed file "test.dat.xz", passes it to the customized LZMAStreamBuf and writes the decompressed data line by line to STDOUT. If an error occurs while decompressing, the badbit of the istream is set.

C++
ifstream ifs("test.dat.xz", ios::in | ios::binary);
LZMAStreamBuf lzmaBuf(&ifs);
istream in(&lzmaBuf);

while(!in.eof() && !in.bad())
{
    string sLine;
    getline(in, sLine);
    cout << sLine << endl;
}

Customized std::streambuf

The functionality of streams in the standard library are generally extended by creating a customized std::streambuf class. This customized class only provides the bare data, while the std::istream instance takes care of the typical "stream behavior" in C++.

Inheriting from std::streambuf in this example is rather straightforward, since I only provide an associated character sequence (https://en.cppreference.com/w/cpp/io/basic_streambuf), i.e., the seek pointer cannot freely be relocated. Imagine, for example, a simple network socket: The data is gone as soon as it has been read from the source.

To implement this behavior, I override the method underflow which will be called by std::istream once it needs more data. When underflow has made new data available, the method passes the data pointer via the setg(...) method back and returns the first available byte in the buffer. If no data is available any more, the method returns EOF. All usual stream functionalities are then provided by std::istream solely.

C++
#include "lzma.h"

class LZMAStreamBuf : public std::streambuf
{
public:
    LZMAStreamBuf(std::istream* pIn)
        : m_nBufLen(10000) // free to chose
        , m_pIn(pIn)
        , m_nCalls(0)
        , m_lzmaStream(LZMA_STREAM_INIT)
    {
        m_pCompressedBuf.reset(new char[m_nBufLen]);
        m_pDecompressedBuf.reset(new char[m_nBufLen]);

        // Initially indicate that the buffer is empty
        setg(&m_pDecompressedBuf[0], &m_pDecompressedBuf[1], &m_pDecompressedBuf[1]);

        // try to open the encoder:
        lzma_ret ret = lzma_stream_decoder
               (&m_lzmaStream, std::numeric_limits<uint64_t>::max(), LZMA_CONCATENATED);
        if(ret != LZMA_OK)
            throw std::runtime_error("LZMA decoder could not be opened\n");

        m_lzmaStream.avail_in = 0;        
    }

    virtual ~LZMAStreamBuf()
    {
    }

    virtual int underflow() override final
    {
        lzma_action action = LZMA_RUN;
        lzma_ret ret = LZMA_OK;

        // Do nothing if data is still available (sanity check)
        if(this->gptr() < this->egptr())
            return traits_type::to_int_type(*this->gptr());

        while(true)
        {
            m_lzmaStream.next_out = 
                   reinterpret_cast<unsigned char*>(m_pDecompressedBuf.get());
            m_lzmaStream.avail_out = m_nBufLen;

            if(m_lzmaStream.avail_in == 0)
            {
                // Read from the file, maximum m_nBufLen bytes
                m_pIn->read(&m_pCompressedBuf[0], m_nBufLen);

                // check for possible I/O error
                if(m_pIn->bad())
                    throw std::runtime_error
                     ("LZMAStreamBuf: Error while reading the provided input stream!");

                m_lzmaStream.next_in = 
                     reinterpret_cast<unsigned char*>(m_pCompressedBuf.get());
                m_lzmaStream.avail_in = m_pIn->gcount();
            }

            // check for eof of the compressed file;
            // if yes, forward this information to the LZMA decoder
            if(m_pIn->eof())
                action = LZMA_FINISH;

            // DO the decoding
            ret = lzma_code(&m_lzmaStream, action);

            // check for data
            // NOTE: avail_out gives that amount of data which is available for LZMA to write!
            //         NOT the size of data which has been written for us!
            if(m_lzmaStream.avail_out < m_nBufLen)
            {
                const size_t nDataAvailable = m_nBufLen - m_lzmaStream.avail_out;

                // Let std::streambuf know how much data is available in the buffer now
                setg(&m_pDecompressedBuf[0], &m_pDecompressedBuf[0], 
                                   &m_pDecompressedBuf[0] + nDataAvailable);
                return traits_type::to_int_type(m_pDecompressedBuf[0]);
            }

            if(ret != LZMA_OK)
            {
                if(ret == LZMA_STREAM_END)
                {
                    // This return code is desired if eof of the source file has been reached
                    assert(action == LZMA_FINISH);
                    assert(m_pIn->eof());
                    assert(m_lzmaStream.avail_out == m_nBufLen);
                    return traits_type::eof();
                }

                // an error has occurred while decoding; reset the buffer
                setg(nullptr, nullptr, nullptr);

                // Throwing an exception will set the bad bit of the istream object
                std::stringstream err;
                err << "Error " << ret << " occurred while decoding LZMA file!";
                throw std::runtime_error(err.str().c_str());
            }            
        }
    }

private:
    std::istream* m_pIn;
    std::unique_ptr<char[]> m_pCompressedBuf, m_pDecompressedBuf;
    const size_t m_nBufLen;
    lzma_stream m_lzmaStream;
};

Points of Interest

  • In case of an error, a std::runtime_error is thrown. This exception is generally caught by std::istream and the badbit of the stream is set. Therefore one regularly should test that condition by calling std::istream::bad(). However, one can also make std::istream to forward the exceptions directly to the user (who then must catch them himself) by calling std::istream::exceptions() (https://en.cppreference.com/w/cpp/io/basic_ios/exceptions) before.
  • Through the std::streambuf::setg() method (https://en.cppreference.com/w/cpp/io/basic_streambuf/setg) underflow forwards the buffer to the std::streambuf. The first argument marks the beginning of the buffer, the second one the current read position and the last one the end of the buffer. In case of an error, the buffer should be set to nullptr.

History

  • 17th June, 2019: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Germany Germany
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionWhere is LZMA.h ?? Pin
Member 1172068118-Jun-19 15:20
Member 1172068118-Jun-19 15:20 
AnswerMessage Closed Pin
18-Jun-19 18:37
Member 1450266718-Jun-19 18:37 
AnswerRe: Where is LZMA.h ?? Pin
Philipp Sch18-Jun-19 19:23
Philipp Sch18-Jun-19 19:23 
GeneralMessage Closed Pin
18-Jun-19 19:28
Member 1450266718-Jun-19 19:28 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.