Click here to Skip to main content
15,867,330 members
Articles / Programming Languages / C#
Article

Extracting files from a remote ZIP archive

Rate me:
Please Sign up or sign in to vote.
4.83/5 (37 votes)
30 Oct 20046 min read 151.2K   4.8K   144   18
This article presents a technique to access parts of a ZIP archive stored on a Web Server.

Introduction

This articles explains a technique for accessing files stored inside a zip file without downloading the whole zip archive. It can be used to access parts of a big ZIP archive, or to download only specific parts of an archive. It works with OpenOffice documents too.

The standard approach of serving parts of an archive file through a Web Server is using a script on the server side, with the fact that the single parts are sent uncompressed; this solution is totally client side and uses the Range header of the HTTP protocol to retrieve specific parts of a ZIP archive.

The analysis and decompression of the ZIP archives is performed using the Open Source SharpZipLib.

Architecture

We start with a description of the ZIP archive that is needed to explain how this technique works. The ZIP archive is organized as in figure; first there are the compressed files, each of them preceded by a local header, and finally a central directory that stores again all the file details. The central directory is needed to speedup the read of the listing of the files, and it is stored at the end because in this way the ZIP archive can be created on a stream. The central directory has a header, then a record for each file, and finally an ending marker.

ZIP File Format - diagram.png

A record in the central directory contains the offset to the Local Header entry, that can be used to extract the specified file. Unluckily, the local header has to be read because in the record of the central directory, there is no information about the size of the local header.

Additional details on the ZIP file format can be obtained here.

The idea is to use the Range options of the HTTP protocol to extract the listing of the files from the Central Directory. When the client application requires a file in the archive, only the compressed content is transferred and the decompression is performed on the client side. Clearly, this technique works best when the communication uses the Keep-Alive option of HTTP.

As stated by the RFC2616 (HTTP 1.1), the Range option of the request header specifies the first byte position and optionally the end byte position of a HTTP request. If the first byte position is negative, the offset is considered from the end of the data stream.

The Web Server responds to a Range request with a "206 - Partial Content" response that specifies the ranges that are sent back to the client. The response header option Content-range reports the returned ranges. The HTTP protocol is also able to manage multiple range request in each HTTP request.

Implementation

The RemoteZip class contains all the implementation for this technique. The most interesting part is the search for the End of Record of the ZIP archive, because it is at the end of the file but it has a variable length.

To access a web resource, we use the HttpWebRequest class of the .NET Framework, and the method AddRange hides all the details of using the Range options. When the range has a negative value, it means that the offset is relative to the end.

This is an example of the HTTP request required for extracting a file:

  1. Request the last 280 bytes of the archive to find the End of Central Directory. From the End Of Central Directory, obtain the offset and length of the Central Directory;
  2. Load the Central Directory with all the entries information. Find the requested file and obtain the offset of its Local Header in the archive;
  3. Request a block of data starting at the Local Header and sized as the maximum size of the Local Header plus the compressed size; skip the Local Header part from the requested data. Then, serve the decompressed data as coming from the Web Server.

Because the Local Header has a dynamic size, we request 16+64K*2+compressedSize bytes. The 64K*2 is the maximum dimension of the dynamic part of the Local Header, but usually it's a really small value. An alternative could be to download only the static part of the Local Header, then perform another request to obtain the compressed data, but it should be avoided because of the additional overhead of the HTTP request.

Extension to the SharpLibZip

This technique is presented as an extension to SharpLibZip. The RemoteZipFile class should be used instead of the ZipFile class: it provides an enumeration of the entries in the archive, and a Stream can be obtained for each entry for reading data.

Usage

C#
RemoteZipFile zip = new RemoteZipFile();
zip.Load(url);

foreach(ZipEntry ze in zip)
{
    // ...
}

ZipEntry ze = ...; 
Stream uncompressedStream = zip.GetInputStream(ze);

Subclasses of the Stream class

An interesting implementation detail is the definition of two Stream classes that wrap other streams:

  1. NoCloseSubStream is a stream that is attached to another Stream, and it detaches itself on Close. It can be used when a Stream should not be closed when the user has finished to work on in;
  2. PartialInputStream is a stream that presents a part of a stream as whole stream. It is used when decompressing the data coming from the Web Server with a specific size that is different from the size returned by the original stream.

Example Application

In the demo project, there is the RemoteZip application that can be used to test this technique. It accesses a remote zip file and shows all the contained files with the associated information. Each file can be saved on disk, or previewed as text or as an image (as long as .NET recognizes the image data).

This snapshot shows the preview inside a text file:

Sample Image - snapshot.jpg

In this case, an image of a OpenOffice Text Document (.sxw) is previewed:

Sample Image - snapshot2.jpg

Conclusion and future Work

This article presents one usage of the HTTP Range option for accessing parts of a big resource like ZIP archives. The technique can be used to recover specific files in archives, like metadata from OpenOffice files or JARs, and it is efficient when the whole archive is not required.

Could be very interesting to explore the possibilities of multi-range request in the HTTP protocol. This is an example from the HTTP specification:

HTTP/1.1 206 Partial Content
  Date: Wed, 15 Nov 1995 06:25:24 GMT
  Last-Modified: Wed, 15 Nov 1995 04:58:08 GMT
  Content-type: multipart/byteranges; boundary=THIS_STRING_SEPARATES

  --THIS_STRING_SEPARATES
  Content-type: application/pdf
  Content-range: bytes 500-999/8000

  ...the first range...
  --THIS_STRING_SEPARATES
  Content-type: application/pdf
  Content-range: bytes 7000-7999/8000

  ...the second range
  --THIS_STRING_SEPARATES--

Actually, it is limited to HTTP but it could be applied also to Local and FTP files. The case of FTP files is tricky because it uses the REST command to start the transfer from a specific position of the file, and then the downloading should be interrupted when the required number of bytes has been transferred.

References

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Software Developer (Senior) Scuola Superiore S.Anna
Italy Italy
Assistant Professor in Applied Mechanics working in Virtual Reality, Robotics and having fun with Programming

Comments and Discussions

 
GeneralVERY interesting ! Pin
Sebastien Lorion30-Oct-04 11:48
Sebastien Lorion30-Oct-04 11:48 
GeneralRe: VERY interesting ! Pin
Emanuele Ruffaldi1-Nov-04 7:16
Emanuele Ruffaldi1-Nov-04 7:16 
GeneralRe: VERY interesting ! Pin
Sebastien Lorion1-Nov-04 11:32
Sebastien Lorion1-Nov-04 11:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.