Safeer's Techlog: Finding the size of a remote file from its URL

Consider the following cases,

You are about to download a file from web, and before downloading, you want to know the size of that remote file

OR

You have downloaded a file earlier, but you are doubtful whether the file has been downloaded fully or not.

From a browser like Firefox or Chrome, you can go to the download url and a window will popup asking you whether to save or open the file. The same popup will also mention the size of the remote file. While this is one way of doing it in many case you might want to do this for multiple files and/or save this information in a report or use it inside another script or task. In such scenarios it is desirable to do this from command line, and the following script will show you how to do it.

To do this, we should have the curl utility installed, on my machine it is installed as "/usr/bin/curl". We will first see the script and then the explanation to it.

safeer@penguinpower:~$ curl -sI DOWNLOAD_URL |awk '/Content-Length/{ print $2 }'

This will provide you the remote file size in bytes.

The basic idea is to process the HTTP headers associated with the download link, without actually downloading the file.

HTTP header of a URL contains useful information like, return status of your request, server details, content type, content length etc.. In this case, we are interested in the "Content Length" field which provides the size of the remote content ( downloadable file in this case ) in bytes.

Let us examine a typical HTTP header. This header is for the download URL of Apache HTTPD web server. We use curl with -s ( silent ) and -I ( info ie; header information ) flags to obtain this.

safeer@penguinpower:/tmp$ curl -sI http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.bz2
HTTP/1.1 200 OK
Date: Sun, 16 Dec 2012 19:46:14 GMT
Server: Apache/2.2.23 (Unix) mod_ssl/2.2.23 OpenSSL/0.9.8e-fips-rhel5 DAV/2 mod_mono/2.6.3 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_jk/1.2.35 mod_qos/9.74 mod_perl/2.0.6 Perl/v5.8.8
Last-Modified: Mon, 20 Aug 2012 13:22:55 GMT
ETag: "8708060-4591af-4c7b2684fa9c0"
Accept-Ranges: bytes
Content-Length: 4559279
Connection: close
Content-Type: application/x-tar

There are multiple fields in the header as you can see, but our interest is in the Content-Length field, which has a value of 4559279. So as per the header, the size of the httpd package is 4559279 bytes.

Let us cross verify by downloading the file.

safeer@penguinpower:/tmp$ wget -q http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.bz2
safeer@penguinpower:/tmp$ du -b httpd-2.4.3.tar.bz2
4559279 httpd-2.4.3.tar.bz2

Well, the file size is indeed 4559279 bytes.

Now we know how to extract the content length from the HTTP header. But is that all? what if the webserver is functioning but the url is not available or some other issue prevent you from downloading the header? This may not be a problem when you are actually looking at the terminal while the command is running, but within a script or report this may not be a good idea.

To solve this, we first check the HTTP status code ie; the first line in the header to see if the response is 200 ( OK ). Only then we return the content length, else we return a negative value so that your script / report can identify that. In this case that value will be the negative of the HTTP response code ( when different from 200 OK ). So the response code will tell you whether the script failed and for what reason. This will help you in designing the fail-safe logic of the script.

safeer@penguinpower:/tmp$curl -sI http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.bz2 | awk '/^HTTP\/1./{if ( $2 == 200 ) { do { getline } while ( $1 != "Content-Length:" ) ; print $2} else { print "-"$2 } }

The awk first looks for the HTTP status code and if the code is "200 OK" the getline function advances the current record pointer to next line until the first part of the current line is "Content-Length", then prints the second part of the line which is the content length in bytes. Otherwise, the script prints the negative of the HTTP status code.

Note: We still need to take care of the curl errors where there wont be any response from the webserver ( no dns/network connection etc.. ) in our scripts logic. Use the exit code of curl ( a non a negative value ) for finding that out.

Safeer's Techlog

Friday, January 4, 2013

Finding the size of a remote file from its URL

No comments:

Post a Comment