Tuesday, January 8, 2013

Netcat as a file downloader


     The netcat utility is a multi-purpose tool for managing and manipulating TCP/IP traffic.  In this article, we will see how netcat can be used as a file downloader.  This will come in handy when you don't have utilities like wget/fetch/curl installed in our machine.

     Netcat ( "nc" is the name of the binary ) can establish tcp connection to any server/port combination and send or receive data through the established channel.  To use it as a downloader, our strategy will be:
  • Establish a connection to the http port of the server.
  • Send an HTTP request with the download link to the established connection
  • Redirect the output of the HTTP response to a file ( which will be the download file).
     Let us try downloading apache httpd package from the url  http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.gz

     First, let us establish a TCP connection to port 80 of server apache.techartifact.com.  Command for this is:

/bin/nc apache.techartifact.com 80

     Second, let us construct an HTTP request. This can be done in two ways - using HTTP protocol 1.0 version and HTTP 1.1 version.

     A generic HTTP request format consists of:
  • A request line ( Further contains request methode - "GET"  for download , request URI - the whole/relative download URL , protocol version - HTTP/1.0 or HTTP/1.1 )
  • Multiple lines of HTTP headers ( Each HTTP header is a single line containing a header name and header value separated by a column and space )
  • An empty line
  • Message body
     Each of these lines will be separated by a Carriage Return ( \r ) and a Line Feed ( \n ) characters.

     Though there are many parts for an HTTP request, a bare minimum HTTP  request requires only the following:
  • A request line ( for both HTTP/1.0 and 1.1 version )
  • A host header ( only for HTTP/1.1 , Format is - Host: web.server.name )
  • A blank line
     All separated by a CR and CF ( "\r" & "\n" )
  • HTTP/1.0 request for our download URL is:  GET http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.gz HTTP/1.0\r\n\r\n
  • HTTP/1.1 request for our download URL is:  GET http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.gz HTTP/1.0\r\nHost: apache.techartifact.com\r\n
     When we sent this request, the response from the  server ( if everything is good and file start getting downloaded ) will contain an http response which begins with line "HTTP/1.1 200 OK" followed by multiple header lines, then followed by a blank line ( containing "\r" ) followed by the response data ( which is the actual file to be downloaded ).  So while saving the response to a file we should strip off the http header information part (all lines between and including "HTTP/1.1 200 OK" and "\r").  This can be achieved by a simple sed command.

     To learn more about HTTP visit this link

     Let us try downloading the file with HTTP/1.0:

safeer@penguinepower:/tmp$ echo -e "GET http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.gz HTTP/1.0\r\n\r\n"|nc apache.techartifact.com 80|sed '/^HTTP\/1.. 200 OK\r$/,/^\r$/d' > httpd-2.4.3-with-http-1.0.tar.gz

     Now with HTTP/1.1

safeer@penguinepower:/tmp$ echo -e "GET http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.gz HTTP/1.1\r\nHost: apache.techartifact.com\r\n"|nc apache.techartifact.com 80 | sed '/^HTTP\/1.. 200 OK\r$/,/^\r$/d' > httpd-2.4.3-with-http-1.1.tar.gz

     Let us also download the file with wget utility directly

safeer@penguinepower:/tmp$ wget -q http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.gz -O httpd-2.4.3-with-wget.tar.gz
     Now compare all the files downloaded to ensure they are all the same.

safeer@penguinepower:/tmp$ du -bs httpd-2.4.3-with-*
6137268 httpd-2.4.3-with-http-1.0.tar.gz
6137268 httpd-2.4.3-with-http-1.1.tar.gz
6137268 httpd-2.4.3-with-wget.tar.gz

safeer@penguinepower:/tmp$ md5sum httpd-2.4.3-with-*
538dccd22dd18466fff3ec7948495417  httpd-2.4.3-with-http-1.0.tar.gz
538dccd22dd18466fff3ec7948495417  httpd-2.4.3-with-http-1.1.tar.gz
538dccd22dd18466fff3ec7948495417  httpd-2.4.3-with-wget.tar.gz


Let us ensure the integrity of the downloaded files by comparing their md5 with the value given in apache website

safeer@penguinepower:/tmp$ curl -s http://www.apache.org/dist/httpd/httpd-2.4.3.tar.gz.md5
538dccd22dd18466fff3ec7948495417 *httpd-2.4.3.tar.gz

Everything looks good now.

Note: This command can download from servers on which the file is actually located ( on the given port and location as in the URL ).  I haven't tested the case where the the actual file is behind a proxy and the download url redirects you to the correct location ( with an HTTP 302 message).  That situation will need some more logic.





Friday, January 4, 2013

Finding the size of a remote file from its URL


Consider the following cases,

You are about to download a file from web, and before downloading, you want to know the size of that remote file

                           OR

You have downloaded a file earlier, but you are doubtful whether the file has been downloaded fully or not.

From a browser like Firefox or Chrome, you can go to the download url and a window will popup asking you whether to save or open the file.  The same popup will also mention the size of the remote file.  While this is one way of doing it in many case you might want to do this for multiple files and/or save this information in a report or use it inside another script or task.  In such scenarios it is desirable to do this from command line, and the following script will show you how to do it.

To do this, we should have the curl utility installed, on my machine it is installed as "/usr/bin/curl".  We will first see the script and then the explanation to it.

safeer@penguinpower:~$ curl -sI DOWNLOAD_URL |awk '/Content-Length/{ print $2 }'

This will provide you the remote file size in bytes.

The basic idea is to process the HTTP headers associated with the download link, without actually downloading the file.

HTTP header of a URL contains useful information like, return status of your request, server details, content type, content length etc..  In this case, we are interested in the "Content Length" field which provides the size of the remote content ( downloadable file in this case ) in bytes.



Let us examine a typical HTTP header.  This header is for the download URL of Apache HTTPD web server.  We use curl with -s ( silent ) and -I ( info ie; header information ) flags to obtain this.

safeer@penguinpower:/tmp$ curl -sI http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.bz2
HTTP/1.1 200 OK
Date: Sun, 16 Dec 2012 19:46:14 GMT
Server: Apache/2.2.23 (Unix) mod_ssl/2.2.23 OpenSSL/0.9.8e-fips-rhel5 DAV/2 mod_mono/2.6.3 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_jk/1.2.35 mod_qos/9.74 mod_perl/2.0.6 Perl/v5.8.8
Last-Modified: Mon, 20 Aug 2012 13:22:55 GMT
ETag: "8708060-4591af-4c7b2684fa9c0"
Accept-Ranges: bytes
Content-Length: 4559279
Connection: close
Content-Type: application/x-tar


There are multiple fields in the header as you can see, but our interest is in the Content-Length field, which has a value of 4559279.  So as per the header, the size of the httpd package is 4559279 bytes.

Let us cross verify by downloading the file.

safeer@penguinpower:/tmp$ wget -q http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.bz2
safeer@penguinpower:/tmp$ du -b httpd-2.4.3.tar.bz2

4559279 httpd-2.4.3.tar.bz2

Well, the file size is indeed 4559279 bytes.


Now we know how to extract the content length from the HTTP header.  But is that all? what if the webserver is functioning but the url is not available or some other issue prevent you from downloading the header?  This may not be a problem when you are actually looking at the terminal while the command is running, but within a script or report this may not be a good idea.

To solve this, we first check the HTTP status code ie; the first line in the header to see if the response is 200 ( OK ).  Only then we return the content length, else we return a negative value so that your script / report can identify that.  In this case that value will be the negative of the HTTP response code ( when different from 200 OK ).  So the response code will tell you whether the script failed and for what reason.  This will help you in designing the fail-safe logic of the script.

safeer@penguinpower:/tmp$curl -sI http://apache.techartifact.com/mirror/httpd/httpd-2.4.3.tar.bz2 | awk '/^HTTP\/1./{if ( $2 == 200 ) { do { getline } while ( $1 != "Content-Length:" ) ; print $2} else { print "-"$2 } }
 
The awk first looks for the HTTP status code and if the code is "200 OK" the getline function advances the current record pointer to next line until the first part of the current line is "Content-Length", then prints the second part of the line which is the content length in bytes.  Otherwise, the script prints the negative of the HTTP status code.

Note: We still need to take care of the curl errors where there wont be any response from the webserver ( no dns/network connection etc.. ) in our scripts logic.  Use the exit code of curl ( a non a negative value ) for finding that out.