Sunday, November 15, 2009

Capturing complete HTTP requests - Echo Server

Background

I recently had a need to capture and inspect a complete HTTP request in preparation for developing a new web service. The main reason for this is that there were no real requirements for the requested service. It wasn't clear which parameters would be sent in the request, or exactly how the parameters would be named. It also wasn't clear how the parameters would be split between GET and POST parameters, or even additional HTTP request headers. Additionally, there was also some history with issues around various character encodings, so I needed to be able to capture a byte-accurate copy of the entire request, including the headers and the body.

Initially, I did not have good luck finding an existing tool or solution for this. My first attempt was to just host a basic web server, then to capture the data using Wireshark. Unfortunately, Wireshark primarily works with Ethernet packets. It supports higher-level viewing of many protocols including HTTP. It even includes options for re-assembly of both HTTP headers and bodies, re-assembly of chuncked transfer-coded bodies, and decompression of entity bodies. However, while I'm sure there are additional options and methods for getting it to work more like I desired, it just didn't seem like the right tool for this particular job - and that is no fault of Wireshark.

My other early attempt was to use the mod_dumpio module in Apache HTTP Server. The first issue with this was that all the output from all requests is simply mixed-in to the same error log file (along with other debugging / outputs), which would make the data very difficult for proper extraction. The second issue was that at least as far as I can tell, there can only be one error log file per <VirtualHost/>, which would have resulted in an excessive amount of data being captured.

I then started to look at a simple Java HTTP server to capture the desired data. I've written trivial HTTP servers before, but it quickly becomes non-trivial to properly handle and respond to all the possible options and variations - even just to accept a complete request (including body) from a client. Trying to avoid duplicating previous work, I looked at a number of existing web servers including Apache Tomcat, but did not find any that provided the desired logging options.

My solution

I started looking further into Jetty. (I previously used and blogged about Jetty in regards to a test platform for my MarkUtils-Web project.) I found that I could intercept the incoming requests byte-by-byte by extending Jetty's default Connector - SelectChannelConenctor, and then overloading the newEndPoint(…) method to return an extended SelectChannelEndPoint. Hooking into the SelectChannelEndPoint's fill(Buffer buffer) method allows for capturing of the complete HTTP request. Kudos to the Jetty developers for not making this difficult or impossible by marking everything as private or otherwise overly-restricted, as compared to an unfortunate practice followed by many other projects and companies!

With only a little extra code, each HTTP request is logged to a chosen directory as a pair of files, grouped by a time-based session ID. The first is a "meta" file that contains details that would not ordinarily be captured as part of the HTTP capture, including the session ID, server date, and remote address, host, and port. The second is the "content" file that contains the actual byte-by-byte capture of the HTTP request. While it is named as a ".txt" file for easy viewing, it is treated as binary and will accurately capture all requests, including those with binary payloads. The format also allows for easily re-playing the request to a server for additional testing, debugging, or analysis.

Finally, by implementing and registering an associated Handler, the request is not only captured, but is efficiently echoed back to the client - without ever needing to buffer or store the entire request. This echoed response starts with the contents of the "meta" file, including the session ID that can be used by the client to easily refer to the saved log file back on the server. The contents of a sample echoed response are shown below, and would appear in the body of a viewing web browser:

Session ID: 124f67a451d-6313f5e0
Date: Sun Nov 15 00:14:18 CST 2009
remoteAddr: 127.0.0.1
remoteHost: 127.0.0.1
remotePort: 23349
==========
POST /someUrlPath?someGetKey=someGetValue HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 GTB5 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 25

somePostKey=somePostValue

On the server-side, the first portion (above the '=' divider) is saved as 124f67a451d-6313f5e0-meta.txt, with the last portion (below the divider) saved as 124f67a451d-6313f5e0-content.txt.

The main class for this project is com.ziesemer.httpEchoServer.HttpEchoServer. It is written to be suitable for inclusion into other projects or uses, as visible from the included JUnit test. It also includes a main() method for direct use from the command-line, supporting arguments to control the port to listen on ("--port") and the directory to use to store the log files ("--logDir"). By default, Jetty is configured to listen on port 8080. If the specified port is unavailable, add "--allowDynamicPort" to configure the process to fall-back to a dynamically-chosen port if the specified port is already in-use.

Fiddler: Another alternative

Another alternative I later considered was Microsoft's Fiddler, a HTTP Debugging Proxy. While not open-source, it is freeware and extensible. It is also certainly a better match for my requirements than either Wireshark or Apache's mod_dumpio, and arguably even my solution described here. However, Fiddler still requires a server to answer the requests for it can monitor the traffic, and doesn't support echoing the request to the response. Fiddler does have many other features to offer that may prove useful, and is at least worth testing out.

Download

com.ziesemer.httpEchoServer is available on ziesemer.java.net under the GPL license, complete with source code, a compiled .jar, generated JavaDocs, and JUnit tests. Download the com.ziesemer.httpEchoServer-*.zip distribution from here. Please report any bugs or feature requests on the java.net Issue Tracker.

3 comments:

Rogan Dawes said...

Hi Mark,

I see you found Fiddler, but I guess you didn't run across WebScarab, WebScarab-NG, OWASP Proxy (disclaimer, I am the author of all three), Burp Proxy, etc.

WebScarab* and OWASP Proxy are all open source , and written in Java.

Mark A. Ziesemer said...

Rogan - thanks for the comment. There are certainly a number of tools that can operate as proxies, including those that you mentioned. I've used some of the other OWASP tools before with great success. All the tools you mentioned are probably better than Fiddler. However - please correct me if I'm wrong, but as with Fiddler, I don't believe any of these can also act as their own server, nor can they echo the request back in the response.

Granted, I probably had a very specific need here - but now it can be reused the next time myself or someone else needs the same thing again.

Rogan Dawes said...

Well, WebScarab has BeanShell support, so you can intercept the request, and generate your own response containing whatever you want.

OWASP Proxy is more of a library for building other tools that need proxy intercept capability, and can therefore also be used to do whatever you want with it. Just program it accordingly.

I'll be publishing some slides of a tal I'll be giving this weekend to show just how easy it is to use OWASP Proxy.