Unix Web Application Architectures

Samuli Kärkkäinen <skarkkai@woods.iki.fi>
Version 1.0.2, 13 October 2000


Table of Contents

  1. Preface
  2. Introduction
    1. An Example Application
    2. Common Characteristics of Web Applications
  3. Basic Approaches
    1. Application Servers
    2. Code Embedded in HTML
    3. mod_perl
  4. HTTP Server
  5. Performance
    1. Static Content
    2. Dynamic Content
  6. Memory Consumption
    1. Static Content
    2. Dynamic Content
  7. Squid as a HTTP Server Accelerator
  8. Session Management
    1. Cookies
    2. Splitting the State into an ID and Data
    3. Cookies vs. Other Places to Store the Session ID in
    4. Expiring the Sessions
    5. Explicit Session State is Good
    6. Session State Mechanism Should be Flexible
  9. Access Control and Authentication
    1. HTTP Authentication
    2. Implementing Authentication Yourself
  10. Using a SQL Database as a Backend
    1. Transactions and Error Recovery
    2. Locking Between Simultaneous Requests
    3. SQL as a Method of Communication
    4. Reports
  11. Request Parameter Validation
    1. Minimize the Number of Function Parameters
  12. Separating Code and Output
    1. HTML Templates
    2. An Example
    3. Advantages
  13. Locking Data Between Requests
  14. Security
  15. Robustness Against Programmer Errors
  16. An Alternate Approach: Abstracted HTML


1 Preface

In this document I discuss various aspects of writing web applications in Unix. By a web application I mean a piece of software that is used with an ordinary browser, without client side Java or other major extensions. The reader is expected to have a basic understanding of the building blocks of web applications, such as HTML, JavaScript, HTTP and CGI. The focus is on problems that emerge when the application gets big enough, that the most simple-minded approaches become insufficient. Issues related to building web sites in general, such as maintaining static HTML pages, aren't considered.

The reader shouldn't expect this to be an unbiased or comprehensive discussion on the subject. Rather, this can be seen as an essay about the lessons I have learned when writing custom web applications for fun and for money. Because I use Linux as my platform, only the technologies available on Linux are considered. However, everything said should be applicable to other Unix variants. I mention Apache a lot of times throughout the text as an example of a HTTP server. This is simply because I use it, many others use it, and it's a fine example of a general purpose web server. I'll use terms "web server" and "HTTP server" pretty much interchangeably.

This document might be useful for someone who writes or is going to write a web application, and wants to get an overview of many possible approaches. All feedback is very welcome.

2 Introduction

2.1 An Example Application

One might call a hit counter a web application, but a program so simple is not of much interest from an architectural point of view. To give a picture of what features the kind of architecture this text discusses should support, let's assume one is building a web based order management application with following requirements:

2.2 Common Characteristics of Web Applications

Web apps tend to share a number of common characteristics, independent of the application domain. Therefore it makes sense to try create a framework which handles these common characteristics as automatically, efficiently, and correctly as possible. When someone codes a framework that does these things, and makes a product of it, the product is often called an "application server." Indeed this document can be seen to describe what an application server does, and how to write one. The term used in this document shall be "application framework" however.

Below is a list of a number of features that many web apps have.

How to achieve these and other features will be discussed in more detail in the rest of this document.

Clustering and related features like load balancing and fail-over, while important for some applications, are not discussed, for the simple reason that I have no experience with them.

3 Basic Approaches

I consider the CGI interface the most rudimentary way of creating web apps. However, it's not necessary to start building things completely from scratch on top of raw CGI. In this chapter I will mention a number of packages or technologies which implement or make it easier to implement one or more of the features mentioned in previous chapter.

3.1 Application Servers

The word application server is used in this text to refer to products such as Allaire's Cold Fusion, IBM WebSphere or OpenSource product Zope. These aim to be more or less all-encompassing solutions that handle all aspects of application development. Some have high end features like fail-over clustering or replication. Many come with easily reusable component library, or with entire prebuilt applications that can be customized. Usually these also address non-coding related issues such as web site management and explicit support for multiple developers and HTML writers. Most have a price tag of at least $1000, and often well over $10,000.

As mentioned above, this document is in a sense about how to write an application server. But why bother since products like this already exist? For smaller applications the price alone can be an obstacle, except with free products. Learning curve is another one: these are full development environments, and it takes weeks or months to learn to use them, more than that to learn using them well. If you already know well a programming language suitable for writing a web app, such as Perl, Python or Java, this makes a huge difference. Flexibility may be another important factor: having written the application framework yourself, you can fully customize any aspect of it.

All this said, third party web application servers can be an excellent choice for many purposes, and many of the largest sites are built using of them. Then again, many are not. In any case, these products are not further explored in this document.

3.2 Code Embedded in HTML

Probably the best known examples of this class of packages are the OpenSource product PHP and Active Server Pages (ASP), the language used in Microsoft's Internet Information Server. In this approach, the code is put into the HTML files. Before the web server sends the files to the browser, the code in them is processed.

This approach works particularly well if the majority of the web site is static HTML. It's typically easy to start programming in these languages, and their structure is well-suited for writing web applications, since that's the whole purpose of the environment. They come with a comprehensive function library for performing common tasks needed in building web sites and applications.

3.2.1 PHP

I personally have only experience with PHP in this application class, but I have been told ASP is essentially similar. I found PHP easy to learn (having background with languages including C++, Perl and shell), and intuitive to use for its intended purpose.

As an example of PHP, assume a HTML form (the example taken from http://www.zend.com/zend/art/intro.php):

<FORM METHOD="GET" ACTION="submit.php"> 
What's your name? <INPUT NAME="myname" SIZE=3> 
</FORM>
File submit.php would then contain for example:
<HTML><BODY>

<?php
print "Hello, $myname!";
?>

</BODY></HTML>

You probably get the idea from that. If not, the PHP web site gives more information. PHP offers some kind of a solution to most points mentioned in section 2.2 above. Some features are implemented in a way that is in my opinion not sufficient in all situations, but this can often be worked around by using a third party library that offers an alternate solution.

Most people use PHP as an Apache module mod_php, in which case there is one copy of the PHP engine in each Apache process. Therefore the memory consumption can become quite high with a large number of simultaneous requests, even though the code (text section) is shared between processes. This is discussed in more detail in chapter 6 Memory Consumption.

3.3 mod_perl

The Apache module mod_perl allows running the perl interpreter as part of Apache processes. This way the perl bytecode is cached in the Apache processes, which speeds up execution a lot. Often mod_perl is simply used for speeding up perl CGI scripts. The "native" programming style with mod_perl is the same as with CGI: print statements scattered around the code. Mod_perl comes with a set of modules and extensions helpful for writing web apps, which makes it more than just a way to speed up perl CGI scripts.

Perl is an efficient and mature language, with a lot of users, books, and support services. There exists a very large number of third party modules for almost any purpose. These things make it a good general purpose language, and it's well suited for writing also web applications.

Perl was born as a language for Unix system administrator and text manipulation, and as such, it's not focused on web application development like PHP is. PHP is probably easier to learn for those not already familiar with either language, while mod_perl may be favored over PHP by those who already know perl, or want to use a language with maximal flexibility.

4 HTTP Server

One choice to make when deciding on the architecture of an application, is whether to use an existing web server for HTTP request handling, or if it would be better to implement a HTTP server as part of the application.

The first HTTP protocol version, now known as HTTP version 0.9, was very simple one. There was just the GET method with no additional information, and the response was the document body as is. Not much was won by using a separate web server, instead of embedding one as part of one's application. But this was the situation 5 years ago.

Current HTTP protocol version is 1.1, and the RFC specifying it is over 400 KB of text. It includes 6 different methods, and 17 different request headers. A good implementation should also work around bugs in client implementations. For example, some browsers don't work properly with keep-alive connections, and that feature shouldn't be used with those buggy browsers (as reported by the User-Agent field). For these reasons, if good HTTP protocol compliance matters, it's a good idea to use a separate HTTP server. An additional benefit of doing that is that often a part of the application features is best implemented by the web server, for example automatic directory listings, HTTP access control or redirections.

Reasons for not wanting to use a separate web server include:

5 Performance

5.1 Static Content

Web servers are largely marketed by speed claims. It probably sounds very impressive to some people that a web server is able to serve so and so many thousand static files per second. In some extreme cases this can indeed make a difference, but those cases are rare. When serving static files, the performance tends to be always sufficient. In a quick test I performed, an Apache 1.3.11 on Linux 2.0.38 on a Pentium II 300 MHz, with quite default configuration, served over 300 small files a second.

When serving static content the available bandwidth usually becomes the bottleneck long before CPU. If each of those small files mentioned above is one kilobyte, then the transfer rate is about 300*1024*8 = 2.5 megabits per second. This is about two T1's. When serving 25 kilobytes big pictures, the server was able to send them at total speed of about 4 megabytes per second, which is 32 megabits, or almost a T3. Even then only a third of the CPU was used.

Because the problems faced when trying to serve static content at very high speed have to do mainly with the choice and tuning of web server, operating system and hardware, the rest of this chapter will focus on dynamic content.

5.2 Dynamic Content

In most sites, majority of time is spent executing dynamic code. However, before spending a lot of time and effort in speeding up your dynamic code, take a moment to consider how much latency and throughput you really want. I personally follow a rule of thumb that request latency for common operations should be kept under 200 ms. This kind of delay will get lost in other delays in fetching a page and rendering it on a browser. For more rarely used operations the latency can of course be far higher. In 200 ms you can do a lot with current CPU's.

5.2.1 Things to Not Do

What you can not do in under 200 ms is for instance:

5.2.2 FastCGI

In the traditional CGI model the CGI program is an executable file in the filesystem, which starts, executes and then exits once per every HTTP request. Starting the CGI program can be slow, especially if it's written in a translated language which must first be precompiled, if it connects to a database or does some other slow initialization operation. Fundamentally, the benefit of having an always running daemon is that things - bytecode, DB connection, precompiled HTML template, etc. - can be cached in memory between requests.

FastCGI exists to solve this performance problem. In the FastCGI model, the CGI program is a continually running daemon to which connections are made over TCP/IP or other mechanism once for each HTTP request. FastCGI defines how the CGI parameters (environment and stdin) are passed from the web server to the CGI daemon over this connection, and how the reply is passed back to the web server. This also makes it possible to have the daemon running on a different host or set of hosts than the web server, maybe with a firewall in between to further improve security.

To make it possible to process multiple requests in parallel efficiently, there should probably be multiple daemon processes running continuously. When a new request arrives, one of idle processes is chosen to process it. If there are no idle processes, more may be started. This technique is called preforking. For example Apache works this way.

The fastest possible FastCGI implementation runs as a module of the web server, so that executing the FastCGI code doesn't require executing an external program, or even a context switch. The performance is normally perfectly sufficient even if FastCGI is implemented as a CGI program written in C, because a computer such as one mentioned above is able to fork and execute a simple statically linked program 500 times a second.

5.2.3 Caching

Cache output of dynamic code. If you think latency significantly below 200 ms is needed, you should probably consider aggressively caching the dynamic pages. This means that once you have generated a page, you keep the entire page in memory or in a quickly accessible file. When the next request for the same page arrives, and you can be sure you'd output identical page, you instead return the cached page. This is especially useful if the data from which the dynamic pages are generated changes rarely, as is often the case.

Often it's best to divide pages into parts that can be separately cached. For instance, an application might cache the complex but rarely changing main part of some page, while the simple header that is different in each request is always created from scratch. When serving a request, the pieces are merged and sent to the browser. This idea can be generalized into any sub-results, such as commonly occurring database queries.

It may require a lot of thought and serious tradeoffs to end up with an application design that allows maximal caching of pages, but it can also be the best way to speed things up. Trying to add in caching once the application is ready, is usually asking for trouble. Nothing is easier than getting the cache and real data out of sync with each other unless the caching layer is properly designed and the application architecture is "caching friendly".

A different approach to caching dynamic content is presented in chapter 7 Squid as a HTTP Server Accelerator. A lot has been written about this subject. Try for instance this Google search.

5.2.4 Other Things

5.2.5 Performance is Rarely a Big Problem

Despite all the talk about performance in relation to web sites, almost nobody writes dynamic web sites entirely in a maximally fast language such as C. Not even in very high traffic sites. This should be proof enough that performance isn't something you should excessively worry about most of time. Just don't make really stupid things, most common of which I listed above, and you should be okay for most applications.

6 Memory Consumption

Memory consumption of a web site depends on several factors: the web server in use, how big part of hits are for dynamic pages, the application architecture, hit rate and the wall clock time it takes to serve a hit.

6.1 Static Content

The number of requests being processed simultaneously is a product of rate and the time it takes to process one hit. Assume processing a hit for a static file takes 1 ms of CPU time, but 5 seconds of wall clock time (because the client receives the file slowly). If a site gets 100 such hits a second (which is rare), it has on average 500 requests being processed all time, but uses only 10% of its CPU time.

If the web server requires one process (as opposed to a thread) per each active request, the memory consumption gets very high with 500 simultaneous request. Apache is an example of such server design, although Apache 2.0 is supposed to be able to run in a multithreaded mode, which may make it more suitable for this purpose.

Some other web servers create just one thread for each request, or don't use OS services at all and instead handle the multiplexing on their own. Especially the latter approach can lead into very small memory usage with very high number of simultaneously active requests. For this reason, web servers that have one process per request aren't a good choice in these situations.

6.2 Dynamic Content

When serving dynamic content, it is useful to separate the code generating the pages from the web server process. The idea is that the dynamic response is generated quickly by a separate process, and then buffered in the web server while it's being returned to the browser over the potentially slow network path. This way the process generating the dynamic content is freed quickly, and doesn't consume memory unnecessarily.

FastCGI has the potential to do this, if either the web server or the FastCGI implementation is able to cache the entire HTTP response, and thus free the daemon immediately after it is done with generating the response.

7 Squid as a HTTP Server Accelerator

Now, Squid is not the only alternative for this job, but it's the only one I'm familiar with, and is known to work well. HTTP server accelerating means that you have Squid sitting between the clients (usually the internet) and your web server. Squid receives the HTTP request, and if the request is for a static or otherwise cacheable object, returns it immediately. Otherwise, Squid forwards the request to the actual web server that holds the original data and is capable of generating the dynamic content. Web server returns the response to Squid, which caches the response if possible, and then returns it to the client.

Since Squid, being a single threaded cache written specifically for this purpose (in addition to being a normal web proxy/cache), is very efficient both in CPU and memory usage, it is efficient to serve static data this way. Dynamic page serving speed isn't significantly degraded. With some applications, it might also be possible to implement some caching of dynamic data in this manner; for example Squid could be told to cache those dynamic pages for 15 seconds that don't need to be completely up to date, say a sports results page.

8 Session Management

With a session I mean the series of HTTP requests/replies that one user makes when visiting a site. In the example application, a session would begin when a customer logs on in the system, and end when the customer logs off or just closes his browser.

Session state is all relevant information about what the user has already done during that session. In our example application following session state data might be kept:

The session management is a significant and very fundamental issue with web applications, because HTTP is a completely stateless protocol. Each HTTP request has no relation with any other HTTP request (aside from possibly using the same TCP connection). Therefore it's the job of the application to create this association. In traditional GUI programming, each screen element is represented by a GUI toolkit object that holds all the data and state of the screen element. In web applications, this data must be kept somewhere else. (However, see chapter 16 An Alternate Approach: Abstracted HTML.)

8.1 Cookies

Cookies are the mechanism that HTTP supports for session state keeping. Netscape's Cookie Spec gives following general description of how cookies work:

A server, when returning an HTTP object to a client, may also send a piece of state information which the client will store. Included in that state object is a description of the range of URLs for which that state is valid. Any future HTTP requests made by the client which fall in that range will include a transmittal of the current value of the state object from the client back to the server. The state object is called a cookie, for no compelling reason.

For the official current specification, see RFC 2109. If session state is kept in cookies, then each time a new session data item is to be set, it is sent to the browser with a Set-Cookie: HTTP response header. This straightforward scheme has a number of problems that are discussed next.

8.2 Splitting the State into an ID and Data

Obviously, the sum of sizes of cookies set for an application can become quite large if many state variables are used. Even the size of a single cookie may need to be rather large. This may exceed the maximum cookie size limits specified in the RFC, or set by the browser implementation. More likely, it will slow things down to have a lot of session data sent in every HTTP request.

Luckily it's not necessary to keep the entire state in the client. Instead, it is sufficient that merely a session ID is kept in a cookie. The server will then fetch, using the session ID, the actual session data from a database, for example, that is running on the server. Especially without SSL, it can also be seen to improve security in some cases that no actual session data and possible secrets in it ever leave the server.

Here is a concrete example of how the mechanism works when the session ID is kept in cookie named 'our_app_session' and the session data is kept in an SQL database:

  1. When a client makes a request with no cookie 'our_app_session' set, the request is assumed to be the first request of a session. A new row (record) is created into table Sessions to represent the new session. The primary key of the table is a 32 bit integer, and is used as the session ID. Let's say ID 438573948 is chosen.
  2. A cookie 'our_app_session="438573948"' is sent to the client.
  3. When the client makes the next request, it sends this cookie to the server.
  4. The server fetches from the database table Sessions the row with primary key 438573948. Other fields of the record contain the session data.
  5. When a new session data item needs to be set, only a change to the session data record in the database needs be done.
  6. When the client logs out, the session record is removed from the database. Note however that most of time people don't bother explicitly logging off.

Most large applications use this kind of a mechanism. From now on, I'll assume this kind of a system is used.

8.3 Cookies vs. Other Places to Store the Session ID in

Cookies can be told to keep their value after a browser is closed and restarted. This benefit is unique to cookies, and can't be achieved in any other way.

There also exists a number of problems with cookies:

8.3.1 DNS Name of the Application by a DNS Wildcard

These problems can be avoided by storing the session ID elsewhere. One mechanism, mentioned on Slashdot, keeps the ID in DNS name that the browser uses for referring the server. I have not tried to use this method, so the details may be a little off, but here goes. If your application is on domain foo.com, make a wild card DNS entry *.foo.com that points to the web server IP. When creating a new session with ID, say, 987654321, redirect the browser to address 987654321.foo.com. Due to the DNS wild card, this maps to the web server IP. The server application then looks at the domain name by which it was referred, and finds out the session ID that way.

This method has the advantage that after the redirection, no extra effort is needed to have the browser remember the ID. On the other hand, care must be taken that the browser doesn't redirect itself to the official name of the server, such as www.foo.com. The server may issue such redirection too, as Apache does for instance when one refers to a directory without trailing slash. This also binds the application tightly to the DNS of the domain, which complicates installation, and adds one more thing that can be configured wrong. I can imagine that in many situations, an organization doesn't want to make such a change to its DNS setup for sake of a single application. I also feel cautious about the effect of such a scheme to DNS and web caching mechanisms, because of the huge number of unique DNS names referred in process of using the application.

8.3.2 HTTP Request Parameter

Another mechanism is to have the client send the session ID in each HTTP request. For example, if the session ID is 987654321 and the HTTP method is GET, the request for "/cgi-bin/some_view?foo=bar" would be "GET /cgi-bin/some_view?session=987654321&foo=bar HTTP/1.1" (followed by other request headers), and analogously with other methods.

This ID must be included in every HTTP request made for the application. This is an added complication over using cookie or the method described in previous paragraph. If this can be semi-automated, it's quite doable however. All the links and ACTION attributes of FORM elements must be generated with wrappers that add the session ID field into them, and that's about it. Of course the session will be lost if the user leaves the site and comes back later, or if the user tries to navigate the application by entering URL's manually.

Having the request be consistently of form /cgi-bin/some_view?parameters, or perhaps hierarchically /cgi-bin/view_group/some_view?parameters, has the additional advantage that web server log analyzers such as Analog can be used for creating usage statistics for different views, and in case of hierarchical view names, for different view groups. The log analyzers are able to ignore the parameters, ie. everything after and including the question mark.

8.3.3 Request URL Before View Name

The ID can also be put into a part of URL that comes before the view name. Using this method, a request might be "GET /cgi-bin/987654321/some_view?foo=bar HTTP/1.1". The CGI program will find out the session ID by looking at CGI PATH_INFO environment variable. If the application then only uses relative URL's for referring other views, the session ID stays automatically in the URL. For example if the view given above has a link "other_view?a=b", then the browser will generate a link pointing to "/cgi-bin/987654321/other_view?a=b". If views are hierarchically named, this requires some more care, but is still doable. For instance if a view "/cgi-bin/987654321/admin/first_view" wants to refer to "/cgi-bin/987654321/customer/second_view", it must use a link "../customer/second_view". This means a link to a view must be different depending on where the link occurs, which isn't very nice at all. Therefore it may be best to not use hierarchical view names when using this method of keeping track of session ID. It also requires playing some strange games with the web server configuration to make this work. While I haven't actually tried this mechanism in practice, I believe it can be made to work fine.

8.4 Expiring the Sessions

Usually sessions end due to the user simply closing his browser, pointing it into a different site, or even his browser crashing. There's no way to know when this has happened. Even if the application has a "log out" button, and users are specifically told to use it, they often won't. Hence, it is necessary to simply assume that a session has ended when it has been inactive for long enough. This implies that the session must have a "last accessed" timestamp that is updated every time a request for that session is made, or at least often enough. This may be in conflict with attempts to execute the requests in parallel as explained in section 10 Using a SQL Database as a Backend.

The time after which an unused session is removed shall be called expiry time. How long this time is depends heavily on the application. For a banking application where security is the first priority and access control is bound into the session, this might be as low as 15 minutes. For a different application which is used only in a trusted intranet, and that is essential part of a user's job, the user may want to leave his browser open when going away for a holiday, and resume his session when coming back. In this case the expiry time might be a week or even a month.

If the expiry time is long and typical session length is short, there may be a large number of inactive sessions in the database, and each of them may be large. This is not a big problem with reasonable indexing. The expiry is best done relatively rarely, because it may take a while.

8.5 Explicit Session State is Good

As much of the session state as possible should be kept in the session state record. In web applications, all too much of the state tends to be in the links and forms of the currently shown HTML page or pages. This is usually a bad thing, because it's difficult to figure out the state of the program when it's scattered all over a changing set of HTML pages.

A particularly ugly habit is to keep the state in parameters of a chain of HTTP requests. In that situation the HTTP requests get long and complex, and it's very easy to make a typo in them. This also obscures the interface and purpose of individual requests.

To make sure the programmer actually uses the session state mechanism, it should be as easy to use as possible.

8.6 Session State Mechanism Should be Flexible

In my experience, it's hard to plan what state variables will be needed. When changes are made and new features are added into the application, more and more state variables are needed. For this reason it should be easy to create new state variables. This is certainly not the case for example if the session record is kept in a database, and each state variable is stored in a column of its own in the session table.

9 Access Control and Authentication

Access control means limiting who or what is allowed to use the application. Authentication, also called identification, means figuring out who is using the application. Often these concepts are interconnected. For example, if access control is based on the information about who is using the application, then access control can't be performed without authentication. On the other hand, controlling access by client IP address can be considered access control without authentication.

9.1 HTTP Authentication

HTTP protocol includes optional authentication, currently specified by RFC 2617. This is the mechanism that makes the browser open a dialog window asking for username and password. Web servers support this mechanism by for instance logging the authenticated username into web server logs and by allowing access control based on the authentication (for example: "this page can only be viewed by user johndoe.")

Because web servers support it, HTTP authentication is a very convenient mechanism to use. In particular, it's vastly easier than any other method for controlling access for static data or several unrelated entities served by the web server.

HTTP authentication doesn't allow much control over the user interface: the dialog that prompts for username and password can't be significantly customized. It can take some effort to make sure the authentication is prompted for only in right places, at the right time, and that the action taken when the user doesn't enter correct credentials (username and password) is intuitive. At least in Apache, the credentials are case sensitive, which may confuse users.

Once a user has authenticated himself as one user, it is difficult to allow him to re-log on as a different user. If a web server is used for performing the access control, the only way to allow logging in as a new user is to make the web server deny access as the current user, which makes the web server re-prompt for the credentials. This can look confusing to the user. It also can be a little ugly operation to perform, since it requires fine-tuning the web server access control settings on the fly.

If and only if the requested resource (the URL) requires authentication, the CGI environment includes a variable REMOTE_USER whose value is the username the user has entered. If a user views a page that doesn't require authentication, REMOTE_USER isn't set. In this situation, the application has no way of knowing if a user has ever logged in as any user. Often it would be useful to know this. For example, it may be desirable that an administrator user can view all the pages in the system, but some of the pages have additional features for admins. In the example application, an unauthenticated page that shows the company's product list, one product per line, could include an "edit" button next to each product name if an admin user is viewing the page. This is not possible with HTTP authentication.

9.2 Implementing Authentication Yourself

For reasons mentioned above, many large applications don't use the HTTP authentication, but instead implement authentication themselves. Implementing authentication has many similarities with implementing session functionality. The authentication record can be similarly split into ID and data portions, and those can be stored in similar ways as session record ID and data. The major difference is that the authentication record can't be created automatically, but the user must enter his credentials first.

It is a natural idea to tie authentication to the session. This can be done simply by:

When designing such a scheme, it must be kept in mind that the application may include pages that don't require authentication, but do require having a session. If that is the case, a session must be possible to establish without the user entering credentials. For maximum flexibility, the system should then:

It can be more work than it seems to get this right. Such a scheme replaces not only the HTTP authentication, but also that of some access control system, such as that of Apache's mod_auth module. It might be worthwhile to check if a suitable library already exist before starting to implement this on your own.

10 Using a SQL Database as a Backend

Many web applications serve as a front end for some data store. They allow fetching and/or changing the data in an intuitive manner. From this perspective, the data store can be called the backend of the application. When there is only a small amount of data, it is feasible to keep it in a simple ad hoc structure that is manually locked as needed, if needed.

As the requirements for data manipulation grow due to larger amount of data, higher requirements for parallelism or reliability or a need to manipulate the data in complex ways, it makes no sense to re-invent the wheel yourself. Instead, a "real" database should be used. Usually this is a SQL database. Other types of just as real and featureful databases exist, for instance exclusively object-based databases, but I know little of those and they are currently much less commonly used. An SQL database is always a relational database. I'll use these terms as synonyms for each other. I assume the reader knows the usual features of SQL databases, and will only consider the features most relevant when building web applications.

10.1 Transactions and Error Recovery

The error recovery strategy of an application can be built around an SQL transaction, if all the data of the application is stored in the database (and the database supports transactions with rollback). The nature of HTTP requests is to normally trigger an operation that takes a fairly short time to finish: usually changing or showing some data record. Therefore one can write an application framework that automatically starts a transaction upon reception of a HTTP request. All the operations that change the database are done inside the transaction. If anything fails, the code processing the request returns an error code, and the transaction is rolled back. Because all the persistent data of the application is in the database, the effects of all operations done while processing the request before the error occurred disappear. If all goes well, the transaction is automatically committed in the end of the request processing before returning the HTTP response.

This is a strong guard against programmer errors, and one can be fairly confident that the data store doesn't get corrupted. This can also simplify programming. The programmer doesn't need to concern himself with the order things are done in, as far as error recovery is concerned.

10.2 Locking Between Simultaneous Requests

Handling multiple simultaneous HTTP requests in parallel is tricky, because more than one of them may read or write the same data. Therefore some locking must be used. This can be a very involved problem, and affect the entire design of an application. Using a good database can ease this task considerably by offering well understood and full featured mechanisms for implementing the locking. Databases typically also take care of such things as deadlock detection, which is far from trivial to implement oneself.

10.2.1 Locking of Session Records

Let's take an example of the kind of problems faced when handling many requests simultaneously. Assume an application which keeps session state in a database as suggested earlier. Assume also that the application framework automatically starts a transaction when starting to handle a HTTP request. Finally assume the application uses multiple frames. HTTP requests for all frames of a frameset are sent quite simultaneously by the browser, and all of the requests belong to the same session. Therefore the session record gets locked by the requests at the same time, and a conflict occurs if more than one of the requests modify the session record. In this case, perhaps the session data should be manipulated through a separate database connection, which doesn't start a transaction automatically. Or maybe it's not necessary to process requests of the same session simultaneously.

10.3 SQL as a Method of Communication

Using an SQL database requires creating a well defined database structure with detailed definition of each data item type. Constrains, assertions etc. further help to describe the structure of the data. This can be a powerful way of communication between members of a development team, because many people can be expected to understand the descriptions of SQL database structure. It can also be of much assistance to a team of one person only. The database structure is at all times a definitive, complete reference about all data in the application. This is a benefit that shouldn't be belittled. With an ad hoc data store, these things would probably be complicated to the extent of being useless.

10.4 Reports

Often the user of an application wants to get unexpected reports or summaries from the data in the database. It may not make sense to write a pretty interface for rare needs. In that situation it's very convenient to be able to write a suitable SQL query in a few minutes. And there is no such query that can't be written in SQL.

When interfacing with the outside world, having data in an SQL database makes things easy. It's commonly necessary to import data from one system to another. If the source application keeps its data in an SQL database, doing this is easy. Almost any programmer can be expected to be able to perform a query that returns the data he wants, even if he's not very familiar with the application is question.

11 Request Parameter Validation

I think of web apps as a group of functions, each of which receives zero or more parameters. Each function implements a different page or operation in the application. Let's say we are using plain CGI, and calling function delete_user with parameter username=johndoe, with session carried in parameters. The HTTP request could then be http://www.foo.com/cgi-bin/delete_user?s=987654321&username=johndoe. I'll talk about a function call when there is a link to some function, a function is the target of a form's ACTION attribute or similar.

There exists no standard mechanism for making sure that each function is called with correct set of parameters, whose format is as expected. To use an analogy to traditional programming, there is no standard way of giving a prototype for functions. I consider it important that this checking is done for all functions. Some people may find such a requirement a hindrance, but long term advantages are often acquired by doing more work in the short term.

For example, functions pointed to by an ACTION attribute of a form may receive a large number of parameters, namely the values of the form's INPUT elements. In this situation, it's only a matter of time when there occurs a mismatch between what the function expects and the +parameters with which it is called. Even when calling a function with few parameters, it is easy to make a typo or to fail to encode the parameters properly.

To ensure the parameter checking is done for all functions, it should be made as easy as possible. Even better, it should be necessary to write a prototype for a function before the function can be used. I personally like to have each function in a file of its own, and for each function file there exists a prototype file, that defines what parameters the function receives. The parameters are then checked automatically by the application framework.

Besides checking that those and only those parameters mentioned in the prototype exist in a call, it's also possible to give each parameter a type. Examples of the types would be integer, unsigned integer, float, safe-string (a string consisting only of printable 7 bit ASCII letters and numbers for example) and arbitrary-string.

Truly suspicious input, such as parameters containing the null character, shouldn't be allowed for any input, regardless of prototype. See Phrack Magazine's article "Perl CGI problems" about what can happen if you don't filter away funny characters from the input for perl CGI program.

Sometimes not all of the parameters a function is going to receive can be known before hand. This is the case for example in functions that process a form that has a variable number of input fields in it. For these situations it must be possible to tell in a prototype that the function can receive any additional parameters, or that the name of each additional parameter must match some regular expression, or that their value must be of certain type. These situations tend to be fairly rare however, and may be best handled in the function itself, case by case.

A logical parameter may consist of many form elements. For example, time of the day can be entered as an hour and a minute using select boxes. In this situation it would make sense that the application framework can be told to first validate these two parameters individually, and then combine them into a new parameter. This may be more trouble than it's worth, though. Also when a validity check requires checking relationships between parameters, for example parameter A must be greater than or equal to parameter B, it almost certainly is best not to try automate that.

When the parameter validation detects an error, it can be either a programmer error (such as incorrectly constructed link), or a user error (the user left a mandatory form field empty, say). If it's a programmer error, it's reasonable to simply print an error message detailing what went wrong, and not process the function further. If it's a user error, the function should be told an error has occurred, and perhaps be given an error message string that can be shown to the user in a way the function sees fit. To allow this, each prototype parameter should contain a field telling if a validation error in that parameter indicates a programmer or a user error. If latter, the parameter should also be given a descriptive name that can be used for automatically creating an error message string that is meaningful to the user.

In addition to serving as the application's internal consistency checks and user input checking mechanism, these checks are a very good security measure.

11.1 Minimize the Number of Function Parameters

Even if function parameters are carefully checked for validity, their number should always be minimized. In particular, no state of any sort should be kept in function parameters, but always in session state variables. No redundant information should be passed around. For example, a function that asks for confirmation for removal of a user might want to show the user's full name instead of just the username. The full name can be derived from the username, so the full name should not be passed in the function call, even if the calling function has it readily available.

The reason why the number of function parameters should be kept as low as possible is that in the architecture that this document recommends, the function calls are the primary interfaces between parts of the applications, and any interfaces in programs should always be kept simple. Complicating the interface means the application as a whole is harder to understand, and there are more dependencies between the functions. Needless to say this is bad.

12 Separating Code and Output

In the straightforward CGI coding style, HTML output is generated by having print statements in the code. In the "Code Embedded in HTML" style the code is in middle of HTML. In my opinion, more clarity and structure for the program can be achieved by better separation of code and HTML.

That kind of separation can be done by splitting each HTTP request processing into two major parts: 1) operations mostly independent of HTML, that is, the code, and 2) outputting the HTML, using a HTML template.

The code part does everything not directly related to the HTML that will be output. This typically includes complex request-specific parameter checking, reading and writing the database, and more complicated data manipulations. As the last step, the code sets some variables in the template interface, which will affect what the template outputs.

12.1 HTML Templates

A HTML template is a file containing mostly HTML with some sort of code embedded in it. The code may be simple variable substitutions, loops, if clauses, etc. The language used should be easy to learn, and can be quite high level. The idea is that the template is as simple as possible, concentrates only on the layout of the page, and doesn't need to know how the data it displays is retrieved. The template should never modify data, only output it.

12.2 An Example

An example will probably make this concept easier to understand. The code shall be written in perl, and the template shall be an Embperl file. Embperl is a package that allows embedding perl code in HTML files. It can be used much like PHP, but here we'll use it strictly as a way to implement HTML templates.

In this simple example we'll fetch some data about a user from a database and show it on a HTML page. The HTTP request parameters will be put into a global hash %fdat by the application framework. The code will get the username of the user whose data to be shown from the hash, read full user data from the database, and add that into the hash. The template will similarly read the hash to get the user data to be shown. The code might look as following (comments are in green italics):

Read the user data from the database. $::users is an object of a
class that wraps the DBI interface and offers for instance the method
get_rows() that does SQL quoting automatically.
$user_data = $::users->get_rows (username => $fdat{username},
                                 status => 'active');

Make sure exactly one record was received. Otherwise print an error
message and abort the request processing.
if (! defined $user_data or scalar @$user_data != 1) {
    ::error "can't get user data: $DBI::errstr";
    return 0;
}

Insert all the user data into %fdat so that the HTML template can
show it. The template can show any or all of the data - that doesn't
concern us.
for my $key (keys %{$user_data->[0]}) {
    $fdat{lc ($key)} = $user_data->[0]->{$key};
}

Return success code.
return 1;

And the simplified HTML template:

<HTML>
<BODY BGCOLOR="white">

The construct [+ <perl code> +] evaluates the code and
substitutes the construct with the return value of the code.
<H1>Data of user [+ $fdat{username} +]</H1>

<P> 
$::USER_TABLE_INSIDE_COLOR is a global variable set in application config.
<TABLE BORDER="0" BGCOLOR="[+ $::USER_TABLE_INSIDE_COLOR +]">
<TR>
  <TD>Full name:</TD>
  <TD>[+ $fdat{fullname} +]</TD>
</TR> 

[$ if <condition> $] ... [$ endif $] is an Embperl extension.
The block between the statements is shown only if the condition is true.
[$ if $fdat{group} eq 'customer' $]
<TR>
  <TD>Password:</TD>
  <TD>[+ $fdat{password} || '&nbsp;' +]</TD>
</TR>
[$ endif $]

<TR>
  <TD>Account created:</TD>
  SHOW_DATE_HOUR_MIN() is a helper function defined elsewhere.
  <TD>[+ SHOW_DATE_HOUR_MIN($fdat{created}) +]</TD>
</TR>

</TABLE>
</BODY>
</HTML>

12.3 Advantages

The advantages of splitting the work into well defined code and template parts include:

Sometimes it makes sense to further separate the layout and textual contents of templates. One person could be responsible for the layout of the pages, and other people, possibly with very little computer skills, could edit the texts of the pages. A simple way to do this would be to store the template in one file, and have it include other files that contain the text blocks. The html-unaware people would then be allowed to only edit the files containing the text blocks.

13 Locking Data Between Requests

As stated before, HTTP is a very stateless protocol, and the user can and will end his session at any moment, perhaps simply because of a browser crash. Because this can't be detected, care must be taken when locking any data between requests.

Let's take an example of editing order data. A user chooses to edit an order, and is shown the order data in a form. In a traditional application, the order would probably now be locked against modification by other sessions. In web applications this is not a good idea, because the browser of the user may crash while he's editing the form, and therefore the lock will not get released before the session is removed, which might take a week.

Instead, locks should usually allow many users to start editing the same record. When a user is about to submit (confirm) the changes, the user should be notified if a different user has made changes to the data during editing. If this has occurred, the application could give the user a chance to cancel his changes. Sometimes it's possible to merge the changes made by both users, but this kind of heuristics may be not only difficult to implement correctly but also confusing for users.

Another situation that may need to be handled is as following. User A has been shown data of some product on a read-only page. He chooses to start editing that data by clicking an "edit this product" button. At this point the application should make sure, that the data the user had on his read-only screen was not stale, that is, some other user B has not changed the data between user A receiving the read-only page and clicking the edit button. If this has occurred, user A should be warned about this.

When it's not feasible to allow multiple editing sessions to occur at the same time, for example because the record being edited is large and editing consists of several stages, the record must be locked when the editing starts, as with traditional apps. If a record being edited by session A, and another session B attempts to start editing the same record, the user of session B should be told that session A is currently editing the record. User of session B should then have the choices of not starting to edit the record after all, or overriding the lock in case user of session B is certain that session A is no longer active.

Because there is no good locking strategy that would work in all situations, each case must be considered separately, and regrettably often requires a unique solution. Doing this is a lot of work, is difficult to get right, and it's hard to be convinced that all situations have been considered. In reality, it is often possible simply not to lock any records between requests, and let the users know about the implications.

14 Security

In web applications input validation is one of the most important security issues. Nothing that comes from web should be trusted. This means not just the data that is supposed to come from the user of the application, such a form fields. This means also the data that the user is not expected to change. A malicious attacker will attempt changing also that data. Forced validation of request parameters as explained in chapter 11 Request Parameter Validation is a big step to the right direction.

The data sent to the browser should be treated almost as cautiously as the data passed to external commands like the shell. This is because the browser _is_ an external program, that manipulates the data it receives in complex ways. What you think of as text data when you write the application may in fact get substituted by HTML data, or worse, a JavaScript program by an attacker. If that reaches the user without proper escaping, it may, for example, mislead the user into submitting the attacker confidential data when the user in fact thinks he's entering the data to the application.

If an SQL database is used, care must be taken when using input from HTML forms in the SQL queries. Using an interface that automatically quotes strings and makes sure numeric and other arguments have valid format is a good idea. The user the application connects to the database as should have minimum amount of privileges. Read-only requests, which means those that don't change the data in the database, should use a database connection that only allows reading data.

If the application is a daemon separate from the web server, the application should run with a user ID of its own, preferably in a chroot jail. Further partitioning the application into separate processes with well defined interfaces never hurts.

Cryptography is a tricky field. Don't make the mistake of assuming that using encryption or a cryptographic protocol automatically makes things secure, unless you fully understand what you're doing. Getting strongly random numbers is difficult, and many implementations have been broken because of this. If /dev/random exists on the platform, use it. Consider using SSL, preferably with 128 bit keys. Be aware that HTTP Basic Authentication passes the passwords across the network essentially in plaintext.

The points above are of course just the tip of the iceberg. For more information, see the Secure Programming for Linux and Unix HOWTO, one of the many resources +available on this topic.

15 Robustness Against Programmer Errors

Effects of errors in code of applications should have as small scope as possible. Web applications have a pleasant quality in this respect: requests are processed quite independently of each other. If processing of one request fails for any reason, that should not prevent the rest of the application from working. Failures come in many flavors, some of which are:

The CGI model has good error recovery characteristics. Each request is executed in a process of its own, started by the web server. That process crashing won't affect processing of other requests, as long as resources locked by the process are automatically freed when the process exits. The CGI model however is too inefficient.

When designing a more efficient model for execution of requests, one should attempt to preserve the good characteristics of the CGI model. This normally means using a "pre-forking" method, used for example by Apache. In this model there is one parent process that manages a set of child processes, a number of which are always running. When a request arrives, one child is chosen to process the request in its entirety.

Each child process handles up to a certain number of requests and then exits, or is forced to exit by the parent process. This limits the effect of memory or other resource leaks that persist between requests.

To prevent infinite busy or non-busy loops as well as rapid resource leaks during executiong of a single request, the parent process can use the operating system's services for this purpose. When the limits set for the child are exceeded, the OS then kills the child. The parent detects this, and starts a new child process to replace the dead one. This mechanism also handles cleanly any unexpected crashes of the child processes.

16 An Alternate Approach: Abstracted HTML

In this last chapter of the document I'll mention a completely different approach to web application development. I call this style "Abstracted HTML."

In this approach, web programming is made to feel to the programmer as much like traditional GUI programming as possible. HTML elements and constructs are wrapped in persistent objects that know how to render themselves using the browser, and HTTP requests are handled as GUI callbacks. The objects are kept in memory or stored on disk between requests. When a callback for an object arrives, the appropriate method of the object is invoked.

I don't know of any implementations of this approach that would be usable in practice, but http://projects.sault.org/wat/ has some working alpha level code in Python. At that site there is a sample application that asks for a filename, reads that file from the filesystem and shows the contents on the browser screen. See the screenshot.

Below is the commented source of that application. Notice that the code is short and easy to understand even for someone who has never done web programming. The question is, does this approach scale to larger programs, or is the web technology too different from traditional GUI programming to make this viable?

#!/usr/bin/python

Import the Web Application Toolkit module.
import wat

Create a new class which subclasses wat.Application. All WAT apps must
derive this class and override the construct method.
class TextViewer(wat.Application):

	Override the construct method, which returns a widget to the
	caller that represents how the application will render itself.
	def construct(self):

		Create a centered dialog object with the title "Text Viewer".
		dialog = wat.Dialog(self, "Text Viewer", align = "center")

		Create a text box widget (a form input control).
		self.file_input = wat.TextBox(self)
		Bind the "changed" event to the display_file method. This means
		when the user hits enter, self.display_file will be called.
		self.file_input.connect("changed", self.display_file)
		Insert the input control created above into the
		dialog box at position 0, 0.
		dialog.form.set_cell(0, 0, self.file_input)

		Create a button on the right of the text box, and bind also
		its "clicked" event to self.display_file.
		button = wat.Button(self, "Display File")
		button.connect("clicked", self.display_file)
		dialog.form.set_cell(1, 0, button, width = "100%")

		Insert a Text widget containing <hr> at position 1,0 to
		create a horizontal line below the two widgets above.
		dialog.form.set_cell(0, 1, wat.Text(self, "<hr>"), colspan = 2)

		Store the dialog so that we can access it in other methods, and 
		return it to the caller.
		self.dialog = dialog
		return dialog

	The display_file method is called when the user clicks the Display
	File button or hits enter in the input control. This is a
	user-defined method and was bound to those events in the construct()
	method.  The o parameter is the object which this method was
	connected to.  If the user clicks the button, o will be the Button
	widget. If the user hits enter in the input control, o will be the
	Input widget. We don't actually use it below, but Python requires it
	to be in the parameter list.
	def display_file(self, o):
		Create a new Text control that will hold the contents of the file.
		contents = wat.Text(self)
		try:
			self.file_input is our input control. It has an attribute
			'value' which holds the text in the input control. Here we
			open the filename in this value and stuff the lines into the
			text attribute of the contents text widget.
			for line in open(self.file_input.value).readlines():
				contents.text = contents.text + line + "<br>"
		except:
			We're here if the file open failed (permission denied, say).
			contents.text = "Unable to open file:" + self.file_input.value

		Finally insert the contents widget at position 0,2 in the dialog
		(below the horizontal line).
		self.dialog.form.set_cell(0, 2, contents, colspan = 2)

Start the program.
app = TextViewer()
app.run()

Copyright (c) 2000 by Samuli Kärkkäinen <skarkkai@woods.iki.fi>. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, v1.0 or later (the latest version is presently available at http://www.opencontent.org/openpub/).

The original version of this document is available at http://webapparch.sourceforge.net/. Comments, feedback and criticism is welcome.