WCP a tool for consistent on-line update
of documents in a WWW server
Shalini Yajnika and
600 Mountain Ave.,
Murray Hill, NJ, U.S.A.
bInfosys Technologies, Bangalore and
Indian Institute of Technology,
With the growing use of the Web infrastructure
as an information provider, it has become necessary to consider the problem
of accessing documents from a WWW server in a consistent fashion. When
a group of related documents that are very frequently updated are accessed,
it is desirable to provide a access semantics in which either only old
copies of all the documents in the group are provided or only new copies
of the documents are provided. As Web servers are stateless, they do not
maintain information about the sequence of documents that are retrieved.
Hence, this type of semantics is not provided for a group of documents,
and special measures need to be taken to support such semantics. We describe
a tool WCP (Web-Copy) that facilitates the on-line update of documents
such that the above document access semantics is ensured within a persistent
HTTP connection without making any changes to the WWW server program.
On-line update; Document groups; Group consistency; WWW servers
The World Wide Web (WWW) has revolutionized the way we think of computing
environments. And there is no doubt that this revolution is here to stay
and grow. Though there are many possible and potential ways in which the
Web infrastructure can be deployed, the main use of the Web currently is
to provide a massive information and commerce service. In this application,
a client specifies the document object it wants through a URL, and through
the HTTP protocol the document is then read and the information on the
server is passed to the client. This type of access by the client is generally
strictly read only. Multiple clients can ask for the same document from
a server. The document objects on the server are maintained and updated
by whoever controls the server, say the Webmaster. If the information on
the server needs to be updated, this will have to be done by the server
If a document object is updated without any consideration for the HTTP
connections accessing the document on behalf of the clients, it can be
shown that the information provided to some clients that are accessing
the document while the update is being performed can be inconsistent (i.e.
the information is neither new nor old). It seems that the current update
approaches: (a) exercise no control, i.e., do not worry about a few clients
getting inconsistent information or (b) make the service unavailable for
sometime while the update is being done, are acceptable in most situations.
The approach of making the information unavailable will not suffice if
the server is updated frequently with new information (e.g. any server
providing real-time information like stock quotes, airline arrival and
departure information, etc.), and the server is heavily in demand. In the
cases where it is important for the client to get consistent information,
e.g. stock quotes, then the approach of not taking any precaution while
updating will not be sufficient. Clearly, for those servers which are heavily
in demand and would like to provide consistent information to the clients
at all times, there is a need to provide mechanisms to perform on-line
update of information such that correct and consistent information is provided
to clients without disrupting the service. Although, currently the need
for consistency is not perceived as significant, as the demand on some
of the servers grow and the information on these servers becomes more dynamic,
we believe that demand for consistency will also arise. The information
providers will become more interested in providing consistency, if it can
be achieved at a low cost. In this work, our aim is to bring forth the
existence of the problem and to provide a simple solution for it that does
not require any changes in the HTTP protocol or the HTTP server and the
user agent code.
The problem of consistency of information arises when some logically
single piece of information is provided through a collection of objects.
For example, an article or a book which is a logical piece of information
may be provided as a collection of different chapter documents. To provide
consistency to a group of documents, we define a logical session.
A logical session consists of a number of requests by a single client for
a set of logically related documents. Currently, the HTTP protocol (HTTP/1.0
) is such that in each TCP connection one document
is transferred. The stateless property of the HTTP servers makes
them respond to each client request without relating it to any other request.
However, for placing a request in the context of a session, we need to
have some state information for a group of related requests. Due to the
lack of a state management mechanism  in the currently
implemented HTTP protocols (HTTP/1.0 and HTTP/1.1), we will use persistent
HTTP connections as a substitute for logical sessions, i.e. we will
provide consistency for a group of documents accessed over a single
persistent HTTP connection.
Unlike the HTTP/1.0 protocol, the newer version of the HTTP protocol
HTTP/1.1  by default allows any number of
document objects to be transferred in one TCP connection, until the connection
is closed either by the server or the client. Although these persistent
connections were meant to reduce TCP overhead, another consideration for
the need for a persistent connection might be one which acknowledges the
fact that different objects in the server are not always independent and
some objects may form a logical group. With this model, the unit of consistency
of information may become a group of document objects, and each persistent
connection can be treated as a read transaction, reading many objects.
As is well known from the database context, when multiple objects are logically
connected, reading and updating objects becomes more complex. Our goal
in this work is to develop schemes for on-line updates of document objects
such that consistent access of these documents within a single persistent
HTTP connection is ensured. We also describe a tool WCP that incorporates
the on-line update schemes.
In the next section, we discuss the system and consistency model that
we consider and describe an analytical model that provides a quantitative
handle on the consistency problem. Section 3
gives a brief overview of the features in the HTTP protocol that are used
by WCP. Section 4 describes in detail our on-line
update schemes and Section 5 describes the
WCP tool. Conclusions are presented in Section
2. System model and consistency
The following characterizes our model of the Web being used as an information
service. Though the Web can also be used to perform update operations on
server data, through forms and CGI programming, we are restricting our
attention to the case when the Web is used as an information provider with
the following characteristics.
Consider the following example which illustrates how uncontrolled updates
can lead to clients getting inconsistent information. Suppose a document
f1 contains embedded documents f2,
f3, f4 and f5. Documents
f1 through f5 form a logical group.
If a client issues a HTTP request for document f1, the
client side browser first fetches document f1. Then it
fetches the embedded documents f2 through f5,
one at a time. Assume that at the same time on the server side, documents
f1, f2, ..., f5 are
updated between the time the file f2 is fetched at the
client and a request for f3 is sent to the server. Now,
when the client side browser fetches document f3, it
will get the new updated version of document f3. This
may have some information which is inconsistent with the old versions of
f1 and f2. In the above scenario, in
order to keep the information given to a client consistent we would like
the client to get either all the old copies of the documents or all the
new copies of the documents in this logical group. This is one example
of a simple way to define a group. This simple example can be solved by
replacing the references to the embedded documents with the correct version
before serving the document to the client. However, there can also be documents
which are not related through being embedded in a document but still may
form a logical group. The information provider may define the dependencies
between documents and hence form a group. Let us now define the notion
of consistency in this general context of a group.
One server and multiple clients. When the Web is used as an information
service, there is usually one information provider and several clients
accessing that information. That is, there is one writer process, and multiple
readers which read the documents. The reader issues requests to read one
document at a time, even if a group of documents are logically related
or even when documents are embedded within a document. This is a fundamental
difference from the general database model.
Multiple documents in a server may be related and an update may require
change to many of the documents on the server e.g. a server may contain
files representing many chapters or components of a document.
The server side process serves all the incoming requests using the HTTP
protocol. The HTTP server reads a configuration file at initialization
and this configuration file provides all the parameters to set up the internal
state of the server.
Note that the consistency and system model are not the same as in the database
context. There is only one writer process and there are no update type
transactions that might have to be ``undone''. The database transaction
type of model is not suitable for the Web. Furthermore, the cost of implementing
a database transaction type model is known to be high. One of the reasons
for the popularity of HTTP is that it was designed to be a very light-weight
protocol. Therefore, any solution for ensuring consistency in information
should be such that it does not add substantial overhead to the communication
between the client and the server. Only if the solution has minimal overhead
will it be acceptable to the WWW community. In addition, the solution should
be limited to either not modifying the HTTP server at all, or allowing
limited modifications to the server. Further, no modifications should be
required on the client side, since client side changes require modifications
in the browsers. In addition, the solution should not suggest any changes
to the HTTP protocol. Our solution in this paper does not require any changes
to the server program.
When a client is reading a logical group of related documents, the consistency
requirement is that the client either gets old copies of all the documents
or new copies of all the documents. A logical group of documents is taken
to be one where all the documents in the group are read within one persistent
HTTP connection. This is like the transaction model which requires atomicity.
Note that in the above definition, we couple the object model with a persistent
HTTP connection and define a group of documents to be those documents that
are provided over one single persistent HTTP connection. We found that
this is one definition that is implementable without changing the server
code. There may be other stronger definitions that may be possible but
those will require changes to the server program. The server will have
enough information about a group and can keep a persistent connection open
until all the documents in the group are transferred. But the server has
no control over when the client may choose to close a connection. Thus,
the server only provides the guarantee to the client that if the client
does not close the connection before all the documents in the group
are transferred, then the documents will be consistent.
Once the information is updated, then all the new client requests must
get only the new information.
Before we discuss the solution let us better understand how uncontrolled
updates can affect consistency. We developed a simple analytical model
for a Web server where there are highly accessed documents (for example,
stock quotes which may be accessed very frequently) and uncontrolled on-line
updates of these documents is being done. We assume that when a group of
documents is being updated it is done by updating one document at a time
(perhaps by rewriting the documents with the new document). This approach
will not satisfy the group consistency requirement. The analytical
model computes the probability that at least one client is accessing the
group of documents while the uncontrolled update is in progress. This probability
is referred to as the interference probability. A group of documents
may be a document and its associated embedded documents and we refer to
these documents collectively as a group.
Using actual measurement data for time needed to move a file and read
transfer times for groups of different sizes (our measurements were done
on SPARCstation 20 running Solaris), we computed the interference probabilities
for different file sizes and read rates, using our model. As expected,
we found that the interference probability increases with file size, and
decreases almost linearly with decrease in read request rate. For example,
for a group size of 100 Kbytes with read requests arriving at a rate of
0.1 request/sec., 2% of the time a read request will interfere with an
update and may lead to inconsistent data being read. For large group sizes
of 5 and 10 Mbytes, it can be seen that the interference probability is
more than 50% if the read request rates are high. This shows that in systems
where constantly changing documents are accessed at a high rate, there
is a very high probability that at least one read of a document in a group,
interferes with an update which may lead to inconsistent data being retrieved.
Therefore, if we need to satisfy the group consistency requirement for
a set of documents, we require some mechanisms for update which can provide
either all new copies of the documents or all old copies of the documents
even if an update is initiated while the client request is being processed.
The details of the model and the results of this modeling can be found
Before going into the details of the scheme, we provide a short description
of HTTP servers and the redirection facility which is a part of the HTTP
HTTP is a light-weight application level protocol for distributed information
systems. It is a request/response protocol. Most servers and browsers currently
support HTTP/1.0 version of the protocol. The newer version HTTP/1.1 is
still under standardization by the IETF. In the HTTP protocol, the client
sends requests to a server. A request contains a request method and the
URL being requested, along with some other information. The server responds
back with a status code and a message containing the data in the URL requested
by the client. If a client wants to make requests for a set of documents/URLs
from the same server, in the HTTP/1.0 version of the protocol it has to
make a separate TCP connection for each URL. However, this is remedied
in HTTP/1.1 version of the protocol. HTTP/1.1 specifies that a connection
is persistent by default unless explicitly closed by either the server
or the client. While the persistent connection is open the client can ask
for any number of documents on the same TCP connection. We use this to
provide a form of transaction semantics for a group of documents.
The HTTP protocol supports a class of status codes called the Redirection
codes which indicate to the client that the client side browser needs to
take another action to fulfill the request. For example, consider a scenario
where a document d has been moved from server s1
to server s2. When a client requests document d
from server s1, s1 returns a status
code which tells the client that the document has moved permanently to
another location and it also includes the URL for the new location on server
s2 in the reply. The client side browser can then make
a request to s2 using the returned URL. All HTTP servers
implement the redirection functionality.
Most HTTP servers use a configuration file to configure themselves
at initialization or at any time when a change in the configuration is
needed. The configuration file may consist of different directives which
configure different dimensions of the HTTP server, e.g. port number for
incoming requests, log file name, timeout for persistent connections.
We used an internal Bell Laboratories developed HTTP server for our experiments.
This server uses Rewrite File url-pattern replace-pattern rules
in the configuration file to translate url-pattern in the requests
to the document or directory in the replace-pattern on the server.
The server compares any incoming URL request against each url-pattern in
the order in which they appear in the configuration file and replaces the
matching part with replace-pattern and the resulting document
is then returned to the client, e.g. Rewrite File /* /usr/local/public_html/
directive means that an incoming URL, say http://www.lucent.com/file1.html
will be evaluated to /usr/local/public_html/file1.html. The server
also provides the redirection facility in the HTTP protocol by adding a
Rewrite Redirect url-pattern replace-pattern directive to the
configuration file. In this case, the replace-pattern is the URL
of a document on another server and the client is sent back the URL in
the replace-pattern. If a change in the configuration of the server
is desired, the configuration file is modified and then a hangup signal
is sent to the server, which then rereads the configuration file and resets
its internal parameters as a part of the signal handler routine.
Our solution is geared towards servers which use the process forking
model to service requests. The process forking mechanism works as follows:
Initially, when the Web server is started, it reads the contents of the
configuration file into its data memory. Once this is done, this configuration
information is used by the server. If a change in the configuration of
the server is desired, the configuration file is modified and then a hangup
signal is sent to the server, which then rereads the configuration file
and resets its internal parameters as a part of the signal handler routine.
When a client HTTP connection request is received at the server, it forks
a child process to serve this request, and then goes back to receiving
further client requests.
When a child process is forked, it is provided with a copy of the data
memory which includes the configuration data. Since the child makes a copy
of the data, even if the configuration data gets changed in the main server
process due to a reread of the configuration file, the copy maintained
by the child process is unchanged.
4. Redirection based document update
To provide group consistency, one conceptually simple approach would be
to copy the original documents onto backups and change the backups. Once
all the backups have been changed, an atomic operation should be used to
inform the server of the updated documents. If we then let the original
copies and the updated copies of the documents in the group
co-exist for some time, and satisfy requests over a persistent HTTP connection
that was initiated before the atomic update by providing the original copies
and satisfy any requests over a persistent HTTP connection that was initiated
after the update by providing the updated copies, then clearly group consistency
will be satisfied in that a read request to a group of documents within
a single persistent HTTP connection will get either the old copies of the
documents or the new copies of the documents. We also need to make sure
that once a request is serviced with the updated copies, all subsequent
requests will be serviced with updated copies.
We use the process fork option and the above mentioned properties to
devise a simple scheme which uses the Rewrite directives of the
experimental Web server and the redirection facility of HTTP to implement
this approach for ensuring consistency within persistent HTTP connections.
The goal is to deliver, within a single persistent HTTP connection, either
all old copies of the documents in the group or updated copies of the documents.
In this scheme, when a group is to be updated, the original documents in
the group are not changed, but new copies of all the documents in the group
are created in a temporary location and these are then updated, thereby
giving the updated documents. The configuration file of the server is changed
to redirect all subsequent requests for any document in the group to the
updated copies. Then, the server is made to reread the configuration file
so that subsequent child processes that are forked to serve client requests
use the changed configuration information and hence redirect the requests
to the updated copies. Note that the new configuration file should have
only a few lines of Redirect directives and hence the reread/restart
at the server will be very quick. In the mean time, child processes that
were forked off before the server was made to reread the configuration
file would continue to serve the old copies of the document until the persistent
HTTP connection is closed. After all the persistent HTTP connections, supported
by child processes forked before the configuration file is reread, are
closed, the old copies are replaced by the updated copies, the redirection
is removed from the configuration file and the server is made to reread
the configuration file. This is done so that the configuration file does
not continually grow when more and more updates are performed. Subsequent
client requests will then get served the new copies of the documents.
Using the above approach, several versions of a document can be maintained
on a server at the same time.
Even in this simple scheme, the details of the update scheme change
depending on the temporary as well as the permanent location of the updated
copies. In the following subsection, we describe the steps to be followed
for two different scenarios outlined below.
The details of our solution for the above two scenarios is discussed below.
Same server: In this scenario, the original copies of the group
of documents to be updated as well as the temporary location of the updated
copies are in the same server. The temporary location of the updated
copies could be in the same directory as the original copies or could be
in a different directory and the same solution works for both.
Fully replicated document tree: In this scenario, the whole document
tree is replicated on more than one server which means that temporary copies
of the documents to be updated are not explicitly created. Already available
replicated copies of the documents on another server are used to perform
the update. However, as full replication can be quite expensive, especially
if only a few documents are to be changed, this scheme is likely to be
of use only when there is already replication of information for performance
reasons. Many servers whose information is highly in demand replicate their
documents on a number of servers. In such a case, one of the replicated
copies of the document can be updated and then the updated document can
be copied to the other copies one copy at a time.
4.1. Same server
The first scenario is where the original copies and the updated copies
are both on the same server. Assume that we need to update the group of
documents D1/f1, D2/f2,
D3/f3, ... on a server, where Di
represent the directory path of document fi. Assume that
the updated copies for each of these documents are created. Let the update
copies be D'1/f'1, D'2/f'2,
D'3/f'3, ..., respectively. The following
steps have to be performed for consistent update.
As explained earlier, by making use of the fork option of a Web server,
the above steps will ensure group consistency. Before the server is sent
the signal to reread the configuration file, a persistent TCP connection
has been established between the clients and the server child processes
with these processes reading the old documents in the group and writing
the contents of the documents to the TCP socket. After the signal is sent
and the server has read the new configuration file, any new HTTP requests
arriving at the server for the documents in the group will be serviced
by newly created child processes that use the new configuration information
and hence provide the updated copies D'i/f'i.
As no new requests will be provided the old documents, after some time
all the ongoing requests will be serviced and there will be no child processes
reading the old documents. At this time, these documents can be written
over, as done in Step 3. By the same argument, in the last step, the documents
can be deleted. Note that the consistency condition is satisfied by the
first two steps. However, if we just leave there, the configuration file
will keep getting longer every time a document is updated, as old unused
copies of the documents will be retained. Steps 3 through 5 avoid this
problem. The steps described above implement the conceptual semantics of
a group update documents in a group are updated in place with new information.
For each document fi, add the following line in the configuration
file of the server: Rewrite File /Di/fi
Send a signal to the server to read the configuration file.
For each document fi, when all the persistent HTTP connections
opened before the signal was sent to the server (in Step 2) are closed,
copy the new updated documents /D'i/fi'
over the old documents /Di/fi.
In the configuration file, delete the lines added in Step 1, and send another
signal to the server to re-read the configuration file.
When all the persistent HTTP connections accessing documents in the group
D'i/f'i, are closed,
this group of documents can be deleted.
4.2. Fully replicated document tree
In this section, we consider the second scenario where a document tree
is fully replicated on other servers. In such a case, it may be useful
to update one of the replicated copies of the document and then copy the
updated document to the other replicas one copy at a time thereby staggering
the updates at the different replicas. This scheme uses the Rewrite
Redirect directive of the HTTP server, instead of the Rewrite
File directive used by the previous scheme. The Rewrite Redirect
directive allows us to redirect requests to another server unlike the Rewrite
File directive which can provide substitution only at the file level
on the same server.
Assume that a document f in directory D has k copies
one on each of the servers w1, w2,
..., wk. Since these copies already exist, it is assumed
that the relative links in the copies have all been created keeping in
view the fact that the document is on a particular server. We need to update
all the copies of the documents and we perform a staged update as follows:
for i = 1 to k do
Consider document /D/f on server wi which
is to be updated.
Add an entry Rewrite Redirect /D/f http://w(i%k)+1.lucent.com/D/f
to the configuration file at server wi. Send a signal
to the server wi to re-read its configuration file.
After all existing persistent HTTP connections at server wi
which are accessing the group of documents that /D/f belongs
to, are closed, update document /D/f.
Delete the line added to the configuration file of server wi
in Step 2; send a signal to server wi to re-read its
In Step 2, we redirect all requests to /D/f at server
wi to its replica on server wi+1.
In Step 3, we update the document at server wi and in
Step 4, we remove the redirection from server wi to server
wi+1 so that subsequent requests to /D/f
at server wi is provided the updated document /D/f
at server wi.
5. WCP a tool for consistent updates
WCP (Web-Copy) is a tool that implements the redirection based solutions
discussed above. The tool can be used for updating a group of files for
the two scenarios discussed earlier, (a) where the updated copies
are located on the same server; or (b) the document tree is fully
5.1. Same server
In this case, the WCP command should be invoked as follows. Assume that
documents f1, f2, f3,
..., need to be updated. Assume that the updated documents have been created
and placed in f'1, f'2, f'3,
..., respectively. Then, the user can invoke wcp using the command:
wcp f'1!f1 f'2!f2
Each document name is expressed as a Unix file path with the target
and destination separated by a "!". The paths could be relative to the
current working directory or they can be absolute paths. For each pair
of documents the wcp utility first evaluates the absolute paths
in the file system. Then it evaluates the path of the original document
relative to the Web directory path. For example, if the HTTP server's home
directory is /usr/local/www/htdocs and the original document is
/usr/local/www/htdocs/greetings.html, then the original document's
path with respect to the Web directory is greetings.html and the
document is accessed by the URL http://serveraddress/greetings.html.
If the updated copy of the document has an absolute path of /usr/local/www/htdocs/temp/newgreetings.html,
then the WCP utility adds the following line to the configuration file:
Rewrite File /greetings.html /usr/local/www/htdocs/temp/newgreetings.html.
After the server receives the signal to reread its configuration file,
for every request that comes in for URL http://serveraddress/greetings.html,
it returns the document /usr/local/www/htdocs/temp/newgreetings.html.
5.2. Fully replicated document tree
In this case, the WCP command should be invoked as shown below. Assume
that documents f1, f2, f3
on server www1 need to be updated and copies of this document are
replicated on server www2. Let the replicated copies of the documents
on www2 be located at f'1, f'2,
f'3 respectively. Then the user can invoke wcp
on www1 using the command:
wcp www2:f'1!f1 www2:f'2!f2
Again, each document name is expressed as a Unix file path. In this
case, the path names of the files on www1 can be specified relative
to the current working directory or can be absolute paths, but the path
names of the files on www2 need to be specified as absolute file
5.3. Experiments with WCP
We have tested wcp for both the above scenarios using the
experimental server developed at Bell Laboratories which allows the static
definition of persistent HTTP connections through the use of keepalive
parameters in the server configuration file (htd.conf). The syntax for
specifying keepalive parameters is the following:
Keepalive on|off [ timeout [ max.requests ]]
where the on|off flag can be used to turn the Keepalive
specification on or off, timeout is the maximum
pause in seconds that is allowed between the end of one request and the
start of the next beyond which a persistent HTTP connection will be closed
at the server and max.requests is the maximum number of document
requests that will be served over a persistent HTTP connection before it
is closed by the server. For example, if the line Keepalive on 50 5
is included in the configuration file, when the server reads the configuration
file, it will set its internal parameters so that a HTTP connection will
be kept open until no new request arrives for more than 50 seconds or until
5 documents are transferred, whichever happens first.
No changes were made at all to the server program. In the basic experiment
that we setup we sent a continuous stream of requests for reading a document
with a group of embedded documents in it. The main document together with
the embedded documents now form a group. The keepalive parameter
was specified such that a connection is kept open at the server until the
number of documents that are retrieved over this connection before it is
closed equals the number of embedded documents plus the main document.
When the requests were being serviced, one or more of the documents in
the group were updated. Then, the documents that were retrieved were compared
to the old group and the new group of documents. In all the cases we found
that either a request retrieved the old group of documents or the new group
5.4. Extra utility in WCP
In this section, we describe an additional scenario where wcp can
be used. It is generally known that in a Web site, most of the document
accesses are to a small number of ``hot'' documents. This means that performance
can be improved by replicating just a small set of documents and not the
whole document tree. This idea has been used in the design of geographically
distributed document caching schemes .
Consider the temporary replication of these hot documents on another
server to ease the load on the original server. In this case, replication
is not necessarily done for an update but to increase the performance by
having more replicas for a temporary period. As all documents are not fully
replicated, relative URLs pose a problem and have to be handled. Relative
URLs are those that specify the location of a document relative to the
main document in which they are embedded. For example, assume that a document
file1.html, that is available in server www1 and can
be accessed from it using the URL http://www1.lucent.com/file1.html,
is getting a large number of requests. The document is replicated on server
www2 and some requests to the document are redirected to www2
to balance the load. In file1.html, there may be a link to a document
file2.html whose URL is specified relative to the directory where
file1.html resides. If this document is not a hot document, this
may not be replicated on www2. If a client accesses file1.html
from www2 and then if the link to file2.html is selected,
the relative URL will be transformed to the complete URL http://www2.lucent.com/file2.html.
However, file2.html is not replicated on www2 and hence
this access will fail. In this case, such relative URLs have to be handled
by redirecting these requests to the server www1. WCP has the
capability to handle relative URLs and hence can be used in a situation
where documents are temporarily replicated to handle load increases. One
such use could be found in the RobustWeb system that we have developed.
This system is described in . In RobustWeb, a front
end redirection server redirects HTTP requests probabilistically to one
of the back-end servers which has a copy of the document. Documents are
statically replicated and distributed. In this system, if we allow documents
to be dynamically replicated and distributed to handle temporary increases
in the load on a server, then the above mentioned relative URL problem
has to be handled. We are currently exploring this possibility.
When a group of related documents is provided on a Web server to be accessed
by clients, it is possible for clients to get inconsistent information
if the files are updated without any control at the server. That is, if
the file updates are not controlled, it is possible that a client may get
some files that are old and some that are new. If the group of files is
related, this situation leads to clients getting inconsistent information.
In this work, we discuss the problem of supporting consistency on the
Web, and provide a simple solution to the problem. We describe a tool called
WCP that can be used to update a group of Web documents such that accesses
to these documents over a persistent HTTP connection are consistent. That
is, a stream of accesses retrieves either all old copies of the documents
or new copies of the documents. The tool can be used on its own and does
not require changes to either the HTTP protocol or to the WWW server program.
In this work, we have proposed a very simple solution which is geared towards
a particular category of Web servers and our solution provides group consistency
over a single persistent HTTP connection. We are currently exploring other
possible solutions which will provide consistency in the context of a logical
session rather than a physical session.
T. Berners-Lee, R. Fielding, and H. Frystyk, Hypertext
Transfer Protocol, HTTP Working Group Informational document,
RFC 1945, May 1996, http://www.ics.uci.edu/pub/ietf/http/rfc1945.ps.gz
R. Fielding, J. Gettys, J.C. Mogul, H. Frystyk, and T.
Bernes-Lee, HyperText Transfer Protocol HTTP/1.1, HTTP Working
Group Proposed Standard, RFC 2068, Jan. 1997, http://www.ics.uci.edu/pub/ietf/http/rfc2068.ps.gz
D.M. Kristol, and L. Montulli, HTTP State Management
Mechanism, HTTP Working Group, Proposed Standard, RFC 2109,
Feb. 1997, http://www.ics.uci.edu/pub/ietf/http/rfc2109.txt
J. Gwertzman and M. Seltzer, The case for geographical
push-caching, HotOS`95, 1995, http://www.eecs.harvard.edu/~vino/web/hotos.ps
A. Bestavaros, Speculative data dissemination and service
to reduce server load, network traffic and service time in distributed
information systems, in: Proc. of the International Conference
on Data Engineering, March 1996, http://www.cs.bu.edu/~best/res/papers/icde96.ps
B. Narendran, S. Rangarajan and S. Yajnik, Data distribution
algorithms for fault-tolerant load balanced web access, in: Proc. of
the IEEE Symposium on Reliable Distributed Systems (SRDS'97), October
S. Rangarajan, S. Yajnik and P. Jalote, WCP a tool
for consistent on-line update of documents in a WWW server (extended
version of this paper),
Sampath Rangarajan is a member of technical staff in
the Distributed Software Research Department at Lucent Technologies Bell
Laboratories in Murray Hill, NJ. Prior to joining Bell Laboratories, he
was an Assistant Professor in the Department of Electrical and Computer
Engineering at Northeastern University in Boston, MA. He received a Ph.D
in Computer Sciences from the University of Texas at Austin in 1990. His
research interests are in the areas of fault-tolerant distributed computing,
mobile computing and performance analysis.
Shalini Yajnik graduated with a Ph.D. degree from Princeton
University, Princeton, NJ, in 1994 and has since been working as a Member
of Technical Staff in the Distributed Software Research Department in Lucent
Technologies, Bell Laboratories. Her research interests are fault tolerance
in distributed systems, CORBA, and fault tolerance and load balancing in
Pankaj Jalote received his Ph.D.
in Computer Science from University of Illinois at Urbana-Champaign in
1985. From 1985 to 1989 he was an Assistant Professor in the Department
of Computer Science at the University of Maryland, College Park.
Since 1989 he has been at Indian Institute of Technology Kanpur, where
he is now a Professor in the Department of Computer Science and Engineering.
Currently he is on a two year Sabbatical with Infosys Technologies Ltd.,
a leading software company in Bangalore, India, as Vice President. His
main areas of interest are software engineering, fault tolerant computing,
and distributed systems. He is the author of two books An Integrated
Approach to Software Engineering (Springer Verlag, 2nd ed., 1997), and
Fault Tolerance in Distributed Systems (Prentice Hall, New Jersey, 1994).
He is a senior member of IEEE.