[ Home Page ] [ Eiffel Archive ] [ Eiffel Classes and Clusters ]

Arc de Triomphe clipart (2486 bytes)

Comparison Library


Written by Ian Elliott.

comparison.zip (23,321 bytes) - source code


1. Introduction

This library provides classes that compare two sequences of like items and report their differences. A sequence is taken to be an ordered collection of items where each item has an associated positive integer index which defines the item’s unique position in the collection. Successive items of the sequence may be accessed and for each access the index of the item accessed increases by 1. Thus files of lines, characters, bits and other objects, arrays and linear lists of items and even traversals through structured objects such as trees and graphs may be treated as sequences. Two sequences are the same if they contain items of equal value (by is_equal) at every corresponding index, otherwise the sequences are different. The smallest index at which corresponding items differ specifies the position of the first difference. At higher indexes the two sequences may contain subsequences of items in common i.e. they may resynchronise and further on they may differ again, and so on. The aim of the library is to provides means to identify the differences, report them and report where they occur.

2. Implementation

A comparison algorithm is implemented for those sequences which are “rewindable” i.e. sequences for which reading may be re-commenced at a position already read. Such sequences include arrays and “random access” files. The classes are given in the Comparison cluster. A simple “brute force” pattern matching algorithm is used and it is reasonably effective for small files. An implementation comparing large files or files of bits would probably require a more sophisticated approach. A full implementation is provided for line based ASCII files i.e. for files which are sequences of ASCII strings separated by end of line characters. These classes are in the Line_file_comparison cluster. The program fcompare provides a simple command line interface to this implementation.

Notes:

  1. Use is made of the Visual Eiffel libraries Formats and Timedate and the class TEXTFILE from the Pool library. This class treats an ASCII file as an array of lines.
  2. The LINKED_LIST class of the Eiffelbase library is used.
  3. The fcompare program make use of the Command_line command line scanning library which is separately downloadable.

3. Simple Compare Algorithm for Rewindable Sequences

This algorithm applies to rewindable sequences – i.e. sequences where the items are typically read one after the other but an item already read can be re-read and the reading the sequence continued from there.

The two sequences are read an item at a time in concert. The items in each pair are compared. If the items are the same i.e. have the same value by is_equal, the next pair is read. This continues until one of the following occurs:

  1. the two items of the pair differ.
  2. both sequences are exhausted,
  3. only one of the sequences exhausts,

Considering each case in turn:

  1. for each sequence a section i.e. a contiguous sequence of items, has been reached which differs from the other’s section. Following each of these sections the two sequences may have, provided neither one exhausts, items in common i.e. the two sequences may re-synchronize. If so, then from this point the pair-wise comparing can continue as before.

    To seek re-synchronization the section containing the different items is determined for each sequence. A record of the difference sections is maintained. Initially each difference section is given the first item which differs. The next item is read from one of the sequences. If the other difference section contains this item, it indicates the start of resynchronized sections of a length of at least 1. If the item read is not in the other’s difference section then the first sequence’s difference section is increased by this item. The next item of the other sequence is read and the same process is applied to it with respect to the first sequence. This alternating processing continues until a match is found or one of the sequences exhausts. On establishing resynchronization, pair-wise reading of the two sequences continues as before. If a sequence exhausts then searching for re-synchronization continues with the remaining sequence. A report of the difference is made.

    For satisfactory resynchronization to be deemed to have occurred the sections must agree for a length which is usually greater than 1. This is required since accepting agreement of less than this length may generate spurious resynchronizations. In these cases numerous stop-start differences can be produced which obscure the true differences sought.
  2. indicates that the comparison is complete and so processing stops.
  3. a report that a sequence has exhausted is made and processing stops.

4. Further Development

A possible extension is to ignore items of certain value e.g. blank lines in file comparisons. Another extension is to create a Windows version of fcompare using Object Tools’ Display Machine and another includes the ability to compare different file formats such as binary.


[ Home Page ] [ Eiffel Archive ] [ Eiffel Classes and Clusters ]