Indexing
Random Access
Files

in Visual Basic®



by Rick Meyer        Home

Table

of

Contents

1. Why is an index necessary?
2. How does an index work?
3. How is the index stored?
4. Searching an index
5. Adding a record in an index
6. Creating an index
    a. General sorting
    b. Index sorting
7. The Mini-Database demo project
Why is an index necessary you might ask. It is possible, of course, to just loop through all the records in a random access file until the record that you Get matches the search criteria you desire. A simple

If string1 = string2 Then

statement would be all you need. There are several reasons: the primary reason is that as your data file of records builds, you would experience increasing delays in retrieving records. It would become just too slow. Also suppose that if the search fails to return an exact match to your criteria, you want to find those records that most closely match what you are looking for. To do this you will need an index.

You might contend, "Well, why don't you put the records in the correct order* in the first place? Then you don't have to worry about an index at all." The answer is that the act of inserting these records into the ordered position in the random access file each time you added (or edited) a field would take considerable time. It would probably take more time than looping through all of them in the search mentioned above. Also you could only store the records in order for just one field. If you want to be able to search on more than one field, an index is required.

So the reasons are:

  • To locate the record you want very quickly (searching).
  • To locate the best match if an exact match search fails.
  • To index more than one field.
  • To insert or reposition a single record quickly.
  • To iterate through the records in indexed order.
  • * Generally records in random access data files are appended sequentially at the end of the file as they are added.

    How does an index work? In random access files, each record has a number starting at 1. If you have 0 records then your file length will be zero (or there will be no file at all). What an index file does is keep track of all these record numbers, like in an array, but the numbers are sorted. For example, let's say you have a random access file with one string field, and you wish to index it alphabetically:

    Data File
    Rec
    Num
    Value
    1 Rick
    2 Jim
    3 Bill
    4 Jill
    Index File
    Rec
    Num
    Value
    1 3
    2 4
    3 2
    4 1

    You see that the index file is just made up of numbers, and the records in the index file are arranged such that they indicate the record of the data file that should hold this position. For instance, if we want to find the first person alphabetically in our data file we would go to the index file and get the first index record. We find the index value to be 3. So we then Get the 3rd record from the data file, and determine that the first record alphabetically is "Bill." The index record value 3 can be thought of as a pointer to the data file record.

    How is the index stored? Answer: Another random access file with the same number of records as the data file. Random access files in Visual Basic are versatile since they store both string and binary information. In fact, it takes only 2 bytes of file storage space to store the number 32767 (the maximum signed integer). It would look like "01111111 11111111" if you could see all the bits on the disk. If you were storing the same number as ASCII bytes, the actual bits would be: "00110011 00110010 00110111 00110110 00110111" (5 bytes, one for each digit).

    Remember that a random access file should have fixed length records, so we must choose the fixed length binary representations over the variable length ASCII characters. If you open up an index file with notepad, and look at the contents, you would not be able to recognize any numbers because they are stored as binary, not ASCII.

    Longs are generally preferred over integers since the maximum number of records that can be indexed is 2,147,483,647 (with routines for negative number adjustment it would be double that number) at the cost of only 4 bytes per record. (4 bytes = 32 bits ... sound familiar?)

    Note that there will be one index file for each field in the data file that you wish to index. In the Mini-Database demo project (reference at bottom of page), the data file is named MiniDB.db, and the related index files are named MiniDB0.idx, MiniDB1.idx, MiniDB2.idx, etc...

    So far you have the idea of how to move through an index consecutively and show the data in order. But how do you search for a particular item that might be somewhere in the middle of the file. Further, if there is no exact match how do you find the record that is the best match?

    It is done simply by repeatedly dividing the number of records you wish to search in half. Divide and conquer! Pick the record in the middle of your range of records first. If it does not happen to be the record that you want, then a simple > comparison will tell you which half it is in. You then divide that half in half and repeat the procedure. Eventually (but surprisingly quickly) you will arrive at the correct position as this geometric procedure in reverse progresses. The following recType is the UDT (user defined type) for data file records.

    
    Private Type recType
        firstName As String * 16
    End Type
    
    Dim recData As recType
    
    'The index file was opened before entering the function below.
    fx% = Freefile      'Handle for index file
    
    'The data file was opened before entering the function below.
    fdb% = Freefile     'Handle for data file
    
    
    Private Function performSearch$(strWanted$)
        Dim dbPtr&, idxPtr&, strFound$
        DIM lower&, upper&, chkStop&
        
        lower = 0: upper = Lof(fx) \ 4 + 1: chkStop = 0
        
    x1: idxPtr = (lower + upper) \ 2
        If idxPtr = 0 Then idxPtr = 1
        
        If idxPtr <> chkStop Then
            Get fx, idxPtr, dbPtr
    
            Get fdb, dbPtr, recData
    
            strFound = recData.firstName
            
            If strFound <> strWanted Then
                If strFound > strWanted Then
                    upper = idxPtr
                Else
                    lower = idxPtr
                End If
                chkStop = idxPtr
                GoTo x1
            End If
        End If
    
        performSearch = strFound
    End Function
    
    As you see we enter the function with StrWanted as the string we are searching for. The lower is set to 0 instead of 1. The upper is set to one more than the number of records in the index file (Lof returns all the bytes in a file and 4 is the number of bytes in a long intrinsic variable type). This works well because the \ operator truncates and will not allow idxPtr to ever reach the max + 1. There is a special line of code to make sure idxPtr never actually becomes 0.

    The one variable not discussed is chkStop. It is simply assigned the last value for idxPtr that we computed on the previous halving loop. If at any time chkStop turns out to the idxPtr number just computed, then we know it is time to stop (without finding the record we wanted) because the file has been halved down to one record. However if that turns out to be the case, the performSearch function will return the strFound as the closest match.

    Just like in the little tables above in "How does an index work," first we Get our index record with the index pointer just computed, and then use that value to Get the actual record in the data file.

    The flow is a loop using Goto. Although usually not preferred, Goto works elegantly here. Try coding it with Do Loops and you soon discover this. You drop out of the function immediately if strFound = strWanted. But if not the search area is immediately cut in half by the reassignment of the upper or lower value to the middle position which was computed (idxPtr).

    Although it is a surprisingly simple process that will cut down a huge number of records in no time at all, it does have one quirk if you do not get the exact match you are looking for. The problem is that the record returned could be just one record away from the best match. There is commented code for the function to determine this last bit in the Mini-Database (reference at bottom of page) project if you wish to know more.

    You will note in the Mini-Database project that this search function is used for searching and also for finding where to insert a new record in the index. For an insertion, a simple comparison will do to determine whether the new record should be inserted above or below the idxPtr last found. It is only for a search that we have to go through a more complicated task of determining whether the record immediately above or below that last idxPtr is a better match.

    Now let's say you want to add a record to the data file. As explained above you just add it to the end of the file: New record position = Lof(fdb) \ Len(recData) + 1. But how do you index a new record? Once you have determined the insertion point by using the above searching technique, you must shift all the values in the index file up one record from that point, and then Put the data record number (determined with the formula in brown above) in the insertion point. In code it looks like this:
    
    Private Sub fileItemsUp(insertPos&)
        Dim j&, newDataPosition&, recTmp&
    
        newDataPosition = Lof(fdb) \ Len(recData) + 1
    
        For j = newDataPosition To insertPos + 1 Step -1
            Get fx, j - 1, recTmp
            Put fx, j, recTmp
        Next
        
        Put fx, j, newDataPosition
    End Sub
    
    The For ... Next loop works from the top down (Step -1), Gets the record below and Puts it one record up. You have to go from top down because you would be overwriting values prematurely if you went from the insertion point up. When the loop is done you are left with j pointing exactly at the insertion point, so you just put the newDataPosition there (the position of the new record in the datafile is the last record).

    One more thing to think of is instead of adding a record, suppose someone just edited a record so that it was no longer in the correct indexed position. Once again you would have to do a search and determine its new insertion point. You would then have a routine similiar to the one above, but you would also need a Sub to move fileItemsDown in the event that the insertion point (newly computed correct position) is above the old position in the index. See the Mini-Database project code if you want to see this in detail.

    Briefly stated, to create an index you have to Open For Random the data file, consecutively (from 1) Get each record, save the field you wish to index in a string array, store the data record numbers in a parallel array of long numbers, sort the number array based on the string values, and then Put the whole resultant sorted number array to the index file (consecutively from 1).

    Doing this takes time and computer memory resources. So you only want to create an index if the index file has become corrupted or deleted, or if you wish to index a new field that hasn't been indexed before. I shall assume that you know how to Dimension the arrays, Open For Random, Get, and assign the above mentioned information into the arrays. The rest is sorting which is demonstarted here with the bubble sort routine.

    Generally when sorting the items being sorted are moved themselves to different positions in the array in which they are contained. Here is how:
    
    Dim strArray$(10)
    
    'Code to populate the string array has been omitted.
    
    Dim j%, tmp$, flag As Boolean
    
    flag = True
    Do While flag
        flag = False
        For j = 1 to (10 - 1)
            If strArray(j) > strArray(j + 1) Then
                tmp = strArray(j)
                strArray(j) = strArray(j + 1)
                strArray(j + 1) = tmp
                flag = True
            End If
        Next
    Loop
    
    If you study the For ... Next loop you see that first a simple comparison is made to see if the value is greater than the value immediately above it. If it is greater then it is exchanged with that element using a temporary string variable as a middleman. Each consecutive Do ... Loop should bubble the next highest value to its highest position. (If it is less than the string immediately above it then no exchange is made). By judicious positioning of the boolean flag, the routine will end when eventually on one For ... Next pass no strings are exchanged. This means that it is sorted. It is interesting to note that in VB you can compare strings just like numbers - the one caveat being that the comparison is case sensitive, so you may wish to convert all your string array elements to upper (UCase$) or lower (LCase$) case as you assign them.
    As stated, the above code will sort the strings nicely, but that is not what we want. We need to Put numbers to the index file, not strings, so the sort above accomplishes nothing for us. To sort an index is an wonderful leap in programming - but you will have to adjust your mind to another dimension in order to grasp it.

    What happens is that you immediately use the array of long numbers as pointers to (look at) the string array, and as you exchange those pointers in the sort, based on the strings they point to, the number array automatically become ordered as desired:

    
    'Code to populate this array has been omitted.
    Dim strArray$(10)
    
    'Populate with the record numbers of the data file.
    ' So initially lngArray(1) = 1, lngArray(2) = 2, etc ...
    Dim lngArray&(10)
    
    Dim j%, tmp&, flag As Boolean
    
    flag = True
    Do While flag
        flag = False
        For j = 1 to (10 - 1)
            If strArray(lngArray(j)) > strArray(lngArray(j + 1)) Then
                tmp = lngArray(j)
                lngArray(j) = lngArray(j + 1)
                lngArray(j + 1) = tmp
                flag = True
            End If
        Next
    Loop
    
    Perhaps it will help if you consider that strArray() is already indexed by element number, so all you are doing is creating a secondary index.

    Notice that longs are being swapped rather than strings. This immediately speeds up things on your computer. When you are done it is the array of numbers that has been ordered perfectly for your next step which is to simply Put them all (one record of the index is simply one long intrinsic variable type) to the index file.

    Normally most of the time spent by the computer will be file I/O. In situations of a large number of records you need to use a different sort method.* The Mini-Database project uses a Shell sort which really is very fast. You may wish to view the code there.

    *To view various sorting methods in action and make comparison timings see the demos at: Makai's Visual Basic®

    The Mini-Database demo project could be described as a "poor man's database" since it simply writes and retrieves your input data as records to and from a random access file and then establishes other random access files to index the records. If you wish, you may download the Mini-Database Project (12K) to see the complete commented source code and indexing in operation. You may find it a bit advanced if this is your introduction to indexing, so it is not necessary for this discussion. When viewing the code in the Mini-Database, it is important to know that eight fields are being indexed (with eight index files). The examples in this tutorial have been simplified to just one.

    Also see the Open for Random in Visual Basic for a brief refresher.