Notes for Programmers

As indicated earlier, Banker can make both hashed and compressed G banks. In the compressed bank, each series has its own starting date and number of observations and most series have been compressed. Compression involves these steps:

  1. Find and record the number of decimal points in the series.

  2. Slide the decimal to the right until the series is all integers.

  3. Record the first observation as a 4-byte integer.

  4. Record the first differences of the series as 2-byte integers.

  5. If a series cannot accurately be recorded in this compressed form, it is declared to have 255 decimal places (just a flag), and the observations are recorded as four-byte floating point numbers.

  6. Missing observations are marked with a special code.

  7. The data file has the extension “CBK”; and the index, “CIN”.

The .CIN file contains:

Item

Size in Bytes

C type

Definition

ns

2

int

The number of series.

nc

2

unsigned int

The cumulative number of character in the series names, counting the nulls at the end of each series name.

names

nc

char

The series names.

If there were three variables in the bank with names tom, dick, and harry, ns would be 3, and the names vector would be

tom0dick0harry0

where 0 represents a null (’0’ in C), and nc would be 15, the number of characters in the names vector, counting the null characters.

The .CBK file contains:

Item

Size in Bytes

C type

Definition

title

80

char

The bank title.

ns

2

int

The number of series

position

4

unsigned long

The byte number at which this “indx” array begins.

series 1

variable

see below

series n

variable

see below

indx

4*ns

unsigned long

An array containing the byte numbers at which the series begin in this file.

To continue the example, suppose that the series “tom” requires 101 bytes, “dick” requires 121 bytes, and “harry” requires 81. Then “indx” is the vector (86, 187, 308). (Remember that in C a file starts with byte 0). The “position”, which is the byte number at which this “indx” array begins, will be 389.

Each compressed series has the format:

Item

Size in Bytes

C Type

Definition

BaseYear

1

unsigned char

The year of the first observation, minus 1900.

FreqPeriod

1

unsigned char

16*frequency+period, where frequency is the number of observations per year (1, 4, or 12), period is the period of first observation. (For frequencies above 12, set FreqPeriod = 255. This value signals that two integers have been inserted after this byte containing the frequency and the period.)

SlashDecplaces

1

unsigned char

16*SlashFactor + Number of Decimal places, where SlashDecplaces usually is just the number of decimal places. If it is 255, the series has not been compressed.

NDif

2

int

The number of differences, equal to the Number of observations - 1.

FirstObs

4

long

The first observation, as a four-byte integer.

Differences

2*NDif

int

The first differences of the series, as 2-byte integers.

If a zero occurs in the series, then it is indicated by 32767 in the differences. The following difference applies to the previous non-zero number, not to the zero. This practice was adopted because some banks have series with numerous missing observations that appear as zeroes. Also, some banks consider quarterly series to be monthly series in which only the end-of-quarter months have non-zero values. Note that SlashFactor normally is 0; the Press program, however, allows the option of dividing by a power of 2 to reduce the magnitudes of a series so that it can be compressed. The SlashFactor is the power (1, 2, 3, etc.) to be used on this series. In Press, the default maximum slash factor is 0, so the occurrence of non-zero slash factors is unusual.

If it was not possible to compress the series, then the format is:

Item

Size in Bytes

C type

BaseYear

1

unsigned char

FreqPeriod

1

unsigned char

255

1

unsigned char

nobs

2

int

Observations

4*nobs

float

Note that the 255 in the third byte is the signal that the series is not compressed. The next two bytes represent the number of observations, and then the observations follow as 4-byte floating point numbers.

The compressed form can represent a series as accurately as can an 18-foot-high graph printed with laser-printer resolution of 300 dots per inch. (All series in the US National Accounts or Industrial Production Indexes compress easily. In the Blue Pages of the Survey of Current Business, however, nearly ten percent fail to compress. In IMF data, the hyperinflation of many third-world countries produces series which fail to compress.)

Hashed banks differ from compressed banks mainly in the organization of their index files. With standard and compressed banks, G7 keeps the names of the series in memory and simply does a linear search for a name each time one is requested. In the hashed banks, the names instead are grouped into bins on the basis of a number calculated from the letters of the name. When a name is requested, G7 calculates the number, locates the bin in which the name has been stored, reads in the names in that bin, and does a linear search over only those names to find the desired series. The size of compressed banks is limited by the requirement that the total number of characters in the names of all series must be less than 64,000. In practice, that limit typically translates to about six or seven thousand series. Hashed banks, in contrast, can go up to several million series. The data file has the extension “HBK” and the extension for the index is “HIN”.

The precise form of the hashed bank .HIN and .HBK files are presented below. The “.HIN” file contains:

Item

Size in Bytes

C type

Definition

ns

4

long

The number of series in the bank.

nbins

2

unsigned

The number of bins in the bank.

nsb

2*nbins

unsigned

An array to contain the number of series in each bin.

ncharb

2*nbins

unsigned

The cumulative number of characters of the series names (including each ‘0’) contained in each bin.

posbin

4*nbins

unsigned

The beginning positions in the “.HIN” file of the first bytes of the binname() strings.

binname(0)

nchar[0]

char

The string binname(i) denotes the concatination (including the 0’s) of all the series names in the i-th bin.

binposts(0)

4*nnmsb[0]

long

binposts(i) is an array of beginning positions in the associated “.HBK” of the series in the the i-th bin.

binname(1)

nchar[1]

char

binposts(1)

4*nnmsb[1]

long

binname(2)

nchar[2]

char

binposts(2)

4*nnmsb[2]

long

binname(nbins-1)

nchar[nbins-1]

char

binposts(nbins-1)

4*nnmsb[nbins-1]

long

The series are separated into nbins “bins”, where the number contained in each bin is recorded in the nsb array. Of course, the ordering of the series in the binname() and binposts() arrays must be the same.

Consider an example. Suppose that the third bin contains the series “joe”, “dave”, and “bill”. The string binname(3) would be

"joe\0dave\0bill\0"

Suppose that the starting positions in the “.HBK” bank for the three series are 40700008, 490987, and 3378294. The array binposts(3) then would be [40700008, 490987, 3378294], with nsb[3] = 3, and with ncharb[3] = 14. If the beginning position of binname(3) in the “.HIN” file is 4724, then posbin[3] = 4724.

To assign a bin number to a series you must use the following hashing routine. In C, the routine is:

unsigned hash(char *s);
hash(char *s){
   unsigned bill;
   for( bill=0; *s!='\0'; s++ ) bill = *s + 31*bill;
   bill = bill%nbins;
   return(bill);
   }

To continue with the example, to determine the bin which the series “joe” really belongs to you’d evaluate the function hash(“joe”).

The .HBK file layout is:

Byte

Type

Description

0 - 79

char

Name of bank (terminated with a null)

80 - 81

int

ns, number of series in the bank

82 - 85

long

psn, position in file of index

86 -

first series, as described below

*(psn+1) -

second series,

psn

long

position in file of first byte of first series

psn+4

long

position in file of first byte of second series

… on out to ns series

For each series, the format is:

Byte

Content

0

base year

1

frequency*16+period

2

slash*16+maxplaces or 255 if not compressed

3-4

number of observations

5-8

first observation as a long

9 -

differences as integers

If the series is not compressed, then floating-point data begin in byte 5.