Notes for Programmers¶
As indicated earlier, Banker can make both hashed and compressed G banks. In the compressed bank, each series has its own starting date and number of observations and most series have been compressed. Compression involves these steps:
Find and record the number of decimal points in the series.
Slide the decimal to the right until the series is all integers.
Record the first observation as a 4-byte integer.
Record the first differences of the series as 2-byte integers.
If a series cannot accurately be recorded in this compressed form, it is declared to have 255 decimal places (just a flag), and the observations are recorded as four-byte floating point numbers.
Missing observations are marked with a special code.
The data file has the extension “CBK”; and the index, “CIN”.
The .CIN file contains:
Item |
Size in Bytes |
C type |
Definition |
ns |
2 |
int |
The number of series. |
nc |
2 |
unsigned int |
The cumulative number of character in the series names, counting the nulls at the end of each series name. |
names |
nc |
char |
The series names. |
If there were three variables in the bank with names tom, dick, and harry, ns would be 3, and the names vector would be
tom0dick0harry0
where 0 represents a null (’0’ in C), and nc would be 15, the number of characters in the names vector, counting the null characters.
The .CBK file contains:
Item |
Size in Bytes |
C type |
Definition |
title |
80 |
char |
The bank title. |
ns |
2 |
int |
The number of series |
position |
4 |
unsigned long |
The byte number at which this “indx” array begins. |
series 1 |
variable |
see below |
|
… |
… |
||
series n |
variable |
see below |
|
indx |
4*ns |
unsigned long |
An array containing the byte numbers at which the series begin in this file. |
To continue the example, suppose that the series “tom” requires 101 bytes, “dick” requires 121 bytes, and “harry” requires 81. Then “indx” is the vector (86, 187, 308). (Remember that in C a file starts with byte 0). The “position”, which is the byte number at which this “indx” array begins, will be 389.
Each compressed series has the format:
Item |
Size in Bytes |
C Type |
Definition |
BaseYear |
1 |
unsigned char |
The year of the first observation, minus 1900. |
FreqPeriod |
1 |
unsigned char |
16*frequency+period, where frequency is the number of observations per year (1, 4, or 12), period is the period of first observation. (For frequencies above 12, set FreqPeriod = 255. This value signals that two integers have been inserted after this byte containing the frequency and the period.) |
SlashDecplaces |
1 |
unsigned char |
16*SlashFactor + Number of Decimal places, where SlashDecplaces usually is just the number of decimal places. If it is 255, the series has not been compressed. |
NDif |
2 |
int |
The number of differences, equal to the Number of observations - 1. |
FirstObs |
4 |
long |
The first observation, as a four-byte integer. |
Differences |
2*NDif |
int |
The first differences of the series, as 2-byte integers. |
If a zero occurs in the series, then it is indicated by 32767 in the differences. The following difference applies to the previous non-zero number, not to the zero. This practice was adopted because some banks have series with numerous missing observations that appear as zeroes. Also, some banks consider quarterly series to be monthly series in which only the end-of-quarter months have non-zero values. Note that SlashFactor normally is 0; the Press program, however, allows the option of dividing by a power of 2 to reduce the magnitudes of a series so that it can be compressed. The SlashFactor is the power (1, 2, 3, etc.) to be used on this series. In Press, the default maximum slash factor is 0, so the occurrence of non-zero slash factors is unusual.
If it was not possible to compress the series, then the format is:
Item |
Size in Bytes |
C type |
BaseYear |
1 |
unsigned char |
FreqPeriod |
1 |
unsigned char |
255 |
1 |
unsigned char |
nobs |
2 |
int |
Observations |
4*nobs |
float |
Note that the 255 in the third byte is the signal that the series is not compressed. The next two bytes represent the number of observations, and then the observations follow as 4-byte floating point numbers.
The compressed form can represent a series as accurately as can an 18-foot-high graph printed with laser-printer resolution of 300 dots per inch. (All series in the US National Accounts or Industrial Production Indexes compress easily. In the Blue Pages of the Survey of Current Business, however, nearly ten percent fail to compress. In IMF data, the hyperinflation of many third-world countries produces series which fail to compress.)
Hashed banks differ from compressed banks mainly in the organization of their index files. With standard and compressed banks, G7 keeps the names of the series in memory and simply does a linear search for a name each time one is requested. In the hashed banks, the names instead are grouped into bins on the basis of a number calculated from the letters of the name. When a name is requested, G7 calculates the number, locates the bin in which the name has been stored, reads in the names in that bin, and does a linear search over only those names to find the desired series. The size of compressed banks is limited by the requirement that the total number of characters in the names of all series must be less than 64,000. In practice, that limit typically translates to about six or seven thousand series. Hashed banks, in contrast, can go up to several million series. The data file has the extension “HBK” and the extension for the index is “HIN”.
The precise form of the hashed bank .HIN and .HBK files are presented below. The “.HIN” file contains:
Item |
Size in Bytes |
C type |
Definition |
ns |
4 |
long |
The number of series in the bank. |
nbins |
2 |
unsigned |
The number of bins in the bank. |
nsb |
2*nbins |
unsigned |
An array to contain the number of series in each bin. |
ncharb |
2*nbins |
unsigned |
The cumulative number of characters of the series names (including each ‘0’) contained in each bin. |
posbin |
4*nbins |
unsigned |
The beginning positions in the “.HIN” file of the first bytes of the binname() strings. |
binname(0) |
nchar[0] |
char |
The string binname(i) denotes the concatination (including the 0’s) of all the series names in the i-th bin. |
binposts(0) |
4*nnmsb[0] |
long |
binposts(i) is an array of beginning positions in the associated “.HBK” of the series in the the i-th bin. |
binname(1) |
nchar[1] |
char |
|
binposts(1) |
4*nnmsb[1] |
long |
|
binname(2) |
nchar[2] |
char |
|
binposts(2) |
4*nnmsb[2] |
long |
|
… |
|||
binname(nbins-1) |
nchar[nbins-1] |
char |
|
binposts(nbins-1) |
4*nnmsb[nbins-1] |
long |
The series are separated into nbins “bins”, where the number contained in each bin is recorded in the nsb array. Of course, the ordering of the series in the binname() and binposts() arrays must be the same.
Consider an example. Suppose that the third bin contains the series “joe”, “dave”, and “bill”. The string binname(3) would be
"joe\0dave\0bill\0"
Suppose that the starting positions in the “.HBK” bank for the three series are 40700008, 490987, and 3378294. The array binposts(3) then would be [40700008, 490987, 3378294], with nsb[3] = 3, and with ncharb[3] = 14. If the beginning position of binname(3) in the “.HIN” file is 4724, then posbin[3] = 4724.
To assign a bin number to a series you must use the following hashing routine. In C, the routine is:
unsigned hash(char *s);
hash(char *s){
unsigned bill;
for( bill=0; *s!='\0'; s++ ) bill = *s + 31*bill;
bill = bill%nbins;
return(bill);
}
To continue with the example, to determine the bin which the series “joe” really belongs to you’d evaluate the function hash(“joe”).
The .HBK file layout is:
Byte |
Type |
Description |
0 - 79 |
char |
Name of bank (terminated with a null) |
80 - 81 |
int |
ns, number of series in the bank |
82 - 85 |
long |
psn, position in file of index |
86 - |
first series, as described below |
|
*(psn+1) - |
second series, |
|
… |
… |
|
psn |
long |
position in file of first byte of first series |
psn+4 |
long |
position in file of first byte of second series |
… |
… on out to ns series |
For each series, the format is:
Byte |
Content |
0 |
base year |
1 |
frequency*16+period |
2 |
slash*16+maxplaces or 255 if not compressed |
3-4 |
number of observations |
5-8 |
first observation as a long |
9 - |
differences as integers |
If the series is not compressed, then floating-point data begin in byte 5.