Notes for Programmers¶

As indicated earlier, Banker can make both hashed and compressed G banks. In the compressed bank, each series has its own starting date and number of observations and most series have been compressed. Compression involves these steps:

Find and record the number of decimal points in the series.
Slide the decimal to the right until the series is all integers.
Record the first observation as a 4-byte integer.
Record the first differences of the series as 2-byte integers.
If a series cannot accurately be recorded in this compressed form, it is declared to have 255 decimal places (just a flag), and the observations are recorded as four-byte floating point numbers.
Missing observations are marked with a special code.
The data file has the extension “CBK”; and the index, “CIN”.

The .CIN file contains:

Item	Size in Bytes	C type	Definition
ns	2	int	The number of series.
nc	2	unsigned int	The cumulative number of character in the series names, counting the nulls at the end of each series name.
names	nc	char	The series names.

If there were three variables in the bank with names tom, dick, and harry, ns would be 3, and the names vector would be

tom0dick0harry0

where 0 represents a null (’0’ in C), and nc would be 15, the number of characters in the names vector, counting the null characters.

The .CBK file contains:

Item	Size in Bytes	C type	Definition
title	80	char	The bank title.
ns	2	int	The number of series
position	4	unsigned long	The byte number at which this “indx” array begins.
series 1	variable	see below
…		…
series n	variable	see below
indx	4*ns	unsigned long	An array containing the byte numbers at which the series begin in this file.

To continue the example, suppose that the series “tom” requires 101 bytes, “dick” requires 121 bytes, and “harry” requires 81. Then “indx” is the vector (86, 187, 308). (Remember that in C a file starts with byte 0). The “position”, which is the byte number at which this “indx” array begins, will be 389.

Each compressed series has the format:

Item	Size in Bytes	C Type	Definition
BaseYear	1	unsigned char	The year of the first observation, minus 1900.
FreqPeriod	1	unsigned char	16*frequency+period, where frequency is the number of observations per year (1, 4, or 12), period is the period of first observation. (For frequencies above 12, set FreqPeriod = 255. This value signals that two integers have been inserted after this byte containing the frequency and the period.)
SlashDecplaces	1	unsigned char	16*SlashFactor + Number of Decimal places, where SlashDecplaces usually is just the number of decimal places. If it is 255, the series has not been compressed.
NDif	2	int	The number of differences, equal to the Number of observations - 1.
FirstObs	4	long	The first observation, as a four-byte integer.
Differences	2*NDif	int	The first differences of the series, as 2-byte integers.

If a zero occurs in the series, then it is indicated by 32767 in the differences. The following difference applies to the previous non-zero number, not to the zero. This practice was adopted because some banks have series with numerous missing observations that appear as zeroes. Also, some banks consider quarterly series to be monthly series in which only the end-of-quarter months have non-zero values. Note that SlashFactor normally is 0; the Press program, however, allows the option of dividing by a power of 2 to reduce the magnitudes of a series so that it can be compressed. The SlashFactor is the power (1, 2, 3, etc.) to be used on this series. In Press, the default maximum slash factor is 0, so the occurrence of non-zero slash factors is unusual.

If it was not possible to compress the series, then the format is:

Item	Size in Bytes	C type
BaseYear	1	unsigned char
FreqPeriod	1	unsigned char
255	1	unsigned char
nobs	2	int
Observations	4*nobs	float

Note that the 255 in the third byte is the signal that the series is not compressed. The next two bytes represent the number of observations, and then the observations follow as 4-byte floating point numbers.

The compressed form can represent a series as accurately as can an 18-foot-high graph printed with laser-printer resolution of 300 dots per inch. (All series in the US National Accounts or Industrial Production Indexes compress easily. In the Blue Pages of the Survey of Current Business, however, nearly ten percent fail to compress. In IMF data, the hyperinflation of many third-world countries produces series which fail to compress.)

Hashed banks differ from compressed banks mainly in the organization of their index files. With standard and compressed banks, G7 keeps the names of the series in memory and simply does a linear search for a name each time one is requested. In the hashed banks, the names instead are grouped into bins on the basis of a number calculated from the letters of the name. When a name is requested, G7 calculates the number, locates the bin in which the name has been stored, reads in the names in that bin, and does a linear search over only those names to find the desired series. The size of compressed banks is limited by the requirement that the total number of characters in the names of all series must be less than 64,000. In practice, that limit typically translates to about six or seven thousand series. Hashed banks, in contrast, can go up to several million series. The data file has the extension “HBK” and the extension for the index is “HIN”.

The precise form of the hashed bank .HIN and .HBK files are presented below. The “.HIN” file contains:

Item	Size in Bytes	C type	Definition
ns	4	long	The number of series in the bank.
nbins	2	unsigned	The number of bins in the bank.
nsb	2*nbins	unsigned	An array to contain the number of series in each bin.
ncharb	2*nbins	unsigned	The cumulative number of characters of the series names (including each ‘0’) contained in each bin.
posbin	4*nbins	unsigned	The beginning positions in the “.HIN” file of the first bytes of the binname() strings.
binname(0)	nchar[0]	char	The string binname(i) denotes the concatination (including the 0’s) of all the series names in the i-th bin.
binposts(0)	4*nnmsb[0]	long	binposts(i) is an array of beginning positions in the associated “.HBK” of the series in the the i-th bin.
binname(1)	nchar[1]	char
binposts(1)	4*nnmsb[1]	long
binname(2)	nchar[2]	char
binposts(2)	4*nnmsb[2]	long
…
binname(nbins-1)	nchar[nbins-1]	char
binposts(nbins-1)	4*nnmsb[nbins-1]	long

The series are separated into nbins “bins”, where the number contained in each bin is recorded in the nsb array. Of course, the ordering of the series in the binname() and binposts() arrays must be the same.

Consider an example. Suppose that the third bin contains the series “joe”, “dave”, and “bill”. The string binname(3) would be

"joe\0dave\0bill\0"

Suppose that the starting positions in the “.HBK” bank for the three series are 40700008, 490987, and 3378294. The array binposts(3) then would be [40700008, 490987, 3378294], with nsb[3] = 3, and with ncharb[3] = 14. If the beginning position of binname(3) in the “.HIN” file is 4724, then posbin[3] = 4724.

To assign a bin number to a series you must use the following hashing routine. In C, the routine is:

unsigned hash(char *s);
hash(char *s){
   unsigned bill;
   for( bill=0; *s!='\0'; s++ ) bill = *s + 31*bill;
   bill = bill%nbins;
   return(bill);
   }

To continue with the example, to determine the bin which the series “joe” really belongs to you’d evaluate the function hash(“joe”).

The .HBK file layout is:

Byte	Type	Description
0 - 79	char	Name of bank (terminated with a null)
80 - 81	int	ns, number of series in the bank
82 - 85	long	psn, position in file of index
86 -		first series, as described below
*(psn+1) -		second series,
…		…
psn	long	position in file of first byte of first series
psn+4	long	position in file of first byte of second series
…		… on out to ns series

For each series, the format is:

Byte	Content
0	base year
1	frequency*16+period
2	slash*16+maxplaces or 255 if not compressed
3-4	number of observations
5-8	first observation as a long
9 -	differences as integers

If the series is not compressed, then floating-point data begin in byte 5.

Notes for Programmers¶

Previous topic

Next topic

This Page