Pooled Time Series and Cross-Sectional Data
Pooled Time Series and Cross-Sectional Data
Economic datasets come in a variety of forms. The cross-sectional, time series, and panel data are the most commonly used kinds of datasets. A cross-sectional dataset consists of a sample of individuals, households, firms, cities, states, countries, or any other micro- or macroeconomic unit taken at a given point in time. Sometimes the data on all units do not correspond to precisely the same time period. In a pure cross-sectional analysis, such minor time differences in data collection are ignored. Figure 1 illustrates the relationship between cross-sectional data on the price of houses sold within a two-week period and the houses’ size. The basic model, yi = c + xiβ + εi (i = 1, 2, …N ), where yi is the dependent variable and xi is a 1 × K vector of explanatory variables, often has stochastic errors εi such that E (εi /xi ) = 0 but . Ordinary least squares (OLS) estimates that ignore heterogeneity across cross sections are unbiased but inefficient. Efficiency can be attained from generalized least squares (GLS) estimation. In economics, the analysis of cross-sectional data is closely associated with applied microeconomics fields such as labor economics, public finance, industrial organization, urban economics, demography, and health economics. Data on individuals, households, firms, and cities at a given point in time are important for testing microeconomic hypotheses and evaluating economic policies.
A time series dataset contains information on a variable or a set of variables over time. Examples of time series data include stock prices, money supply, the consumer price index, gross domestic product (GDP), annual homicide rates, and automobile sales figures. Figure 2 illustrates the time series for NASDAQ on July 7, 2006. Since past events influence the future and lags in behavior are prevalent in the social sciences, time is an important dimension in time series datasets. Unlike the arrangement of cross-sectional data, chronology is crucial in time series datasets. Time series observations are hard to analyze mainly because of the interdependency of observations over time. The basic model, yt = c + xtβ + εt (t = 1, 2, …,T ), where yt is the dependent variable and xt is a 1 × K vector of explanatory variables, has stochastic errors εt such that E (εi /xi ) = 0 but εt = pεt-1 + ut (ut satisfies all classical assumptions).
Most economic data are strongly related to their recent histories. For example, information on GDP from the last quarter allows the researcher to make accurate predictions about the likely range of GDP during the current quarter, because GDP tends to remain fairly stable from one quarter to the next. OLS regression estimates that ignore the time-dependence features of time series data produce inaccurate results. Model transformation is required to produce GLS estimates that are efficient. Several modifications and embellishments to standard econometric techniques have been developed to account for and exploit the dependent nature of economic time series and to address other issues, such as the fact that some economic variables tend to display clear trends over time.
Another feature of time series data that can require special attention is the frequency at which the data are collected. In economics, the most common frequencies are daily, weekly, monthly, quarterly, and annually. Stock prices are recorded daily (excluding Saturday and Sunday). The money supply in the U.S. economy is reported weekly. Many macroeconomic series, such as inflation and unemployment rates, are tabulated monthly. The gross domestic product is a quarterly series. Other time series, such as the infant mortality rates in the United States, are available only on an annual basis. Many weekly, monthly, and quarterly economic data display strong seasonal patterns. For example, monthly data on crop yield differ across months simply due to changes in weather conditions. Hence, before analyzing time series data, it is important to deseasonalize the data or remove the seasonal trends.
A panel or longitudinal dataset varies across both time and cross-sectional units, as seen in Figure 3. This is a partial table of the entire dataset that consists of the value of sales (sal ), payroll (pay ), capital expenditure (cap ), and cost of pollution abatement (abat ) for a set of industries followed over a three-year period. The ordering of the data by microunits first and then by time is typical of all longitudinal datasets. The number of time periods are kept constant across the cross-sectional units in balanced panels. Treatment for unbalanced panels requires further analysis.
The basic model for the i th cross section is yit = ci + xitβ + uit (t = 1, 2, …, T ), where xit is a 1(K ) matrix of explanatory variables that vary across i or t or both, ci represents cross-sectional heterogeneity, and uit is a stochastic error. The conditional mean of the disturbances is assumed to be zero. In traditional approaches, the model is random effects (RE) when ci is a random variable, and fixed effects (FE) when cj is a fixed parameter to be estimated (Balestra and Nerlove 1966). Yair Mundlak (1978) made a valid argument that unobserved effects cj should be treated as random draws from the population along with yit and xit. In modern econometric language, in an
RE model, cj is assumed to be uncorrelated with xit, while in a FE model, arbitrary correlation between cj and xit is allowed.
Two famous studies that analyze panel datasets are the National Longitudinal Survey of Labor Market Experience (NLS) and the Michigan Panel Study of Income Dynamics (PSID). In these datasets, very large cross sections, consisting of thousands of microunits, are followed through time, but the number of time periods is often small. The PSID is a study of roughly six thousand families and fifteen thousand individuals who have been interviewed periodically from 1968 to the present. Another group of intensely studied panel datasets are those from the income tax experiments of 1970, in which thousands of families were followed for eight to thirteen quarters. For most panels, cross-sectional dependence is strong and time dependence is insignificant. Panel datasets are wide but short, and heterogeneity across units is often the central focus of the analysis. In two-way error components models, the cross-sectional heterogeneity varies across time and cross sections, that is, yit = cit + xitβ + uit or yit = ci + δt + xitβ + uit. The fundamental advantage of a panel dataset over a cross section is that it allows the researcher greater flexibility in modeling differences in behavior across individuals.
SEE ALSO National Longitudinal Survey of Youth; Panel Study of Income Dynamics
Balestra, Pietro, and Marc Nerlove 1966. Pooling Cross Section and Time Series Data in the Estimation of a Dynamic Model: The Demand for Natural Gas. Econometrica 34 (3): 585–612.
Greene, William H. 2003. Econometric Analysis. 5th ed. Upper Saddle River, NJ: Prentice Hall.
Mundlak, Yair. 1978. On the Pooling of Time Series and Cross Section Data. Econometrica 46 (1): 69–85.
Wooldridge, Jeffrey M. 2006. Introductory Econometrics: A Modern Approach. 3rd ed. Mason, OH: Thomson/SouthWestern.