**Date Published:** June 7, 2018

**Publisher:** Public Library of Science

**Author(s):** Hukum Chandra, Kaustav Aditya, U. C. Sud, Stefano Marchetti.

http://doi.org/10.1371/journal.pone.0198502

**Abstract**

**Poverty affects many people, but the ramifications and impacts affect all aspects of society. Information about the incidence of poverty is therefore an important parameter of the population for policy analysis and decision making. In order to provide specific, targeted solutions when addressing poverty disadvantage small area statistics are needed. Surveys are typically designed and planned to produce reliable estimates of population characteristics of interest mainly at higher geographic area such as national and state level. Sample sizes are usually not large enough to provide reliable estimates for disaggregated analysis. In many instances estimates are required for areas of the population for which the survey providing the data was unplanned. Then, for areas with small sample sizes, direct survey estimation of population characteristics based only on the data available from the particular area tends to be unreliable. This paper describes an application of small area estimation (SAE) approach to improve the precision of estimates of poverty incidence at district level in the State of Bihar in India by linking data from the Household Consumer Expenditure Survey 2011–12 of NSSO and the Population Census 2011. The results show that the district level estimates generated by SAE method are more precise and representative. In contrast, the direct survey estimates based on survey data alone are less stable.**

**Partial Text**

Bihar is third-most populous state in India. According to the 2011 Population Census, the population of state is 103 million, which is about 8.58 percent of the total population of the country. Poverty is a very complex issue in Bihar and there is an exigent need to devise a focused strategy for poverty eradication. Reliable, qualitative and timely disaggregate level data is essential for effective planning, implementation and monitoring of various Government schemes in Bihar. Spatially disaggregated level data is inevitable for identifying the areas more in need and for developing focused and target oriented intervention programs. The geographic distribution of poverty and wealth is used to make decisions about resource allocation and provides a foundation for the study of inequality and the determinants of economic growth [1–2]. In developing countries, however, the scarcity of reliable quantitative data represents a major challenge to policy-makers and researchers. In India, National Sample Survey Office (NSSO) surveys are the main source of official statistics. A range of invaluable data at state and national level are generated through these surveys. The state level estimates generated by these surveys often masked the local level heterogeneity. More importantly, state level estimates do not adequately capture the extent of geographical inequalities which restricts the scope for evaluating progress locally within and between administrative units. But, the NSSO survey data cannot directly be used to produce reliable disaggregate level (e.g. district or further disaggregate level) estimates due to small sample sizes. In the survey literature, an area (or domain) is regarded as small if the area-specific (or domain-specific) sample is not large enough to support a direct survey estimator of adequate precision with unacceptably large coefficient of variation [3–4]. At the same time it is also true that conducting district specific survey is going to be very trivial and costly as well as time consuming job. An alternative solution to this problem is to use small area estimation (SAE) techniques. The SAE approach produces reliable estimates for such small areas with small sample sizes by borrowing strength from data of other areas. The SAE techniques are based on model-based survey estimation methods. The idea is to use statistical models to link the variable of interest with auxiliary information, e.g. Census and Administrative data, for the small areas to define model-based estimators for these areas. In other words, the SAE method uses indirect small area estimators that make use of the sample data from related areas or domains through linking models, and hence increases the effective sample size in the small areas. Such estimators can provide significantly smaller coefficient of variation than direct estimators, provided the linking models are valid, see [5]. Recently, some researchers have also used satellite imagery and mobile phone networks data to predict the poverty. Existing high resolution daytime satellite imagery is used to predict the spatial distribution of economic well-being across five African countries namely Nigeria, Tanzania, Uganda, Malawi, and Rwanda [6]. Anonymized data from mobile phone networks, combined with survey data, are also used to predict the poverty and wealth of individual subscribers, as well as to create high-resolution maps of the geographic distribution of wealth [7].

This Section describes basic sources of data i.e. survey data and the auxiliary data used to estimate the poverty incidence at district level. The poverty incidence is defined as the proportion of households with income below the poverty line, also referred as head count ratio (HCR). The HCR is a poverty indicator which measures the frequency of households under poverty line. Two types of variables are required for SAE analysis, the variable of interest and the auxiliary variables. In this study, the variable of interest for which small area estimates are required is drawn from the Household Consumer Expenditure Survey 2011–12 of NSSO for rural areas of the State of Bihar in India. The NSSO survey data is not freely downloadable but it can be obtained from the NSSO, Ministry of Statistics and Programme Implementation, Government of India (http://mospi.nic.in/). The sampling design used in the NSSO data is stratified multi-stage random sampling with districts as strata, villages as first stage units and households as the second stage units. A total of 3312 households were surveyed from the 38 districts of the Bihar. The district-wise sample size varied from minimum 64 to maximum 128 with average of 87 (Table 1). From Table 1, it is evident that district level sample sizes are very small with very low values of average sampling fraction of 0.00025. Therefore, it is difficult to produce reliable estimates of the poverty incidence and their standard errors at district level. Hence, the application of SAE technique is an obvious choice for obtaining the district level estimates of poverty incidence. The SAE technique is expected to provide reliable estimates for the districts having small sample data [3–5]. The target variable used for the study is poor households. The poverty line has been used to identify whether given household is poor or not. A household having monthly per capita consumer expenditure below the state’s poverty line (Rs 778) is categorised as poor household. The poverty line used in this study is same as those of year 2011–12, given by the planning commission, Government of India (see http://planningcommission.nic.in/news/press_pov2307.pdf).

In this Section we illustrate the theoretical framework used to produce small area estimates of the poverty incidence and their measure of precision. The details presented here are followed from [12–13]. Let us assume a finite population U of size N and a sample s of size n is drawn from this population with a given survey design. We assume that this population consists of D small areas or small domains (or simply areas or domains) Ud(d = 1,…,D) such that U=∪d=1DUd and N=∑d=1DNd. Throughout, we use a subscript d to index the quantities belonging to small area d (d = 1,…,D), where D is the number of small areas (or areas) in the population. The subscript s and r are used for denoting the quantities related to the sample and non-sample parts of the population. So that nd and Nd represent the sample and population (i.e., number of households in sample and population) sizes in district d, respectively. Let sd denotes the part of sample from area d such that s=∪d=1Dsd and n=∑d=1Dnd. Let ydi denotes the value of target variable of interest y for unit i in small area d. Let assume that the variable of interest y is binary and the target is the estimation of population counts yd=∑i∈Udydi or population proportions Pd=Nd−1(∑i∈Udydi) in area d. The direct estimator of proportion of poor household is defined as p^dDirect=(∑i∈sdwdi)−1(∑i∈sdwdiydi), where wdi is the survey weight associated with household i in area d. Assuming that joint inclusion 1/wdi,d′j = 0 for d ≠ d′ or i ≠ j, the estimate of variance of p^dDirect is v(p^dDirect)=(∑i∈sdwdi)−2{∑i∈sdwdi(wdi−1)(ydi−p^dDirect)2}, see for example [14]. Let us denote by ysd and yrd the sample and non-sample counts of poor households in area (or district) d. The sample count ysd has a Binomial distribution with parameters nd and pd, denoted by ysd ~ Bin(nd,pd), where pd is the probability of a poor household in area d, often termed as the probability of a ‘success’. Similarly, yrd ~ Bin(Nd − nd,pd). Further, ysd and yrd are assumed to be independent Binomial variables with pd being a common success probability. Here we assume that only aggregated level data is available for the small area modelling. For example, from survey data ysd and from secondary data sources (i.e. Census and administrative records etc) xd, the p-vector of the covariates, are available for area d. Following [12–13], the model linking the probabilities of success pd with the covariates xd is the logistic linear mixed model given by

logit(pd)=ln{pd1−pd}=ηd=xdTβ+ud,(1)

with pd=exp(xdTβ+ud){1+exp(xdTβ+ud)}−1=expit(xdTβ+ud). Here β is the p-vector of regression coefficients, often known as fixed effect parameters, and ud is the area-specific random effect that capture the between area heterogeneity. We assume that ud’s are independent and normally distributed with mean zero and variance ϕ. Here, we observe that equation number (1) relates the area (or district) level proportions (direct estimates) from the survey data to the area (or district) level covariates. This type of model is often referred to as ‘area-level’ model in SAE terminology, see for example [4, 8]. Area level model was originally proposed by Fay and Herriot [8] for the prediction of mean per-capita income (PCI) in small geographical areas (less than 500 persons) within counties in the United States. Fay-Herriot model [8] is widely used area level model for the estimation of small area quantities. In many small area applications, when data are non-linear on original scale, Fay-Herriot model is fitted on transformed scale. For example, some function of small area direct survey estimates is linearly related to the area aggregates of auxiliary variables. In small area income and poverty estimation project of the US Census Bureau, namely SAIPE, Fay-Herriot model is fitted using logarithm of direct poverty rate estimates [15]. Similarly, in Chilean poverty estimation methodology, Fay-Herriot model is fitted with transformed poverty rate estimates using the arcsine transformation [16]. In such cases, model parameters are estimated under Fay-Herriot model fitted on transformed scale. This is followed by back transformation to obtain the estimate for small area quantities on original scale. However, back transformation leads to biased estimates of small area quantities on original scale [15, 17]. This approach of poverty estimation based on Fay-Herriot method using the transformed direct estimates is often criticised. The Fay-Herriot method for SAE is based on area level linear mixed model and their approach is applicable to a continuous variable. This model is not applicable for non-normal data. Equation number (1) on the other hand, a special case of a generalized linear mixed model (GLMM) with logit link function, is suitable for modelling discrete data, particularly the binary variables. Here,

yds|ud∼Binomial(nd,expit(xdTβ+ud))andydr|ud∼Binomial(Nd−nd,expit(xdTβ+ud)).

This leads to E(ysd|ud)=ndexpit(xdTβ+ud) and E(yrd|ud)=(Nd−nd)expit(xdTβ+ud). Collecting the area level models given by equation number (1), we can write population level version of model of form

g(p)=η=Xβ+Zu.(2)

Here p = (p1,…,pD)T, X=(x1T,….,xDT)T is a D×p matrix, Z is a D×D diagonal matrix and u = (u1,…,uD)T is a vector of D×1 of area random effects, which is normally distributed with mean zero and variance Ω = ϕID. Here, ID is a D×D diagonal matrix. Note that estimation of fixed effect parameters β and area specific random effects ud’s uses the data from all small areas. We used an iterative procedure that combines the Penalized Quasi-Likelihood (PQL) estimation of β and u with restricted maximum likelihood (REML) estimation of ϕ to estimate these unknown parameters. Detailed description of the approach can be followed from [18–20]. Let us write the total counts, i.e. the total number of poor households in district d as yd = ysd + yrd, where ysd (sample count) is known and yrd (non-sample count) is unknown. Therefore, a plug-in empirical best predictor (EBP) estimate of the total count in area (or district) d, obtained by replacing yrd by its predicted value, is given by

y^dEBP=ysd+E^(yrd|ud)=ysd+(Nd−nd)[expit(xdTβ^+ZdTu^)],(3)

where ZdT=(0,..,1,.,0) is 1×D vector with 1 in position d-th. An estimate of proportion in area d is then obtained as p^dEBP=Nd−1y^dEBP. For area with zero sample sizes (i.e. non-sampled areas), the conventional approach for estimating area proportions or counts is synthetic estimation, based on a suitable GLMM fitted to the data from the sampled areas [12]. From equation number (1), for non-sampled areas, the synthetic type predictor of total count for area d is y^dSYN=Ndexpit(xd,outTβ^), where xd,out denote the vector of covariates associated with non-sampled area d. An alternative to predictor (3) has been proposed by [21]. Unfortunately, this predictor does not have a closed form and can only be computed via numerical approximation. This is generally not straightforward, and so many users tend to favour computation of a plug-in empirical predictors like (3). There are several alternative approaches for estimating the small area counts. For example, Bayesian approaches for modelling the counts, using a negative binomial distribution or via a hierarchical Poisson-gamma model, are popular in the disease mapping and ecological regression literature, see for example, [22–26] and references therein.

We now discuss the results (i.e. estimates of the proportion of poor households at district level in the State of Bihar) generated by the model-based small area method (3). In this analysis, we use survey data from the Household Consumer Expenditure Survey 2011–12 of NSSO and the Population Census 2011, and assume a binomial specification for the observed district level sample counts. Model specification for this application was discussed in previous Section, and resulted in the identification of three PCA-based covariates, labelled X11,X21 and X31, there.

Theory of SAE method for estimation of proportions is well developed, however, its application in the field of agricultural or social sciences are not so popular. In developed countries like USA, UK, Australia etc., SAE has been initiated and included as a part of their objectives in the national statistical offices. Although need of small area statistics has been felt in different agencies and organization in India, but, not much initiative has been taken place. In India, the Census is usually limited in its scope in collection of data; it focuses mainly on basic social and demographic information and that too at decennial interval. On the other hand, NSSO conducts regular surveys on a number of socioeconomic indicators, but their utility is restricted to generate national and state level estimates, but not administrative units below state because of small sample sizes for such units. This paper demonstrates that the SAE can be used as cost effective and efficient approach for generating fairly accurate disaggregate level estimates of the poverty incidence from existing survey data and using auxiliary information from different published data sources. The results clearly indicates the advantage of using SAE technique to cope up the small sample size problem in producing reliable estimates. Notably, the model-based SAE method brings gain in efficiency in district level estimates of the poverty.

Source:

http://doi.org/10.1371/journal.pone.0198502