Integrating Citizen Scientist Data into the Surveillance System for Avian Influenza Virus, Taiwan

The continuing circulation and reassortment with low-pathogenicity avian influenza Gs/Gd (goose/Guangdong/1996)-like avian influenza viruses (AIVs) has caused huge economic losses and raised public health concerns over the zoonotic potential. Virologic surveillance of wild birds has been suggested as part of a global AIV surveillance system. However, underreporting and biased selection of sampling sites has rendered gaining information about the transmission and evolution of highly pathogenic AIV problematic. We explored the use of the Citizen Scientist eBird database to elucidate the dynamic distribution of wild birds in Taiwan and their potential for AIV exchange with domestic poultry. Through the 2-stage analytical framework, we associated nonignorable risk with 10 species of wild birds with >100 significant positive results. We generated a risk map, which served as the guide for highly pathogenic AIV surveillance. Our methodologic blueprint has the potential to be incorporated into the global AIV surveillance system of wild birds.

since the duration exceed this criterion tends to correlate with particular bird sighting activity, such as Taiwan New Year bird count event (2). We didn't restrict the checklists based on the sampling protocol of the observers used since we are trying to capture all bird sighting activities regardless of whether the observers would record all species or target only specific bird species.
After data filtering, we obtained the final dataset used for the analysis, which consisted of 2,366,327 records of total numbers of species, covering 735 species, from 3080 observers.

Wildbird species selection
Before constructing the wild bird distribution map, the initial step is the selection of wild bird species relevant for the introduction of either HPAI or LPAI into the poultry farm. The bird species which show passage or regularly occurring breeding and wintering with preference to areas in Taiwan, and passing once or twice a year, may potentially act as a reservoir for LPAI, and will thus be considered for selection. The final inclusion criteria of bird species was based on either the top 20% abundancy by ranking the counts from the checklists of the observers (3)  The main difficulty in building a universally valid regression is that the presence (or absence) of bird species involves many variables (including land-related factors and environmental variables). It is challenging to obtain a unified explanation about the model structure. To estimate a risk map (occupancy probability), a variable selection procedure, called the elastic net method, were used for screening significant variables, including bird species and environmental factors. The elastic net method is a compromise between ridge regression and Lasso (4,5).

Defining the resolution: 3km×3km grids
Let Taiwan be divided into a set of 3km×3 km grids, each with an area equal to 9 km 2 , with four sides parallel to the Earth's longitudes and latitudes. This partition gives 4,762 grids covering the entire map of Taiwan including the coastline. Let the squares be denoted as Further, let , , be the value of the s-th variable for temporal, terrestrial, and environmental factors.

Modeling the occupancy
For species k, we first estimate the probability of its occurrence based on the presence or absence of other species as explanatory variables. This probability, denoted U and interpreted as a propensity score, is used as the "matching variable" in the following text. To model outbreaks in grids containing a certain number of poultry farms, the presence or absence of species k was used as the primary explanatory variable when the corresponding propensity scores U were matched. Therefore, it is still necessary to estimate the probability of occurrence of each species of bird in each grid based on a logistic autoregressive model to present an overall risk map.

Notations and model description
For grid "i" and for bird species "k", Yi,k is the indicator variable of existence of species "k", and Yi,-k is the indicator of all other species than species k. A natural conclusion is: the existence of species k depends on all the other species; and thus Yi,-k is a (K-1)-dimensional vector. Besides, Ti,k is the vector-valued variable representing all other variables (including the environmental data) except for bird species. Explicit modeling of spatial correlation between Y and the other Ts is implemented through the variable Y-i,k , which is also an indicator variable of observing species k in all adjacent grids using Queen's contiguity-based neighbors (6), that is The ZIP model estimates the probability of bird occupancy in a grid that accommodates both structural zeros (species never appear in the grid) and random zeros:

Page 5 of 10
Estimating occupancy probabilities using autoregressive logistic model The principle of incorporating spatial autocorrelation is to consider the correlation with adjacent grids. Let ( , = 1|Y i,−k , T i,k , Y −i,k ) be the occupancy probability given the "status" of the adjacent grids and the other land-cover and environmental variables. Explicit modeling of spatial correlation between Y and the other T is implemented through the variable Y-i,k which is an indicator of observing species k in all adjacent grids (using Queen's contiguity-based neighbors).
The occupancy probability is estimated through a ZIP model by summing the probabilities of nonzero terms: In (A3), the advantage of using the indicator metric Y i,−k to model occupancy is that it avoids possible biases based on intrinsic properties of the eBird data, which can arise when reporting the number of species observed, but less often happens when only occupancy "status" is adopted. However, reducing bias inevitably leads to a loss of efficiency in statistical estimation. In the event the ZIP model is not suitable for model fitting, the zero-inflated negative binomial (ZINB) model can be used instead (7).

Propensity score and matched-pair design
The propensity score (U) corresponding to an indicator variable Z, which is random but dependent on a set of covariates, is the (estimated) probability of being equal to 1 for Z. To the purpose of adjustment for multiple explanatory variables (denoted by X) in this study, we consider X to include the indicators of observing species other than bird species k, as well as Page 6 of 10 many other variables. After the adjustment (for the propensity score), the risk factors are reassessed for their association with the outcome variable (Y) by matching on the propensity score (8). The remaining question is: why use propensity score matching? First, the variable number of bird species can be large, making it impractical to report the risk of individual bird species oneby-one. Because of this concern, we consider the approach of matched-pair design so that the propensity scores of each species relative to all other species are matched. In addition, through this setting, the adjustment of environmental factors can also be achieved.

How to construct the case-control set
Among the N=1,073 grids where bird observations were reported, there are D=296 grids which contains at least 1 outbreak. We call the grid in D a "case", and for the other C=N-D=777 grids where no outbreak was reported, we call them "controls". In the sequel, we denote SD as the set of "case" grids, and SC as the set of "control" grids. Since every case can have multiple matched controls, it is possible to consider a resampling scheme from the matched control set and compute the McNemar statistic for each resampling. Section.

McNemar's matched-pair association test and Bootstrapping
In the Appendix Table, In these four yes-no cells, only one of them equals to 1, the other three equal to 0. Here a "one" represents "one matched-pair". Taking the summation over d=1,…,D, we obtain the total number of discordant pairs. Further, let α ( ) be the number of pairs that the case grid has species-k but the control-grid does not; Conversely, β ( ) is the number of pairs that the case-grid has no species-k but the control-grid does have. Because the random sampling is implemented on Mc(d), a subset of SC, the bootstrapping suggests that this random sampling can be repeated B times for a large "B" (9).
If, temporarily, the number of poultry farm in the cells are not taken into account (but actually the number itself is a risk factor of the outbreak indicator of that cell), denote α ( ) and β ( ) to be the numbers of discordant pairs with conditions (i) and (ii) stated above for species k, and at the b-th resampling. Let The quantities ( ) and ( ) measure the tendency of positive and negative associations, respectively, using the resampling procedure.