科研成果

2022

Pan R, Ren T, Guo B, Li F, Li G, Wang H. A Note on Distributed Quantile Regression by Pilot Sampling and One-Step Updating. Journal of Business and Economic Statistics [Internet]. 2022;40:1691–1700. 访问链接 Abstract

Quantile regression is a method of fundamental importance. How to efficiently conduct quantile regression for a large dataset on a distributed system is of great importance. We show that the popularly used one-shot estimation is statistically inefficient if data are not randomly distributed across different workers. To fix the problem, a novel one-step estimation method is developed with the following nice properties. First, the algorithm is communication efficient. That is the communication cost demanded is practically acceptable. Second, the resulting estimator is statistically efficient. That is its asymptotic covariance is the same as that of the global estimator. Third, the estimator is robust against data distribution. That is its consistency is guaranteed even if data are not randomly distributed across different workers. Numerical experiments are provided to corroborate our findings. A real example is also presented for illustration.

Wang X, Kang Y, Petropoulos F, Li F. The Uncertainty Estimation of Feature-Based Forecast Combinations. Journal of the Operational Research Society [Internet]. 2022;73:979–993. 访问链接 Abstract

Forecasting is an indispensable element of operational research (OR) and an important aid to planning. The accurate estimation of the forecast uncertainty facilitates several operations management activities, predominantly in supporting decisions in inventory and supply chain management and effectively setting safety stocks. In this paper, we introduce a feature-based framework, which links the relationship between time series features and the interval forecasting performance into providing reliable interval forecasts. We propose an optimal threshold ratio searching algorithm and a new weight determination mechanism for selecting an appropriate subset of models and assigning combination weights for each time series tailored to the observed features. We evaluate our approach using a large set of time series from the M4 competition. Our experiments show that our approach significantly outperforms a wide range of benchmark models, both in terms of point forecasts as well as prediction intervals.

康雁飞, 李丰. 统计计算. 在线出版; 2022. 访问链接

2021

Janeway MG, Zhao X, Rosenthaler M, Zuo Y, Balasubramaniyan K, Poulson M, Neufeld M, Siracuse JJ, Takahashi CE, Allee L, et al. Clinical Diagnostic Phenotypes in Hospitalizations Due to Self-Inflicted Firearm Injury. Journal of Affective Disorders. 2021;278:172–180.Abstract

Hospitalized self-inflicted firearm injuries have not been extensively studied, particularly regarding clinical diagnoses at the index admission. The objective of this study was to discover the diagnostic phenotypes (DPs) or clusters of hospitalized self-inflicted firearm injuries. Using Nationwide Inpatient Sample data in the US from 1993 to 2014, we used International Classification of Diseases, Ninth Revision codes to identify self-inflicted firearm injuries among those ≥18 years of age. The 25 most frequent diagnostic codes were used to compute a dissimilarity matrix and the optimal number of clusters. We used hierarchical clustering to identify the main DPs. The overall cohort included 14072 hospitalizations, with self-inflicted firearm injuries occurring mainly in those between 16 to 45 years of age, black, with co-occurring tobacco and alcohol use, and mental illness. Out of the three identified DPs, DP1 was the largest (n=10,110), and included most common diagnoses similar to overall cohort, including major depressive disorders (27.7%), hypertension (16.8%), acute post hemorrhagic anemia (16.7%), tobacco (15.7%) and alcohol use (12.6%). DP2 (n=3,725) was not characterized by any of the top 25 ICD-9 diagnoses codes, and included children and peripartum women. DP3, the smallest phenotype (n=237), had high prevalence of depression similar to DP1, and defined by fewer fatal injuries of chest and abdomen. There were three distinct diagnostic phenotypes in hospitalizations due to self-inflicted firearm injuries. Further research is needed to determine how DPs can be used to tailor clinical care and prevention efforts.

Kang Y, Spiliotis E, Petropoulos F, Athiniotis N, Li F, Assimakopoulos V. Déjà vu: A Data-Centric Forecasting Approach through Time Series Cross-Similarity. Journal of Business Research [Internet]. 2021;132:719–731. 访问链接 Abstract

Accurate forecasts are vital for supporting the decisions of modern companies. Forecasters typically select the most appropriate statistical model for each time series. However, statistical models usually presume some data generation process while making strong assumptions about the errors. In this paper, we present a novel data-centric approach — ‘forecasting with cross-similarity’, which tackles model uncertainty in a model-free manner. Existing similarity-based methods focus on identifying similar patterns within the series, i.e., ‘self-similarity’. In contrast, we propose searching for similar patterns from a reference set, i.e., ‘cross-similarity’. Instead of extrapolating, the future paths of the similar series are aggregated to obtain the forecasts of the target series. Building on the cross-learning concept, our approach allows the application of similarity-based forecasting on series with limited lengths. We evaluate the approach using a rich collection of real data and show that it yields competitive accuracy in both points forecasts and prediction intervals.

Zhu X, Li F, Wang H. Least-Square Approximation for a Distributed System. Journal of Computational and Graphical Statistics [Internet]. 2021;30:1004–1018. 访问链接 Abstract

In this work, we develop a distributed least-square approximation (DLSA) method that is able to solve a large family of regression problems (e.g., linear regression, logistic regression, and Cox’s model) on a distributed system. By approximating the local objective function using a local quadratic form, we are able to obtain a combined estimator by taking a weighted average of local estimators. The resulting estimator is proved to be statistically as efficient as the global estimator. Moreover, it requires only one round of communication. We further conduct a shrinkage estimation based on the DLSA estimation using an adaptive Lasso approach. The solution can be easily obtained by using the LARS algorithm on the master node. It is theoretically shown that the resulting estimator possesses the oracle property and is selection consistent by using a newly designed distributed Bayesian information criterion. The finite sample performance and computational efficiency are further illustrated by an extensive numerical study and an airline dataset. The airline dataset is 52 GB in size. The entire methodology has been implemented in Python for a de-facto standard Spark system. The proposed DLSA algorithm on the Spark system takes 26 min to obtain a logistic regression estimator, which is more efficient and memory friendly than conventional methods. Supplementary materials for this article are available online.

2020

Hao C, Li F, von Rosen D. A Bilinear Reduced Rank Model. In: Fan J, Pan J Contemporary Experimental Design, Multivariate Analysis and Data Mining. Springer Nature; 2020. 访问链接 Abstract

This article considers a bilinear model that includes two different latent effects. The first effect has a direct influence on the response variable, whereas the second latent effect is assumed to first influence other latent variables, which in turn affect the response variable. In this article, latent variables are modelled via rank restrictions on unknown mean parameters and the models which are used are often referred to as reduced rank regression models. This article presents a likelihood-based approach that results in explicit estimators. In our model, the latent variables act as covariates that we know exist, but their direct influence is unknown and will therefore not be considered in detail. One example is if we observe hundreds of weather variables, but we cannot say which or how these variables affect plant growth.

Li X, Kang Y, Li F. Forecasting with Time Series Imaging. Expert Systems with Applications [Internet]. 2020;160:113680. 访问链接 Abstract

Feature-based time series representations have attracted substantial attention in a wide range of time series analysis methods. Recently, the use of time series features for forecast model averaging has been an emerging research focus in the forecasting community. Nonetheless, most of the existing approaches depend on the manual choice of an appropriate set of features. Exploiting machine learning methods to extract features from time series automatically becomes crucial in state-of-the-art time series analysis. In this paper, we introduce an automated approach to extract time series features based on time series imaging. We first transform time series into recurrence plots, from which local features can be extracted using computer vision algorithms. The extracted features are used for forecast model averaging. Our experiments show that forecasting based on automatically extracted features, with less human intervention and a more comprehensive view of the raw time series data, yields highly comparable performances with the best methods in the largest forecasting competition dataset (M4) and outperforms the top methods in the Tourism forecasting competition dataset.

Kang Y, Hyndman RJ, Li F. GRATIS: GeneRAting TIme Series with Diverse and Controllable Characteristics. Statistical Analysis and Data Mining: The ASA Data Science Journal [Internet]. 2020;13:354–376. 访问链接 Abstract

The explosion of time series data in recent years has brought a flourish of new time series analysis methods, for forecasting, clustering, classification and other tasks. The evaluation of these new methods requires either collecting or simulating a diverse set of time series benchmarking data to enable reliable comparisons against alternative approaches. We propose GeneRAting TIme Series with diverse and controllable characteristics, named GRATIS, with the use of mixture autoregressive (MAR) models. We simulate sets of time series using MAR models and investigate the diversity and coverage of the generated time series in a time series feature space. By tuning the parameters of the MAR models, GRATIS is also able to efficiently generate new time series with controllable features. In general, as a costless surrogate to the traditional data collection approach, GRATIS can be used as an evaluation tool for tasks such as time series forecasting and classification. We illustrate the usefulness of our time series generation process through a time series forecasting application.

Kalesan B, Zhao S, Poulson M, Neufeld M, Dechert T, Siracuse JJ, Zuo Y, Li F. Intersections of Firearm Suicide, Drug-Related Mortality, and Economic Dependency in Rural America. Journal of Surgical Research. 2020;256:96–102.

2019

Bailey HM, Zuo Y, Li F, Min J, Vaddiparti K, Prosperi M, Fagan J, Galea S, Kalesan B. Changes in Patterns of Mortality Rates and Years of Life Lost Due to Firearms in the United States, 1999 to 2016: A Joinpoint Analysis. PLOS ONE. 2019;14:e0225223.Abstract

Background Firearm-related death rates and years of potential life lost (YPLL) vary widely between population subgroups and states. However, changes or inflections in temporal trends within subgroups and states are not fully documented. We assessed temporal patterns and inflections in the rates of firearm deaths and %YPLL due to firearms for overall and by sex, age, race/ethnicity, intent, and states in the United States between 1999 and 2016. Methods We extracted age-adjusted firearm mortality and YPLL rates per 100,000, and %YPLL from 1999 to 2016 by using the WONDER (Wide-ranging Online Data for Epidemiologic Research) database. We used Joinpoint Regression to assess temporal trends, the inflection points, and annual percentage change (APC) from 1999 to 2016. Results National firearm mortality rates were 10.3 and 11.8 per 100,000 in 1999 and 2016, with two distinct segments; a plateau until 2014 followed by an increase of APC = 7.2% (95% CI 3.1, 11.4). YPLL rates were from 304.7 and 338.2 in 1999 and 2016 with a steady APC increase in %YPLL of 0.65% (95% CI 0.43, 0.87) from 1999 to an inflection point in 2014, followed by a larger APC in %YPLL of 5.1% (95% CI 0.1, 10.4). The upward trend in firearm mortality and YPLL rates starting in 2014 was observed in subgroups of male, non-Hispanic blacks, Hispanic whites and for firearm assaults. The inflection points for firearm mortality and YPLL rates also varied across states. Conclusions Within the United States, firearm mortality rates and YPLL remained constant between 1999 and 2014 and has been increasing subsequently. There was, however, an increase in firearm mortality rates in several subgroups and individual states earlier than 2014.

Li F, He Z. Credit Risk Clustering in a Business Group: Which Matters More, Systematic or Idiosyncratic Risk? McMillan D. Cogent Economics & Finance [Internet]. 2019;7:1632528. 访问链接 Abstract

Understanding how defaults correlate across firms is a persistent concern in risk management. In this paper, we apply covariate-dependent copula models to assess the dynamic nature of credit risk dependence, which we define as “credit risk clustering”. We also study the driving forces of the credit risk clustering in CEC business group in China. Our empirical analysis shows that the credit risk clustering varies over time and exhibits different patterns across firm pairs in a business group. We also investigate the impacts of systematic and idiosyncratic factors on credit risk clustering. We find that the impacts of the money supply and the short-term interest rates are positive, whereas the impacts of exchange rates are negative. The roles of the CPI on credit risk clustering are ambiguous. Idiosyncratic factors are vital for predicting credit risk clustering. From a policy perspective, our results not only strengthen the results of previous research but also provide a possible approach to model and predict the extreme co-movement of credit risk in business groups with financial indicators.

2018

Pino EC, Zuo Y, Olivera CMD, Mahalingaiah S, Keiser O, Moore LL, Li F, Vasan RS, Corkey BE, Kalesan B. Cohort Profile: The MULTI sTUdy Diabetes rEsearch (MULTITUDE) Consortium. BMJ Open. 2018;8:e020640.Abstract

Purpose Globally, the age-standardised prevalence of type 2 diabetes mellitus (T2DM) has nearly doubled from 1980 to 2014, rising from 4.7% to 8.5% with an estimated 422 million adults living with the chronic disease. The MULTI sTUdy Diabetes rEsearch (MULTITUDE) consortium was recently established to harmonise data from 17 independent cohort studies and clinical trials and to facilitate a better understanding of the determinants, risk factors and outcomes associated with T2DM. Participants Participants range in age from 3 to 88 years at baseline, including both individuals with and without T2DM. MULTITUDE is an individual-level pooled database of demographics, comorbidities, relevant medications, clinical laboratory values, cardiac health measures, and T2DM-associated events and outcomes across 45 US states and the District of Columbia. Findings to date Among the 135 156 ongoing participants included in the consortium, almost 25% (33 421) were diagnosed with T2DM at baseline. The average age of the participants was 54.3, while the average age of participants with diabetes was 64.2. Men (55.3%) and women (44.6%) were almost equally represented across the consortium. Non-whites accounted for 31.6% of the total participants and 40% of those diagnosed with T2DM. Fewer individuals with diabetes reported being regular smokers than their non-diabetic counterparts (40.3% vs 47.4%). Over 85% of those with diabetes were reported as either overweight or obese at baseline, compared with 60.7% of those without T2DM. We observed differences in all-cause mortality, overall and by T2DM status, between cohorts. Future plans Given the wide variation in demographics and all-cause mortality in the cohorts, MULTITUDE consortium will be a unique resource for conducting research to determine: differences in the incidence and progression of T2DM; sequence of events or biomarkers prior to T2DM diagnosis; disease progression from T2DM to disease-related outcomes, complications and premature mortality; and to assess race/ethnicity differences in the above associations.

Li F, Kang Y. Improving Forecasting Performance Using Covariate-Dependent Copula Models. International Journal of Forecasting [Internet]. 2018;34:456–476. 访问链接 Abstract

Copulas provide an attractive approach to the construction of multivariate distributions with flexible marginal distributions and different forms of dependences. Of particular importance in many areas is the possibility of forecasting the tail-dependences explicitly. Most of the available approaches are only able to estimate tail-dependences and correlations via nuisance parameters, and cannot be used for either interpretation or forecasting. We propose a general Bayesian approach for modeling and forecasting tail-dependences and correlations as explicit functions of covariates, with the aim of improving the copula forecasting performance. The proposed covariate-dependent copula model also allows for Bayesian variable selection from among the covariates of the marginal models, as well as the copula density. The copulas that we study include the Joe-Clayton copula, the Clayton copula, the Gumbel copula and the Student’s t-copula. Posterior inference is carried out using an efficient MCMC simulation method. Our approach is applied to both simulated data and the S&P 100 and S&P 600 stock indices. The forecasting performance of the proposed approach is compared with those of other modeling strategies based on log predictive scores. A value-at-risk evaluation is also performed for the model comparisons.

2016

李丰. 大数据分布式计算与案例. 第一版. 中国人民大学出版社; 2016. 访问链接

2013

Li F. Bayesian Modeling of Conditional Densities. [Internet]. 2013. 访问链接 Abstract

This thesis develops models and associated Bayesian inference methods for flexible univariate and multivariate conditional density estimation. The models are flexible in the sense that they can capture widely differing shapes of the data. The estimation methods are specifically designed to achieve flexibility while still avoiding overfitting. The models are flexible both for a given covariate value, but also across covariate space. A key contribution of this thesis is that it provides general approaches of density estimation with highly efficient Markov chain Monte Carlo methods. The methods are illustrated on several challenging non-linear and non-normal datasets. In the first paper, a general model is proposed for flexibly estimating the density of a continuous response variable conditional on a possibly high-dimensional set of covariates. The model is a finite mixture of asymmetric student-t densities with covariate-dependent mixture weights. The four parameters of the components, the mean, degrees of freedom, scale and skewness, are all modeled as functions of the covariates. The second paper explores how well a smooth mixture of symmetric components can capture skewed data. Simulations and applications on real data show that including covariate-dependent skewness in the components can lead to substantially improved performance on skewed data, often using a much smaller number of components. We also introduce smooth mixtures of gamma and log-normal components to model positively-valued response variables. In the third paper we propose a multivariate Gaussian surface regression model that combines both additive splines and interactive splines, and a highly efficient MCMC algorithm that updates all the multi-dimensional knot locations jointly. We use shrinkage priors to avoid overfitting with different estimated shrinkage factors for the additive and surface part of the model, and also different shrinkage parameters for the different response variables. In the last paper we present a general Bayesian approach for directly modeling dependencies between variables as function of explanatory variables in a flexible copula context. In particular, the Joe-Clayton copula is extended to have covariate-dependent tail dependence and correlations. Posterior inference is carried out using a novel and efficient simulation method. The appendix of the thesis documents the computational implementation details.

Li F, Villani M. Efficient Bayesian Multivariate Surface Regression. Scandinavian Journal of Statistics [Internet]. 2013;40:706–723. 访问链接 Abstract

Methods for choosing a fixed set of knot locations in additive spline models are fairly well established in the statistical literature. The curse of dimensionality makes it nontrivial to extend these methods to nonadditive surface models, especially when there are more than a couple of covariates. We propose a multivariate Gaussian surface regression model that combines both additive splines and interactive splines, and a highly efficient Markov chain Monte Carlo algorithm that updates all the knot locations jointly. We use shrinkage prior to avoid overfitting with different estimated shrinkage factors for the additive and surface part of the model, and also different shrinkage parameters for the different response variables. Simulated data and an application to firm leverage data show that the approach is computationally efficient, and that allowing for freely estimated knot locations can offer a substantial improvement in out-of-sample predictive performance.

2011

Li F, Villani M, Kohn R. Modelling Conditional Densities Using Finite Smooth Mixtures. In: Mixtures: Estimation and Applications. John Wiley & Sons; 2011. pp. 123–144. 访问链接 Abstract

Smooth mixtures, i.e. mixture models with covariate-dependent mixing weights, are very useful flexible models for conditional densities. Previous work shows that using too simple mixture components for modeling heteroscedastic and/or heavy tailed data can give a poor fit, even with a large number of components. This paper explores how well a smooth mixture of symmetric components can capture skewed data. Simulations and applications on real data show that including covariate-dependent skewness in the components can lead to substantially improved performance on skewed data, often using a much smaller number of components. Furthermore, variable selection is effective in removing unnecessary covariates in the skewness, which means that there is little loss in allowing for skewness in the components when the data are actually symmetric. We also introduce smooth mixtures of gamma and log-normal components to model positively-valued response variables.

2010

Li F, Villani M, Kohn R. Flexible Modeling of Conditional Distributions Using Smooth Mixtures of Asymmetric Student t Densities. Journal of Statistical Planning and Inference [Internet]. 2010;140:3638–3654. 访问链接 Abstract

A general model is proposed for flexibly estimating the density of a continuous response variable conditional on a possibly high-dimensional set of covariates. The model is a finite mixture of asymmetric student t densities with covariate-dependent mixture weights. The four parameters of the components, the mean, degrees of freedom, scale and skewness, are all modeled as functions of the covariates. Inference is Bayesian and the computation is carried out using Markov chain Monte Carlo simulation. To enable model parsimony, a variable selection prior is used in each set of covariates and among the covariates in the mixing weights. The model is used to analyze the distribution of daily stock market returns, and shown to more accurately forecast the distribution of returns than other widely used models for financial data.

李丰 (Feng Li)

北京大学光华管理学院　商务统计与经济计量系　副教授、研究员，博士生导师

科研成果

Pages

成果类型

成果概览

最新科研成果