Seminari - Eventi | Università Cattolica del Sacro Cuore

Seminari - Eventi

Thursday, January 25th, 2024
Daniele Durante
Assistant professor of Statistics, Department of Decision Sciences, Bocconi University

Bayesian Nonparametric Stochastic Block Modeling of Criminal Networks

Abstract:

Europol recently defined criminal networks as a modern version of the Hydra mythological creature, with covert and complex structure. Indeed, relationships data among criminals are subject to measurement errors, structured missingness patterns, and exhibit a sophisticated combination of an unknown number of core-periphery, assortative and disassortative structures that may encode key architectures of the criminal organization. The coexistence of these noisy block patterns limits the reliability of community detection algorithms routinely-used in criminology, thereby leading to overly-simplified and possibly biased reconstructions of organized crime architectures. In this seminar, I will present a number of model-based solutions which aim at covering these gaps via a combination of stochastic block models and priors for random partitions arising from Bayesian nonparametrics. These include Gibbs-type priors, and random partition priors driven by the urn scheme of a hierarchical normalized completely random measure. Product-partition models to incorporate criminals' attributes, and zero-inflated Poisson representations accounting for weighted edges and security strategies, will be also discussed. Collapsed Gibbs samplers for posterior computation are presented, and refined strategies for estimation, prediction, uncertainty quantification and model selection will be outlined. Results are illustrated in an application to an Italian Mafia network, where the proposed models unveil a structure of the criminal organization mostly hidden to state-of-the-art alternatives routinely used in criminology. I will conclude the seminar with ideas on how to learn the evolutionary history of the criminal organization from the relationship data among its criminals via a novel combination of latent space models for network data and phylogenetic trees.

Wednesday, December 13th, 2023

Matteo Iacopini

Lecturer in Statistics, Queen Mary University of London

Static and Dynamic BART for Rank-Order Data

Abstract

Ranking lists are often provided at regular time intervals by one or multiple rankers in a range of applications, including sports, marketing, and politics. Most popular methods for rank-order data postulate a linear specification for the latent scores, which determine the observed ranks, and ignore the temporal dependence of the ranking lists. To address these issues, novel nonparametric static (ROBART) and autoregressive (ARROBART) models are introduced, with latent scores de- fined as nonlinear Bayesian additive regression tree functions of covariates.To make inferences in the dynamic ARROBART model, closed-form filtering, predictive, and smoothing distributions for the latent time-varying scores are de- rived. These results are applied in a Gibbs sampler with data augmentation for posterior inference.The proposed methods are shown to outperform existing competitors in sim- ulation studies, and the advantages of the dynamic model are demonstrated by forecasts of weekly pollster rankings of NCAA football teams.

Monday, June 5th, 2023

Marzia A. Cremona

Université Laval - Québec (Québec) G1V 0A6 (Canada)

smoothEM: a new approach for the simultaneous assessment of smooth patterns and spikes

Abstract:

We consider functional data where an underlying smooth curve is composed not just with errors, but also with irregular spikes that (a) are themselves of interest, and (b) can negatively affect our ability to characterize the underlying curve. We propose an approach that, combining regularized spline smoothing and an Expectation-Maximization algorithm, allows one to both identify spikes and estimate the smooth component. Imposing some assumptions on the error distribution, we prove consistency of EM estimates. Next, we demonstrate the performance of our proposal on finite samples and its robustness to assumptions violations through simulations. Finally, we apply our proposal to data on the annual heatwaves index in the US and on weekly electricity consumption in Ireland. In both datasets, we are able to characterize underlying smooth trends and to pinpoint irregular/extreme behaviors.

Work in collaboration with Huy Dang (Penn State University) and Francesca Chiaromonte (Penn State University and Sant’Anna School of Advanced Studies)

Thursday, May 18th, 2023

Matteo SESIA

Department of Data Sciences and Operations, Marshall School of Business - University of Southern California

Conformal Inference for Frequency Estimation with Sketched Data

Abstract

A flexible model-free method is developed to construct a confidence interval for the frequency of a queried object in a very large data set, based on a much smaller sketch of the data. The approach requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals for random queries using a conformal inference approach. After achieving marginal coverage for random queries under the assumption of data exchangeability, the proposed method is extended to provide stronger inferences accounting for possibly heterogeneous frequencies of different random queries, redundant queries, and distribution shifts. While the presented methods are broadly applicable, this work focuses on use cases involving the count-min sketch algorithm and a non-linear variation thereof, to facilitate comparison to prior work. In particular, the developed methods are compared empirically to frequentist and Bayesian alternatives, through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature.

Slide seminario

Monday, March 27th, 2023

Augusto FASANO

Scalable and accurate variational Bayes for high-dimensional binary regression models and beyond

Abstract:

Bayesian binary probit regression and its extensions to time-dependent observations and multi-class responses are popular tools in binary and categorical data regression due to their interpretability and non-restrictive assumptions. Although the theory is well established in the frequentist literature, these models still face a florid research in the Bayesian framework to overcome computational issues or inaccuracies in high dimensions as well as the lack of a closed-form expression for the posterior distribution of the model parameters in many cases. We develop a novel variational approximation for the posterior distribution of the coefficients in high-dimensional probit regression with binary responses and Gaussian priors, resulting in a unified skew-normal (SUN) approximating distribution that converges to the exact posterior as the number of predictors increases. Moreover, we derive closed-form expressions for posterior distributions arising from models that account for correlated binary time-series and multi-class responses, developing computational methods that outperform state-of-the-art routines. Finally, we show that such methodological and computational results can be extended to a broad variety of routinely-used regression models leveraging on SUN conjugacy.

Monday, November 5th, 2022

Gerardo GALLO

Responsabile del Servizio censimento permanente della popolazione dell’Istat

I «Segnali di vita amministrativi»: cosa sono e come vengono utilizzati nel censimento permanente della popolazione

Abstract

Dal 1861 al 2001 i censimenti della popolazione hanno conosciuto notevoli cambiamenti. Nonostante questo, però, l’impianto portante del censimento della popolazione è rimasto più o meno immutato almeno fino all’inizio del nuovo millennio. Fino al 2001 la rilevazione censuaria veniva condotta su tutto il territorio nazionale attraverso l’impiego dei rilevatori comunali che si recavano porta a porta presso gli alloggi e le altre strutture abitative per contare le famiglie, le convivenze, le persone dimoranti abitualmente (o residenti) e rilevare le loro caratteristiche principali.

Il censimento del 2011 ha conservato l’enumerazione completa e simultanea, ma ha utilizzato, per la prima volta in Italia, una lista di partenza delle unità di rilevazione personalizzata in base ai dati nominativi delle Liste anagrafiche comunali (Lac) di tutti i comuni italiani. Questo ha rappresentato una tappa decisiva per realizzare un registro statistico della popolazione residente centralizzato, rendendo così possibile negli anni successivi la realizzazione del censimento italiano basato sull’uso di dati amministrativi, proprio come accade già da tempo in altri paesi europei.

Nel 2012, con il provvedimento legislativo n. 221/2012 di conversione in legge del decreto-legge n. 179/2012, si è dato avvio al censimento permanente della popolazione realizzato ogni anno, e fondato sulla combinazione di rilevazioni campionarie e dati di fonte amministrativa trattati statisticamente.

In questo quadro il seminario intende illustrare le più recenti procedure adottate dai ricercatori dell’Istituto Nazionale di Statistica per selezionare le fonti statistiche adeguate ad osservare i «segnali di vita» degli individui in termini di «dimora abituale in Italia». Si mostrerà inoltre come tali segnali intervengono nel processo complessivo di conteggio della popolazione, nel rispetto dei regolamenti europei in materia di censimenti. Si accennerà infine alle vie suggerite dall’Istat per realizzare ricerche basate sull’utilizzo di dati censuari.

Thurday, November 1st, 2022

Andrea POZZI

Università Cattolica del Sacro Cuore di Brescia

Solving Sequential Decision-Making Problems with Reinforcement Learning

Abstract

Reinforcement learning is an area of in an environment to maximize the notion of cumulative reward, where the interaction between the agent and the environment is usually represented as a Markov Decision Process. Reinforcement learning, which is studied in computer science, control theory, and statistics, is one of the most promising artificial intelligence techniques in the last decade. In particular, its use in combination with deep neural networks has led to extraordinary results, such as achieving super-human performance in numerous games (see for instance Mnih et al. 2013, “Playing Atari with Deep Reinforcement Learning”). In real-world scenarios, reinforcement learning is mainly used when: i) a model of the environment is known but an analytic solution is not available, ii) the only way to collect information about the environment is to directly interact with it. Among the different real-world applications of reinforcement learning, the focus is here directed to: autonomous vehicle control; algo-trading and portfolio management in finance; text summarization in natural language processing.

Webinar Video

Friday, July 15th, 2022

Mario BERAHA

Politecnico di Milano

Normalized Latent Measure Factor Models

Abstract

We propose a methodology for modeling and comparing probability distributions within a Bayesian nonparametric framework. Building on dependent normalized random measures, we consider a prior distribution for a collection of discrete random measures where each measure is a linear combination of a set of latent measures, interpretable as characteristic traits shared by different distributions, with positive random weights. The model is non-identified and a method for post-processing posterior samples to achieve identified inference is developed. This uses Riemannian optimization to solve a non-trivial optimization problem over a Lie group of matrices. The effectiveness of our approach is validated on simulated data and in two applications to two real-world data sets: school student test scores and personal incomes in California. Our approach leads to interesting insights for populations and easily interpretable posterior inference

Friday, July 15th, 2022

Andrea GILARDI

Università degli studi Milano Bicocca

Lattice models for spatial data on linear networks

Abstract

In the last years, we observed a surge of interest in the statistical analysis of spatial data lying on or alongside networks. Car crashes, vehicle thefts, and ambulance interventions are just few of the most typical examples, whereas the edges of the network represent an abstraction of roads, rivers or railways. In this talk, we discuss two approaches for the analysis of car-crashes at the street network level. In both cases, the analyses are based on a major city (Leeds, UK) in which car crashes of different severity were recorded over several years. In the first project, we introduce a multivariate Bayesian hierarchical model that includes spatially structured and unstructured random effects to capture the spatial nature of the events and the dependencies between the severity levels. We also discuss a novel procedure for testing the presence of MAUP at the network-lattice level. In the second part of the talk, we present a series of preliminary results that extend the first project by including an external covariate that suffers from spatial measurement error. The suggested methodology is exemplified considering estimates of traffic volumes at the road network level obtained from mobile devices.

Wednesday, June 8th, 2022

Statistical Bridges Series

Robert MATTHEWS
Department of Mathematics – Aston University, Birmingham, UK

The Replication Crisis in Research: A Progress Report

Abstract

Evidence for the unreliability of research claims continues to grow, and has led to the emergence of the so-called "replication crisis". Arguably the most significant cause of such unreliability is the concept of statistical significance. Its ability to undermine reliable inference has been noted in fields from medicine and psychology to finance and computer science. Following an unprecedented call for action from the American Statistical Association, the statistical community has responded with a range of alternatives, from quick fixes to paradigm shifts. I review the impact of these attempts to move beyond statistical significance, and suggest some ways forward.

Webinar Video

Tuesday, May 31, 2022

Francesca CHIAROMONTE

Scuola superiore Sant'Anna e Penn State University

Information matrices and Numbers in Large Supervised Problems

Abstract

Contemporary high-throughput data gathering techniques, measuring massive numbers of features simultaneously and/or merging information from multiple sources, lead to high or ultra-high dimensional supervised problems. In this talk I will briefly introduce two classes of statistical methods used to tackle such problems: Sufficient Dimension Reduction (SDR) techniques, which extract a small number of synthetic features to capture information on the outcome variable; and Screening algorithms, which are used to remove features irrelevant to the outcome prior to the use of dimension reduction or feature selection techniques. In particular, I will present recent SDR [1] and Screening [2] approaches based on a Fisher Information framework. This is joint work with Debmalya Nandy (University of Colorado), Weixin Yao (UC Riverside), Runze Li (Penn State University) and Bruce Lindsay (in memoriam).

[1] Weixin Yao, Debmalya Nandy, Bruce G. Lindsay & Francesca Chiaromonte (2019) Covariate Information Matrix for Sufficient Dimension Reduction, Journal of the American Statistical Association, 114:528, 1752-1764, DOI: 10.1080/01621459.2018.1515080

[2] Debmalya Nandy, Francesca Chiaromonte & Runze Li (2021) Covariate Information Number for Feature Screening in Ultrahigh-Dimensional Supervised Problems, Journal of the American Statistical Association, DOI: 10.1080/01621459.2020.1864380.

Wednesday, April 13, 2022

Francesco DENTI

Dipartimento di Scienze statistiche, Università Cattolica del Sacro Cuore

Bayesian nonparametric mixtures for novelty detection and partially exchangeable data

Abstract

Bayesian mixtures have become an extremely popular statistical tool: they are widely applied to address different tasks such as density estimation, model-based clustering, and novelty detection.

In this talk, we will briefly introduce the fundaments of this topic with a specific focus on nonparametric mixtures, along with the computational methods used to perform inference.

Then, we will discuss how we can extend the basic models to address the challenges that arise when dealing with more complex data, discussing two examples.

First, we will present a two-stage semiparametric Bayesian model for novelty detection called Brand. Brand is a model-based classifier that robustly learns the characteristics of observed classes from a training set while allowing for the presence of unseen classes in the test set. The classification of the test data into seen and unseen classes - the latter containing novelties and outliers - is obtained via nested mixtures.

Second, we will discuss mixtures for partially exchangeable datasets, where observations are naturally organized into groups. Examples of this setting are microbiome abundance tables, where microbes frequencies are recorded within different subjects, or the famous Spotify datasets, which present audio features for multiple songs authored by various artists.

After introducing the Nested Dirichlet process, we will discuss a potential shortcoming of its usage for distributional clustering and propose a solution discussing the Common Atom Model (CAM).

CAM is a Bayesian nonparametric model that allows the estimation of a two-layer clustering solution, grouping observations and groups while allowing information sharing across the statistical units.

Webinar Video

Mercoledì 2 Marzo 2022

Marco FATTORE

Dipartimento di Statistica e Metodi Quantitativi, Università degli Studi di Milano-Bicocca

Riduzione della dimensionalità ed estrazione di ranking da sistemi multidimensionali di indicatori ordinali

Abstract.

Il problema della costruzione di indici sintetici e di ranking a partire da sistemi multidimensionali di indicatori ordinali è sempre più diffuso, nell’ambito della statistica socio-economica e a supporto di processi di valutazione e di multi-criteria decision/policy making; ciononostante, l’apparato metodologico per l’analisi di dati ordinali a più dimensioni è ancora limitato e in larga parte mutuato dall’analisi statistica di variabili quantitative. Obiettivo del seminario è mostrare come sia invece possibile impostare una “analisi ordinale dei dati”, importando nella metodologia statistica le corrette strutture matematiche, a partire dalla Teoria delle Relazioni d’Ordine, una parte della matematica discreta dedicata alle proprietà degli insiemi parzialmente ordinati e quasi-ordinati. In particolare, prendendo le mosse dal problema della misurazione della povertà multidimensionale, del benessere o della sostenibilità, il seminario affronta il tema della riduzione della dimensionalità e dell’estrazione di ranking per sistemi di dati ordinali e parzialmente ordinati, introducendo ai più recenti sviluppi metodologici, illustrando sia gli algoritmi già disponibili che quelli in corso di sviluppo e discutendo i loro punti di forza e di debolezza, con particolare attenzione agli aspetti computazionali. Il seminario si conclude, con un’illustrazione delle linee di ricerca, teoriche e applicate, in corso di sviluppo, fornendo una mappa dei problemi risolti e di quelli ancora aperti, nell’ambito dell’analisi multidimensionale ordinale di dati socio-economici.

Monday, October 11, 2021

Statistical Bridges Series

Eric-Jan Wagenmakers
Department of Psychological Methods, University of Amsterdam, http://ejwagenmakers.com

From p-values to Bayesian evidence

Abstract
Despite its continuing dominance in empirical research, the p-value suffers from a series of well-known statistical limitations; for instance, it cannot quantify evidence in favor of the null hypothesis, it cannot be monitored until the results are sufficiently compelling, and it tends to reject the null even when the evidence is ambiguous. Here I present a simple set of equations that allow researchers to transform their p-values into an approximate objective Bayes factor for the test of a point null hypothesis against a composite alternative. The transformed quantity is able to quantify evidence in favor of the null hypothesis, may be monitored until it is sufficiently compelling, and does not reject the null when the evidence is ambiguous.

Webinar Video

Friday, June 11, 2021

Statistical Bridges Series

Elizabeth Ogburn
Johns Hopkins University, Department of Biostatistics - https://www.eogburn.com

Social network dependence, unmeasured confounding, and the replication crisis

Abstract
In joint work with Youjin Lee, we showed that social network dependence can result in spurious associations, potentially contributing to replication crises across the health and social sciences. Researchers in these fields frequently sample subjects from one or a small number of communities, schools, hospitals, etc., and while many of the limitations of such convenience samples are well-known, the issue of statistical dependence due to social network ties has not previously been addressed. A paradigmatic example of this is the Framingham Heart Study (FHS). Using a statistic that we adapted to measure network dependence, we tested for possible spurious associations due to network dependence in several of the thousands of influential papers published using FHS data. Results suggest that some of the many decades of research on coronary heart disease, other health outcomes, and peer influence using FHS data may suffer from spurious estimates of association and anticonservative uncertainty quantification due to unacknowledged network structure. In the latter part of the talk I will discuss how the phenomenon of spurious associations due to dependence is related to unmeasured confounding by network structure, akin to confounding by population structure in GWAS studies, and how this relationship sheds light on methods to control for both spurious associations and unmeasured confounding.

Most relevant paper:
Youjin Lee & Elizabeth L. Ogburn (2020): Network Dependence Can Lead to Spurious Associations and Invalid Inference, Journal of the American Statistical Association, DOI: 10.1080/01621459.2020.1782219
https://www.tandfonline.com/doi/pdf/10.1080/01621459.2020.1782219?casa_token=rvG9oMzmKLIAAAAA:T6IW864UWrT9z1yctCnZf0qAByjVbsOseMvsaw3uWmp1jhY8bQdiEFLXbzEFd8XFZY8qYQHDw-1W

Webinar Video
Interview

Wednesday, May 26, 2021
Jianyi Lin
Università Cattolica del Sacro Cuore

Computational complexity and exact algorithms for size-constrained clustering problems

Abstract
Classical clustering analysis in geometric ambient space can benefit from a-priori background information in the form of problem constraints, such as cluster size constraints introduced for avoiding unbalanced clusterings and hence improving solution’s quality. I will present the problem of finding a partition of a given set of n points of the d-dimensional real space into k clusters such that the lp-norm induced distance of all points from their cluster centroid is globally minimized and each cluster has a prescribed cardinality. Such general problem is as computationally intractable as its unconstrained counterpart, the k-Means problem, which was shown to be NP-hard for a general parameter k by S. Dasgupta in 2008 only, although the corresponding heuristic is widespread since the ‘50s. Computational hardness results will be presented for certain variants of the size-constrained geometrical clustering, while in the polynomial time and space tractable cases some globally optimal exact algorithms based on computational and algebraic geometry techniques will be illustrated.

Wednesday, May 5, 2021
Stefano Rizzelli
Università Cattolica del Sacro Cuore - École Polytecnique Fédérale de Lausanne

Data-dependent choice of prior hyperparameters in Bayesian inference: consistency and merging of posterior distributions

Abstract
The Bayesian inferential paradigm prescribes the specification of a prior distribution on the parameters of the statistical model. For complex models, the subjective elicitation of prior hyper-parameters can be a delicate and difficult task. This is particularly the case for hyper-parameters affecting posterior inference via complexity penalization, shrinkage effects, etc. In absence of sufficient information a priori, a principled specification of a hyper-prior distribution can be difficult too and complicate computations. It is common practice to resort to a data-driven choice of the prior hyper-parameters as a shortcut: this approach is commonly called empirical Bayes (EB). Although not rigorous from a Bayesian standpoint, the traditional folklore of EB analysis is that it provides approximations to genuine Bayesian inference, while enjoying some frequentist asymptotic guarantees. We give a new illustration of EB posterior consistency in a semiparametric estimation problem, involving the analysis of extreme multivariate events. We then drift to parametric models and focus on merging in total variation between EB and Bayesian posterior/predictive distributions, almost surely as the sample size increases. We provide new results refining those in Petrone et al. (2014) and illustrate their applications in the context of variable selection.

Static List 8931 ( KB)

Sesia-Lucidi ( KB)

Seminari - Eventi

Archivio seminari ed eventi degli anni precedenti:

Dipartimento di Scienze statistiche