Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Finding Multidimensional Simpson's Paradox

Finding Multidimensional Simpson's Paradox Finding and analyzing Simpson's paradox, a well known statistical phenomenon, has found many applications. While the existing literature focuses on only analyzing the causes of identi ed Simpson's paradox, there is no systematic analysis on Simpson's paradox in multidimensional spaces. In this paper, we develop a simple yet practical approach to automatically identify all Simpson's paradox instances formed by various sub-populations and separator attributes in a multidimensional data set. Moreover, we analyze the distribution of the multidimensional Simpson's paradox instances on three real data sets with respect to dimensionality, size of sub-populations, participation of individual records, redundancy, and more. We obtain a series of interesting observations about a few questions that have never been asked before. The results open doors to a few interesting directions for future study. Moreover, this paper is an outcome from a high-school student summer research internship. It re ects our on-going e ort in promoting data science research to youth and high school students. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png ACM SIGKDD Explorations Newsletter Association for Computing Machinery

Finding Multidimensional Simpson's Paradox

ACM SIGKDD Explorations Newsletter , Volume 24 (2): 13 – Dec 5, 2022

Loading next page...
 
/lp/association-for-computing-machinery/finding-multidimensional-simpson-s-paradox-StSTvdN9eo
Publisher
Association for Computing Machinery
Copyright
Copyright © 2022 Copyright is held by the owner/author(s)
ISSN
1931-0145
eISSN
1931-0153
DOI
10.1145/3575637.3575645
Publisher site
See Article on Publisher Site

Abstract

Finding and analyzing Simpson's paradox, a well known statistical phenomenon, has found many applications. While the existing literature focuses on only analyzing the causes of identi ed Simpson's paradox, there is no systematic analysis on Simpson's paradox in multidimensional spaces. In this paper, we develop a simple yet practical approach to automatically identify all Simpson's paradox instances formed by various sub-populations and separator attributes in a multidimensional data set. Moreover, we analyze the distribution of the multidimensional Simpson's paradox instances on three real data sets with respect to dimensionality, size of sub-populations, participation of individual records, redundancy, and more. We obtain a series of interesting observations about a few questions that have never been asked before. The results open doors to a few interesting directions for future study. Moreover, this paper is an outcome from a high-school student summer research internship. It re ects our on-going e ort in promoting data science research to youth and high school students.

Journal

ACM SIGKDD Explorations NewsletterAssociation for Computing Machinery

Published: Dec 5, 2022

References