Data8
Data Science
DATA 8
What is Data Science?
The Best Python Libraries for Data Science and Machine Learning
Test statistics explained
What is Data Science?
What is Data Science?
Data Science is about drawing useful conclusions from large and diverse data sets through exploration, prediction, and inference.
- Exploration involves identifying patterns in information.
- Primary tools for exploration are visualizations and descriptive statistics.
- Prediction involves using information we know to make informed guesses about values we wish we knew.
- For prediction are machine learning and optimization
- Inference involves quantifying our degree of certainty: will the patterns that we found in our data also appear in new observations? How accurate are our predictions?
- For inference are statistical tests and models.
Components
- Statistics is a central component of data science because statistics studies how to make robust conclusions based on incomplete information.
- Computing is a central component because programming allows us to apply analysis techniques to the large and diverse data sets that arise in real-world applications: not just numbers, but text, images, videos, and sensor readings.
- Data science is all of these things, but it is more than the sum of its parts because of the applications.
Through understanding a particular domain, data scientists learn to ask appropriate questions about their data and correctly interpret the answers provided by our inferential and computational tools.
Introduction
Data are descriptions of the world around us, collected through observation and stored on computers. Computers enable us to infer properties of the world from these descriptions. Data science is the discipline of drawing conclusions from data using computation. There are three core aspects of effective data analysis: exploration, prediction, and inference. This text develops a consistent approach to all three, introducing statistical ideas and fundamental ideas in computer science concurrently. We focus on a minimal set of core techniques that can be applied to a vast range of real-world applications. A foundation in data science requires not only understanding statistical and computational techniques, but also recognizing how they apply to real scenarios.
For whatever aspect of the world we wish to study—whether it’s the Earth’s weather, the world’s markets, political polls, or the human mind—data we collect typically offer an incomplete description of the subject at hand. A central challenge of data science is to make reliable conclusions using this partial information.
In this endeavor, we will combine two essential tools: computation and randomization. For example, we may want to understand climate change trends using temperature observations. Computers will allow us to use all available information to draw conclusions. Rather than focusing only on the average temperature of a region, we will consider the whole range of temperatures together to construct a more nuanced analysis. Randomness will allow us to consider the many different ways in which incomplete information might be completed. Rather than assuming that temperatures vary in a particular way, we will learn to use randomness as a way to imagine many possible scenarios that are all consistent with the data we observe.
Applying this approach requires learning to program a computer, and so this text interleaves a complete introduction to programming that assumes no prior knowledge. Readers with programming experience will find that we cover several topics in computation that do not appear in a typical introductory computer science curriculum. Data science also requires careful reasoning about numerical quantities, but this text does not assume any background in mathematics or statistics beyond basic algebra. You will find very few equations in this text. Instead, techniques are described to readers in the same language in which they are described to the computers that execute them—a programming language.
Computational Tools
This text uses the Python 3 programming language, along with a standard set of numerical and data visualization tools that are used widely in commercial applications, scientific experiments, and open-source projects. Python has recruited enthusiasts from many professions that use data to draw conclusions. By learning the Python language, you will join a million-person-strong community of software developers and data scientists.
Getting Started. The easiest and recommended way to start writing programs in Python is to log into the companion site for this text, datahub.berkeley.edu. If you have a @berkeley.edu email address, you already have full access to the programming environment hosted on that site. If not, please complete this form to request access.
You are not at all restricted to using this web-based programming environment. A Python program can be executed by any computer, regardless of its manufacturer or operating system, provided that support for the language is installed. If you wish to install the version of Python and its accompanying libraries that will match this text, we recommend the Anaconda distribution that packages together the Python 3 language interpreter, IPython libraries, and the Jupyter notebook environment.
This text includes a complete introduction to all of these computational tools. You will learn to write programs, generate images from data, and work with real-world data sets that are published online.
Statistical Techniques
The discipline of statistics has long addressed the same fundamental challenge as data science: how to draw robust conclusions about the world using incomplete information. One of the most important contributions of statistics is a consistent and precise vocabulary for describing the relationship between observations and conclusions. This text continues in the same tradition, focusing on a set of core inferential problems from statistics: testing hypotheses, estimating confidence, and predicting unknown quantities.
Data science extends the field of statistics by taking full advantage of computing, data visualization, machine learning, optimization, and access to information. The combination of fast computers and the Internet gives anyone the ability to access and analyze vast datasets: millions of news articles, full encyclopedias, databases for any domain, and massive repositories of music, photos, and video.
Applications to real data sets motivate the statistical techniques that we describe throughout the text. Real data often do not follow regular patterns or match standard equations. The interesting variation in real data can be lost by focusing too much attention on simplistic summaries such as average values. Computers enable a family of methods based on resampling that apply to a wide range of different inference problems, take into account all available information, and require few assumptions or conditions. Although these techniques have often been reserved for advanced courses in statistics, their flexibility and simplicity are a natural fit for data science applications.
Why Data Science?
Most important decisions are made with only partial information and uncertain outcomes. However, the degree of uncertainty for many decisions can be reduced sharply by access to large data sets and the computational tools required to analyze them effectively. Data-driven decision making has already transformed a tremendous breadth of industries, including finance, advertising, manufacturing, and real estate. At the same time, a wide range of academic disciplines are evolving rapidly to incorporate large-scale data analysis into their theory and practice.
Studying data science enables individuals to bring these techniques to bear on their work, their scientific endeavors, and their personal decisions. Critical thinking has long been a hallmark of a rigorous education, but critiques are often most effective when supported by data. A critical analysis of any aspect of the world, may it be business or social science, involves inductive reasoning; conclusions can rarely been proven outright, but only supported by the available evidence. Data science provides the means to make precise, reliable, and quantitative arguments about any set of observations. With unprecedented access to information and computing, critical thinking about any aspect of the world that can be measured would be incomplete without effective inferential techniques.
The world has too many unanswered questions and difficult challenges to leave this critical reasoning to only a few specialists. All educated members of society can build the capacity to reason about data. The tools, techniques, and data sets are all readily available; this text aims to make them accessible to everyone.
Plotting the classics
In this example, we will explore statistics for two classic novels: The Adventures of Huckleberry Finn by Mark Twain, and Little Women by Louisa May Alcott. The text of any book can be read by a computer at great speed. Books published before 1923 are currently in the public domain, meaning that everyone has the right to copy or use the text in any way. Project Gutenberg is a website that publishes public domain books online. Using Python, we can load the text of these books directly from the web.
This example is meant to illustrate some of the broad themes of this text. Don’t worry if the details of the program don’t yet make sense. Instead, focus on interpreting the images generated below. Later sections of the text will describe most of the features of the Python programming language used below.
First, we read the text of both books into lists of chapters, called huck_finn_chapters
and little_women_chapters
. In Python, a name cannot contain any spaces, and so we will often use an underscore _
to stand in for a space. The =
in the lines below give a name on the left to the result of some computation described on the right. A uniform resource locator or URL is an address on the Internet for some content; in this case, the text of a book. The #
symbol starts a comment, which is ignored by the computer but helpful for people reading the code.
1 | import urllib.request |
While a computer cannot understand the text of a book, it can provide us with some insight into the structure of the text. The name huck_finn_chapters is currently bound to a list of all the chapters in the book. We can place them into a table to see how each chapter begins.
1 | import pandas as pd |
Chapters | |
---|---|
0 | I.YOU don't know about me without you have rea... |
1 | II.WE went tiptoeing along a path amongst the ... |
2 | III.WELL, I got a good going-over in the morni... |
3 | IV.WELL, three or four months run along, and i... |
4 | V.I had shut the door to. Then I turned aroun... |
... | ... |
38 | XXXIX.IN the morning we went up to the village... |
39 | XL.WE was feeling pretty good after breakfast,... |
40 | XLI.THE doctor was an old man; a very nice, ki... |
41 | XLII.THE old man was uptown again before break... |
42 | THE LASTTHE first time I catched Tom private I... |
Each chapter begins with a chapter number in Roman numerals, followed by the first sentence of the chapter. Project Gutenberg has printed the first word of each chapter in upper case.
Literary Characters
The Adventures of Huckleberry Finn describes a journey that Huck and Jim take along the Mississippi River. Tom Sawyer joins them towards the end as the action heats up. Having loaded the text, we can quickly visualize how many times these characters have each been mentioned at any point in the book.
1 | import numpy as np |
Jim | Tom | Huck | Chapter | |
---|---|---|---|---|
0 | 0 | 6 | 3 | 1 |
1 | 16 | 30 | 5 | 2 |
2 | 16 | 35 | 7 | 3 |
3 | 24 | 35 | 8 | 4 |
4 | 24 | 35 | 8 | 5 |
... | ... | ... | ... | ... |
38 | 345 | 177 | 69 | 39 |
39 | 358 | 188 | 72 | 40 |
40 | 358 | 196 | 72 | 41 |
41 | 370 | 226 | 74 | 42 |
42 | 376 | 232 | 78 | 43 |
1 | # Plot the cumulative counts: |
In the plot above, the horizontal axis shows chapter numbers and the vertical axis shows how many times each character has been mentioned up to and including that chapter.
You can see that Jim is a central character by the large number of times his name appears. Notice how Tom is hardly mentioned for much of the book until he arrives and joins Huck and Jim, after Chapter 30. His curve and Jim’s rise sharply at that point, as the action involving both of them intensifies. As for Huck, his name hardly appears at all, because he is the narrator.
Little Women is a story of four sisters growing up together during the civil war. In this book, chapter numbers are spelled out and chapter titles are written in all capital letters.
1 | # The chapters of Little Women, in a table |
Chapters | |
---|---|
0 | ONEPLAYING PILGRIMS"Christmas won't be Christm... |
1 | TWOA MERRY CHRISTMASJo was the first to wake i... |
2 | THREETHE LAURENCE BOY"Jo! Jo! Where are you?... |
3 | FOURBURDENS"Oh, dear, how hard it does seem to... |
4 | FIVEBEING NEIGHBORLY"What in the world are you... |
... | ... |
42 | FORTY-THREESURPRISESJo was alone in the twilig... |
43 | FORTY-FOURMY LORD AND LADY"Please, Madam Mothe... |
44 | FORTY-FIVEDAISY AND DEMII cannot feel that I h... |
45 | FORTY-SIXUNDER THE UMBRELLAWhile Laurie and Am... |
46 | FORTY-SEVENHARVEST TIMEFor a year Jo and her P... |
We can track the mentions of main characters to learn about the plot of this book as well. The protagonist Jo interacts with her sisters Meg, Beth, and Amy regularly, up until Chapter 27 when she moves to New York alone.
1 | # Get data little_women_chapters |
Amy | Beth | Jo | Meg | Laurie | Chapter | |
---|---|---|---|---|---|---|
0 | 23 | 26 | 44 | 26 | 0 | 1 |
1 | 36 | 38 | 65 | 46 | 0 | 2 |
2 | 38 | 40 | 127 | 82 | 16 | 3 |
3 | 52 | 58 | 161 | 99 | 16 | 4 |
4 | 58 | 72 | 216 | 112 | 51 | 5 |
... | ... | ... | ... | ... | ... | ... |
42 | 619 | 459 | 1435 | 673 | 571 | 43 |
43 | 632 | 459 | 1444 | 673 | 581 | 44 |
44 | 633 | 461 | 1450 | 675 | 581 | 45 |
45 | 635 | 462 | 1506 | 679 | 583 | 46 |
46 | 645 | 465 | 1543 | 685 | 596 | 47 |
1 | # Plot the cumulative counts. |
Laurie is a young man who marries one of the girls in the end. See if you can use the plots to guess which one.
Inspiration
See if we can use this tech to count the number of positive reviews and negative reviews on a company in a news for judging the influence of the breaking news on its stock price.
Another Kind of Character
In some situations, the relationships between quantities allow us to make predictions. This text will explore how to make accurate predictions based on incomplete information and develop methods for combining multiple sources of uncertain information to make decisions.
As an example of visualizing information derived from multiple sources, let us first use the computer to get some information that would be tedious to acquire by hand. In the context of novels, the word “character” has a second meaning: a printed symbol such as a letter or number or punctuation symbol. Here, we ask the computer to count the number of characters and the number of periods in each chapter of both Huckleberry Finn and Little Women.
1 | # In each chapter, count the number of all characters; |
Here are the data for Huckleberry Finn. Each row of the table corresponds to one chapter of the novel and displays the number of characters as well as the number of periods in the chapter. Not surprisingly, chapters with fewer characters also tend to have fewer periods, in general: the shorter the chapter, the fewer sentences there tend to be, and vice versa. The relation is not entirely predictable, however, as sentences are of varying lengths and can involve other punctuation such as question marks.
1 | chars_periods_huck_finn |
Huck Finn Chapter Length | Number of Periods | |
---|---|---|
0 | 6970 | 66 |
1 | 11874 | 117 |
2 | 8460 | 72 |
3 | 6755 | 84 |
4 | 8095 | 91 |
... | ... | ... |
38 | 10763 | 96 |
39 | 11386 | 60 |
40 | 13278 | 77 |
41 | 15565 | 92 |
42 | 21461 | 228 |
Here are the corresponding data for Little Women.
1 | chars_periods_little_women |
Little Women Chapter Length | Number of Periods | |
---|---|---|
0 | 21496 | 189 |
1 | 21941 | 188 |
2 | 20335 | 231 |
3 | 25213 | 195 |
4 | 23115 | 255 |
... | ... | ... |
42 | 32811 | 305 |
43 | 10166 | 95 |
44 | 12390 | 96 |
45 | 27078 | 234 |
46 | 40592 | 392 |
You can see that the chapters of Little Women are in general longer than those of Huckleberry Finn. Let us see if these two simple variables – the length and number of periods in each chapter – can tell us anything more about the two books. One way to do this is to plot both sets of data on the same axes.
In the plot below, there is a dot for each chapter in each book. Blue dots correspond to Huckleberry Finn and gold dots to Little Women. The horizontal axis represents the number of periods and the vertical axis represents the number of characters.
1 | import plotly.graph_objects as go |
The plot shows us that many but not all of the chapters of Little Women are longer than those of Huckleberry Finn, as we had observed by just looking at the numbers. But it also shows us something more. Notice how the blue points are roughly clustered around a straight line, as are the yellow points. Moreover, it looks as though both colors of points might be clustered around the same straight line.
Now look at all the chapters that contain about 100 periods. The plot shows that those chapters contain about 10,000 characters to about 15,000 characters, roughly. That’s about 100 to 150 characters per period.
Indeed, it appears from looking at the plot that on average both books tend to have somewhere between 100 and 150 characters between periods, as a very rough estimate. Perhaps these two great 19th century novels were signaling something so very familiar to us now: the 140-character limit of Twitter.
Causality and Experiments
Causality and Experiments
“These problems are, and will probably ever remain, among the inscrutable secrets of nature. They belong to a class of questions radically inaccessible to the human intelligence.” —The Times of London, September 1849, on how cholera is contracted and spread
Does the death penalty have a deterrent effect? Is chocolate good for you? What causes breast cancer?
All of these questions attempt to assign a cause to an effect. A careful examination of data can help shed light on questions like these. In this section you will learn some of the fundamental concepts involved in establishing causality.
Observation is a key to good science. An observational study is one in which scientists make conclusions based on data that they have observed but had no hand in generating. In data science, many such studies involve observations on a group of individuals, a factor of interest called a treatment, and an outcome measured on each individual.
It is easiest to think of the individuals as people. In a study of whether chocolate is good for the health, the individuals would indeed be people, the treatment would be eating chocolate, and the outcome might be a measure of heart disease. But individuals in observational studies need not be people. In a study of whether the death penalty has a deterrent effect, the individuals could be the 50 states of the union. A state law allowing the death penalty would be the treatment, and an outcome could be the state’s murder rate.
The fundamental question is whether the treatment has an effect on the outcome. Any relation between the treatment and the outcome is called an association. If the treatment causes the outcome to occur, then the association is causal. Causality is at the heart of all three questions posed at the start of this section. For example, one of the questions was whether chocolate directly causes improvements in health, not just whether there there is a relation between chocolate and health.
The establishment of causality often takes place in two stages. First, an association is observed. Next, a more careful analysis leads to a decision about causality.
Observation and Visualization: John Snow and the Broad Street Pump
One of the most powerful examples of astute observation eventually leading to the establishment of causality dates back more than 150 years. To get your mind into the right timeframe, try to imagine London in the 1850’s. It was the world’s wealthiest city but many of its people were desperately poor. Charles Dickens, then at the height of his fame, was writing about their plight. Disease was rife in the poorer parts of the city, and cholera was among the most feared. It was not yet known that germs cause disease; the leading theory was that “miasmas” were the main culprit. Miasmas manifested themselves as bad smells, and were thought to be invisible poisonous particles arising out of decaying matter. Parts of London did smell very bad, especially in hot weather. To protect themselves against infection, those who could afford to held sweet-smelling things to their noses.
For several years, a doctor by the name of John Snow had been following the devastating waves of cholera that hit England from time to time. The disease arrived suddenly and was almost immediately deadly: people died within a day or two of contracting it, hundreds could die in a week, and the total death toll in a single wave could reach tens of thousands. Snow was skeptical of the miasma theory. He had noticed that while entire households were wiped out by cholera, the people in neighboring houses sometimes remained completely unaffected. As they were breathing the same air—and miasmas—as their neighbors, there was no compelling association between bad smells and the incidence of cholera.
Snow had also noticed that the onset of the disease almost always involved vomiting and diarrhea. He therefore believed that the infection was carried by something people ate or drank, not by the air that they breathed. His prime suspect was water contaminated by sewage.
At the end of August 1854, cholera struck in the overcrowded Soho district of London. As the deaths mounted, Snow recorded them diligently, using a method that went on to become standard in the study of how diseases spread: he drew a map. On a street map of the district, he recorded the location of each death.
Here is Snow’s original map. Each black bar represents one death. When there are multiple deaths at the same address, the bars corresponding to those deaths are stacked on top of each other. The black discs mark the locations of water pumps. The map displays a striking revelation—the deaths are roughly clustered around the Broad Street pump.
Snow studied his map carefully and investigated the apparent anomalies. All of them implicated the Broad Street pump. For example:
- There were deaths in houses that were nearer the Rupert Street pump than the Broad Street pump. Though the Rupert Street pump was closer as the crow flies, it was less convenient to get to because of dead ends and the layout of the streets. The residents in those houses used the Broad Street pump instead.
- There were no deaths in two blocks just east of the pump. That was the location of the Lion Brewery, where the workers drank what they brewed. If they wanted water, the brewery had its own well.
- There were scattered deaths in houses several blocks away from the Broad Street pump. Those were children who drank from the Broad Street pump on their way to school. The pump’s water was known to be cool and refreshing.
The final piece of evidence in support of Snow’s theory was provided by two isolated deaths in the leafy and genteel Hampstead area, quite far from Soho. Snow was puzzled by these until he learned that the deceased were Mrs. Susannah Eley, who had once lived in Broad Street, and her niece. Mrs. Eley had water from the Broad Street pump delivered to her in Hampstead every day. She liked its taste.
Later it was discovered that a cesspit that was just a few feet away from the well of the Broad Street pump had been leaking into the well. Thus the pump’s water was contaminated by sewage from the houses of cholera victims.
Snow used his map to convince local authorities to remove the handle of the Broad Street pump. Though the cholera epidemic was already on the wane when he did so, it is possible that the disabling of the pump prevented many deaths from future waves of the disease.
The removal of the Broad Street pump handle has become the stuff of legend. At the Centers for Disease Control (CDC) in Atlanta, when scientists look for simple answers to questions about epidemics, they sometimes ask each other, “Where is the handle to this pump?”
Snow’s map is one of the earliest and most powerful uses of data visualization. Disease maps of various kinds are now a standard tool for tracking epidemics.
Towards Causality
Though the map gave Snow a strong indication that the cleanliness of the water supply was the key to controlling cholera, he was still a long way from a convincing scientific argument that contaminated water was causing the spread of the disease. To make a more compelling case, he had to use the method of comparison.
Scientists use comparison to identify an association between a treatment and an outcome. They compare the outcomes of a group of individuals who got the treatment (the treatment group) to the outcomes of a group who did not (the control group). For example, researchers today might compare the average murder rate in states that have the death penalty with the average murder rate in states that don’t.
If the results are different, that is evidence for an association. To determine causation, however, even more care is needed.
Snow’s “Grand Experiment”
Encouraged by what he had learned in Soho, Snow completed a more thorough analysis. For some time, he had been gathering data on cholera deaths in an area of London that was served by two water companies. The Lambeth water company drew its water upriver from where sewage was discharged into the River Thames. Its water was relatively clean. But the Southwark and Vauxhall (S&V) company drew its water below the sewage discharge, and thus its supply was contaminated.
The map below shows the areas served by the two companies. Snow honed in on the region where the two service areas overlap.
Snow noticed that there was no systematic difference between the people who were supplied by S&V and those supplied by Lambeth. “Each company supplies both rich and poor, both large houses and small; there is no difference either in the condition or occupation of the persons receiving the water of the different Companies … there is no difference whatever in the houses or the people receiving the supply of the two Water Companies, or in any of the physical conditions with which they are surrounded …”
The only difference was in the water supply, “one group being supplied with water containing the sewage of London, and amongst it, whatever might have come from the cholera patients, the other group having water quite free from impurity.”
Confident that he would be able to arrive at a clear conclusion, Snow summarized his data in the table below.
Supply Area | Number of houses | cholera deaths | deaths per 10,000 houses |
---|---|---|---|
S&V | 40,046 | 1,263 | 315 |
Lambeth | 26,107 | 98 | 37 |
Rest of London | 256,423 | 1,422 | 59 |
The numbers pointed accusingly at S&V. The death rate from cholera in the S&V houses was almost ten times the rate in the houses supplied by Lambeth.
Establishing Causality
In the language developed earlier in the section, you can think of the people in the S&V houses as the treatment group, and those in the Lambeth houses at the control group. A crucial element in Snow’s analysis was that the people in the two groups were comparable to each other, apart from the treatment.
In order to establish whether it was the water supply that was causing cholera, Snow had to compare two groups that were similar to each other in all but one aspect—their water supply. Only then would he be able to ascribe the differences in their outcomes to the water supply. If the two groups had been different in some other way as well, it would have been difficult to point the finger at the water supply as the source of the disease. For example, if the treatment group consisted of factory workers and the control group did not, then differences between the outcomes in the two groups could have been due to the water supply, or to factory work, or both. The final picture would have been much more fuzzy.
Snow’s brilliance lay in identifying two groups that would make his comparison clear. He had set out to establish a causal relation between contaminated water and cholera infection, and to a great extent he succeeded, even though the miasmatists ignored and even ridiculed him. Of course, Snow did not understand the detailed mechanism by which humans contract cholera. That discovery was made in 1883, when the German scientist Robert Koch isolated the Vibrio cholerae, the bacterium that enters the human small intestine and causes cholera.
In fact the Vibrio cholerae had been identified in 1854 by Filippo Pacini in Italy, just about when Snow was analyzing his data in London. Because of the dominance of the miasmatists in Italy, Pacini’s discovery languished unknown. But by the end of the 1800’s, the miasma brigade was in retreat. Subsequent history has vindicated Pacini and John Snow. Snow’s methods led to the development of the field of epidemiology, which is the study of the spread of diseases.
Confounding
Let us now return to more modern times, armed with an important lesson that we have learned along the way:
In an observational study, if the treatment and control groups differ in ways other than the treatment, it is difficult to make conclusions about causality.
An underlying difference between the two groups (other than the treatment) is called a confounding factor, because it might confound you (that is, mess you up) when you try to reach a conclusion.
Example: Coffee and lung cancer. Studies in the 1960’s showed that coffee drinkers had higher rates of lung cancer than those who did not drink coffee. Because of this, some people identified coffee as a cause of lung cancer. But coffee does not cause lung cancer. The analysis contained a confounding factor—smoking. In those days, coffee drinkers were also likely to have been smokers, and smoking does cause lung cancer. Coffee drinking was associated with lung cancer, but it did not cause the disease.
Confounding factors are common in observational studies. Good studies take great care to reduce confounding and to account for its effects.
Randomization
An excellent way to avoid confounding is to assign individuals to the treatment and control groups at random, and then administer the treatment to those who were assigned to the treatment group. Randomization keeps the two groups similar apart from the treatment.
If you are able to randomize individuals into the treatment and control groups, you are running a randomized controlled experiment, also known as a randomized controlled trial (RCT). Sometimes, people’s responses in an experiment are influenced by their knowing which group they are in. So you might want to run a blind experiment in which individuals do not know whether they are in the treatment group or the control group. To make this work, you will have to give the control group a placebo, which is something that looks exactly like the treatment but in fact has no effect.
Randomized controlled experiments have long been a gold standard in the medical field, for example in establishing whether a new drug works. They are also becoming more commonly used in other fields such as economics.
Example: Welfare subsidies in Mexico. In Mexican villages in the 1990’s, children in poor families were often not enrolled in school. One of the reasons was that the older children could go to work and thus help support the family. Santiago Levy , a minister in Mexican Ministry of Finance, set out to investigate whether welfare programs could be used to increase school enrollment and improve health conditions. He conducted an RCT on a set of villages, selecting some of them at random to receive a new welfare program called PROGRESA. The program gave money to poor families if their children went to school regularly and the family used preventive health care. More money was given if the children were in secondary school than in primary school, to compensate for the children’s lost wages, and more money was given for girls attending school than for boys. The remaining villages did not get this treatment, and formed the control group. Because of the randomization, there were no confounding factors and it was possible to establish that PROGRESA increased school enrollment. For boys, the enrollment increased from 73% in the control group to 77% in the PROGRESA group. For girls, the increase was even greater, from 67% in the control group to almost 75% in the PROGRESA group. Due to the success of this experiment, the Mexican government supported the program under the new name OPORTUNIDADES, as an investment in a healthy and well educated population.
In some situations it might not be possible to carry out a randomized controlled experiment, even when the aim is to investigate causality. For example, suppose you want to study the effects of alcohol consumption during pregnancy, and you randomly assign some pregnant women to your “alcohol” group. You should not expect cooperation from them if you present them with a drink. In such situations you will almost invariably be conducting an observational study, not an experiment. Be alert for confounding factors.
Endnote
In the terminology that we have developed, John Snow conducted an observational study, not a randomized experiment. But he called his study a “grand experiment” because, as he wrote, “No fewer than three hundred thousand people … were divided into two groups without their choice, and in most cases, without their knowledge …”
Studies such as Snow’s are sometimes called “natural experiments.” However, true randomization does not simply mean that the treatment and control groups are selected “without their choice.”
The method of randomization can be as simple as tossing a coin. It may also be quite a bit more complex. But every method of randomization consists of a sequence of carefully defined steps that allow chances to be specified mathematically. This has two important consequences.
- It allows us to account—mathematically—for the possibility that randomization produces treatment and control groups that are quite different from each other.
- It allows us to make precise mathematical statements about differences between the treatment and control groups. This in turn helps us make justifiable conclusions about whether the treatment has any effect.
In this course, you will learn how to conduct and analyze your own randomized experiments. That will involve more detail than has been presented in this chapter. For now, just focus on the main idea: to try to establish causality, run a randomized controlled experiment if possible. If you are conducting an observational study, you might be able to establish association but it will be harder to establish causation. Be extremely careful about confounding factors before making conclusions about causality based on an observational study.
Terminology
- observational study
- treatment
- outcome
- association
- causal association
- causality
- comparison
- treatment group
- control group
- epidemiology
- confounding
- randomization
- randomized controlled experiment
- randomized controlled trial (RCT)
- blind
- placebo
Fun facts
John Snow is sometimes called the father of epidemiology, but he was an anesthesiologist by profession. One of his patients was Queen Victoria, who was an early recipient of anesthetics during childbirth.
Florence Nightingale, the originator of modern nursing practices and famous for her work in the Crimean War, was a die-hard miasmatist. She had no time for theories about contagion and germs, and was not one for mincing her words. “There is no end to the absurdities connected with this doctrine,” she said. “Suffice it to say that in the ordinary sense of the word, there is no proof such as would be admitted in any scientific enquiry that there is any such thing as contagion.”
A later RCT established that the conditions on which PROGRESA insisted—children going to school, preventive health care—were not necessary to achieve increased enrollment. Just the financial boost of the welfare payments was sufficient.
Good reads
The Strange Case of the Broad Street Pump: John Snow and the Mystery of Cholera by Sandra Hempel, published by our own University of California Press, reads like a whodunit. It was one of the main sources for this section’s account of John Snow and his work. A word of warning: some of the contents of the book are stomach-churning.
Poor Economics, the best seller by Abhijit Banerjee and Esther Duflo of MIT, is an accessible and lively account of ways to fight global poverty. It includes numerous examples of RCTs, including the PROGRESA example in this section.
Programming in Python
Programming in Python
Programming can dramatically improve our ability to collect and analyze information about the world, which in turn can lead to discoveries through the kind of careful reasoning demonstrated in the previous section. In data science, the purpose of writing a program is to instruct a computer to carry out the steps of an analysis. Computers cannot study the world on their own. People must describe precisely what steps the computer should take in order to collect and analyze data, and those steps are expressed through programs.
Expressions
Programming languages are much simpler than human languages. Nonetheless, there are some rules of grammar to learn in any language, and that is where we will begin. In this text, we will use the Python programming language. Learning the grammar rules is essential, and the same rules used in the most basic programs are also central to more sophisticated programs.
Programs are made up of expressions, which describe to the computer how to combine pieces of data. For example, a multiplication expression consists of a * symbol between two numerical expressions. Expressions, such as 3 * 4, are evaluated by the computer. The value (the result of evaluation) of the last expression in each cell, 12 in this case, is displayed below the cell.
1 | 3 * 4 |
The grammar rules of a programming language are rigid. In Python, the * symbol cannot appear twice in a row. The computer will not try to interpret an expression that differs from its prescribed expression structures. Instead, it will show a SyntaxError error. The Syntax of a language is its set of grammar rules, and a SyntaxError indicates that an expression structure doesn’t match any of the rules of the language.
1 | 3 * * 4 |
Small changes to an expression can change its meaning entirely. Below, the space between the *’s has been removed. Because ** appears between two numerical expressions, the expression is a well-formed exponentiation expression (the first number raised to the power of the second: 3 times 3 times 3 times 3). The symbols * and ** are called operators, and the values they combine are called operands.
1 | 3 ** 4 |
Common Operators. Data science often involves combining numerical values, and the set of operators in a programming language are designed to so that expressions can be used to express any sort of arithmetic. In Python, the following operators are essential. See more Python Operators
Expression Type | Operator | Example | Value |
---|---|---|---|
Addition | + |
2 + 3 |
5 |
Subtraction | - |
2 - 3 |
-1 |
Multiplication | * |
2 * 3 |
6 |
Division | / |
7 / 3 |
2.66667 |
Remainder | % |
7 % 3 |
1 |
Exponentiation | ** |
2 ** 0.5 |
1.41421 |
Python expressions obey the same familiar rules of precedence as in algebra: multiplication and division occur before addition and subtraction. Parentheses can be used to group together smaller expressions within a larger expression.
1 | 1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 + 8 - 9 + 10 |
1 | 1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 + 8 - 9 + 10 |
This chapter introduces many types of expressions. Learning to program involves trying out everything you learn in combination, investigating the behavior of the computer. What happens if you divide by zero? What happens if you divide twice in a row? You don’t always need to ask an expert (or the Internet); many of these details can be discovered by trying them out yourself.
Names
Names are given to values in Python using an assignment statement. In an assignment, a name is followed by =, which is followed by any expression. The value of the expression to the right of = is assigned to the name. Once a name has a value assigned to it, the value will be substituted for that name in future expressions.
1 | a = 10 |
A previously assigned name can be used in the expression to the right of =.
1 | quarter = 1/4 |
However, only the current value of an expression is assigned to a name. If that value changes later, names that were defined in terms of that value will not change automatically.
1 | quarter = 4 |
Names must start with a letter, but can contain both letters and numbers. A name cannot contain a space; instead, it is common to use an underscore character _ to replace each space. Names are only as useful as you make them; it’s up to the programmer to choose names that are easy to interpret. Typically, more meaningful names can be invented than a and b. For example, to describe the sales tax on a $5 purchase in Berkeley, CA, the following names clarify the meaning of the various quantities involved.
1 | purchase_price = 5 |
Example: Growth Rates
The relationship between two measurements of the same quantity taken at different times is often expressed as a growth rate. For example, the United States federal government employed 2,766,000 people in 2002 and 2,814,000 people in 2012. To compute a growth rate, we must first decide which value to treat as the initial amount. For values over time, the earlier value is a natural choice. Then, we divide the difference between the changed and initial amount by the initial amount.
1 | initial = 2766000 |
It is also typical to subtract one from the ratio of the two measurements, which yields the same value.
1 | (changed/initial) - 1 |
This value is the growth rate over 10 years. A useful property of growth rates is that they don’t change even if the values are expressed in different units. So, for example, we can express the same relationship between thousands of people in 2002 and 2012.
1 | initial = 2766 |
In 10 years, the number of employees of the US Federal Government has increased by only 1.74%. In that time, the total expenditures of the US Federal Government increased from $2.37 trillion to $3.38 trillion in 2012.
1 | initial = 2.37 |
A 42.6% increase in the federal budget is much larger than the 1.74% increase in federal employees. In fact, the number of federal employees has grown much more slowly than the population of the United States, which increased 9.21% in the same time period from 287.6 million people in 2002 to 314.1 million in 2012.
1 | initial = 287.6 |
A growth rate can be negative, representing a decrease in some value. For example, the number of manufacturing jobs in the US decreased from 15.3 million in 2002 to 11.9 million in 2012, a -22.2% growth rate.
1 | initial = 15.3 |
An annual growth rate is a growth rate of some quantity over a single year. An annual growth rate of 0.035, accumulated each year for 10 years, gives a much larger ten-year growth rate of 0.41 (or 41%).
1 | 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 - 1 |
This same computation can be expressed using names and exponents.
1 | annual_growth_rate = 0.035 |
Likewise, a ten-year growth rate can be used to compute an equivalent annual growth rate. Below, t
is the number of years that have passed between measurements. The following computes the annual growth rate of federal expenditures over the last 10 years.
1 | initial = 2.37 |
The total growth over 10 years is equivalent to a 3.6% increase each year.
In summary, a growth rate g
is used to describe the relative size of an initial amount and a changed amount after some amount of time t
. To compute changed , apply the growth rate g repeatedly, t
times using exponentiation.
1 | initial * (1 + g) ** t |
To compute g, raise the total growth to the power of 1/t
and subtract one.
1 | (changed/initial) ** (1/t) - 1 |
Call Expressions
Call expressions invoke functions, which are named operations. The name of the function appears first, followed by expressions in parentheses.
1 | abs(-12) |
1 | round(5 - 1.3) |
1 | max(2, 2 + 3, 4) |
In this last example, the max function is called on three arguments: 2, 5, and 4. The value of each expression within parentheses is passed to the function, and the function returns the final value of the full call expression. The max function can take any number of arguments and returns the maximum.
A few functions are available by default, such as abs and round, but most functions that are built into the Python language are stored in a collection of functions called a module. An import statement is used to provide access to a module, such as math or operator.
1 | import math |
An equivalent expression could be expressed using the + and ** operators instead.
1 | (4 + 5) ** 0.5 |
Operators and call expressions can be used together in an expression. The percent difference between two values is used to compare values for which neither one is obviously initial or changed. For example, in 2014 Florida farms produced 2.72 billion eggs while Iowa farms produced 16.25 billion eggs (http://quickstats.nass.usda.gov/). The percent difference is 100 times the absolute value of the difference between the values, divided by their average. In this case, the difference is larger than the average, and so the percent difference is greater than 100.
1 | florida = 2.72 |
Learning how different functions behave is an important part of learning a programming language. A Jupyter notebook can assist in remembering the names and effects of different functions. When editing a code cell, press the tab key after typing the beginning of a name to bring up a list of ways to complete that name. For example, press tab after math. to see all of the functions available in the math module. Typing will narrow down the list of options. To learn more about a function, place a ? after its name. For example, typing math.log? will bring up a description of the log function in the math module.
1 | math.log? |
Return the logarithm of x to the given base.
If the base not specified, returns the natural logarithm (base e) of x.
The square brackets in the example call indicate that an argument is optional. That is, log can be called with either one or two arguments.
1 | math.log(16, 2) |
1 | math.log(16)/math.log(2) |
The list of Python’s built-in functions is quite long and includes many functions that are never needed in data science applications. The list of mathematical functions in the math module is similarly long. This text will introduce the most important functions in context, rather than expecting the reader to memorize or understand these lists.
Introduction to Tables
We can now apply Python to analyze data. We will work with data stored in Table structures.
Tables are a fundamental way of representing data sets. A table can be viewed in two ways:
- (NoSQL)a sequence of named columns that each describe a single attribute of all entries in a data set, or
- (SQL)a sequence of rows that each contain all information about a single individual in a data set.
We will study tables in great detail in the next several chapters. For now, we will just introduce a few methods without going into technical details.
The table cones has been imported for us; later we will see how, but here we will just work with it. First, let’s take a look at it.
1 | import pandas as pd |
1 | # Show cones |
Flavor | Color | Price | |
---|---|---|---|
0 | strawberry | pink | 3.55 |
1 | chocolate | light brown | 4.75 |
2 | chocolate | dark brown | 5.25 |
3 | strawberry | pink | 5.25 |
4 | chocolate | dark brown | 5.25 |
5 | bubblegem | pink | 4.75 |
The table has six rows. Each row corresponds to one ice cream cone. The ice cream cones are the individuals.
Each cone has three attributes: flavor
, color
, and price
. Each column contains the data on one of these attributes, and so all the entries of any single column are of the same kind. Each column has a label. We will refer to columns by their labels.
A table method is just like a function, but it must operate on a table. So the call looks like
name_of_table.method(arguments)
For example, if you want to see just the first two rows of a table, you can use the table method show.
1 | # Show first two rows |
Flavor | Color | Price | |
---|---|---|---|
0 | strawberry | pink | 3.55 |
1 | chocolate | light brown | 4.75 |
… (4 rows omitted)
You can replace 2 by any number of rows. If you ask for more than six, you will only get six, because cones only has six rows.
Choosing Sets of Columns
The method loc
creates a new table consisting of only the specified columns.
1 | # Show Flavor |
Flavor | |
---|---|
0 | strawberry |
1 | chocolate |
2 | chocolate |
3 | strawberry |
4 | chocolate |
5 | bubblegem |
This leaves the original table unchanged.
1 | # Show cones |
Flavor | Color | Price | |
---|---|---|---|
0 | strawberry | pink | 3.55 |
1 | chocolate | light brown | 4.75 |
2 | chocolate | dark brown | 5.25 |
3 | strawberry | pink | 5.25 |
4 | chocolate | dark brown | 5.25 |
5 | bubblegem | pink | 4.75 |
You can loc
more than one column, by separating the column labels by commas.
1 | # Show Flavor, Price |
Flavor | Price | |
---|---|---|
0 | strawberry | 3.55 |
1 | chocolate | 4.75 |
2 | chocolate | 5.25 |
3 | strawberry | 5.25 |
4 | chocolate | 5.25 |
5 | bubblegem | 4.75 |
You can also drop columns you don’t want. The table above can be created by dropping the Color column.
1 | # Drop Color |
Flavor | Price | |
---|---|---|
0 | strawberry | 3.55 |
1 | chocolate | 4.75 |
2 | chocolate | 5.25 |
3 | strawberry | 5.25 |
4 | chocolate | 5.25 |
5 | bubblegem | 4.75 |
You can name this new table and look at it again by just typing its name.
1 | # Redefine cones |
Flavor | Price | |
---|---|---|
0 | strawberry | 3.55 |
1 | chocolate | 4.75 |
2 | chocolate | 5.25 |
3 | strawberry | 5.25 |
4 | chocolate | 5.25 |
5 | bubblegem | 4.75 |
Like loc
, the drop
method creates a smaller table and leaves the original table unchanged. In order to explore your data, you can create any number of smaller tables by using choosing or dropping columns. It will do no harm to your original data table.
Sorting Rows
The sort
method creates a new table by arranging the rows of the original table in ascending order of the values in the specified column. Here the cones table has been sorted in ascending order of the price of the cones.
1 | # Sort Price |
Flavor | Color | Price | |
---|---|---|---|
0 | strawberry | pink | 3.55 |
1 | chocolate | light brown | 4.75 |
5 | bubblegem | pink | 4.75 |
2 | chocolate | dark brown | 5.25 |
3 | strawberry | pink | 5.25 |
4 | chocolate | dark brown | 5.25 |
To sort in descending order, you can use an optional argument to sort. As the name implies, optional arguments don’t have to be used, but they can be used if you want to change the default behavior of a method.
By default, sort sorts in increasing order of the values in the specified column. To sort in decreasing order, use the optional argument descending=True.
1 | # Sort Price Descending |
Flavor | Color | Price | |
---|---|---|---|
2 | chocolate | dark brown | 5.25 |
3 | strawberry | pink | 5.25 |
4 | chocolate | dark brown | 5.25 |
1 | chocolate | light brown | 4.75 |
5 | bubblegem | pink | 4.75 |
0 | strawberry | pink | 3.55 |
Like loc
, the drop
method, the sort
method leaves the original table unchanged.
Selecting Rows that Satisfy a Condition
The where
method creates a new table consisting only of the rows that satisfy a given condition. In this section we will work with a very simple condition, which is that the value in a specified column must be equal to a value that we also specify. Thus the where method has two arguments.
The code in the cell below creates a table consisting only of the rows corresponding to chocolate cones.
1 | # Making boolean series for a cones |
Flavor | Color | Price | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | chocolate | light brown | 4.75 |
2 | chocolate | dark brown | 5.25 |
3 | NaN | NaN | NaN |
4 | chocolate | dark brown | 5.25 |
5 | NaN | NaN | NaN |
OR
1 | # Making boolean series for a cones |
Flavor | Color | Price | |
---|---|---|---|
1 | chocolate | light brown | 4.75 |
2 | chocolate | dark brown | 5.25 |
4 | chocolate | dark brown | 5.25 |
The arguments, separated by a comma, are the label of the column and the value we are looking for in that column. The where
method can also be used when the condition that the rows must satisfy is more complicated. In those situations the call will be a little more complicated as well.
It is important to provide the value exactly. For example, if we specify Chocolate instead of chocolate, then where correctly finds no rows where the flavor is Chocolate.
1 | # Making boolean series for a cones |
Flavor | Color | Price |
---|
1 | # Making boolean series for a cones |
Flavor | Color | Price | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | NaN | NaN |
4 | NaN | NaN | NaN |
5 | NaN | NaN | NaN |
Like all the other table methods in this section, where leaves the original table unchanged.
Example: Salaries in the NBA
“The NBA is the highest paying professional sports league in the world,” reported CNN in March 2016. The table nba
contains the salaries of all National Basketball Association players in 2015-2016.
Each row represents one player. The columns are:
Column Label | Description |
---|---|
PLAYER |
Player's name |
POSITION |
Player's position on team |
TEAM |
Team name |
SALARY |
Player's salary in 2015-2016, in millions of dollars |
The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn’t involve details about how basketball is played.
The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost $18.7 million in 2015-2016.
1 | import pandas as pd |
RANK | PLAYER | POSITION | TEAM | SALARY ($M) | |
---|---|---|---|---|---|
0 | 1 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
1 | 2 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
2 | 3 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
3 | 4 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
4 | 5 | Dwight Howard | C | Houston Rockets | 22.359364 |
... | ... | ... | ... | ... | ... |
412 | 413 | Elliot Williams | SG | Memphis Grizzlies | 0.055722 |
413 | 414 | Phil Pressey | PG | Phoenix Suns | 0.055722 |
414 | 415 | Jordan McRae | SG | Phoenix Suns | 0.049709 |
415 | 416 | Cory Jefferson | PF | Phoenix Suns | 0.049709 |
416 | 417 | Thanasis Antetokounmpo | SF | New York Knicks | 0.030888 |
417 rows × 5 columns
By default, the first 20 lines of a table are displayed. You can use head
to display table more or fewer from the first row. To display the entire table, set pd.set_option("display.max_rows", None)
, then call the table directly.
1 | # Get default max_rows to show |
1 | nba |
Fans of Stephen Curry can find his row by using loc
.
1 | # Filter "Stephen Curry" |
RANK | PLAYER | POSITION | TEAM | SALARY ($M) | |
---|---|---|---|---|---|
56 | 57 | Stephen Curry | PG | Golden State Warriors | 11.370786 |
We can also create a new table called warriors
consisting of just the data for the Golden State Warriors.
1 | # Filter "Golden State Warriors" |
RANK | PLAYER | POSITION | TEAM | SALARY ($M) | |
---|---|---|---|---|---|
27 | 28 | Klay Thompson | SG | Golden State Warriors | 15.501000 |
33 | 34 | Draymond Green | PF | Golden State Warriors | 14.260870 |
37 | 38 | Andrew Bogut | C | Golden State Warriors | 13.800000 |
55 | 56 | Andre Iguodala | SF | Golden State Warriors | 11.710456 |
56 | 57 | Stephen Curry | PG | Golden State Warriors | 11.370786 |
100 | 101 | Jason Thompson | PF | Golden State Warriors | 7.008475 |
127 | 128 | Shaun Livingston | PG | Golden State Warriors | 5.543725 |
177 | 178 | Harrison Barnes | SF | Golden State Warriors | 3.873398 |
178 | 179 | Marreese Speights | C | Golden State Warriors | 3.815000 |
236 | 237 | Leandro Barbosa | SG | Golden State Warriors | 2.500000 |
267 | 268 | Festus Ezeli | C | Golden State Warriors | 2.008748 |
312 | 313 | Brandon Rush | SF | Golden State Warriors | 1.270964 |
335 | 336 | Kevon Looney | SF | Golden State Warriors | 1.131960 |
402 | 403 | Anderson Varejao | PF | Golden State Warriors | 0.289755 |
The nba table is sorted in alphabetical order of the team names. To see how the players were paid in 2015-2016, it is useful to sort the data by salary. Remember that by default, the sorting is in increasing order.
1 | # Sort nba based on SALARY, ascending=True |
RANK | PLAYER | POSITION | TEAM | SALARY ($M) | |
---|---|---|---|---|---|
416 | 417 | Thanasis Antetokounmpo | SF | New York Knicks | 0.030888 |
415 | 416 | Cory Jefferson | PF | Phoenix Suns | 0.049709 |
414 | 415 | Jordan McRae | SG | Phoenix Suns | 0.049709 |
411 | 412 | Orlando Johnson | SG | Phoenix Suns | 0.055722 |
413 | 414 | Phil Pressey | PG | Phoenix Suns | 0.055722 |
... | ... | ... | ... | ... | ... |
4 | 5 | Dwight Howard | C | Houston Rockets | 22.359364 |
3 | 4 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
2 | 3 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
1 | 2 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
0 | 1 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
417 rows × 5 columns
These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table.
The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world. To identify these players we can sort in descending order of salary and look at the top few rows.
1 | # Sort nba based on SALARY, ascending=False |
RANK | PLAYER | POSITION | TEAM | SALARY ($M) | |
---|---|---|---|---|---|
0 | 1 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
1 | 2 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
2 | 3 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
3 | 4 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
4 | 5 | Dwight Howard | C | Houston Rockets | 22.359364 |
... | ... | ... | ... | ... | ... |
412 | 413 | Elliot Williams | SG | Memphis Grizzlies | 0.055722 |
413 | 414 | Phil Pressey | PG | Phoenix Suns | 0.055722 |
414 | 415 | Jordan McRae | SG | Phoenix Suns | 0.049709 |
415 | 416 | Cory Jefferson | PF | Phoenix Suns | 0.049709 |
416 | 417 | Thanasis Antetokounmpo | SF | New York Knicks | 0.030888 |
417 rows × 5 columns
Kobe Bryant, since retired, was the highest earning NBA player in 2015-2016.
R.I.P Kobe.
Data Types
Every value has a type, and the built-in type function returns the type of the result of any expression.
One type we have encountered already is a built-in function. Python indicates that the type is a builtin_function_or_method
; the distinction between a function and a method is not important at this stage.
1 | type(abs) |
This chapter will explore many useful types of data.
Numbers
Computers are designed to perform numerical calculations, but there are some important details about working with numbers that every programmer working with quantitative data should know. Python (and most other programming languages) distinguishes between two different types of numbers:
- Integers are called
int
values in the Python language. They can only represent whole numbers (negative, zero, or positive) that don’t have a fractional component. - Real numbers are called
float
values (or floating point values) in the Python language. They can represent whole or fractional numbers but have some limitations.
The type of a number is evident from the way it is displayed: int
values have no decimal point and float
values always have a decimal point.
1 | # Some int values |
1 | 1 + 3 |
1 | -1234567890000000000 |
1 | # Some float values |
1 | 3.0 |
When a float value is combined with an int
value using some arithmetic operator, then the result is always a float
value. In most cases, two integers combine to form another integer, but any number (int
or float
) divided by another will be a float
value. Very large or very small float
values are displayed using scientific notation.
1 | 1.5 + 2 |
1 | 3 / 1 |
1 | -12345678900000000000.0 |
The type function can be used to find the type of any number.
1 | type(3) |
1 | type(3 / 1) |
The type of an expression is the type of its final value. So, the type function will never indicate that the type of an expression is a name, because names are always evaluated to their assigned values.
1 | x = 3 |
1 | type(x + 2.5) |
More About Float Values
Float values are very flexible, but they do have limits.
- A float can represent extremely large and extremely small numbers. There are limits, but you will rarely encounter them.
- A float only represents 15 or 16 significant digits for any number; the remaining precision is lost. This limited precision is enough for the vast majority of applications.
- After combining float values with arithmetic, the last few digits may be incorrect. Small rounding errors are often confusing when first encountered.
The first limit can be observed in two ways. If the result of a computation is a very large number, then it is represented as infinite. If the result is a very small number, then it is represented as zero.
1 | 2e306 * 10 |
1 | 2e306 * 100 |
1 | 2e-322 / 10 |
1 | 2e-322 / 100 |
The second limit can be observed by an expression that involves numbers with more than 15 significant digits. These extra digits are discarded before any arithmetic is carried out.
1 | 0.6666666666666666 - 0.6666666666666666123456789 |
The third limit can be observed when taking the difference between two expressions that should be equivalent. For example, the expression 2 ** 0.5 computes the square root of 2, but squaring this value does not exactly recover 2.
1 | 2 ** 0.5 |
1 | (2 ** 0.5) * (2 ** 0.5) |
1 | (2 ** 0.5) * (2 ** 0.5) - 2 |
The final result above is 0.0000000000000004440892098500626, a number that is very close to zero. The correct answer to this arithmetic expression is 0, but a small error in the final significant digit appears very different in scientific notation. This behavior appears in almost all programming languages because it is the result of the standard way that arithmetic is carried out on computers.
Although float values are not always exact, they are certainly reliable and work the same way across all different kinds of computers and programming languages.
Strings
Much of the world’s data is text, and a piece of text represented in a computer is called a string. A string can represent a word, a sentence, or even the contents of every book in a library. Since text can include numbers (like this: 5) or truth values (True), a string can also describe those things.
The meaning of an expression depends both upon its structure and the types of values that are being combined. So, for instance, adding two strings together produces another string. This expression is still an addition expression, but it is combining a different type of value.
1 | "data" + "science" |
Addition is completely literal; it combines these two strings together without regard for their contents. It doesn’t add a space because these are different words; that’s up to the programmer (you) to specify.
1 | "data" + " " + "science" |
Single and double quotes can both be used to create strings: 'hi'
and "hi"
are identical expressions. Double quotes are often preferred because they allow you to include apostrophes inside of strings.
1 | "This won't work with a single-quoted string!" |
Why not? Try it out.
The str
function returns a string representation of any value. Using this function, strings can be constructed that have embedded values.
1 | "That's " + str(1 + 1) + ' ' + str(True) |
String Methods
From an existing string, related strings can be constructed using string methods, which are functions that operate on strings. These methods are called by placing a dot after the string, then calling the function.
For example, the following method generates an uppercased version of a string.
1 | "loud".upper() |
Perhaps the most important method is replace
, which replaces all instances of a substring within the string. The replace method takes two arguments, the text to be replaced and its replacement.
1 | 'hitchhiker'.replace('hi', 'ma') |
String methods can also be invoked using variable names, as long as those names are bound to strings. So, for instance, the following two-step process generates the word “degrade” starting from “train” by first creating “ingrain” and then applying a second replacement.
1 | s = "train" |
Note that the line t = s.replace(‘t’, ‘ing’) doesn’t change the string s, which is still “train”. The method call s.replace(‘t’, ‘ing’) just has a value, which is the string “ingrain”.
1 | s |
This is the first time we’ve seen methods, but methods are not unique to strings. As we will see shortly, other types of objects can have them.
Comparisons
Boolean values most often arise from comparison operators. Python includes a variety of operators that compare values. For example, 3 is larger than 1 + 1.
1 | 3 > 1 + 1 |
The value True indicates that the comparison is valid; Python has confirmed this simple fact about the relationship between 3 and 1+1. The full set of common comparison operators are listed below.
Comparison | Operator | True example | False Example |
---|---|---|---|
Less than | < | 2 < 3 | 2 < 2 |
Greater than | > | 3>2 | 3>3 |
Less than or equal | <= | 2 <= 2 | 3 <= 2 |
Greater or equal | >= | 3 >= 3 | 2 >= 3 |
Equal | == | 3 == 3 | 3 == 2 |
Not equal | != | 3 != 2 | 2 != 2 |
An expression can contain multiple comparisons, and they all must hold in order for the whole expression to be True. For example, we can express that 1+1 is between 1 and 3 using the following expression.
1 | 1 < 1 + 1 < 3 |
The average of two numbers is always between the smaller number and the larger number. We express this relationship for the numbers x and y below. You can try different values of x and y to confirm this relationship.
1 | x = 12 |
Strings can also be compared, and their order is alphabetical. A shorter string is less than a longer string that begins with the shorter string.
1 | "Dog" > "Catastrophe" > "Cat" |
Sequences
Sequences
Values can be grouped together into collections, which allows programmers to organize those values and refer to all of them with a single name. By grouping values together, we can write code that performs a computation on many pieces of data at once.
Calling the function np.array
on several values places them into an array, which is a kind of sequential collection. Below, we collect four different temperatures into an array called highs. These are the estimated average daily high temperatures over all land on Earth (in degrees Celsius) for the decades surrounding 1850, 1900, 1950, and 2000, respectively, expressed as deviations from the average absolute high temperature between 1951 and 1980, which was 14.48 degrees.
1 | import numpy as np |
Collections allow us to pass multiple values into a function using a single name. For instance, the sum
function computes the sum of all values in a collection, and the len
function computes its length. (That’s the number of values we put in it.) Using them together, we can compute the average of a collection.
1 | sum(highs)/len(highs) |
The complete chart of daily high and low temperatures appears below.
Mean of Daily High Temperature
Mean of Daily Low Temperature
Arrays
While there are many kinds of collections in Python, we will work primarily with arrays in this class. We’ve already seen that the np.array
function can be used to create arrays of numbers.
Arrays can also contain strings or other types of values, but a single array can only contain a single kind of data. (It usually doesn’t make sense to group together unlike data anyway.) For example:
1 | import numpy as np |
Returning to the temperature data, we create arrays of average daily high temperatures for the decades surrounding 1850, 1900, 1950, and 2000.
1 | baseline_high = 14.48 |
Arrays can be used in arithmetic expressions to compute over their contents. When an array is combined with a single number, that number is combined with each element of the array. Therefore, we can convert all of these temperatures to Fahrenheit by writing the familiar conversion formula.
1 | (9/5) * highs + 32 |
Arrays also have methods, which are functions that operate on the array values. The mean
of a collection of numbers is its average value: the sum divided by the length. Each pair of parentheses in the examples below is part of a call expression; it’s calling a function with no arguments to perform a computation on the array called highs.
1 | highs.size |
1 | highs.sum() |
1 | highs.mean() |
Functions on Arrays
The numpy package, abbreviated np in programs, provides Python programmers with convenient and powerful functions for creating and manipulating arrays
1 | import numpy as np |
For example, the diff
function computes the difference between each adjacent pair of elements in an array. The first element of the diff
is the second element minus the first.
1 | np.diff(highs) |
1 | np.diff(np.array([4, 3, 2, 1])) |
The full Numpy reference lists these functions exhaustively, but only a small subset are used commonly for data processing applications. These are grouped into different packages within np. Learning this vocabulary is an important part of learning the Python language, so refer back to this list often as you work through examples and problems.
However, you don’t need to memorize these. Use this as a reference.
Each of these functions takes an array as an argument and returns a single value.
Function | Description |
---|---|
np.prod |
Multiply all elements together |
np.sum |
Add all elements together |
np.all |
Test whether all elements are true values (non-zero numbers are true) |
np.any |
Test whether any elements are true values (non-zero numbers are true) |
np.count_nonzero |
Count the number of non-zero elements |
Each of these functions takes an array as an argument and returns an array of values.
Function | Description |
---|---|
np.diff |
Difference between adjacent elements |
np.round |
Round each number to the nearest integer (whole number) |
np.cumprod |
A cumulative product: for each element, multiply all elements so far |
np.cumsum |
A cumulative sum: for each element, add all elements so far |
np.exp |
Exponentiate each element |
np.log |
Take the natural logarithm of each element |
np.sqrt |
Take the square root of each element |
np.sort |
Sort the elements |
Each of these functions takes an array of strings and returns an array.
Function | Description |
---|---|
np.char.lower |
Lowercase each element |
np.char.upper |
Uppercase each element |
np.char.strip |
Remove spaces at the beginning or end of each element |
np.char.isalpha |
Whether each element is only letters (no numbers or symbols) |
np.char.isnumeric |
Whether each element is only numeric (no letters) |
Each of these functions takes both an array of strings and a search string; each returns an array.
Function | Description |
---|---|
np.char.count |
Count the number of times a search string appears among the elements of an array |
np.char.find |
The position within each element that a search string is found first |
np.char.rfind |
The position within each element that a search string is found last |
np.char.startswith |
Whether each element starts with the search string |
Ranges
A range is an array of numbers in increasing or decreasing order, each separated by a regular interval. Ranges are useful in a surprisingly large number of situations, so it’s worthwhile to learn about them.
Ranges are defined using the np.arange
function, which takes either one, two, or three arguments: a start
, and stop
, and a step
.
If you pass one argument to np.arange
, this becomes the stop
value, with start=0, step=1
assumed. Two arguments give the start
and stop
with step=1
assumed. Three arguments give the start
, stop
and step
explicitly.
A range always includes its start value, but does not include its end value. It counts up by step, and it stops before it gets to the end.
np.arange(stop)
: An array starting with 0 of increasing consecutive integers, stopping before stop.
1 | import numpy as np |
Notice how the array starts at 0 and goes only up to 4, not to the end value of 5.
np.arange(start=0, stop)
: An array of consecutive increasing integers from start, stopping before stop.
- The following two examples are the same.
1 | np.arange(3, 9) |
np.arange(start=0, stop, step=1)
: A range with a difference of step between each pair of consecutive values, starting from start and stopping before end.
- The following two examples are the same.
1 | np.arange(3, 30, 5) |
This array starts at 3, then takes a step of 5 to get to 8, then another step of 5 to get to 13, and so on.
When you specify a step, the start, end, and step can all be either positive or negative and may be whole numbers or fractions.
- The following two examples are the same.
1 | np.arange(1.5, -2, -0.5) |
Example: Leibniz’s formula for $\pi$
The great German mathematician and philosopher Gottfried Wilhelm Leibniz (1646 - 1716) discovered a wonderful formula for $\pi$ as an infinite sum of simple fractions. The formula is
$$\pi = 4 \cdot \left(1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} - \frac{1}{11} + \dots\right)$$
Though some math is needed to establish this, we can use arrays to convince ourselves that the formula works. Let’s calculate the first 5000 terms of Leibniz’s infinite sum and see if it is close to $\pi$.
$$4 \cdot \left(1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7} + \frac{1}{9} - \frac{1}{11} + \dots - \frac{1}{9999} \right)$$
We will calculate this finite sum by adding all the positive terms first and then subtracting the sum of all the negative terms[^1] :
[^1]: Surprisingly, when we add infinitely many fractions, the order can matter! But our approximation to $\pi$ uses only a large finite number of fractions, so it’s okay to add the terms in any convenient order.
$$4 \cdot \left( \left(1 + \frac{1}{5} + \frac{1}{9} + \dots + \frac{1}{9997} \right) - \left(\frac{1}{3} + \frac{1}{7} + \frac{1}{11} + \dots + \frac{1}{9999} \right) \right)$$
The positive terms in the sum have 1, 5, 9, and so on in the denominators. The array by_four_to_20
contains these numbers up to 17:
1 | by_four_to_20 = np.arange(1, 20, 4) |
To get an accurate approximation to $\pi$, we’ll use the much longer array positive_term_denominators.
1 | positive_term_denominators = np.arange(1, 10000, 4) |
The positive terms we actually want to add together are just 1 over these denominators:
1 | positive_terms = 1 / positive_term_denominators |
The negative terms have 3, 7, 11, and so on on in their denominators. This array is just 2 added to positive_term_denominators.
1 | negative_terms = 1 / (positive_term_denominators + 2) |
The overall sum is
1 | 4 * ( sum(positive_terms) - sum(negative_terms) ) |
This is very close to $\pi=3.14159…$. Leibniz’s formula is looking good!
More on Arrays
It’s often necessary to compute something that involves data from more than one array. If two arrays are of the same size, Python makes it easy to do calculations involving both arrays.
For our first example, we return once more to the temperature data. This time, we create arrays of average daily high and low temperatures for the decades surrounding 1850, 1900, 1950, and 2000.
1 | import numpy as np |
1 | baseline_low = 3.00 |
Suppose we’d like to compute the average daily range of temperatures for each decade. That is, we want to subtract the average daily high in the 1850s from the average daily low in the 1850s, and the same for each other decade.
We could write this laboriously using .item
:
1 | np.array([ |
As when we converted an array of temperatures from Celsius to Fahrenheit, Python provides a much cleaner way to write this:
1 | highs - lows |
What we’ve seen in these examples are special cases of a general feature of arrays.
Elementwise arithmetic on pairs of numerical arrays
If an arithmetic operator acts on two arrays of the same size, then the operation is performed on each corresponding pair of elements in the two arrays. The final result is an array.
For example, if array1
and array2
have the same number of elements, then the value of array1 * array2
is an array. Its first element is the first element of array1
times the first element of array2
, its second element is the second element of array1
times the second element of array2
, and so on.
Example: Wallis’ Formula for $\pi$
The number $\pi$ is important in many different areas of math. Centuries before computers were invented, mathematicians worked on finding simple ways to approximate the numerical value of $\pi$. We have already seen Leibniz’s formula for $\pi$. About half a century before Leibniz, the English mathematician John Wallis (1616-1703) also expressed $\pi$ in terms of simple fractions, as an infinite product.
$$\pi = 2 \cdot \left( \frac{2}{1}\cdot\frac{2}{3}\cdot\frac{4}{3}\cdot\frac{4}{5}\cdot\frac{6}{5}\cdot\frac{6}{7}\dots \right)$$
This is a product of “even/odd” fractions. Let’s use arrays to multiply a million of them, and see if the product is close to $\pi$.
Remember that multiplication can done in any order[^2], so we can readjust our calculation to:
[^2]: As we saw in the example about Leibniz’s formula, when we add infinitely many fractions, the order can matter. The same is true with multiplying fractions, as we are doing here. But our approximation to $\pi$ uses only a large finite number of fractions, so it’s okay to multiply the terms in any convenient order.
$$\pi \approx 2 \cdot \left( \frac{2}{1} \cdot \frac{4}{3} \cdot \frac{6}{5} \cdots \frac{1,000,000}{999999} \right) \cdot \left( \frac{2}{3} \cdot \frac{4}{5} \cdot \frac{6}{7} \cdots \frac{1,000,000}{1,000,001} \right)$$
We’re now ready to do the calculation. We start by creating an array of even numbers 2, 4, 6, and so on upto 1,000,000. Then we create two lists of odd numbers: 1, 3, 5, 7, … upto 999,999, and 3, 5, 7, … upto 1,000,001.
1 | even = np.arange(2, 1000001, 2) |
Remember that np.prod
multiplies all the elements of an array together. Now we can calculate Wallis’ product, to a good approximation.
1 | 2 * np.prod(even/one_below_even) * np.prod(even/one_above_even) |
That’s $\pi$ correct to five decimal places. Wallis clearly came up with a great formula.
Tables
Tables
Tables are a fundamental object type for representing data sets. A table can be viewed in two ways:
- a sequence of named columns that each describe a single aspect of all entries in a data set, or
- a sequence of rows that each contain all information about a single entry in a data set.
In order to use tables, import all of the module called datascience, a module created for this text.
1 | import pandas as pd |
Empty tables can be created using the pd.DataFrame
function. An empty table is usefuly because it can be extended to contain new rows and columns.
1 | pd.DataFrame() |
The pd.DataFrame
method on a table constructs a new table with additional labeled columns. Each column of a table is an array. To add one new column to a table, call df[]
with a label and an array.
Below, we begin each example with an empty table that has no columns.
- Treat
data
keyword asrow
in a table.
1 | pd.DataFrame(columns=['Number of petals'], data=np.array([8, 34, 5])) |
Number of petals | |
---|---|
0 | 8 |
1 | 34 |
2 | 5 |
To add two (or more) new columns, provide the label and array for each column in dict
. All columns must have the same length, or an error will occur.
1 | pd.DataFrame({'Number of petals': np.array([8, 34, 5]), |
Number of petals | Name | |
---|---|---|
0 | 8 | lotus |
1 | 34 | sunflower |
2 | 5 | rose |
We can give this table a name, and then extend the table with another column. To add one new column to a table, call df[]
with a label and an array.
1 | flowers = pd.DataFrame({'Number of petals': np.array([8, 34, 5]), |
Number of petals | Name | Color | |
---|---|---|---|
0 | 8 | lotus | pink |
1 | 34 | sunflower | yellow |
2 | 5 | rose | red |
Creating tables in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use Python to read it into a table, instead of typing it all in cell by cell.
Often, tables are created from files that contain comma-separated values. Such files are called CSV files.
Below, we use the pd
method read_table
to read a CSV file that contains some of the data used by Minard in his graphic about Napoleon’s Russian campaign. The data are placed in a table named minard
.
1 | # Read CSV |
Join the table minard_troops
and minard_cities
based on keys long
and lat
.
1 | # Join minard_troops and minard_cities |
long | lat | survivors | direction | group | city | |
---|---|---|---|---|---|---|
0 | 24.0 | 54.9 | 340000 | A | 1 | NaN |
1 | 24.5 | 55.0 | 340000 | A | 1 | NaN |
2 | 25.5 | 54.5 | 340000 | A | 1 | NaN |
3 | 26.0 | 54.7 | 320000 | A | 1 | NaN |
4 | 27.0 | 54.8 | 300000 | A | 1 | NaN |
5 | 28.0 | 54.9 | 280000 | A | 1 | NaN |
6 | 28.5 | 55.0 | 240000 | A | 1 | NaN |
7 | 29.0 | 55.1 | 210000 | A | 1 | NaN |
8 | 30.0 | 55.2 | 180000 | A | 1 | NaN |
9 | 30.3 | 55.3 | 175000 | A | 1 | NaN |
10 | 32.0 | 54.8 | 145000 | A | 1 | Smolensk |
11 | 33.2 | 54.9 | 140000 | A | 1 | Dorogobouge |
12 | 34.4 | 55.5 | 127100 | A | 1 | Chjat |
13 | 35.5 | 55.4 | 100000 | A | 1 | NaN |
14 | 36.0 | 55.5 | 100000 | A | 1 | Mojaisk |
15 | 37.6 | 55.8 | 100000 | A | 1 | Moscou |
16 | 37.7 | 55.7 | 100000 | R | 1 | NaN |
17 | 37.5 | 55.7 | 98000 | R | 1 | NaN |
18 | 37.0 | 55.0 | 97000 | R | 1 | NaN |
19 | 36.8 | 55.0 | 96000 | R | 1 | NaN |
Filter the null
data in column city
.
pandas.DataFrame.isnull
pandas.DataFrame.notna
pandas.DataFrame.notnull
1 | # Filter the NaN(Null) value in city column |
long | lat | survivors | direction | group | city | |
---|---|---|---|---|---|---|
10 | 32.0 | 54.8 | 145000 | A | 1 | Smolensk |
11 | 33.2 | 54.9 | 140000 | A | 1 | Dorogobouge |
12 | 34.4 | 55.5 | 127100 | A | 1 | Chjat |
14 | 36.0 | 55.5 | 100000 | A | 1 | Mojaisk |
15 | 37.6 | 55.8 | 100000 | A | 1 | Moscou |
21 | 34.3 | 55.2 | 55000 | R | 1 | Wixma |
29 | 26.8 | 54.3 | 12000 | R | 1 | Moiodexno |
30 | 26.4 | 54.4 | 14000 | R | 1 | Smorgoni |
40 | 28.7 | 55.5 | 33000 | A | 2 | Polotzk |
41 | 28.7 | 55.5 | 33000 | R | 2 | Polotzk |
We will use this small table to demonstrate some useful Pandas.DataFrame methods. We will then use those same methods, and develop other methods, on much larger tables of data.
The Columns(Attributes) of the Table
The method columns
can be used to list the labels of all the columns. With minard we don’t gain much by this, but it can be very useful for tables that are so large that not all columns are visible on the screen.
1 | # The columns of the Table |
We can change column name using the rename
method. This creates a new table and leaves minard
unchanged.
1 | # Rename column 'city' to 'city name', inplace=False |
long | lat | survivors | direction | group | city name | |
---|---|---|---|---|---|---|
10 | 32.0 | 54.8 | 145000 | A | 1 | Smolensk |
11 | 33.2 | 54.9 | 140000 | A | 1 | Dorogobouge |
12 | 34.4 | 55.5 | 127100 | A | 1 | Chjat |
14 | 36.0 | 55.5 | 100000 | A | 1 | Mojaisk |
15 | 37.6 | 55.8 | 100000 | A | 1 | Moscou |
21 | 34.3 | 55.2 | 55000 | R | 1 | Wixma |
29 | 26.8 | 54.3 | 12000 | R | 1 | Moiodexno |
30 | 26.4 | 54.4 | 14000 | R | 1 | Smorgoni |
40 | 28.7 | 55.5 | 33000 | A | 2 | Polotzk |
41 | 28.7 | 55.5 | 33000 | R | 2 | Polotzk |
1 | minard |
long | lat | survivors | direction | group | city | |
---|---|---|---|---|---|---|
10 | 32.0 | 54.8 | 145000 | A | 1 | Smolensk |
11 | 33.2 | 54.9 | 140000 | A | 1 | Dorogobouge |
12 | 34.4 | 55.5 | 127100 | A | 1 | Chjat |
14 | 36.0 | 55.5 | 100000 | A | 1 | Mojaisk |
15 | 37.6 | 55.8 | 100000 | A | 1 | Moscou |
21 | 34.3 | 55.2 | 55000 | R | 1 | Wixma |
29 | 26.8 | 54.3 | 12000 | R | 1 | Moiodexno |
30 | 26.4 | 54.4 | 14000 | R | 1 | Smorgoni |
40 | 28.7 | 55.5 | 33000 | A | 2 | Polotzk |
41 | 28.7 | 55.5 | 33000 | R | 2 | Polotzk |
A common pattern is to assign the original name minard
to the new table, so that all future uses of minard
will refer to the relabeled table. Also, you can pass the inplace=True
to rename
method.
inplace
is a defaultFalse
parameter ofrename
method
1 | # Rename column 'city' to 'city name', inplace=True |
long | lat | survivors | direction | group | city name | |
---|---|---|---|---|---|---|
10 | 32.0 | 54.8 | 145000 | A | 1 | Smolensk |
11 | 33.2 | 54.9 | 140000 | A | 1 | Dorogobouge |
12 | 34.4 | 55.5 | 127100 | A | 1 | Chjat |
14 | 36.0 | 55.5 | 100000 | A | 1 | Mojaisk |
15 | 37.6 | 55.8 | 100000 | A | 1 | Moscou |
21 | 34.3 | 55.2 | 55000 | R | 1 | Wixma |
29 | 26.8 | 54.3 | 12000 | R | 1 | Moiodexno |
30 | 26.4 | 54.4 | 14000 | R | 1 | Smorgoni |
40 | 28.7 | 55.5 | 33000 | A | 2 | Polotzk |
41 | 28.7 | 55.5 | 33000 | R | 2 | Polotzk |
OR
1 | # Rename column 'city' to 'city name' |
The Size of the Table
The method len(df.columns)
gives the number of columns in the table, and len(df)
gives the number of rows.
1 | # The number of columns in the table |
1 | # The number of rows in the table |
Accessing the Data in a Column
We can use a column’s label to access the array of data in the column.
1 | minard['survivors'] |
You can also:
1 | minard.survivors |
The 6 columns are indexed 0, 1, 2, 3, 4, and 5. The column survivors
can also be accessed by using its column index.
1 | minard.iloc[:, 2] |
The 10 items in the array are indexed 0, 1, 2, and so on, up to 10. The items in the column can be accessed using item, as with any array.
1 | minard.iloc[0, 2] |
Working with the Data in a Column
Because columns are arrays, we can use array operations on them to discover new information. For example, we can create a new column that contains the percent of all survivors at each city after Smolensk.
1 | # Working with the Data in a Column |
long | lat | survivors | direction | group | city name | percent surviving | |
---|---|---|---|---|---|---|---|
10 | 32.0 | 54.8 | 145000 | A | 1 | Smolensk | 1.000000 |
11 | 33.2 | 54.9 | 140000 | A | 1 | Dorogobouge | 0.965517 |
12 | 34.4 | 55.5 | 127100 | A | 1 | Chjat | 0.876552 |
14 | 36.0 | 55.5 | 100000 | A | 1 | Mojaisk | 0.689655 |
15 | 37.6 | 55.8 | 100000 | A | 1 | Moscou | 0.689655 |
21 | 34.3 | 55.2 | 55000 | R | 1 | Wixma | 0.379310 |
29 | 26.8 | 54.3 | 12000 | R | 1 | Moiodexno | 0.082759 |
30 | 26.4 | 54.4 | 14000 | R | 1 | Smorgoni | 0.096552 |
40 | 28.7 | 55.5 | 33000 | A | 2 | Polotzk | 0.227586 |
41 | 28.7 | 55.5 | 33000 | R | 2 | Polotzk | 0.227586 |
To make the proportions in the new columns appear as percents, we can use the method df.style.format
with the option PercentFormatter. The df.style.format
method takes Formatter objects, which exist for dates (DateFormatter), currencies (CurrencyFormatter), numbers, and percentages.
1 | # Show minard data types |
1 | # Formart columns 'percent surviving' |
Choosing Sets of Columns
The method loc
creates a new table that contains only the specified columns.
1 | minard[['long', 'lat']] |
OR
1 | minard.loc[:, ['long', 'lat']] |
long | lat | |
---|---|---|
10 | 32.0 | 54.8 |
11 | 33.2 | 54.9 |
12 | 34.4 | 55.5 |
14 | 36.0 | 55.5 |
15 | 37.6 | 55.8 |
21 | 34.3 | 55.2 |
29 | 26.8 | 54.3 |
30 | 26.4 | 54.4 |
40 | 28.7 | 55.5 |
41 | 28.7 | 55.5 |
The same selection can be made using column indices iloc
instead of loc
.
1 | minard.iloc[:, :2] |
The result of using loc
is a new table, even when you loc
just one column.
1 | minard.loc[:, ['survivors']] |
survivors | |
---|---|
10 | 145000 |
11 | 140000 |
12 | 127100 |
14 | 100000 |
15 | 100000 |
21 | 55000 |
29 | 12000 |
30 | 14000 |
40 | 33000 |
41 | 33000 |
Notice that the result is a table, unlike the result of column, which is an array.
1 | # Notice that the result is a table, unlike the result of column, which is an array. |
Another way to create a new table consisting of a set of columns is to drop the columns you don’t want.
You can also drop the rows based on index or duplicated records.
pandas.DataFrame.drop
pandas.DataFrame.drop_duplicates
pandas.DataFrame.duplicated
1 | minard.drop(columns=['long', 'lat', 'direction']) |
survivors | group | city name | percent surviving | |
---|---|---|---|---|
10 | 145000 | 1 | Smolensk | 1.000000 |
11 | 140000 | 1 | Dorogobouge | 0.965517 |
12 | 127100 | 1 | Chjat | 0.876552 |
14 | 100000 | 1 | Mojaisk | 0.689655 |
15 | 100000 | 1 | Moscou | 0.689655 |
21 | 55000 | 1 | Wixma | 0.379310 |
29 | 12000 | 1 | Moiodexno | 0.082759 |
30 | 14000 | 1 | Smorgoni | 0.096552 |
40 | 33000 | 2 | Polotzk | 0.227586 |
41 | 33000 | 2 | Polotzk | 0.227586 |
Neither loc
, iloc
nor drop
change the original table. Instead, they create new smaller tables that share the same data. The fact that the original table is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.
1 | minard |
long | lat | survivors | direction | group | city name | percent surviving | |
---|---|---|---|---|---|---|---|
10 | 32.0 | 54.8 | 145000 | A | 1 | Smolensk | 1.000000 |
11 | 33.2 | 54.9 | 140000 | A | 1 | Dorogobouge | 0.965517 |
12 | 34.4 | 55.5 | 127100 | A | 1 | Chjat | 0.876552 |
14 | 36.0 | 55.5 | 100000 | A | 1 | Mojaisk | 0.689655 |
15 | 37.6 | 55.8 | 100000 | A | 1 | Moscou | 0.689655 |
21 | 34.3 | 55.2 | 55000 | R | 1 | Wixma | 0.379310 |
29 | 26.8 | 54.3 | 12000 | R | 1 | Moiodexno | 0.082759 |
30 | 26.4 | 54.4 | 14000 | R | 1 | Smorgoni | 0.096552 |
40 | 28.7 | 55.5 | 33000 | A | 2 | Polotzk | 0.227586 |
41 | 28.7 | 55.5 | 33000 | R | 2 | Polotzk | 0.227586 |
All of the methods that we have used above can be applied to any table.
Sorting Rows
“The NBA is the highest paying professional sports league in the world,” reported CNN in March 2016. The table nba_salaries contains the salaries of all National Basketball Association players in 2015-2016.
Each row represents one player. The columns are:
Column Label | Description |
---|---|
PLAYER |
Player's name |
POSITION |
Player's position on team |
TEAM |
Team name |
'15-'16 SALARY |
Player's salary in 2015-2016, in millions of dollars |
The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn’t involve details about how basketball is played.
The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost $18.7 million in 2015-2016.
1 | # This table can be found online: https://www.statcrunch.com/app/index.php?dataid=1843341 |
PLAYER | POSITION | TEAM | '15-'16 SALARY | |
---|---|---|---|---|
0 | Paul Millsap | PF | Atlanta Hawks | 18.671659 |
1 | Al Horford | C | Atlanta Hawks | 12.000000 |
2 | Tiago Splitter | C | Atlanta Hawks | 9.756250 |
3 | Jeff Teague | PG | Atlanta Hawks | 8.000000 |
4 | Kyle Korver | SG | Atlanta Hawks | 5.746479 |
... | ... | ... | ... | ... |
412 | Gary Neal | PG | Washington Wizards | 2.139000 |
413 | DeJuan Blair | C | Washington Wizards | 2.000000 |
414 | Kelly Oubre Jr. | SF | Washington Wizards | 1.920240 |
415 | Garrett Temple | SG | Washington Wizards | 1.100602 |
416 | Jarell Eddie | SG | Washington Wizards | 0.561716 |
The table contains 417 rows, one for each player. Only 10 of the rows are displayed. The show method allows us to specify the number of rows, with the default (no specification) being all the rows of the table.
1 | # Show first 3 rows |
PLAYER | POSITION | TEAM | '15-'16 SALARY | |
---|---|---|---|---|
0 | Paul Millsap | PF | Atlanta Hawks | 18.671659 |
1 | Al Horford | C | Atlanta Hawks | 12.000000 |
2 | Tiago Splitter | C | Atlanta Hawks | 9.756250 |
Glance through about 20 rows or so, and you will see that the rows are in alphabetical order by team name. It’s also possible to list the same rows in alphabetical order by player name using the sort_values
method. The argument to sort is a column label or index.
1 | # Sort nba_salaries based on 'PLAYER' then show first 5 rows |
PLAYER | POSITION | TEAM | '15-'16 SALARY | |
---|---|---|---|---|
68 | Aaron Brooks | PG | Chicago Bulls | 2.250000 |
291 | Aaron Gordon | PF | Orlando Magic | 4.171680 |
59 | Aaron Harrison | SG | Charlotte Hornets | 0.525093 |
235 | Adreian Payne | PF | Minnesota Timberwolves | 1.938840 |
1 | Al Horford | C | Atlanta Hawks | 12.000000 |
To examine the players’ salaries, it would be much more helpful if the data were ordered by salary.
To do this, we will first simplify the label of the column of salaries (just for convenience), and then sort by the new label SALARY
.
This arranges all the rows of the table in increasing order of salary, with the lowest salary appearing first. The output is a new table with the same columns as the original but with the rows rearranged.
1 | # Rename '15-'16 SALARY to 'SALARY' |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
267 | Thanasis Antetokounmpo | SF | New York Knicks | 0.030888 |
327 | Cory Jefferson | PF | Phoenix Suns | 0.049709 |
326 | Jordan McRae | SG | Phoenix Suns | 0.049709 |
324 | Orlando Johnson | SG | Phoenix Suns | 0.055722 |
325 | Phil Pressey | PG | Phoenix Suns | 0.055722 |
... | ... | ... | ... | ... |
131 | Dwight Howard | C | Houston Rockets | 22.359364 |
255 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
72 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
29 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
169 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table. Point Guard Phil Pressey, for example, moved from Philadelphia to Phoenix during the year, and might be moving yet again to the Golden State Warriors.
The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world.
To order the rows of the table in decreasing order of salary, we must use sort with the option ascending=False
.
1 | # Sort nba_salaries by 'SALARY', ascending=False |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
169 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
29 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
72 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
255 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
131 | Dwight Howard | C | Houston Rockets | 22.359364 |
... | ... | ... | ... | ... |
200 | Elliot Williams | SG | Memphis Grizzlies | 0.055722 |
324 | Orlando Johnson | SG | Phoenix Suns | 0.055722 |
327 | Cory Jefferson | PF | Phoenix Suns | 0.049709 |
326 | Jordan McRae | SG | Phoenix Suns | 0.049709 |
267 | Thanasis Antetokounmpo | SF | New York Knicks | 0.030888 |
Kobe Bryant, in his final season with the Lakers, was the highest paid at a salary of $25 million. Notice that the MVP Stephen Curry doesn’t appear among the top 10. He is quite a bit further down the list, as we will see later.
Named Arguments
The ascending=False
portion of this call expression is called a named argument. When a function or method is called, each argument has both a position and a name. Both are evident from the help text of a function or method.
1 | help(nba.sort_values) |
At the very top of this help text, the signature of the sort_values
method appears:
1 | sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key: 'ValueKeyFunc' = None) |
This describes the positions, names, and default values of the three arguments to sort_values
. When calling this method, you can use either positional arguments or named arguments, so the following three calls do exactly the same thing.
1 | # Examples |
Example:
Sort nba first by ‘TEAM’, then by ‘SALARY’.
1 | # Sort nba first by 'TEAM', then by 'SALARY' |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
400 | John Wall | PG | Washington Wizards | 15.851950 |
401 | Nene Hilario | C | Washington Wizards | 13.000000 |
402 | Marcin Gortat | C | Washington Wizards | 11.217391 |
403 | Markieff Morris | PF | Washington Wizards | 8.000000 |
404 | Bradley Beal | SG | Washington Wizards | 5.694674 |
... | ... | ... | ... | ... |
9 | Tim Hardaway Jr. | SG | Atlanta Hawks | 1.304520 |
10 | Walter Tavares | C | Atlanta Hawks | 1.000000 |
11 | Jason Richardson | SG | Atlanta Hawks | 0.947276 |
12 | Lamar Patterson | SG | Atlanta Hawks | 0.525093 |
13 | Terran Petteway | SG | Atlanta Hawks | 0.525093 |
When an argument is simply True
or False
, it’s a useful convention to include the argument name so that it’s more obvious what the argument value means.
Selecting Rows
Often, we would like to extract just those rows that correspond to entries with a particular feature. For example, we might want only the rows corresponding to the Warriors, or to players who earned more than $10 million. Or we might just want the top five earners.
Specified Rows
The Table method take does just that – it takes a specified set of rows. Its argument is a row index or array of indices, and it creates a new table consisting of only those rows.
For example, if we wanted just the first row of nba, we could use take as follows.
1 | nba |
PLAYER | POSITION | TEAM | SALARY |
---|---|---|---|
Paul Millsap | PF | Atlanta Hawks | 18.6717 |
Al Horford | C | Atlanta Hawks | 12 |
Tiago Splitter | C | Atlanta Hawks | 9.75625 |
Jeff Teague | PG | Atlanta Hawks | 8 |
Kyle Korver | SG | Atlanta Hawks | 5.74648 |
Thabo Sefolosha | SF | Atlanta Hawks | 4 |
Mike Scott | PF | Atlanta Hawks | 3.33333 |
Kent Bazemore | SF | Atlanta Hawks | 2 |
Dennis Schroder | PG | Atlanta Hawks | 1.7634 |
Tim Hardaway Jr. | SG | Atlanta Hawks | 1.30452 |
1 | # Show the 1st row |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
0 | Paul Millsap | PF | Atlanta Hawks | 18.671659 |
This is a new table with just the single row that we specified.
We could also get the fourth, fifth, and sixth rows by specifying a range of indices as the argument.
1 | # Show rows between 3 to 5 |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
3 | Jeff Teague | PG | Atlanta Hawks | 8.000000 |
4 | Kyle Korver | SG | Atlanta Hawks | 5.746479 |
5 | Thabo Sefolosha | SF | Atlanta Hawks | 4.000000 |
If we want a table of the top 5 highest paid players, we can first sort the list by salary and then take the first five rows:
1 | # Take the top five high salary players |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
169 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
29 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
72 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
255 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
131 | Dwight Howard | C | Houston Rockets | 22.359364 |
Rows Corresponding to a Specified Feature
More often, we will want to access data in a set of rows that have a certain feature, but whose indices we don’t know ahead of time. For example, we might want data on all the players who made more than $10 million, but we don’t want to spend time counting rows in the sorted table.
The method where does the job for us. Its output is a table with the same columns as the original but only the rows where the feature occurs.
The first argument of where is the label of the column that contains the information about whether or not a row has the feature we want. If the feature is “made more than $10 million”, the column is SALARY
.
The second argument of where is a way of specifying the feature. A couple of examples will make the general method of specification easier to understand.
In the first example, we extract the data for all those who earned more than $10 million.
1 | # Filter the rows where 'SALARY' >= 10 |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
0 | Paul Millsap | PF | Atlanta Hawks | 18.671659 |
1 | Al Horford | C | Atlanta Hawks | 12.000000 |
29 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
30 | Thaddeus Young | PF | Brooklyn Nets | 11.235955 |
42 | Al Jefferson | C | Charlotte Hornets | 13.500000 |
... | ... | ... | ... | ... |
368 | DeMar DeRozan | SG | Toronto Raptors | 10.050000 |
383 | Gordon Hayward | SF | Utah Jazz | 15.409570 |
400 | John Wall | PG | Washington Wizards | 15.851950 |
401 | Nene Hilario | C | Washington Wizards | 13.000000 |
402 | Marcin Gortat | C | Washington Wizards | 11.217391 |
The use of the argument nba['SALARY']>10' ensured that each selected row had a value of
SALARY` that was greater than 10.
There are 69 rows in the new table, corresponding to the 69 players who made more than 10 million dollars. Arranging these rows in order makes the data easier to analyze. DeMar DeRozan of the Toronto Raptors was the “poorest” of this group, at a salary of just over 10 million dollars.
1 | # Filter the rows where 'SALARY' >= 10, then sort by 'SALARY', ascending=False |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
169 | Kobe Bryant | SF | Los Angeles Lakers | 25.000000 |
29 | Joe Johnson | SF | Brooklyn Nets | 24.894863 |
72 | LeBron James | SF | Cleveland Cavaliers | 22.970500 |
255 | Carmelo Anthony | SF | New York Knicks | 22.875000 |
131 | Dwight Howard | C | Houston Rockets | 22.359364 |
... | ... | ... | ... | ... |
95 | Wilson Chandler | SF | Denver Nuggets | 10.449438 |
144 | Monta Ellis | SG | Indiana Pacers | 10.300000 |
204 | Luol Deng | SF | Miami Heat | 10.151612 |
298 | Gerald Wallace | SF | Philadelphia 76ers | 10.105855 |
368 | DeMar DeRozan | SG | Toronto Raptors | 10.050000 |
How much did Stephen Curry make? For the answer, we have to access the row where the value of PLAYER
is equal to Stephen Curry
. That is placed a table consisting of just one line:
1 | # Filter the rows where 'PLARYER' == 'Stephen Curry' |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
121 | Stephen Curry | PG | Golden State Warriors | 11.370786 |
Curry made just under $11.4 million dollars. That’s a lot of money, but it’s less than half the salary of LeBron James. You’ll find that salary in the “Top 5” table earlier in this section, or you could find it replacing ‘Stephen Curry’ by ‘LeBron James’ in the line of code above.
In the code, are is used again, but this time with the predicate ==
instead of >
. Thus for example you can get a table of all the Warriors:
1 | # Get a table of all the Warriors |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
117 | Klay Thompson | SG | Golden State Warriors | 15.501000 |
118 | Draymond Green | PF | Golden State Warriors | 14.260870 |
119 | Andrew Bogut | C | Golden State Warriors | 13.800000 |
120 | Andre Iguodala | SF | Golden State Warriors | 11.710456 |
121 | Stephen Curry | PG | Golden State Warriors | 11.370786 |
122 | Jason Thompson | PF | Golden State Warriors | 7.008475 |
123 | Shaun Livingston | PG | Golden State Warriors | 5.543725 |
124 | Harrison Barnes | SF | Golden State Warriors | 3.873398 |
125 | Marreese Speights | C | Golden State Warriors | 3.815000 |
126 | Leandro Barbosa | SG | Golden State Warriors | 2.500000 |
127 | Festus Ezeli | C | Golden State Warriors | 2.008748 |
128 | Brandon Rush | SF | Golden State Warriors | 1.270964 |
129 | Kevon Looney | SF | Golden State Warriors | 1.131960 |
130 | Anderson Varejao | PF | Golden State Warriors | 0.289755 |
OR You can use fuzzy search (approximate string matching).
1 | # Fuzzy search for records in 'TEAM' columns containing 'Warriors' |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
117 | Klay Thompson | SG | Golden State Warriors | 15.501000 |
118 | Draymond Green | PF | Golden State Warriors | 14.260870 |
119 | Andrew Bogut | C | Golden State Warriors | 13.800000 |
120 | Andre Iguodala | SF | Golden State Warriors | 11.710456 |
121 | Stephen Curry | PG | Golden State Warriors | 11.370786 |
122 | Jason Thompson | PF | Golden State Warriors | 7.008475 |
123 | Shaun Livingston | PG | Golden State Warriors | 5.543725 |
124 | Harrison Barnes | SF | Golden State Warriors | 3.873398 |
125 | Marreese Speights | C | Golden State Warriors | 3.815000 |
126 | Leandro Barbosa | SG | Golden State Warriors | 2.500000 |
127 | Festus Ezeli | C | Golden State Warriors | 2.008748 |
128 | Brandon Rush | SF | Golden State Warriors | 1.270964 |
129 | Kevon Looney | SF | Golden State Warriors | 1.131960 |
130 | Anderson Varejao | PF | Golden State Warriors | 0.289755 |
This portion of the table is already sorted by salary, because the original table listed players sorted by salary within the same team.
Multiple Features
You can access rows that have multiple specified features, by using where repeatedly. For example, here is a way to extract all the Point Guards whose salaries were over $15 million.
1 | # Filter 'POSITION' == 'PG' & 'SALARY' > 15 |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
60 | Derrick Rose | PG | Chicago Bulls | 20.093064 |
74 | Kyrie Irving | PG | Cleveland Cavaliers | 16.407501 |
156 | Chris Paul | PG | Los Angeles Clippers | 21.468695 |
269 | Russell Westbrook | PG | Oklahoma City Thunder | 16.744218 |
400 | John Wall | PG | Washington Wizards | 15.851950 |
General Form
By now you will have realized that the general way to create a new table by selecting rows with a given feature is to use where and are with the appropriate condition:
1 | # Filter a SALARY range |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
204 | Luol Deng | SF | Miami Heat | 10.151612 |
298 | Gerald Wallace | SF | Philadelphia 76ers | 10.105855 |
356 | Danny Green | SG | San Antonio Spurs | 10.000000 |
368 | DeMar DeRozan | SG | Toronto Raptors | 10.050000 |
Notice that the table above includes Danny Green who made $10 million, but not Monta Ellis who made $10.3 million. As elsewhere in Python, the range between includes the left end but not the right.
If we specify a condition that isn’t satisfied by any row, we get a table with column labels but no rows.
1 | nba[nba['PLAYER']=='Barack Obama'] |
PLAYER | POSITION | TEAM | SALARY |
---|
Example: Population Trends
We are now ready to work with large tables of data. The file below contains “Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States.” Notice that read_csv
can read data directly from a URL.
1 | # As of Jan 2017, this census file is online here: |
SEX | AGE | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3944153 | 3944160 | 3951330 | 3963087 | 3926540 | 3931141 | 3949775 | 3978038 |
1 | 0 | 1 | 3978070 | 3978090 | 3957888 | 3966551 | 3977939 | 3942872 | 3949776 | 3968564 |
2 | 0 | 2 | 4096929 | 4096939 | 4090862 | 3971565 | 3980095 | 3992720 | 3959664 | 3966583 |
3 | 0 | 3 | 4119040 | 4119051 | 4111920 | 4102470 | 3983157 | 3992734 | 4007079 | 3974061 |
4 | 0 | 4 | 4063170 | 4063186 | 4077551 | 4122294 | 4112849 | 3994449 | 4005716 | 4020035 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
301 | 2 | 97 | 53582 | 53605 | 54118 | 57159 | 59533 | 61255 | 62779 | 69285 |
302 | 2 | 98 | 36641 | 36675 | 37532 | 40116 | 42857 | 44359 | 46208 | 47272 |
303 | 2 | 99 | 26193 | 26214 | 26074 | 27030 | 29320 | 31112 | 32517 | 34064 |
304 | 2 | 100 | 44202 | 44246 | 45058 | 47556 | 50661 | 53902 | 58008 | 61886 |
305 | 2 | 999 | 156964212 | 156969328 | 157258820 | 158427085 | 159581546 | 160720625 | 161952064 | 163189523 |
Only the first 10 rows of the table are displayed. Later we will see how to display the entire table; however, this is typically not useful with large tables.
A description of the table appears online. The SEX
column contains numeric codes: 0 stands for the total, 1 for male, and 2 for female. The AGE
column contains ages in completed years, but the special value 999 is a sum of the total population. The rest of the columns contain estimates of the US population.
Typically, a public table will contain more information than necessary for a particular investigation or analysis. In this case, let us suppose that we are only interested in the population changes from 2010 to 2014. Let us select the relevant columns.
1 | # Select relevant columns |
SEX | AGE | POPESTIMATE2010 | POPESTIMATE2014 | |
---|---|---|---|---|
0 | 0 | 0 | 3951330 | 3949775 |
1 | 0 | 1 | 3957888 | 3949776 |
2 | 0 | 2 | 4090862 | 3959664 |
3 | 0 | 3 | 4111920 | 4007079 |
4 | 0 | 4 | 4077551 | 4005716 |
... | ... | ... | ... | ... |
301 | 2 | 97 | 54118 | 62779 |
302 | 2 | 98 | 37532 | 46208 |
303 | 2 | 99 | 26074 | 32517 |
304 | 2 | 100 | 45058 | 58008 |
305 | 2 | 999 | 157258820 | 161952064 |
1 | # Simplify the columns name |
SEX | AGE | 2010 | 2014 | |
---|---|---|---|---|
0 | 0 | 0 | 3951330 | 3949775 |
1 | 0 | 1 | 3957888 | 3949776 |
2 | 0 | 2 | 4090862 | 3959664 |
3 | 0 | 3 | 4111920 | 4007079 |
4 | 0 | 4 | 4077551 | 4005716 |
... | ... | ... | ... | ... |
301 | 2 | 97 | 54118 | 62779 |
302 | 2 | 98 | 37532 | 46208 |
303 | 2 | 99 | 26074 | 32517 |
304 | 2 | 100 | 45058 | 58008 |
305 | 2 | 999 | 157258820 | 161952064 |
We now have a table that is easy to work with. Each column of the table is an array of the same length, and so columns can be combined using arithmetic. Here is the change in population between 2010 and 2014.
1 | # The change in population between 2010 and 2014 |
Let us augment us_pop
with a column that contains these changes, both in absolute terms and as percents relative to the value in 2010.
pandas.DataFrame.assign
pandas.DataFrame.insert
Add multiple columns to dataframe in Pandas
pandas.io.formats.style.Styler
1 | # Make a new df census from us_pop, then add two new columns |
SEX | AGE | 2010 | 2014 | Change | Percent Change | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 3951330 | 3949775 | -1555 | -0.04% |
1 | 0 | 1 | 3957888 | 3949776 | -8112 | -0.20% |
2 | 0 | 2 | 4090862 | 3959664 | -131198 | -3.21% |
3 | 0 | 3 | 4111920 | 4007079 | -104841 | -2.55% |
4 | 0 | 4 | 4077551 | 4005716 | -71835 | -1.76% |
5 | 0 | 5 | 4064653 | 4006900 | -57753 | -1.42% |
6 | 0 | 6 | 4073013 | 4135930 | 62917 | 1.54% |
7 | 0 | 7 | 4043046 | 4155326 | 112280 | 2.78% |
8 | 0 | 8 | 4025604 | 4120903 | 95299 | 2.37% |
9 | 0 | 9 | 4125415 | 4108349 | -17066 | -0.41% |
10 | 0 | 10 | 4187062 | 4116942 | -70120 | -1.67% |
11 | 0 | 11 | 4115511 | 4087402 | -28109 | -0.68% |
12 | 0 | 12 | 4113279 | 4070682 | -42597 | -1.04% |
13 | 0 | 13 | 4119666 | 4171030 | 51364 | 1.25% |
14 | 0 | 14 | 4145614 | 4233839 | 88225 | 2.13% |
15 | 0 | 15 | 4231002 | 4164796 | -66206 | -1.56% |
16 | 0 | 16 | 4313252 | 4168559 | -144693 | -3.35% |
17 | 0 | 17 | 4376367 | 4186513 | -189854 | -4.34% |
18 | 0 | 18 | 4491005 | 4227920 | -263085 | -5.86% |
19 | 0 | 19 | 4571411 | 4329038 | -242373 | -5.30% |
20 | 0 | 20 | 4568517 | 4421330 | -147187 | -3.22% |
Sorting the data. Let us sort the table in decreasing order of the absolute change in population.
1 | # Sort census by Change, ascending=False |
SEX | AGE | 2010 | 2014 | Change | Percent Change | |
---|---|---|---|---|---|---|
101 | 0 | 999 | 309346863 | 318907401 | 9560538 | 0.030906 |
203 | 1 | 999 | 152088043 | 156955337 | 4867294 | 0.032003 |
305 | 2 | 999 | 157258820 | 161952064 | 4693244 | 0.029844 |
67 | 0 | 67 | 2693707 | 3485241 | 791534 | 0.293846 |
64 | 0 | 64 | 2706055 | 3487559 | 781504 | 0.288798 |
... | ... | ... | ... | ... | ... | ... |
39 | 0 | 39 | 4324463 | 3982507 | -341956 | -0.079075 |
48 | 0 | 48 | 4534663 | 4159738 | -374925 | -0.082680 |
46 | 0 | 46 | 4529716 | 4077151 | -452565 | -0.099910 |
47 | 0 | 47 | 4535473 | 4082883 | -452590 | -0.099789 |
40 | 0 | 40 | 4387480 | 3859395 | -528085 | -0.120362 |
Not surprisingly, the top row of the sorted table is the line that corresponds to the entire population: both sexes and all age groups. From 2010 to 2014, the population of the United States increased by about 9.5 million people, a change of just over 3%.
The next two rows correspond to all the men and all the women respectively. The male population grew more than the female population, both in absolute and percentage terms. Both percent changes were around 3%.
Now take a look at the next few rows. The percent change jumps from about 3% for the overall population to almost 30% for the people in their late sixties and early seventies. This stunning change contributes to what is known as the greying of America.
By far the greatest absolute change was among those in the 64-67 agegroup in 2014. What could explain this large increase? We can explore this question by examining the years in which the relevant groups were born.
Those who were in the 64-67 age group in 2010 were born in the years 1943 to 1946. The attack on Pearl Harbor was in late 1941, and by 1942 U.S. forces were heavily engaged in a massive war that ended in 1945.
Those who were 64 to 67 years old in 2014 were born in the years 1947 to 1950, at the height of the post-WWII baby boom in the United States.
The post-war jump in births is the major reason for the large changes that we have observed.
Example: Trends in Gender
We are now equipped with enough coding skills to examine features and trends in subgroups of the U.S. population. In this example, we will look at the distribution of males and females across age groups. We will continue using the us_pop
table from the previous section.
1 | us_pop |
SEX | AGE | 2010 | 2014 | |
---|---|---|---|---|
0 | 0 | 0 | 3951330 | 3949775 |
1 | 0 | 1 | 3957888 | 3949776 |
2 | 0 | 2 | 4090862 | 3959664 |
3 | 0 | 3 | 4111920 | 4007079 |
4 | 0 | 4 | 4077551 | 4005716 |
... | ... | ... | ... | ... |
301 | 2 | 97 | 54118 | 62779 |
302 | 2 | 98 | 37532 | 46208 |
303 | 2 | 99 | 26074 | 32517 |
304 | 2 | 100 | 45058 | 58008 |
305 | 2 | 999 | 157258820 | 161952064 |
As we know from having examined this dataset earlier, a description of the table appears online. Here is a reminder of what the table contains.
Each row represents an age group. The SEX
column contains numeric codes: 0 stands for the total, 1 for male, and 2 for female. The AGE
column contains ages in completed years, but the special value 999 represents the entire population regardless of age. The rest of the columns contain estimates of the US population.
Understanding AGE = 100
As a preliminary, let’s interpret data in the final age category in the table, where AGE
is 100. The code below extracts the rows for the combined group of men and women (SEX
code 0) for the highest ages.
1 | # Filter the records between 97 and 100, regardless sex |
SEX | AGE | 2010 | 2014 | |
---|---|---|---|---|
97 | 0 | 97 | 68893 | 83089 |
98 | 0 | 98 | 47037 | 59726 |
99 | 0 | 99 | 32178 | 41468 |
100 | 0 | 100 | 54410 | 71626 |
Not surprisingly, the numbers of people are smaller at higher ages – for example, there are fewer 99-year-olds than 98-year-olds.
It does come as a surprise, though, that the numbers for AGE
100 are quite a bit larger than those for age 99. A closer examination of the documentation shows that it’s because the Census Bureau used 100 as the code for everyone aged 100 or more.
The row with AGE
100 doesn’t just represent 100-year-olds – it also includes those who are older than 100. That is why the numbers in that row are larger than in the row for the 99-year-olds.
Overall Proportions of Males and Females
We will now begin looking at gender ratios in 2014. First, let’s look at all the age groups together. Remember that this means looking at the rows where the “age” is coded 999. The table all_ages contains this information. There are three rows: one for the total of both genders, one for males (SEX
code 1), and one for females (SEX
code 2).
1 | # Drop column '2010' then filter 'AGE'==999 |
SEX | AGE | 2014 | |
---|---|---|---|
101 | 0 | 999 | 318907401 |
203 | 1 | 999 | 156955337 |
305 | 2 | 999 | 161952064 |
Row 0 of all_ages
contains the total U.S. population in each of the two years. The United States had just under 319 million in 2014.
Row 1 contains the counts for males and Row 2 for females. Compare these two rows to see that in 2014, there were more females than males in the United States.
The population counts in Row 1 and Row 2 add up to the total population in Row 0.
For comparability with other quantities, we will need to convert these counts to percents out of the total population. Let’s access the total for 2014 and name it. Then, we’ll show a population table with a proportion column. Consistent with our earlier observation that there were more females than males, about 50.8% of the population in 2014 was female and about 49.2% male in each of the two years.
1 | # Add a column Proportion |
SEX | AGE | 2014 | Proportion | |
---|---|---|---|---|
101 | 0 | 999 | 318907401 | 1.000000 |
203 | 1 | 999 | 156955337 | 0.492166 |
305 | 2 | 999 | 161952064 | 0.507834 |
Proportions of Boys and Girls among Infants
When we look at infants, however, the opposite is true. Let’s define infants to be babies who have not yet completed one year, represented in the row corresponding to AGE
0. Here are their numbers in the population. You can see that male infants outnumbered female infants.
1 | # Filter AGE == 0 |
SEX | AGE | 2014 | |
---|---|---|---|
0 | 0 | 0 | 3949775 |
102 | 1 | 0 | 2020326 |
204 | 2 | 0 | 1929449 |
As before, we can convert these counts to percents out of the total numbers of infants. The resulting table shows that in 2014, just over 51% of infants in the U.S. were male.
1 | # Add a column Proportion |
SEX | AGE | 2014 | Proportion | |
---|---|---|---|---|
0 | 0 | 0 | 3949775 | 1.000000 |
102 | 1 | 0 | 2020326 | 0.511504 |
204 | 2 | 0 | 1929449 | 0.488496 |
In fact, it has long been observed that the proportion of boys among newborns is slightly more than 1/2. The reason for this is not thoroughly understood, and scientists are still working on it.
Female:Male Gender Ratio at Each Age
We have seen that while there are more baby boys than baby girls, there are more females than males overall. So it’s clear that the split between genders must vary across age groups.
To study this variation, we will separate out the data for the females and the males, and eliminate the row where all the ages are aggregated and AGE
is coded as 999.
The tables females and males contain the data for each the two genders.
1 | # Get all females data except total age 999 |
SEX | AGE | 2014 | |
---|---|---|---|
204 | 2 | 0 | 1929449 |
205 | 2 | 1 | 1931375 |
206 | 2 | 2 | 1935991 |
207 | 2 | 3 | 1957483 |
208 | 2 | 4 | 1961199 |
... | ... | ... | ... |
300 | 2 | 96 | 89432 |
301 | 2 | 97 | 62779 |
302 | 2 | 98 | 46208 |
303 | 2 | 99 | 32517 |
304 | 2 | 100 | 58008 |
1 | # Get all males data except total age 999 |
SEX | AGE | 2014 | |
---|---|---|---|
102 | 1 | 0 | 2020326 |
103 | 1 | 1 | 2018401 |
104 | 1 | 2 | 2023673 |
105 | 1 | 3 | 2049596 |
106 | 1 | 4 | 2044517 |
... | ... | ... | ... |
198 | 1 | 96 | 31053 |
199 | 1 | 97 | 20310 |
200 | 1 | 98 | 13518 |
201 | 1 | 99 | 8951 |
202 | 1 | 100 | 13618 |
The plan now is to compare the number of women and the number of men at each age, for each of the two years. Array and Pandas methods give us straightforward ways to do this. Both of these tables have one row for each age.
1 | # Get all males age |
1 | # Get all females age |
For any given age, we can get the Female:Male gender ratio by dividing the number of females by the number of males. To do this in one step, we can use column to extract the array of female counts and the corresponding array of male counts, and then simply divide one array by the other. Elementwise division will create an array of gender ratios for all the years.
1 | # Make a new df ratios |
AGE | 2014 F:M RATIO | |
---|---|---|
204 | 0 | 0.955019 |
205 | 1 | 0.956884 |
206 | 2 | 0.956672 |
207 | 3 | 0.955058 |
208 | 4 | 0.959248 |
... | ... | ... |
300 | 96 | 2.879979 |
301 | 97 | 3.091039 |
302 | 98 | 3.418257 |
303 | 99 | 3.632778 |
304 | 100 | 4.259656 |
You can see from the display that the ratios are all around 0.96 for children aged nine or younger. When the Female:Male ratio is less than 1, there are fewer females than males. Thus what we are seeing is that there were fewer girls than boys in each of the age groups 0, 1, 2, and so on through 9. Moreover, in each of these age groups, there were about 96 girls for every 100 boys.
So how can the overall proportion of females in the population be higher than the males?
Something extraordinary happens when we examine the other end of the age range. Here are the Female:Male ratios for people aged more than 75.
1 | # Filter the ratios more than 75 |
AGE | 2014 F:M RATIO | |
---|---|---|
280 | 76 | 1.234867 |
281 | 77 | 1.257965 |
282 | 78 | 1.282442 |
283 | 79 | 1.316273 |
284 | 80 | 1.341383 |
... | ... | ... |
300 | 96 | 2.879979 |
301 | 97 | 3.091039 |
302 | 98 | 3.418257 |
303 | 99 | 3.632778 |
304 | 100 | 4.259656 |
Not only are all of these ratios greater than 1, signifying more women than men in all of these age groups, many of them are considerably greater than 1.
- At ages 89 and 90 the ratios are close to 2, meaning that there were about twice as many women as men at those ages in 2014.
- At ages 98 and 99, there were about 3.5 to 4 times as many women as men.
If you are wondering how many people there were at these advanced ages, you can use Python to find out:
1 | # Filter males who are between 98 and 100 |
SEX | AGE | 2014 | |
---|---|---|---|
200 | 1 | 98 | 13518 |
201 | 1 | 99 | 8951 |
1 | # Filter females who are between 98 and 100 |
SEX | AGE | 2014 | |
---|---|---|---|
302 | 2 | 98 | 46208 |
303 | 2 | 99 | 32517 |
The graph below shows the gender ratios plotted against age. The blue curve shows the 2014 ratio by age.
The ratios are almost 1 (signifying close to equal numbers of males and females) for ages 0 through 60, but they start shooting up dramatically (more females than males) starting at about age 65.
That females outnumber males in the U.S. is partly due to the marked gender imbalance in favor of women among senior citizens.
1 | import plotly.express as px |
Visualization
Visulization
Tables are a powerful way of organizing and visualizing data. However, large tables of numbers can be difficult to interpret, no matter how organized they are. Sometimes it is much easier to interpret graphs than numbers.
In this chapter we will develop some of the fundamental graphical methods of data analysis. Our source of data is the Internet Movie Database, an online database that contains information about movies, television shows, video games, and so on. The site Box Office Mojo provides many summaries of IMDB data, some of which we have adapted. We have also used data summaries from The Numbers, a site with a tagline that says it is “where data and the movie business meet.”
Scatter Plots and Line Graphs
The table actors contains data on Hollywood actors, both male and female. The columns are:
Actors Metadata
Column | Contents |
---|---|
Actor |
Name of actor |
Total Gross |
Total gross domestic box office receipt, in millions of dollars, of all of the actor's movies |
Number of Movies |
The number of movies the actor has been in |
Average per Movie |
Total gross divided by number of movies |
#1 Movie |
The highest grossing movie the actor has been in |
Gross |
Gross domestic box office receipt, in millions of dollars, of the actor's #1 Movie |
In the calculation of the gross receipt, the data tabulators did not include movies where an actor had a cameo role or a speaking role that did not involve much screen time.
The table has 50 rows, corresponding to the 50 top grossing actors. The table is already sorted by Total Gross, so it is easy to see that Harrison Ford is the highest grossing actor. In total, his movies have brought in more money at domestic box office than the movies of any other actor.
1 | import pandas as pd |
1 | # Read csv actors |
Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross | |
---|---|---|---|---|---|---|
0 | Harrison Ford | 4871.7 | 41 | 118.8 | Star Wars: The Force Awakens | 936.7 |
1 | Samuel L. Jackson | 4772.8 | 69 | 69.2 | The Avengers | 623.4 |
2 | Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9 |
3 | Tom Hanks | 4340.8 | 44 | 98.7 | Toy Story 3 | 415.0 |
4 | Robert Downey, Jr. | 3947.3 | 53 | 74.5 | The Avengers | 623.4 |
... | ... | ... | ... | ... | ... | ... |
45 | Jeremy Renner | 2500.3 | 21 | 119.1 | The Avengers | 623.4 |
46 | Philip Seymour Hoffman | 2463.7 | 40 | 61.6 | Catching Fire | 424.7 |
47 | Sandra Bullock | 2462.6 | 35 | 70.4 | Minions | 336.0 |
48 | Chris Evans | 2457.8 | 23 | 106.9 | The Avengers | 623.4 |
49 | Anne Hathaway | 2416.5 | 25 | 96.7 | The Dark Knight Rises | 448.1 |
Terminology. A variable
is a formal name for what we have been calling a feature
, such as ‘number of movies.’ The term variable
emphasizes that the feature
can have different values for different individuals – the numbers of movies that actors have been in varies across all the actors.
Variables that have numerical values, such as ‘number of movies’ or ‘average gross receipts per movie’ are called quantitative or numerical variables.
Scatter Plots
A scatter plot displays the relation between two numerical variables. You saw an example of a scatter plot in an early section where we looked at the number of periods and number of characters in two classic novels.
The plotly.express method scatter
draws a scatter plot consisting of one point for each row of the table. Its x
argument is the label of the column to be plotted on the horizontal axis, and y
argument is the label of the column on the vertical.
1 | # Display the association of 'Number of Movies' and 'Total Gross' in scatter plot |
The plot contains 50 points, one point for each actor in the table. You can see that it slopes upwards, in general. The more movies an actor has been in, the more the total gross of all of those movies – in general.
Formally, we say that the plot shows an association between the variables, and that the association is positive: high values of one variable tend to be associated with high values of the other, and low values of one with low values of the other, in general.
Of course there is some variability. Some actors have high numbers of movies but middling total gross receipts. Others have middling numbers of movies but high receipts. That the association is positive is simply a statement about the broad general trend.
Later in the course we will study how to quantify association. For the moment, we will just think about it qualitatively.
Now that we have explored how the number of movies is related to the total gross receipt, let’s turn our attention to how it is related to the average gross receipt per movie.
1 | # Display the association of 'Number of Movie' and 'Average per Movies' in scatter plot |
This is a markedly different picture and shows a negative association. In general, the more movies an actor has been in, the less the average receipt per movie.
Also, one of the points is quite high and off to the left of the plot. It corresponds to one actor who has a low number of movies and high average per movie. This point is an outlier. It lies outside the general range of the data. Indeed, it is quite far from all the other points in the plot.
We will examine the negative association further by looking at points at the right and left ends of the plot.
For the right end, let’s zoom in on the main body of the plot by just looking at the portion that doesn’t have the outlier.
1 | # eliminate the outlier |
The negative association is still clearly visible. Let’s identify the actors corresponding to the points that lie on the right hand side of the plot where the number of movies is large:
1 | # Filter the actors who have been in more than 60 movies |
Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross | |
---|---|---|---|---|---|---|
1 | Samuel L. Jackson | 4772.8 | 69 | 69.2 | The Avengers | 623.4 |
2 | Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9 |
19 | Robert DeNiro | 3081.3 | 79 | 39.0 | Meet the Fockers | 279.3 |
21 | Liam Neeson | 2942.7 | 63 | 46.7 | The Phantom Menace | 474.5 |
The great actor Robert DeNiro has the highest number of movies and the lowest average receipt per movie. Other fine actors are at points that are not very far away, but DeNiro’s is at the extreme end.
To understand the negative association, note that the more movies an actor is in, the more variable those movies might be, in terms of style, genre, and box office draw. For example, an actor might be in some high-grossing action movies or comedies (such as Meet the Fockers), and also in a large number of smaller films that may be excellent but don’t draw large crowds. Thus the actor’s value of average receipts per movie might be relatively low.
To approach this argument from a different direction, let us now take a look at the outlier.
1 | # Filter the actors who have been in less than 60 movies |
Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross | |
---|---|---|---|---|---|---|
14 | Anthony Daniels | 3162.9 | 7 | 451.8 | Star Wars: The Force Awakens | 936.7 |
As an actor, Anthony Daniels might not have the stature of Robert DeNiro. But his 7 movies had an astonishingly high average receipt of nearly 452
452
million dollars per movie.
What were these movies? You might know about the droid C-3PO in Star Wars:
That’s Anthony Daniels inside the metallic suit. He plays C-3PO.
Mr. Daniels’ entire filmography (apart from cameos) consists of movies in the high-grossing Star Wars franchise. That explains both his high average receipt and his low number of movies.
Variables such as genre and production budget have an effect on the association between the number of movies and the average receipt per movie. This example is a reminder that studying the association between two variables often involves understanding other related variables as well.
Line Graphs
Line graphs are among the most common visualizations and are often used to study chronological trends and patterns.
The table movies_by_year contains data on movies produced by U.S. studios in each of the years 1980 through 2015. The columns are:
Column | Content |
---|---|
Year |
Year |
Total Gross |
Total domestic box office gross, in millions of dollars, of all movies released |
Number of Movies |
Number of movies released |
#1 Movie |
Highest grossing movie |
1 | # Read csv movies_by_year |
Year | Total Gross | Number of Movies | #1 Movie | |
---|---|---|---|---|
0 | 2015 | 11128.5 | 702 | Star Wars: The Force Awakens |
1 | 2014 | 10360.8 | 702 | American Sniper |
2 | 2013 | 10923.6 | 688 | Catching Fire |
3 | 2012 | 10837.4 | 667 | The Avengers |
4 | 2011 | 10174.3 | 602 | Harry Potter / Deathly Hallows (P2) |
... | ... | ... | ... | ... |
31 | 1984 | 4031.0 | 536 | Beverly Hills Cop |
32 | 1983 | 3766.0 | 495 | Return of the Jedi |
33 | 1982 | 3453.0 | 428 | E.T. |
34 | 1981 | 2966.0 | 173 | Raiders / Lost Ark |
35 | 1980 | 2749.0 | 161 | The Empire Strikes Back |
The plotly.express method line
produces a line graph. Its two arguments (x
and y
) are the same as those for scatter: first the column on the horizontal axis, then the column on the vertical. Here is a line graph of the number of movies released each year over the years 1980 through 2015.
1 | # Display the association of 'Year' and 'Number of Movies' in line graph |
The graph rises sharply and then has a gentle upwards trend though the numbers vary noticeably from year to year. The sharp rise in the early 1980’s is due in part to studios returning to the forefront of movie production after some years of filmmaker driven movies in the 1970’s.
Our focus will be on more recent years. In keeping with the theme of movies, the table of rows corresponding to the years 2000 through 2015 have been assigned to the name century_21
.
1 | # Filter the year over than 1999 |
1 | # Display the association of 'Year' and 'Number of Movies' in line graph |
The global financial crisis of 2008 has a visible effect – in 2009 there is a sharp drop in the number of movies released.
The dollar figures, however, didn’t suffer much.
1 | # Display the association of 'Year' and 'Total Gross' in line graph |
The total domestic gross receipt was higher in 2009 than in 2008, even though there was a financial crisis and a much smaller number of movies were released.
One reason for this apparent contradiction is that people tend to go to the movies when there is a recession. “In Downturn, Americans Flock to the Movies,” said the New York Times in February 2009. The article quotes Martin Kaplan of the University of Southern California saying, “People want to forget their troubles, and they want to be with other people.” When holidays and expensive treats are unaffordable, movies provide welcome entertainment and relief.
In 2009, another reason for high box office receipts was the movie Avatar and its 3D release. Not only was Avatar the #1 movie of 2009, it is also by some calculations the second highest grossing movie of all time, as we will see later.
1 | # Filter year 2009 |
Year | Total Gross | Number of Movies | #1 Movie | |
---|---|---|---|---|
6 | 2009 | 10595.5 | 521 | Avatar |
Categorical Distributions
Data come in many forms that are not numerical. Data can be pieces of music, or places on a map. They can also be categories into which you can place individuals. Here are some examples of categorical variables.
- The individuals are cartons of ice-cream, and the variable is the flavor in the carton.
- The individuals are professional basketball players, and the variable is the player’s team.
- The individuals are years, and the variable is the genre of the highest grossing movie of the year.
- The individuals are survey respondents, and the variable is the response they choose from among “Not at all satisfied,” “Somewhat satisfied,” and “Very satisfied.”
The table icecream contains data on 30 cartons of ice-cream.
1 | # Make table named icecream |
Flavor | Number of Cartons | |
---|---|---|
0 | Chocolate | 16 |
1 | Strawberry | 5 |
2 | Vanilla | 9 |
The values of the categorical variable “flavor” are chocolate, strawberry, and vanilla. The table shows the number of cartons of each flavor. We call this a distribution table. A distribution shows all the values of a variable, along with the frequency of each one.
Bar Chart
The bar chart is a familiar way of visualizing categorical distributions. It displays a bar for each category. The bars are equally spaced and equally wide. The length of each bar is proportional to the frequency of the corresponding category.
We will draw bar charts with horizontal bars because it’s easier to label the bars that way. The Table method is therefore called px.bar
. It takes two arguments: the first is the column label of the categories, and the second is the column label of the frequencies.
1 | # Bar chart |
Features of Categorical Distributions
Apart from purely visual differences, there is an important fundamental distinction between bar charts and the two graphs that we saw in the previous sections. Those were the scatter plot and the line plot, both of which display two numerical variables – the variables on both axes are numerical. In contrast, the bar chart has categories on one axis and numerical frequencies on the other.
This has consequences for the chart. First, the width of each bar and the space between consecutive bars is entirely up to the person who is producing the graph, or to the program being used to produce it. Python made those choices for us. If you were to draw the bar graph by hand, you could make completely different choices and still have a perfectly correct bar graph, provided you drew all the bars with the same width and kept all the spaces the same.
Most importantly, the bars can be drawn in any order. The categories “chocolate,” “vanilla,” and “strawberry” have no universal rank order, unlike for example the numbers 5, 7, and 10.
This means that we can draw a bar chart that is easier to interpret, by rearranging the bars in decreasing order. To do this, we first rearrange the rows of icecream in decreasing order of Number of Cartons, and then draw the bar chart.
1 | # Bar chart |
This bar chart contains exactly the same information as the previous ones, but it is a little easier to read. While this is not a huge gain in reading a chart with just three bars, it can be quite significant when the number of categories is large.
Grouping Categorical Data
To construct the table icecream, someone had to look at all 30 cartons of ice-cream and count the number of each flavor. But if our data does not already include frequencies, we have to compute the frequencies before we can draw a bar chart. Here is an example where this is necessary.
The table top consists of U.S.A.’s top grossing movies of all time. The first column contains the title of the movie; Star Wars: The Force Awakens has the top rank, with a box office gross amount of more than 900 million dollars in the United States. The second column contains the name of the studio that produced the movie. The third contains the domestic box office gross in dollars, and the fourth contains the gross amount that would have been earned from ticket sales at 2016 prices. The fifth contains the release year of the movie.
There are 200 movies on the list. Here are the top ten according to unadjusted gross receipts.
1 | # Read table |
Title | Studio | Gross | Gross (Adjusted) | Year | |
---|---|---|---|---|---|
0 | Star Wars: The Force Awakens | Buena Vista (Disney) | 906723418 | 906723400 | 2015 |
1 | Avatar | Fox | 760507625 | 846120800 | 2009 |
2 | Titanic | Paramount | 658672302 | 1178627900 | 1997 |
3 | Jurassic World | Universal | 652270625 | 687728000 | 2015 |
4 | Marvel's The Avengers | Buena Vista (Disney) | 623357910 | 668866600 | 2012 |
... | ... | ... | ... | ... | ... |
195 | The Caine Mutiny | Columbia | 21750000 | 386173500 | 1954 |
196 | The Bells of St. Mary's | RKO | 21333333 | 545882400 | 1945 |
197 | Duel in the Sun | Selz. | 20408163 | 443877500 | 1946 |
198 | Sergeant York | Warner Bros. | 16361885 | 418671800 | 1941 |
199 | The Four Horsemen of the Apocalypse | MPC | 9183673 | 399489800 | 1921 |
The Disney subsidiary Buena Vista shows up frequently in the top ten, as do Fox and Warner Brothers. Which studios will appear most frequently if we look among all 200 rows?
To figure this out, first notice that all we need is a table with the movies and the studios; the other information is unnecessary.
1 | # Filter 'Title' and 'Studio' |
Title | Studio | |
---|---|---|
0 | Star Wars: The Force Awakens | Buena Vista (Disney) |
1 | Avatar | Fox |
2 | Titanic | Paramount |
3 | Jurassic World | Universal |
4 | Marvel's The Avengers | Buena Vista (Disney) |
... | ... | ... |
195 | The Caine Mutiny | Columbia |
196 | The Bells of St. Mary's | RKO |
197 | Duel in the Sun | Selz. |
198 | Sergeant York | Warner Bros. |
199 | The Four Horsemen of the Apocalypse | MPC |
The Pandas method groupby
allows us to count how frequently each studio appears in the table, by calling each studio a category and assigning each row to one category. The groupby
method takes as its argument the label of the column that contains the categories, and returns a table of counts of rows in each category. The column of counts is always called count
, but you can change that if you like by using relabeled. Finally don’t forget using reset_index
for rearrange the table.
pandas.DataFrame.groupby
Group by: split-apply-combine
pandas.DataFrame.size
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
1 | # Group movies_and_studios by counting 'Studio' |
Studio | count | |
---|---|---|
0 | AVCO | 1 |
1 | Buena Vista (Disney) | 29 |
2 | Columbia | 10 |
3 | Disney | 11 |
4 | Dreamworks | 3 |
... | ... | ... |
19 | TriS | 2 |
20 | UA | 6 |
21 | Universal | 22 |
22 | Warner Bros. | 29 |
23 | Warner Bros. (New Line) | 1 |
Thus groupby
creates a distribution table that shows how the movies are distributed among the categories (studios).
We can now use this table, along with the graphing skills that we acquired above, to draw a bar chart that shows which studios are most frequent among the 200 highest grossing movies.
1 | # Distribution |
Warner Brothers and Buena Vista are the most common studios among the top 200 movies. Warner Brothers produces the Harry Potter movies and Buena Vista produces Star Wars.
Because total gross receipts are being measured in unadjusted dollars, it is not very surprising that the top movies are more frequently from recent years than from bygone decades. In absolute terms, movie tickets cost more now than they used to, and thus gross receipts are higher. This is borne out by a bar chart that show the distribution of the 200 movies by year of release.
plotly.express.bar
Python Figure Reference: layout.xaxis
Formatting Ticks in Python
Axes in Python
1 | # Distribution |
All of the longest bars correspond to years after 2000. This is consistent with our observation that recent years should be among the most frequent.
Numerical Distributions
Many of the variables that data scientists study are quantitative or numerical. Their values are numbers on which you can perform arithmetic. Examples that we have seen include the number of periods in chapters of a book, the amount of money made by movies, and the age of people in the United States.
The values of a categorical variable can be given numerical codes, but that doesn’t make the variable quantitative. In the example in which we studied Census data broken down by age group, the categorial variable SEX
had the numerical codes 1 for ‘Male,’ 2 for ‘Female,’ and 0 for the aggregate of both groups 1 and 2. While 0, 1, and 2 are numbers, in this context it doesn’t make sense to subtract 1 from 2, or take the average of 0, 1, and 2, or perform other arithmetic on the three values. SEX
is a categorical variable even though the values have been given a numerical code.
For our main example, we will return to a dataset that we studied when we were visualizing categorical data. It is the table top, which consists of data from U.S.A.’s top grossing movies of all time. For convenience, here is the description of the table again.
The first column contains the title of the movie. The second column contains the name of the studio that produced the movie. The third contains the domestic box office gross in dollars, and the fourth contains the gross amount that would have been earned from ticket sales at 2016 prices. The fifth contains the release year of the movie.
There are 200 movies on the list. Here are the top ten according to the unadjusted gross receipts in the column Gross
.
1 | top |
Title | Studio | Gross | Gross (Adjusted) | Year | |
---|---|---|---|---|---|
0 | Star Wars: The Force Awakens | Buena Vista (Disney) | 906723418 | 906723400 | 2015 |
1 | Avatar | Fox | 760507625 | 846120800 | 2009 |
2 | Titanic | Paramount | 658672302 | 1178627900 | 1997 |
3 | Jurassic World | Universal | 652270625 | 687728000 | 2015 |
4 | Marvel's The Avengers | Buena Vista (Disney) | 623357910 | 668866600 | 2012 |
... | ... | ... | ... | ... | ... |
195 | The Caine Mutiny | Columbia | 21750000 | 386173500 | 1954 |
196 | The Bells of St. Mary's | RKO | 21333333 | 545882400 | 1945 |
197 | Duel in the Sun | Selz. | 20408163 | 443877500 | 1946 |
198 | Sergeant York | Warner Bros. | 16361885 | 418671800 | 1941 |
199 | The Four Horsemen of the Apocalypse | MPC | 9183673 | 399489800 | 1921 |
Visualizing the Distribution of the Adjusted Receipts
In this section we will draw graphs of the distribution of the numerical variable in the column Gross (Adjusted)
. For simplicity, let’s create a smaller table that has the information that we need. And since three-digit numbers are easier to work with than nine-digit numbers, let’s measure the Adjusted Gross receipts in millions of dollars. Note how round is used to retain only two decimal places.
1 | # Make a new table millions |
Title | Adjusted Gross | |
---|---|---|
0 | Star Wars: The Force Awakens | 906.72 |
1 | Avatar | 846.12 |
2 | Titanic | 1178.63 |
3 | Jurassic World | 687.73 |
4 | Marvel's The Avengers | 668.87 |
... | ... | ... |
195 | The Caine Mutiny | 386.17 |
196 | The Bells of St. Mary's | 545.88 |
197 | Duel in the Sun | 443.88 |
198 | Sergeant York | 418.67 |
199 | The Four Horsemen of the Apocalypse | 399.49 |
A Histogram
A histogram of a numerical dataset looks very much like a bar chart, though it has some important differences that we will examine in this section. First, let’s just draw a histogram of the adjusted receipts.
The px.histogram
method generates a histogram of the values in a column. The optional histogram
argument is used to specifiy ‘percent’, ‘probability’, ‘density’, or ‘probability density’. The histogram shows the distribution of the adjusted gross amounts, in millions of 2016 dollars.
1 | # Histogram |
The Horizontal Axis
The amounts have been grouped into contiguous intervals called bins. Although in this dataset no movie grossed an amount that is exactly on the edge between two bins, px.histogram
does have to account for situations where there might have been values at the edges. So px.histogram
has an endpoint convention: bins include the data at their left endpoint, but not the data at their right endpoint.
We will use the notation [a, b)
for the bin that starts at a
and ends at b
but doesn’t include b
.
Sometimes, adjustments have to be made in the first or last bin, to ensure that the smallest and largest values of the variable are included. You saw an example of such an adjustment in the Census data studied earlier, where an age of “100” years actually meant “100 years old or older.”
We can see that there are 15 bins (some bars are so low that they are hard to see), and that they all have the same width. We can also see that none of the movies grossed fewer than 300 million dollars; that is because we are considering only the top grossing movies of all time.
It is a little harder to see exactly where the ends of the bins are situated. For example, it is not easy to pinpoint exactly where the value 500 lies on the horizontal axis. So it is hard to judge exactly where one bar ends and the next begins.
The optional argument nbins
can be used with px.histogram
to specify the endpoints of the bins. It must consist of an integer. And x_range
for constrain the range of x
. it must consist of a sequence of numbers that starts with the left end of the first bin and ends with the right end of the last bin. We will start by setting the numbers in bins to be 300, 400, 500, and so on, ending with 2000.
1 | # Histogram |
The horizontal axis of this figure is easier to read. The labels 200, 400, 600, and so on are centered at the corresponding values. The tallest bar is for movies that grossed between 300 million and 400 million dollars.
A very small number of movies grossed 800 million dollars or more. This results in the figure being “skewed to the right,” or, less formally, having “a long right hand tail.” Distributions of variables like income or rent in large populations also often have this kind of shape.
1 | # Calculate bins |
bins | Adjusted Gross Count | |
---|---|---|
0 | [300, 400) | 81 |
1 | [400, 500) | 52 |
2 | [500, 600) | 28 |
3 | [600, 700) | 16 |
4 | [700, 800) | 7 |
5 | [800, 900) | 5 |
6 | [1100, 1200) | 3 |
7 | [900, 1000) | 3 |
8 | [1200, 1300) | 2 |
9 | [1000, 1100) | 1 |
10 | [1500, 1600) | 1 |
11 | [1700, 1800) | 1 |
12 | [1300, 1400) | 0 |
13 | [1400, 1500) | 0 |
14 | [1600, 1700) | 0 |
15 | [1800, 1900) | 0 |
16 | [1900, 2000) | 0 |
Notice the bin
value 2000 in the last row. That’s not the left end-point of any bar – it’s the right end point of the last bar. By the endpoint convention, the data there are not included. So the corresponding count is recorded as 0, and would have been recorded as 0 even if there had been movies that made more than $2,000$ million dollars. When either bin or hist is called with a nbins
argument, the graph only considers values that are in the specified bins.
The Vertical Axis: Density Scale
The horizontal axis of a histogram is straightforward to read, once we have taken care of details like the ends of the bins. The features of the vertical axis require a little more attention. We will go over them one by one.
Let’s start by examining how to calculate the numbers on the vertical axis. If the calculation seems a little strange, have patience – the rest of the section will explain the reasoning.
Calculation. The height of each bar is the percent of elements that fall into the corresponding bin, relative to the width of the bin.
1 | # Density Scale |
bin | Adjusted Gross Count | Percent | Height | |
---|---|---|---|---|
0 | (300, 400] | 81 | 40.5 | 0.405 |
1 | (400, 500] | 52 | 26.0 | 0.260 |
2 | (500, 600] | 28 | 14.0 | 0.140 |
3 | (600, 700] | 16 | 8.0 | 0.080 |
4 | (700, 800] | 7 | 3.5 | 0.035 |
5 | (800, 900] | 5 | 2.5 | 0.025 |
6 | (900, 1000] | 3 | 1.5 | 0.015 |
7 | (1000, 1100] | 1 | 0.5 | 0.005 |
8 | (1100, 1200] | 3 | 1.5 | 0.015 |
9 | (1200, 1300] | 2 | 1.0 | 0.010 |
10 | (1300, 1400] | 0 | 0.0 | 0.000 |
11 | (1400, 1500] | 0 | 0.0 | 0.000 |
12 | (1500, 1600] | 1 | 0.5 | 0.005 |
13 | (1600, 1700] | 0 | 0.0 | 0.000 |
14 | (1700, 1800] | 1 | 0.5 | 0.005 |
15 | (1800, 1900] | 0 | 0.0 | 0.000 |
16 | (1900, 2000] | 0 | 0.0 | 0.000 |
Go over the numbers on the vertical axis of the histogram above to check that the column Heights
looks correct.
The calculations will become clear if we just examine the first row of the table.
Remember that there are 200 movies in the dataset. The [300, 400) bin contains 81 movies. That’s 40.5% of all the movies:
$$\mbox{Percent} = \frac{81}{200} \cdot 100 = 40.5$$
The width of the [300, 400) bin is $400−300=100$. So
$$\mbox{Height} = \frac{40.5}{100} = 0.405$$
The code for calculating the heights used the facts that there are 200 movies in all and that the width of each bin is 100.
Units. The height of the bar is 40.5% divided by 100 million dollars, and so the height is 0.405% per million dollars.
This method of drawing histograms creates a vertical axis that is said to be on the density scale. The height of bar is not the percent of entries in the bin; it is the percent of entries in the bin relative to the amount of space in the bin. That is why the height measures crowdedness or density.
The Histogram: General Principles and Calculation
- The bins are drawn to scale and are contiguous (though some might be empty), because the values on the horizontal axis are numerical.
- The area of each bar is proportional to the number of entries in the bin.
Property 2 is the key to drawing a histogram, and is usually achieved as follows:
$$\mbox{area of bar} = \mbox{percent of entries in bin}$$
The calculation of the heights just uses the fact that the bar is a rectangle:
$$\mbox{area of bar} = \mbox{height of bar} \times \mbox{width of bin}$$
and so
$$\mbox{height of bar} = \frac{\mbox{area of bar}}{\mbox{width of bin}} = \frac{\mbox{percent of entries in bin}}{\mbox{width of bin}}$$
The units of height are “percent per unit on the horizontal axis.”
When drawn using this method, the histogram is said to be drawn on the density scale. On this scale:
- The area of each bar is equal to the percent of data values that are in the corresponding bin.
- The total area of all the bars in the histogram is 100%. Speaking in terms of proportions, we say that the areas of all the bars in a histogram “sum to 1”.
Differences Between Bar Charts and Histograms
- Bar charts display one quantity per category. They are often used to display the distributions of categorical variables. Histograms display the distributions of quantitative variables.
- All the bars in a bar chart have the same width, and there is an equal amount of space between consecutive bars. The bars of a histogram can have different widths, and they are contiguous.
- The lengths (or heights, if the bars are drawn vertically) of the bars in a bar chart are proportional to the value for each category. The heights of bars in a histogram measure densities; the areas of bars in a histogram are proportional to the numbers of entries in the bins.
Overlaid Graphs
In this chapter, we have learned how to visualize data by drawing graphs. A common use of such visualizations is to compare two datasets. In this section, we will see how to overlay plots, that is, draw them in a single graphic on a common pair of axes.
For the overlay to make sense, the graphs that are being overlaid must represent the same variables and be measured in the same units.
To draw overlaid graphs, the methods scatter, plot, and barh can all be called in the same way. For scatter and plot, one column must serve as the common horizontal axis for all the overlaid graphs. For barh, one column must serve as the common axis which is the set of categories. The general call looks like:
1 | import plotly.express as px |
Scatter Plots
Franics Galton (1822-1911) was an English polymath who was a pioneer in the analysis of relations between numerical variables.
Galton meticulously collected copious amounts of data, some of which we will analyze in this course. Here is a subset of Galton’s data on heights of parents and their children. Specifically, the population consists of 179 men who were the first-born in their families. The data are their own heights and the heights of their parents. All heights were measured in inches.
1 | # Import dataset galton_subset.csv |
father | mother | son | |
---|---|---|---|
0 | 78.5 | 67.0 | 73.2 |
1 | 75.5 | 66.5 | 73.5 |
2 | 75.0 | 64.0 | 71.0 |
3 | 75.0 | 64.0 | 70.5 |
4 | 75.0 | 58.5 | 72.0 |
... | ... | ... | ... |
174 | 64.0 | 64.0 | 70.5 |
175 | 64.0 | 63.0 | 64.5 |
176 | 64.0 | 60.0 | 66.0 |
177 | 62.0 | 66.0 | 64.0 |
178 | 62.5 | 63.0 | 66.5 |
The scatter method allows us to visualize how the sons’ heights are related to the heights of both their parents. In the graph, the sons’ heights will form the common horizontal axis.
1 | # Overlay scatter plot son-mother on scatter plot son-father |
Notice how we only specified the variable (sons’ heights) on the common horizontal axis. Python drew two scatter plots: one each for the relation between this variable and the other two.
Both the orange and the black scatter plots slope upwards and show a positive association between the sons’ heights and the heights of both their parents. The black (fathers) plot is in general higher than the orange, because the fathers were in general taller than the mothers.
Line Plots
Our next example involves data on children of more recent times. We will return to the Census data table us_pop, created below again for reference. From this, we will extract the counts of all children in each of the age categories 0 through 18 years.
1 | # Read the full Census table |
SEX | AGE | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3944153 | 3944160 | 3951330 | 3963087 | 3926540 | 3931141 | 3949775 | 3978038 |
1 | 0 | 1 | 3978070 | 3978090 | 3957888 | 3966551 | 3977939 | 3942872 | 3949776 | 3968564 |
2 | 0 | 2 | 4096929 | 4096939 | 4090862 | 3971565 | 3980095 | 3992720 | 3959664 | 3966583 |
3 | 0 | 3 | 4119040 | 4119051 | 4111920 | 4102470 | 3983157 | 3992734 | 4007079 | 3974061 |
4 | 0 | 4 | 4063170 | 4063186 | 4077551 | 4122294 | 4112849 | 3994449 | 4005716 | 4020035 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
301 | 2 | 97 | 53582 | 53605 | 54118 | 57159 | 59533 | 61255 | 62779 | 69285 |
302 | 2 | 98 | 36641 | 36675 | 37532 | 40116 | 42857 | 44359 | 46208 | 47272 |
303 | 2 | 99 | 26193 | 26214 | 26074 | 27030 | 29320 | 31112 | 32517 | 34064 |
304 | 2 | 100 | 44202 | 44246 | 45058 | 47556 | 50661 | 53902 | 58008 | 61886 |
305 | 2 | 999 | 156964212 | 156969328 | 157258820 | 158427085 | 159581546 | 160720625 | 161952064 | 163189523 |
1 | # Extract four columns from full_census_table. |
AGE | 2010 | 2014 | |
---|---|---|---|
0 | 0 | 3951330 | 3949775 |
1 | 1 | 3957888 | 3949776 |
2 | 2 | 4090862 | 3959664 |
3 | 3 | 4111920 | 4007079 |
4 | 4 | 4077551 | 4005716 |
5 | 5 | 4064653 | 4006900 |
6 | 6 | 4073013 | 4135930 |
7 | 7 | 4043046 | 4155326 |
8 | 8 | 4025604 | 4120903 |
9 | 9 | 4125415 | 4108349 |
10 | 10 | 4187062 | 4116942 |
11 | 11 | 4115511 | 4087402 |
12 | 12 | 4113279 | 4070682 |
13 | 13 | 4119666 | 4171030 |
14 | 14 | 4145614 | 4233839 |
15 | 15 | 4231002 | 4164796 |
16 | 16 | 4313252 | 4168559 |
17 | 17 | 4376367 | 4186513 |
18 | 18 | 4491005 | 4227920 |
1 | # Overlay line chart AGE-2014 on line chart AGE-2010 |
On this scale, it’s important to remember that we only have data at ages 0, 1, 2, and so on; the graphs “join the dots” in between.
The graphs cross each other in a few places: for example, there were more 4-year-olds in 2010 than in 2014, and there were more 14-year-olds in 2014 than in 2010.
Of course, the 14-year-olds in 2014 mostly consist of the 10-year-olds in 2010. To see this, look at the gold graph at AGE 14 and the blue graph at AGE 10. Indeed, you will notice that the entire gold graph (2014) looks like the blue graph (2010) slid over to the right by 4 years. The slide is accompanied by a slight rise due to the net effect of children who entered the country between 2010 and 2014 outnumbering those who left. Fortunately at these ages there is not much loss of life.
Bar Charts
For our final example of this section, we look at distributions of ethnicities of adults and children in California as well as in the entire United States.
The Kaiser Family Foundation has complied Census data on the distribution of race and ethnicity in the U.S. The Foundation’s website provides compilations of data for the entire U.S. population in 2014, as well as for U.S. children who were younger than 18 years old that year.
Here is a table adapted from their data for the United States and California. The columns represent everyone in the U.S.A., everyone in California, children in the U.S.A., and children in California. The body of the table contains proportions in the different categories. Each column shows the distribution of ethnicities in the group of people corresponding to that column. So in each column, the entries add up to 1.
1 | # Read CSV use_ca |
Ethnicity | USA All | CA All | USA Children | CA Children | |
---|---|---|---|---|---|
0 | Black | 0.12 | 0.05 | 0.14 | 0.05 |
1 | Hispanic | 0.18 | 0.38 | 0.24 | 0.50 |
2 | White | 0.62 | 0.39 | 0.52 | 0.29 |
3 | Other | 0.08 | 0.18 | 0.10 | 0.16 |
It is natural to want to compare these distributions. It makes sense to compare the columns directly, because all the entries are proportions and are therefore on the same scale.
The method barh allows us to visualize the comparisons by drawing multiple bar charts on the same axes. The call is analogous to those for scatter and plot: we have to specify the common axis of categories.
1 | # Draw bar charts |
While drawing the overlaid bar charts is straightforward, there is a bit too much information on this graph for us to be able to sort out similarities and differences between populations. It seems clear that the distributions of ethnicities for everyone in the U.S. and for children in the U.S. are more similar to each other than any other pair, but it’s much easier to compare the populations one pair at a time.
Let’s start by comparing the entire populations of the U.S.A. and California.
1 | # Draw bar charts. Filter 'USA ALL' and 'CA All' |
The two distributions are quite different. California has higher proportions in the Hispanic and Other categories, and correspondingly lower proportions of Black and White. The differences are largely due to California’s geographical location and patterns of immigration, both historically and in more recent decades. For example, the Other category in California includes a significant proportion of Asians and Pacific Islanders.
As you can see from the graph, almost 40% of the Californian population in 2014 was Hispanic. A comparison with the population of children in the state indicates that the Hispanic proportion is likely to be greater in future years. Among Californian children, 50% are in the Hispanic category.
1 | # Draw bar charts. Filter 'CA ALL' and 'CA Children' |
More complex datasets naturally give rise to varied and interesting visualizations, including overlaid graphs of different kinds. To analyze such data, it helps to have some more skills in data manipulation, so that we can get the data into a form that allows us to use methods like those in this section. In the next chapter we will develop some of these skills.
Functions and Tables
Functions and Tables
We are building up a useful inventory of techniques for identifying patterns and themes in a data set by using functions already available in Python. We will now explore a core feature of the Python programming language: function definition.
We have used functions extensively already in this text, but never defined a function of our own. The purpose of defining a function is to give a name to a computational process that may be applied multiple times. There are many situations in computing that require repeated computation. For example, it is often the case that we want to perform the same manipulation on every value in a column of a table.
Defining a Function
The definition of the double function below simply doubles a number.
1 | # Our first function definition |
We start any function definition by writing def. Here is a breakdown of the other parts (the syntax) of this small function:
When we run the cell above, no particular number is doubled, and the code inside the body of double is not yet evaluated. In this respect, our function is analogous to a recipe. Each time we follow the instructions in a recipe, we need to start with ingredients. Each time we want to use our function to double a number, we need to specify a number.
We can call double in exactly the same way we have called other functions. Each time we do that, the code in the body is executed, with the value of the argument given the name x
.
1 | double(17) |
1 | double(-0.6/4) |
The two expressions above are both call expressions. In the second one, the value of the expression -0.6/4 is computed and then passed as the argument named x to the double function. Each call expresson results in the body of double being executed, but with a different value of x.
The body of double has only a single line:
1 | return 2*x |
Executing this return statement completes execution of the double function’s body and computes the value of the call expression.
The argument to double can be any expression, as long as its value is a number. For example, it can be a name. The double function does not know or care how its argument is computed or stored; its only job is to execute its own body using the values of the arguments passed to it.
1 | any_name = 42 |
The argument can also be any value that can be doubled. For example, a whole array of numbers can be passed as an argument to double, and the result will be another array.
1 | import numpy as np |
However, names that are defined inside a function, including arguments like double’s x, have only a fleeting existence. They are defined only while the function is being called, and they are only accessible inside the body of the function. We can’t refer to x outside the body of double. The technical terminology is that x has local scope.
Therefore the name x isn’t recognized outside the body of the function, even though we have called double in the cells above.
1 | x |
Docstrings. Though double is relatively easy to understand, many functions perform complicated tasks and are difficult to use without explanation. (You may have discovered this yourself!) Therefore, a well-composed function has a name that evokes its behavior, as well as documentation. In Python, this is called a docstring — a description of its behavior and expectations about its arguments. The docstring can also show example calls to the function, where the call is preceded by >>>.
A docstring can be any string, as long as it is the first thing in a function’s body. Docstrings are typically defined using triple quotation marks at the start and end, which allows a string to span multiple lines. The first line is conventionally a complete but short description of the function, while following lines provide further guidance to future users of the function.
Here is a definition of a function called percent that takes two arguments. The definition includes a docstring.
1 | # A function with more than one argument |
1 | percent(33, 200) |
Contrast the function percent defined above with the function percents defined below. The latter takes an array as its argument, and converts all the numbers in the array to percents out of the total of the values in the array. The percents are all rounded to two decimal places, this time replacing round by np.round
because the argument is an array and not a number.
1 | def percents(counts): |
1 | some_array = np.array([7, 10, 4]) |
It is helpful to understand the steps Python takes to execute a function. To facilitate this, we have put a function definition and a call to that function in the same cell below.
1 | def biggest_difference(array_x): |
Here is what happens when we run that cell:
Multiple Arguments
There can be multiple ways to generalize an expression or block of code, and so a function can take multiple arguments that each determine different aspects of the result. For example, the percents function we defined previously rounded to two decimal places every time. The following two-argument definition allows different calls to round to different amounts.
1 | def percents(counts, decimal_places): |
The flexibility of this new definition comes at a small price: each time the function is called, the number of decimal places must be specified. Default argument values allow a function to be called with a variable number of arguments; any argument that isn’t specified in the call expression is given its default value, which is stated in the first line of the def statement. For example, in this final definition of percents, the optional argument decimal_places is given a default value of 2.
1 | def percents(counts, decimal_places=2): |
Note: Methods
Functions are called by placing argument expressions in parentheses after the function name. Any function that is defined in isolation is called in this way. You have also seen examples of methods, which are like functions but are called using dot notation, such as <df>.sort_values(by=["<col>"])
. The functions that you define will always be called using the function name first, passing in all of the arguments.
Applying a Function to a Column
We have seen many examples of creating new columns of tables by applying functions to existing columns or to other arrays. All of those functions took arrays as their arguments. But frequently we will want to convert the entries in a column by a function that doesn’t take an array as its argument. For example, it might take just one number as its argument, as in the function cut_off_at_100
defined below.
1 | def cut_off_at_100(x): |
1 | cut_off_at_100(17) |
The function cut_off_at_100
simply returns its argument if the argument is less than or equal to 100. But if the argument is greater than 100, it returns 100.
In our earlier examples using Census data, we saw that the variable AGE had a value 100 that meant “100 years old or older”. Cutting off ages at 100 in this manner is exactly what cut_off_at_100
does.
To use this function on many ages at once, we will have to be able to refer to the function itself, without actually calling it. Analogously, we might show a cake recipe to a chef and ask her to use it to bake 6 cakes. In that scenario, we are not using the recipe to bake any cakes ourselves; our role is merely to refer the chef to the recipe. Similarly, we can ask a table to call cut_off_at_100
on 6 different numbers in a column.
First, we create the table ages with a column for people and one for their ages. For example, person C is 52 years old.
1 | ages = pd.DataFrame({'Person': np.array(['A', 'B', 'C', 'D', 'E', 'F']), |
Person | Age | |
---|---|---|
0 | A | 17 |
1 | B | 117 |
2 | C | 52 |
3 | D | 100 |
4 | E | 6 |
5 | F | 101 |
apply
To cut off each of the ages at 100, we will use a new Pandas.DataFrame method. The apply
method calls a function on each element of a column, forming a new array of return values. To indicate which function to call, just name it (without quotation marks or parentheses). The name of the column of input values is a string that must still appear within quotation marks.
1 | # Filter ages.Age by cut_off_at_100 |
What we have done here is apply the function cut_off_at_100
to each value in the Age column of the table ages. The output is the array of corresponding return values of the function. For example, 17 stayed 17, 117 became 100, 52 stayed 52, and so on.
This array, which has the same length as the original Age column of the ages table, can be used as the values in a new column called Cut Off Age alongside the existing Person and Age columns.
1 | # Add a column 'Cut Off Age' |
Person | Age | Cut Off Age | |
---|---|---|---|
0 | A | 17 | 17 |
1 | B | 117 | 100 |
2 | C | 52 | 52 |
3 | D | 100 | 100 |
4 | E | 6 | 6 |
5 | F | 101 | 100 |
Functions as Values
We’ve seen that Python has many kinds of values. For example, 6 is a number value, “cake” is a text value, Table() is an empty table, and ages is a name for a table value (since we defined it above).
In Python, every function, including cut_off_at_100
, is also a value. It helps to think about recipes again. A recipe for cake is a real thing, distinct from cakes or ingredients, and you can give it a name like “Ani’s cake recipe.” When we defined cut_off_at_100
with a def statement, we actually did two separate things: we created a function that cuts off numbers at 100, and we gave it the name cut_off_at_100
.
We can refer to any function by writing its name, without the parentheses or arguments necessary to actually call it. We did this when we called apply above. When we write a function’s name by itself as the last line in a cell, Python produces a text representation of the function, just like it would print out a number or a string value.
1 | cut_off_at_100 |
Notice that we did not write cut_off_at_100
with quotes (which is just a piece of text), or cut_off_at_100()
(which is a function call, and an invalid one at that). We simply wrote cut_off_at_100
to refer to the function.
Just like we can define new names for other values, we can define new names for functions. For example, suppose we want to refer to our function as cut_off
instead of cut_off_at_100
. We can just write this:
1 | cut_off = cut_off_at_100 |
Now cut_off
is a name for a function. It’s the same function as cut_off_at_100
, so the printed value is exactly the same.
1 | cut_off |
Let us see another application of apply.
Example: Prediction
Data Science is often used to make predictions about the future. If we are trying to predict an outcome for a particular individual – for example, how she will respond to a treatment, or whether he will buy a product – it is natural to base the prediction on the outcomes of other similar individuals.
Charles Darwin’s cousin Sir Francis Galton was a pioneer in using this idea to make predictions based on numerical data. He studied how physical characteristics are passed down from one generation to the next.
The data below are Galton’s carefully collected measurements on the heights of parents and their adult children. Each row corresponds to one adult child. The variables are a numerical code for the family, the heights (in inches) of the father and mother, a “midparent height” which is a weighted average 1 of the height of the two parents, the number of children in the family, as well as the child’s birth rank (1 = oldest), gender, and height.
1 | # Galton's data on heights of parents and their adult children |
family | father | mother | midparentHeight | children | childNum | gender | childHeight | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 75.43 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 75.43 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 75.43 | 4 | 3 | female | 69.0 |
3 | 1 | 78.5 | 67.0 | 75.43 | 4 | 4 | female | 69.0 |
4 | 2 | 75.5 | 66.5 | 73.66 | 4 | 1 | male | 73.5 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
929 | 203 | 62.0 | 66.0 | 66.64 | 3 | 1 | male | 64.0 |
930 | 203 | 62.0 | 66.0 | 66.64 | 3 | 2 | female | 62.0 |
931 | 203 | 62.0 | 66.0 | 66.64 | 3 | 3 | female | 61.0 |
932 | 204 | 62.5 | 63.0 | 65.27 | 2 | 1 | male | 66.5 |
933 | 204 | 62.5 | 63.0 | 65.27 | 2 | 2 | female | 57.0 |
A primary reason for collecting the data was to be able to predict the adult height of a child born to parents similar to those in the dataset. Let us try to do this, using midparent height as the variable on which to base our prediction. Thus midparent height is our predictor variable.
The table heights consists of just the midparent heights and child’s heights. The scatter plot of the two variables shows a positive association, as we would expect for these variables.
1 | # Extract midparentHeight and childHeight from galton |
MidParent | Child | |
---|---|---|
0 | 75.43 | 73.2 |
1 | 75.43 | 69.2 |
2 | 75.43 | 69.0 |
3 | 75.43 | 69.0 |
4 | 73.66 | 73.5 |
... | ... | ... |
929 | 66.64 | 64.0 |
930 | 66.64 | 62.0 |
931 | 66.64 | 61.0 |
932 | 65.27 | 66.5 |
933 | 65.27 | 57.0 |
1 | # Scatter of Child vs. MidParent |
Now suppose Galton encountered a new couple, similar to those in his dataset, and wondered how tall their child would be. What would be a good way for him to go about predicting the child’s height, given that the midparent height was, say, 68 inches?
One reasonable approach would be to base the prediction on all the points that correspond to a midparent height of around 68 inches. The prediction equals the average child’s height calculated from those points alone.
Let’s pretend we are Galton and execute this plan. For now we will just make a reasonable definition of what “around 68 inches” means, and work with that. Later in the course we will examine the consequences of such choices.
We will take “close” to mean “within half an inch”. The figure below shows all the points corresponding to a midparent height between 67.5 inches and 68.5 inches. These are all the points in the strip between the red lines. Each of these points corresponds to one child; our prediction of the height of the new couple’s child is the average height of all the children in the strip. That’s represented by the gold dot.
Ignore the code, and just focus on understanding the mental process of arriving at that gold dot.
1 | # Add shape to fig |
In order to calculate exactly where the gold dot should be, we first need to indentify all the points in the strip. These correspond to the rows where MidParent is between 67.5 inches and 68.5 inches.
1 | # Filter the MidParent that between 67.5 and 68.5 |
MidParent | Child | |
---|---|---|
233 | 68.44 | 62.0 |
396 | 67.94 | 71.2 |
397 | 67.94 | 67.0 |
516 | 68.33 | 62.5 |
517 | 68.23 | 73.0 |
... | ... | ... |
885 | 67.60 | 69.0 |
886 | 67.60 | 68.0 |
887 | 67.60 | 67.7 |
888 | 67.60 | 64.5 |
889 | 67.60 | 60.5 |
The predicted height of a child who has a midparent height of 68 inches is the average height of the children in these rows. That’s 66.24 inches.
1 | # Average child's height of close_to_68 |
We now have a way to predict the height of a child given any value of the midparent height near those in our dataset. We can define a function predict_child
that does this. The body of the function consists of the code in the two cells above, apart from choices of names.
1 | # Define predict_child for predicting any value of the midparent height |
1 | predict_child(68) |
How good are these predictions? We can get a sense of this by comparing the predictions with the data that we already have. To do this, we first apply the function predict_child
to the column of Midparent heights, and collect the results in a new column called Prediction.
1 | # Apply predict_child to all the midparent heights |
MidParent | Child | Prediction | |
---|---|---|---|
0 | 75.43 | 73.2 | 70.100000 |
1 | 75.43 | 69.2 | 70.100000 |
2 | 75.43 | 69.0 | 70.100000 |
3 | 75.43 | 69.0 | 70.100000 |
4 | 73.66 | 73.5 | 70.415789 |
... | ... | ... | ... |
929 | 66.64 | 64.0 | 65.156579 |
930 | 66.64 | 62.0 | 65.156579 |
931 | 66.64 | 61.0 | 65.156579 |
932 | 65.27 | 66.5 | 64.229630 |
933 | 65.27 | 57.0 | 64.229630 |
To see where the predictions lie relative to the observed data, we can draw overlaid scatter plots with MidParent as the common horizontal axis.
1 | # Draw the original scatter plot along with the predicted values |
The graph of gold dots is called a graph of averages, because each gold dot is the center of a vertical strip like the one we drew earlier. Each one provides a prediction of a child’s height given the midparent height. For example, the scatter shows that for a midparent height of 72 inches, the predicted height of the child would be somewhere between 68 inches and 69 inches, and indeed predict_child(72) returns 68.5.
Galton’s calculations and visualizations were very similar to ours, except that he didn’t have Python. He drew the graph of averages through the scatter diagram and noticed that it roughly followed a straight line. This straight line is now called the regression line and is one of the most common methods of making predictions. Galton’s friend, the mathematician Karl Pearson, used these analyses to formalize the notion of correlation.
This example, like the one about John Snow’s analysis of cholera deaths, shows how some of the fundamental concepts of modern data science have roots going back more than a century. Galton’s methods such as the one we have used here are precursors to nearest neighbor prediction methods that now have powerful applications in diverse settings. The modern field of machine learning includes the automation of such methods to make predictions based on vast and rapidly evolving datasets.
Classifying by One Variable
Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. For example, in the example using Galton’s data on heights, we saw that it was useful to classify families according to the parents’ midparent heights, and then find the average height of the children in each group.
This section is about classifying individuals into categories that are not numerical. We begin by recalling the basic use of group.
Counting the Number in Each Category
The groupby
method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.
Here is a small table of data on ice cream cones. The group method can be used to list the distinct flavors and provide the counts of each flavor.
1 | # Create DataFrame |
Flavor | Price | |
---|---|---|
0 | strawberry | 3.55 |
1 | chocolate | 4.75 |
2 | chocolate | 6.55 |
3 | strawberry | 5.25 |
4 | chocolate | 5.25 |
1 | # Draw pivot table, group by size or count |
Flavor | count | |
---|---|---|
0 | chocolate | 3 |
1 | strawberry | 2 |
There are two distinct categories, chocolate and strawberry. The call to group creates a table of counts in each category. The column is called count by default, and contains the number of rows in each category.
Notice that this can all be worked out from just the Flavor column. The Price column has not been used.
But what if we wanted the total price of the cones of each different flavor? That’s where the second argument of group comes in.
Finding a Characteristic of Each Category
The optional second argument of groupby
names the function that will be used to aggregate values in other columns for all of those rows. For instance, sum will sum up the prices in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table.
To find the total price of each flavor, we call groupby
again, with Flavor as its first argument as before. But this time there is a second argument: the function name sum.
1 | # SELECT Flavor, SUM(Price) AS 'Price sum' FROM cones; |
Flavor | Price sum | |
---|---|---|
0 | chocolate | 16.55 |
1 | strawberry | 8.80 |
To create this new table, groupby
has calculated the sum of the Price entries in all the rows corresponding to each distinct flavor. The prices in the three chocolate rows add up to $16.55 (you can assume that price is being measured in dollars).
The prices in the two strawberry rows have a total of $8.80.
The label of the newly created “sum” column is Price sum, which is created by taking the label of the column being summed, and appending the word sum.
Because groupby
finds the sum of all columns other than the one with the categories, there is no need to specify that it has to sum the prices.
To see in more detail what groupby
is doing, notice that you could have figured out the total prices yourself, not only by mental arithmetic but also using code. For example, to find the total price of all the chocolate cones, you could start by creating a new table consisting of only the chocolate cones, and then accessing the column of prices:
1 | # SELECT Price FROM cones WHERE 'Flavor' = 'chocolate'; |
1 | # Sum all the price of chocolate |
This is what group is doing for each distinct value in Flavor.
1 | # SELECT Price FROM cones WHERE Flavor = 'chocolate'; |
Flavor | Array of All the Prices | Sum of the Array | |
---|---|---|---|
0 | chocolate | [4.75, 6.55, 5.25] | 16.55 |
1 | strawberry | [3.55, 5.25] | 8.80 |
You can replace sum by any other functions that work on arrays. For example, you could use max to find the largest price in each category:
1 | # SELECT Flavor, MAX('Price') AS 'Price Max' FROM cones GROUP BY 'Flavor'; |
Flavor | Price Max | |
---|---|---|
0 | chocolate | 6.55 |
1 | strawberry | 5.25 |
Once again, groupby
creates arrays of the prices in each Flavor category. But now it finds the max of each array:
1 | # Create a new DataFrame Price_maxes from grouped_cones. Then add a new column 'Max of the Array' |
Flavor | Array of All the Prices | Max of the Array | |
---|---|---|---|
0 | chocolate | [4.75, 6.55, 5.25] | 6.55 |
1 | strawberry | [3.55, 5.25] | 5.25 |
Indeed, the original call to groupby
with just one argument has the same effect as using len as the function and then cleaning up the table.
1 | # Create a new DataFrame lengths from grouped_cones. Then add a new column 'Length of the Array' |
Flavor | Array of All the Prices | Length of the Array | |
---|---|---|---|
0 | chocolate | [4.75, 6.55, 5.25] | 3 |
1 | strawberry | [3.55, 5.25] | 2 |
Example: NBA Salaries
The table nba contains data on the 2015-2016 players in the National Basketball Association. We have examined these data earlier. Recall that salaries are measured in millions of dollars.
1 | # Read CSV nba_salaries.csv |
PLAYER | POSITION | TEAM | SALARY | |
---|---|---|---|---|
0 | Paul Millsap | PF | Atlanta Hawks | 18.671659 |
1 | Al Horford | C | Atlanta Hawks | 12.000000 |
2 | Tiago Splitter | C | Atlanta Hawks | 9.756250 |
3 | Jeff Teague | PG | Atlanta Hawks | 8.000000 |
4 | Kyle Korver | SG | Atlanta Hawks | 5.746479 |
... | ... | ... | ... | ... |
412 | Gary Neal | PG | Washington Wizards | 2.139000 |
413 | DeJuan Blair | C | Washington Wizards | 2.000000 |
414 | Kelly Oubre Jr. | SF | Washington Wizards | 1.920240 |
415 | Garrett Temple | SG | Washington Wizards | 1.100602 |
416 | Jarell Eddie | SG | Washington Wizards | 0.561716 |
- How much money did each team pay for its players’ salaries?
The only columns involved are TEAM
and SALARY
. We have to group the rows by TEAM
and then sum the salaries of the groups.
1 | # Q: How much money did each team pay for its players’ salaries? |
TEAM | SALARY sum | |
---|---|---|
0 | Atlanta Hawks | 69.573103 |
1 | Boston Celtics | 50.285499 |
2 | Brooklyn Nets | 57.306976 |
3 | Charlotte Hornets | 84.102397 |
4 | Chicago Bulls | 78.820890 |
... | ... | ... |
25 | Sacramento Kings | 68.384890 |
26 | San Antonio Spurs | 84.652074 |
27 | Toronto Raptors | 74.672620 |
28 | Utah Jazz | 52.631878 |
29 | Washington Wizards | 90.047498 |
- How many NBA players were there in each of the five positions?
We have to classify by POSITION
, and count. This can be done with just one argument to group:
1 | # Q: How many NBA players were there in each of the five positions? |
POSITION | count | |
---|---|---|
0 | C | 69 |
1 | PF | 85 |
2 | PG | 85 |
3 | SF | 82 |
4 | SG | 96 |
- What was the average salary of the players at each of the five positions?
This time, we have to group by POSITION
and take the mean of the salaries. For clarity, we will work with a table of just the positions and the salaries.
1 | # Q3: What was the average salary of the players at each of the five positions? |
POSITION | SALARY mean | |
---|---|---|
0 | C | 6.082913 |
1 | PF | 4.951344 |
2 | PG | 5.165487 |
3 | SF | 5.532675 |
4 | SG | 3.988195 |
Center was the most highly paid position, at an average of over 6 million dollars.
Cross-Classifying by More than One Variable
When individuals have multiple features, there are many different ways to classify them. For example, if we have a population of college students for each of whom we have recorded a major and the number of years in college, then the students could be classified by major, or by year, or by a combination of major and year.
The group method also allows us to classify individuals according to multiple variables. This is called cross-classifying.
Two Variables: Counting the Number in Each Paired Category
The table more_cones
records the flavor, color, and price of six ice cream cones.
1 | # Create a DataFrame more_cones |
Flavor | Color | Price | |
---|---|---|---|
0 | strawberry | pink | 3.55 |
1 | chocolate | light brown | 4.75 |
2 | chocolate | dark brown | 5.25 |
3 | strawberry | pink | 5.25 |
4 | chocolate | dark brown | 5.25 |
5 | bubblegum | pink | 4.75 |
We know how to use groupby
to count the number of cones of each flavor:
1 | # SELECT Flavor, COUNT(*) AS count FROM more_cones GROUP BY Flavor; |
Flavor | count | |
---|---|---|
0 | bubblegum | 1 |
1 | chocolate | 3 |
2 | strawberry | 2 |
But now each cone has a color as well. To classify the cones by both flavor and color, we will pass a list of labels as an argument to group. The resulting table has one row for every unique combination of values that appear together in the grouped columns. As before, a single argument (a list, in this case, but an array would work too) gives row counts.
Although there are six cones, there are only four unique combinations of flavor and color. Two of the cones were dark brown chocolate, and two pink strawberry.
1 | # SELECT Flavor, COUNT(*) AS count FROM more_cones GROUP BY Flavor, Color; |
Flavor | Color | count | |
---|---|---|---|
0 | bubblegum | pink | 1 |
1 | chocolate | dark brown | 2 |
2 | chocolate | light brown | 1 |
3 | strawberry | pink | 2 |
Two Variables: Finding a Characteristic of Each Paired Category
A second argument aggregates all other columns that are not in the list of grouped columns.
1 | # SELECT Flavor, SUM(Price) AS 'Price sum' FROM more_cones GROUP BY Flavor, Color; |
Flavor | Color | Price sum | |
---|---|---|---|
0 | bubblegum | pink | 4.75 |
1 | chocolate | dark brown | 10.50 |
2 | chocolate | light brown | 4.75 |
3 | strawberry | pink | 8.80 |
Three or More Variables. You can use groupby
to classify rows by three or more categorical variables. Just include them all in the list that is the first argument. But cross-classifying by multiple variables can become complex, as the number of distinct combinations of categories can be quite large.
Pivot Tables: Rearranging the Output of group
pandas.DataFrame.pivot
pandas.DataFrame.pivot_table
Reshaping and pivot tables
Many uses of cross-classification involve just two categorical variables, like Flavor
and Color
in the example above. In these cases it is possible to display the results of the classification in a different kind of table, called a pivot table. Pivot tables, also known as contingency tables, make it easier to work with data that have been classified according to two variables.
Recall the use of group to count the number of cones in each paired category of flavor and color:
1 | # SELECT Flavor, COUNT(*) AS count FROM more_cones GROUP BY Flavor, Color; |
Flavor | Color | count | |
---|---|---|---|
0 | bubblegum | pink | 1 |
1 | chocolate | dark brown | 2 |
2 | chocolate | light brown | 1 |
3 | strawberry | pink | 2 |
The same data can be displayed differenly using the Pandas
method pivot_table
. Ignore the code for a moment, and just examine the table of outcomes.
1 | # Pivot Table |
Flavor | bubblegum | chocolate | strawberry |
---|---|---|---|
Color | |||
dark brown | 0 | 2 | 0 |
light brown | 0 | 1 | 0 |
pink | 1 | 0 | 2 |
Notice how this table displays all nine possible pairs of Flavor
and Color
, including pairs like “dark brown bubblegum” that don’t exist in our data. Notice also that the count in each pair appears in the body of the table: to find the number of light brown chocolate cones, run your eye along the row light brown until it meets the column chocolate.
The groupby
method takes a list of two labels because it is flexible: it could take one or three or more. On the other hand, pivot always takes two column labels, one to determine the columns and one to determine the rows.
pitvot_table
The pitvot_table
method is closely related to the groupby
method: it groups together rows that share a combination of values. It differs from group because it organizes the resulting values in a grid. The first argument to pivot is the label of a column that contains the values that will be used to form new columns in the result. The second argument is the label of a column used for the rows. The result gives the count of all rows of the original table that share the combination of column and row values.
Like groupby
, pitvot_table
can be used with additional arguments to find characteristics of each paired category. An optional third argument called values indicates a column of values that will replace the counts in each cell of the grid. All of these values will not be displayed, however; the fourth argument collect indicates how to collect them all into one aggregated value to be displayed in the cell.
An example will help clarify this. Here is pitvot_table
being used to find the total price of the cones in each cell.
1 | # Pivot Table |
Flavor | bubblegum | chocolate | strawberry |
---|---|---|---|
Color | |||
dark brown | 0.00 | 10.50 | 0.0 |
light brown | 0.00 | 4.75 | 0.0 |
pink | 4.75 | 0.00 | 8.8 |
And here is group doing the same thing.
1 | # SELECT Flavor, Color, SUM(Price) AS 'Price sum' FROM more_cones GROUP BY Flavor, Color; |
Flavor | Color | Price sum | |
---|---|---|---|
0 | bubblegum | pink | 4.75 |
1 | chocolate | dark brown | 10.50 |
2 | chocolate | light brown | 4.75 |
3 | strawberry | pink | 8.80 |
Though the numbers in both tables are the same, table produced by pivot_table
is easier to read and lends itself more easily to analysis. The advantage of pivot is that it places grouped values into adjacent columns, so that they can be combined and compared.
Example: Education and Income of Californian Adults
The State of California’s Open Data Portal is a rich source of information about the lives of Californians. It is our source of a dataset on educational attainment and personal income among Californians over the years 2008 to 2014. The data are derived from the U.S. Census Current Population Survey.
For each year, the table records the Population Count of Californians in many different combinations of age, gender, educational attainment, and personal income. We will study only the data for the year 2014
1 | # Read CSV educ_inc.csv |
Year | Age | Gender | Educational Attainment | Personal Income | Population Count | |
---|---|---|---|---|---|---|
885 | 1/1/14 0:00 | 18 to 64 | Female | No high school diploma | H: 75,000 and over | 2058 |
886 | 1/1/14 0:00 | 65 to 80+ | Male | No high school diploma | H: 75,000 and over | 2153 |
894 | 1/1/14 0:00 | 65 to 80+ | Female | No high school diploma | G: 50,000 to 74,999 | 4666 |
895 | 1/1/14 0:00 | 65 to 80+ | Female | High school or equivalent | H: 75,000 and over | 7122 |
896 | 1/1/14 0:00 | 65 to 80+ | Female | No high school diploma | F: 35,000 to 49,999 | 7261 |
... | ... | ... | ... | ... | ... | ... |
1021 | 1/1/14 0:00 | 18 to 64 | Female | High school or equivalent | A: 0 to 4,999 | 670294 |
1022 | 1/1/14 0:00 | 18 to 64 | Male | Bachelor's degree or higher | G: 50,000 to 74,999 | 682425 |
1023 | 1/1/14 0:00 | 18 to 64 | Female | No high school diploma | A: 0 to 4,999 | 723208 |
1024 | 1/1/14 0:00 | 18 to 64 | Female | Bachelor's degree or higher | H: 75,000 and over | 953282 |
1025 | 1/1/14 0:00 | 18 to 64 | Male | Bachelor's degree or higher | H: 75,000 and over | 1628605 |
Each row of the table corresponds to a combination of age
, gender
, educational level
, and income
. There are 127 such combinations in all!
As a first step it is a good idea to start with just one or two variables. We will focus on just one pair: educational attainment
and personal income
.
1 | # Filter 'Educational Attainment', 'Personal Income', 'Population Count' |
Educational Attainment | Personal Income | Population Count | |
---|---|---|---|
885 | No high school diploma | H: 75,000 and over | 2058 |
886 | No high school diploma | H: 75,000 and over | 2153 |
894 | No high school diploma | G: 50,000 to 74,999 | 4666 |
895 | High school or equivalent | H: 75,000 and over | 7122 |
896 | No high school diploma | F: 35,000 to 49,999 | 7261 |
... | ... | ... | ... |
1021 | High school or equivalent | A: 0 to 4,999 | 670294 |
1022 | Bachelor's degree or higher | G: 50,000 to 74,999 | 682425 |
1023 | No high school diploma | A: 0 to 4,999 | 723208 |
1024 | Bachelor's degree or higher | H: 75,000 and over | 953282 |
1025 | Bachelor's degree or higher | H: 75,000 and over | 1628605 |
Let’s start by looking at educational level
alone. The categories of this variable have been subdivided by the different levels of income. So we will group the table by Educational Attainment
and sum the Population Count
in each category.
1 | # Filter 'Educational Attainment', 'Population Count' |
Educational Attainment | Population Count sum | |
---|---|---|
0 | Bachelor's degree or higher | 8525698 |
1 | College, less than 4-yr degree | 7775497 |
2 | High school or equivalent | 6294141 |
3 | No high school diploma | 4258277 |
There are only four categories of educational attainment. The counts are so large that is is more helpful to look at percents. For this, we will use the function percents that we defined in an earlier section. It converts an array of numbers to an array of percents out of the total in the input array.
1 | def percents(array_x): |
We now have the distribution of educational attainment among adult Californians. More than 30% have a Bachelor’s degree or higher, while almost 16% lack a high school diploma.
1 | # Copy educ_totals to educ_distribution. Then add a column to it. |
Educational Attainment | Population Count sum | Population Percent | |
---|---|---|---|
0 | Bachelor's degree or higher | 8525698 | 31.75 |
1 | College, less than 4-yr degree | 7775497 | 28.96 |
2 | High school or equivalent | 6294141 | 23.44 |
3 | No high school diploma | 4258277 | 15.86 |
By using pivot_table
, we can get a contingency table (a table of counts) of adult Californians cross-classified by Educational Attainment
and Personal Income
.
1 | # Pivot Table |
Here you see the power of pivot_table
over other cross-classification methods. Each column of counts is a distribution of personal income at a specific level of educational attainment. Converting the counts to percents allows us to compare the four distributions.
1 | # Convert population of table total from numeric to percent |
Personal Income | Bachelor's degree or higher | College, less than 4-yr degree | High school or equivalent | No high school diploma | |
---|---|---|---|---|---|
0 | A: 0 to 4,999 | 6.75 | 12.67 | 18.46 | 28.29 |
1 | B: 5,000 to 9,999 | 3.82 | 10.43 | 9.95 | 14.02 |
2 | C: 10,000 to 14,999 | 5.31 | 10.27 | 11.00 | 15.61 |
3 | D: 15,000 to 24,999 | 9.07 | 17.30 | 19.90 | 20.56 |
4 | E: 25,000 to 34,999 | 8.14 | 14.04 | 14.76 | 10.91 |
5 | F: 35,000 to 49,999 | 13.17 | 14.31 | 12.44 | 6.12 |
6 | G: 50,000 to 74,999 | 18.70 | 11.37 | 8.35 | 3.11 |
7 | H: 75,000 and over | 35.03 | 9.62 | 5.13 | 1.38 |
At a glance, you can see that over 35% of those with Bachelor’s degrees or higher had incomes of $75,000 and over, whereas fewer than 10% of the people in the other education categories had that level of income.
The bar chart below compares the personal income distributions of adult Californians who have no high school diploma with those who have completed a Bachelor’s degree or higher. The difference in the distributions is striking. There is a clear positive association between educational attainment and personal income.
1 | # Bar chart |
Joining Tables by Columns
Often, data about the same individuals is maintained in more than one table. For example, one university office might have data about each student’s time to completion of degree, while another has data about the student’s tuition and financial aid.
To understand the students’ experience, it may be helpful to put the two datasets together. If the data are in two tables, each with one row per student, then we would want to put the columns together, making sure to match the rows so that each student’s information remains on a single row.
Let us do this in the context of a simple example, and then use the method with a larger dataset.
The table cones is one we have encountered earlier. Now suppose each flavor of ice cream comes with a rating that is in a separate table.
1 | # Create DataFrame cones |
Flavor | Price | |
---|---|---|
0 | strawberry | 3.55 |
1 | vanilla | 4.75 |
2 | chocolate | 6.55 |
3 | strawberry | 5.25 |
4 | chocolate | 5.75 |
1 | # Create DataFrame ratings |
Kind | Stars | |
---|---|---|
0 | strawberry | 2.5 |
1 | chocolate | 3.5 |
2 | vanilla | 4.0 |
Each of the tables has a column that contains ice cream flavors: cones has the column Flavor, and ratings has the column Kind. The entries in these columns can be used to link the two tables.
The method join
creates a new table in which each cone in the cones table is augmented with the Stars information in the ratings table. For each cone in cones, join finds a row in ratings whose Kind matches the cone’s Flavor. We have to tell join to use those columns for matching.
1 | # Join table cones and ratings |
Flavor | Price | Stars | |
---|---|---|---|
0 | strawberry | 3.55 | 2.5 |
1 | vanilla | 4.75 | 4.0 |
2 | chocolate | 6.55 | 3.5 |
3 | strawberry | 5.25 | 2.5 |
4 | chocolate | 5.75 | 3.5 |
Each cone now has not only its price but also the rating of its flavor.
In general, a call to join that augments a table (say table1) with information from another table (say table2) looks like this:
1 | DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False |
The new table rated allows us to work out the price per star, which you can think of as an informal measure of value. Low values are good – they mean that you are paying less for each rating star.
1 | # Add a new column '$/Star' to table rated |
Flavor | Price | Stars | $/Star | |
---|---|---|---|---|
1 | vanilla | 4.75 | 4.0 | 1.188 |
0 | strawberry | 3.55 | 2.5 | 1.420 |
4 | chocolate | 5.75 | 3.5 | 1.643 |
2 | chocolate | 6.55 | 3.5 | 1.871 |
3 | strawberry | 5.25 | 2.5 | 2.100 |
Though strawberry has the lowest rating among the three flavors, the less expensive strawberry cone does well on this measure because it doesn’t cost a lot per star.
Side note. Does the order we list the two tables matter? Let’s try it. As you see it, this changes the order that the columns appear in, and can potentially changes the order of the rows, but it doesn’t make any fundamental difference.
1 | # Join table cones and ratings |
Kind | Stars | Price | |
---|---|---|---|
0 | strawberry | 2.5 | 3.55 |
0 | strawberry | 2.5 | 5.25 |
1 | chocolate | 3.5 | 6.55 |
1 | chocolate | 3.5 | 5.75 |
2 | vanilla | 4.0 | 4.75 |
Also note that the join will only contain information about items that appear in both tables. Let’s see an example. Suppose there is a table of reviews of some ice cream cones, and we have found the average review for each flavor.
1 | # Create DataFrame reviews |
Flavor | Stars | |
---|---|---|
0 | vanilla | 5 |
1 | chocolate | 3 |
2 | vanilla | 5 |
3 | chocolate | 4 |
1 | # SELECT Flavor, AVG(Stars) AS 'Stars Average' FROM reviews GROUP BY Flavor; |
Flavor | Stars Average | |
---|---|---|
0 | chocolate | 3.5 |
1 | vanilla | 5.0 |
We can join cones and average_review by providing the labels of the columns by which to join.
1 | # SELECT Flavor, Price, 'Stars Average' FROM cones LEFT JOIN average_review USING('Flavor'); |
Flavor | Price | Stars Average | |
---|---|---|---|
0 | strawberry | 3.55 | NaN |
1 | vanilla | 4.75 | 5.0 |
2 | chocolate | 6.55 | 3.5 |
3 | strawberry | 5.25 | NaN |
4 | chocolate | 5.75 | 3.5 |
1 | # SELECT Flavor, Price, 'Stars Average' FROM cones INNER JOIN average_review USING('Flavor'); |
Flavor | Price | Stars Average | |
---|---|---|---|
1 | vanilla | 4.75 | 5.0 |
2 | chocolate | 6.55 | 3.5 |
4 | chocolate | 5.75 | 3.5 |
Notice how the strawberry cones have disappeared. None of the reviews are for strawberry cones, so there is nothing to which the strawberry rows can be joined. This might be a problem, or it might not be - that depends on the analysis we are trying to perform with the joined table.
Bike Sharing in the Bay Area
We end this chapter by using all the methods we have learned to examine a new and large dataset.
The Bay Area Bike Share service published a dataset describing every bicycle rental from September 2014 to August 2015 in their system. There were 354,152 rentals in all. The columns are:
- An ID for the rental
- Duration of the rental, in seconds
- Start date
- Name of the Start Station and code for Start Terminal
- Name of the End Station and code for End Terminal
- A serial number for the bike
- Subscriber type and zip code
1 | Read CSV trip.csv |
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 913460 | 765 | 8/31/2015 23:26 | Harry Bridges Plaza (Ferry Building) | 50 | 8/31/2015 23:39 | San Francisco Caltrain (Townsend at 4th) | 70 | 288 | Subscriber | 2139 |
1 | 913459 | 1036 | 8/31/2015 23:11 | San Antonio Shopping Center | 31 | 8/31/2015 23:28 | Mountain View City Hall | 27 | 35 | Subscriber | 95032 |
2 | 913455 | 307 | 8/31/2015 23:13 | Post at Kearny | 47 | 8/31/2015 23:18 | 2nd at South Park | 64 | 468 | Subscriber | 94107 |
3 | 913454 | 409 | 8/31/2015 23:10 | San Jose City Hall | 10 | 8/31/2015 23:17 | San Salvador at 1st | 8 | 68 | Subscriber | 95113 |
4 | 913453 | 789 | 8/31/2015 23:09 | Embarcadero at Folsom | 51 | 8/31/2015 23:22 | Embarcadero at Sansome | 60 | 487 | Customer | 9069 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
354147 | 432951 | 619 | 9/1/2014 4:21 | Powell Street BART | 39 | 9/1/2014 4:32 | Townsend at 7th | 65 | 335 | Subscriber | 94118 |
354148 | 432950 | 6712 | 9/1/2014 3:16 | Harry Bridges Plaza (Ferry Building) | 50 | 9/1/2014 5:08 | San Francisco Caltrain (Townsend at 4th) | 70 | 259 | Customer | 44100 |
354149 | 432949 | 538 | 9/1/2014 0:05 | South Van Ness at Market | 66 | 9/1/2014 0:14 | 5th at Howard | 57 | 466 | Customer | 32 |
354150 | 432948 | 568 | 9/1/2014 0:05 | South Van Ness at Market | 66 | 9/1/2014 0:15 | 5th at Howard | 57 | 461 | Customer | 32 |
354151 | 432947 | 569 | 9/1/2014 0:05 | South Van Ness at Market | 66 | 9/1/2014 0:15 | 5th at Howard | 57 | 318 | Customer | 32 |
We’ll focus only on the free trips, which are trips that last less than 1800 seconds (half an hour). There is a charge for longer trips.
1 | # Filter the free trips |
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscriber Type | Zip Code | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 913460 | 765 | 8/31/2015 23:26 | Harry Bridges Plaza (Ferry Building) | 50 | 8/31/2015 23:39 | San Francisco Caltrain (Townsend at 4th) | 70 | 288 | Subscriber | 2139 |
1 | 913459 | 1036 | 8/31/2015 23:11 | San Antonio Shopping Center | 31 | 8/31/2015 23:28 | Mountain View City Hall | 27 | 35 | Subscriber | 95032 |
2 | 913455 | 307 | 8/31/2015 23:13 | Post at Kearny | 47 | 8/31/2015 23:18 | 2nd at South Park | 64 | 468 | Subscriber | 94107 |
3 | 913454 | 409 | 8/31/2015 23:10 | San Jose City Hall | 10 | 8/31/2015 23:17 | San Salvador at 1st | 8 | 68 | Subscriber | 95113 |
4 | 913453 | 789 | 8/31/2015 23:09 | Embarcadero at Folsom | 51 | 8/31/2015 23:22 | Embarcadero at Sansome | 60 | 487 | Customer | 9069 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
354146 | 432952 | 240 | 9/1/2014 4:59 | South Van Ness at Market | 66 | 9/1/2014 5:03 | Civic Center BART (7th at Market) | 72 | 292 | Subscriber | 94102 |
354147 | 432951 | 619 | 9/1/2014 4:21 | Powell Street BART | 39 | 9/1/2014 4:32 | Townsend at 7th | 65 | 335 | Subscriber | 94118 |
354149 | 432949 | 538 | 9/1/2014 0:05 | South Van Ness at Market | 66 | 9/1/2014 0:14 | 5th at Howard | 57 | 466 | Customer | 32 |
354150 | 432948 | 568 | 9/1/2014 0:05 | South Van Ness at Market | 66 | 9/1/2014 0:15 | 5th at Howard | 57 | 461 | Customer | 32 |
354151 | 432947 | 569 | 9/1/2014 0:05 | South Van Ness at Market | 66 | 9/1/2014 0:15 | 5th at Howard | 57 | 318 | Customer | 32 |
The histogram below shows that most of the trips took around 10 minutes (600 seconds) or so. Very few took near 30 minutes (1800 seconds), possibly because people try to return the bikes before the cutoff time so as not to have to pay.
1 | # Histogram, percent, bins=10 |
We can get more detail by specifying a larger number of bins. But the overall shape doesn’t change much.
1 | # Histogram, percent, bins=100 |
Exploring the Data with group and pivot
We can use group to identify the most highly used Start Station:
1 | # SELECT 'Start Station', COUNT(*) AS 'count' FROM commute GROUP BY 'Start Station' ORDER BY 'count' DESC; |
Start Station | count | |
---|---|---|
49 | San Francisco Caltrain (Townsend at 4th) | 25858 |
50 | San Francisco Caltrain 2 (330 Townsend) | 21523 |
23 | Harry Bridges Plaza (Ferry Building) | 15543 |
65 | Temporary Transbay Terminal (Howard at Beale) | 14298 |
2 | 2nd at Townsend | 13674 |
... | ... | ... |
31 | Mezes Park | 189 |
41 | Redwood City Medical Center | 139 |
55 | San Mateo County Center | 108 |
42 | Redwood City Public Library | 101 |
20 | Franklin at Maple | 62 |
The largest number of trips started at the Caltrain Station on Townsend and 4th in San Francisco. People take the train into the city, and then use a shared bike to get to their next destination.
The groupby
method can also be used to classify the rentals by both Start Station and End Station.
1 | # SELECT 'Start Station', 'End Station', COUNT(*) AS 'count' FROM commute GROUP BY 'Start Station', 'End Station'; |
Start Station | End Station | count | |
---|---|---|---|
0 | 2nd at Folsom | 2nd at Folsom | 54 |
1 | 2nd at Folsom | 2nd at South Park | 295 |
2 | 2nd at Folsom | 2nd at Townsend | 437 |
3 | 2nd at Folsom | 5th at Howard | 113 |
4 | 2nd at Folsom | Beale at Market | 127 |
... | ... | ... | ... |
1624 | Yerba Buena Center of the Arts (3rd @ Howard) | Steuart at Market | 202 |
1625 | Yerba Buena Center of the Arts (3rd @ Howard) | Temporary Transbay Terminal (Howard at Beale) | 113 |
1626 | Yerba Buena Center of the Arts (3rd @ Howard) | Townsend at 7th | 261 |
1627 | Yerba Buena Center of the Arts (3rd @ Howard) | Washington at Kearny | 66 |
1628 | Yerba Buena Center of the Arts (3rd @ Howard) | Yerba Buena Center of the Arts (3rd @ Howard) | 73 |
Fifty-four trips both started and ended at the station on 2nd at Folsom. A much large number (437) were between 2nd at Folsom and 2nd at Townsend.
The pivot_table
method does the same classification but displays its results in a contingency table that shows all possible combinations of Start and End Stations, even though some of them didn’t correspond to any trips.
There is a train station as well as a Bay Area Rapid Transit (BART) station near Beale at Market, explaining the high number of trips that start and end there.
1 | # Pivot Table |
End Station | 2nd at Folsom | 2nd at South Park | 2nd at Townsend | 5th at Howard | Adobe on Almaden | Arena Green / SAP Center | Beale at Market | Broadway St at Battery St | California Ave Caltrain Station | Castro Street and El Camino Real | ... | South Van Ness at Market | Spear at Folsom | St James Park | Stanford in Redwood City | Steuart at Market | Temporary Transbay Terminal (Howard at Beale) | Townsend at 7th | University and Emerson | Washington at Kearny | Yerba Buena Center of the Arts (3rd @ Howard) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Start Station | |||||||||||||||||||||
2nd at Folsom | 54 | 295 | 437 | 113 | 0 | 0 | 127 | 67 | 0 | 0 | ... | 46 | 327 | 0 | 0 | 128 | 414 | 347 | 0 | 142 | 83 |
2nd at South Park | 190 | 164 | 151 | 177 | 0 | 0 | 79 | 89 | 0 | 0 | ... | 41 | 209 | 0 | 0 | 224 | 437 | 309 | 0 | 142 | 180 |
2nd at Townsend | 554 | 71 | 185 | 148 | 0 | 0 | 183 | 279 | 0 | 0 | ... | 50 | 407 | 0 | 0 | 1644 | 486 | 418 | 0 | 72 | 174 |
5th at Howard | 107 | 180 | 92 | 83 | 0 | 0 | 59 | 119 | 0 | 0 | ... | 102 | 100 | 0 | 0 | 371 | 561 | 312 | 0 | 47 | 90 |
Adobe on Almaden | 0 | 0 | 0 | 0 | 11 | 7 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Temporary Transbay Terminal (Howard at Beale) | 237 | 429 | 784 | 750 | 0 | 0 | 167 | 748 | 0 | 0 | ... | 351 | 99 | 0 | 0 | 204 | 94 | 825 | 0 | 90 | 401 |
Townsend at 7th | 342 | 143 | 417 | 200 | 0 | 0 | 35 | 50 | 0 | 0 | ... | 366 | 336 | 0 | 0 | 276 | 732 | 132 | 0 | 29 | 153 |
University and Emerson | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 57 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 62 | 0 | 0 |
Washington at Kearny | 17 | 63 | 57 | 43 | 0 | 0 | 64 | 79 | 0 | 0 | ... | 25 | 24 | 0 | 0 | 31 | 98 | 53 | 0 | 55 | 36 |
Yerba Buena Center of the Arts (3rd @ Howard) | 31 | 209 | 166 | 267 | 0 | 0 | 45 | 47 | 0 | 0 | ... | 115 | 71 | 0 | 0 | 201 | 113 | 261 | 0 | 66 | 72 |
We can also use pivot_table
to find the shortest time of the rides between Start and End Stations. Here pivot has been given Duration as the optional values argument, and min as the function which to perform on the values in each cell.
1 | # Pivot Table |
End Station | 2nd at Folsom | 2nd at South Park | 2nd at Townsend | 5th at Howard | Adobe on Almaden | Arena Green / SAP Center | Beale at Market | Broadway St at Battery St | California Ave Caltrain Station | Castro Street and El Camino Real | ... | South Van Ness at Market | Spear at Folsom | St James Park | Stanford in Redwood City | Steuart at Market | Temporary Transbay Terminal (Howard at Beale) | Townsend at 7th | University and Emerson | Washington at Kearny | Yerba Buena Center of the Arts (3rd @ Howard) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Start Station | |||||||||||||||||||||
2nd at Folsom | 61 | 61 | 137 | 215 | 0 | 0 | 219 | 351 | 0 | 0 | ... | 673 | 154 | 0 | 0 | 219 | 112 | 399 | 0 | 266 | 145 |
2nd at South Park | 97 | 60 | 67 | 300 | 0 | 0 | 343 | 424 | 0 | 0 | ... | 801 | 219 | 0 | 0 | 322 | 195 | 324 | 0 | 378 | 212 |
2nd at Townsend | 164 | 77 | 60 | 384 | 0 | 0 | 417 | 499 | 0 | 0 | ... | 727 | 242 | 0 | 0 | 312 | 261 | 319 | 0 | 464 | 299 |
5th at Howard | 268 | 86 | 423 | 68 | 0 | 0 | 387 | 555 | 0 | 0 | ... | 383 | 382 | 0 | 0 | 384 | 279 | 330 | 0 | 269 | 128 |
Adobe on Almaden | 0 | 0 | 0 | 0 | 84 | 305 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 409 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Temporary Transbay Terminal (Howard at Beale) | 149 | 61 | 249 | 265 | 0 | 0 | 94 | 291 | 0 | 0 | ... | 644 | 119 | 0 | 0 | 128 | 60 | 534 | 0 | 248 | 190 |
Townsend at 7th | 448 | 78 | 259 | 357 | 0 | 0 | 619 | 885 | 0 | 0 | ... | 378 | 486 | 0 | 0 | 581 | 542 | 61 | 0 | 642 | 479 |
University and Emerson | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 531 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 93 | 0 | 0 |
Washington at Kearny | 429 | 270 | 610 | 553 | 0 | 0 | 222 | 134 | 0 | 0 | ... | 749 | 439 | 0 | 0 | 296 | 311 | 817 | 0 | 65 | 360 |
Yerba Buena Center of the Arts (3rd @ Howard) | 165 | 96 | 284 | 109 | 0 | 0 | 264 | 411 | 0 | 0 | ... | 479 | 303 | 0 | 0 | 280 | 226 | 432 | 0 | 190 | 60 |
Someone had a very quick trip (271 seconds, or about 4.5 minutes) from 2nd at Folsom to Beale at Market, about five blocks away. There are no bike trips between the 2nd Avenue stations and Adobe on Almaden, because the latter is in a different city.
Drawing Maps
The table stations contains geographical information about each bike station, including latitude, longitude, and a “landmark” which is the name of the city where the station is located
1 | # Read CSV station.csv |
station_id | name | lat | long | dockcount | landmark | installation | |
---|---|---|---|---|---|---|---|
0 | 2 | San Jose Diridon Caltrain Station | 37.329732 | -121.901782 | 27 | San Jose | 8/6/2013 |
1 | 3 | San Jose Civic Center | 37.330698 | -121.888979 | 15 | San Jose | 8/5/2013 |
2 | 4 | Santa Clara at Almaden | 37.333988 | -121.894902 | 11 | San Jose | 8/6/2013 |
3 | 5 | Adobe on Almaden | 37.331415 | -121.893200 | 19 | San Jose | 8/5/2013 |
4 | 6 | San Pedro Square | 37.336721 | -121.894074 | 15 | San Jose | 8/7/2013 |
... | ... | ... | ... | ... | ... | ... | ... |
65 | 77 | Market at Sansome | 37.789625 | -122.400811 | 27 | San Francisco | 8/25/2013 |
66 | 80 | Santa Clara County Civic Center | 37.352601 | -121.905733 | 15 | San Jose | 12/31/2013 |
67 | 82 | Broadway St at Battery St | 37.798541 | -122.400862 | 15 | San Francisco | 1/22/2014 |
68 | 83 | Mezes Park | 37.491269 | -122.236234 | 15 | Redwood City | 2/20/2014 |
69 | 84 | Ryland Park | 37.342725 | -121.895617 | 15 | San Jose | 4/9/2014 |
We can draw a map of where the stations are located, using px.scatter_mapbox
. The function operates on a table, whose columns are (in order) latitude, longitude, and an optional identifier for each point.
Mapbox Access Token and Base Map Configuration
To plot on Mapbox maps with Plotly you may need a Mapbox account and a public Mapbox Access Token. See Mapbox Map Layers documentation for more information.
After register an account for Mapbox. Click New Style
.
Click Share
on the top and right.
Cope your Access Token
Then open your shell. Make sure you are in your .py file working directory.
1 | pwd |
Then replace your Acess Token to the following command.
1 | touch .mapbox_token |
Test Dataset px.carshare()
1 | # Test dataset px.data.carshare() |
Back to dataset stations
mapbox_style (str (default 'basic'
, needs Mapbox API token)) – Identifier of base map style, some of which require a Mapbox API token
to be set using plotly.express.set_mapbox_access_token()
. Allowed
values which do not require a Mapbox API token are 'open-street-map'
,
'white-bg'
, 'carto-positron'
, 'carto-darkmatter'
, 'stamen-
terrain'
, 'stamen-toner'
, 'stamen-watercolor'
. Allowed values
which do require a Mapbox API token are 'basic'
, 'streets'
,
'outdoors'
, 'light'
, 'dark'
, 'satellite'
, 'satellite-
streets'
.
1 | # Draw map |
The map is created using OpenStreetMap, which is an open online mapping system that you can use just as you would use Google Maps or any other online map. Zoom in to San Francisco to see how the stations are distributed. Click on a marker to see which station it is.
You can also represent points on a map by colored circles. Here is such a map of the San Francisco bike stations
1 | # Filter 'landmark' == 'San Francisco' |
More Informative Maps: An Application of join
The bike stations are located in five different cities in the Bay Area. To distinguish the points by using a different color for each city, let’s start by using group to identify all the cities and assign each one a color.
1 | # SELECT landmark AS city, COUNT(*) AS 'count' FROM cities GROUP BY city; |
city | count | |
---|---|---|
0 | Mountain View | 7 |
1 | Palo Alto | 5 |
2 | Redwood City | 7 |
3 | San Francisco | 35 |
4 | San Jose | 16 |
1 | # Copy cities to colors then add a new columns |
city | count | color | |
---|---|---|---|
0 | Mountain View | 7 | blue |
1 | Palo Alto | 5 | red |
2 | Redwood City | 7 | green |
3 | San Francisco | 35 | orange |
4 | San Jose | 16 | purple |
Now we can join stations and colors by landmark, and then select the columns we need to draw a map.
1 | # JOIN table stations and colors |
lat | long | name | color | |
---|---|---|---|---|
0 | 37.329732 | -121.901782 | San Jose Diridon Caltrain Station | purple |
1 | 37.330698 | -121.888979 | San Jose Civic Center | purple |
2 | 37.333988 | -121.894902 | Santa Clara at Almaden | purple |
3 | 37.331415 | -121.893200 | Adobe on Almaden | purple |
4 | 37.336721 | -121.894074 | San Pedro Square | purple |
... | ... | ... | ... | ... |
62 | 37.794139 | -122.394434 | Steuart at Market | orange |
63 | 37.791300 | -122.399051 | Mechanics Plaza (Market at Battery) | orange |
64 | 37.786305 | -122.404966 | Market at 4th | orange |
65 | 37.789625 | -122.400811 | Market at Sansome | orange |
67 | 37.798541 | -122.400862 | Broadway St at Battery St | orange |
1 | # Draw map |
Now the markers have five different colors for the five different cities.
To see where most of the bike rentals originate, let’s identify the start stations:
1 | SELECT 'Start Station', COUNT(*) AS 'count' FROM commute GROUP BY 'Start Station' ORDER BY 'count' DESC; |
Start Station | count | |
---|---|---|
49 | San Francisco Caltrain (Townsend at 4th) | 25858 |
50 | San Francisco Caltrain 2 (330 Townsend) | 21523 |
23 | Harry Bridges Plaza (Ferry Building) | 15543 |
65 | Temporary Transbay Terminal (Howard at Beale) | 14298 |
2 | 2nd at Townsend | 13674 |
... | ... | ... |
31 | Mezes Park | 189 |
41 | Redwood City Medical Center | 139 |
55 | San Mateo County Center | 108 |
42 | Redwood City Public Library | 101 |
20 | Franklin at Maple | 62 |
We can include the geographical data needed to map these stations, by first joining starts with stations:
1 | # INNER JOIN stations and starts |
station_id | name | lat | long | dockcount | landmark | installation | count | |
---|---|---|---|---|---|---|---|---|
0 | 2 | San Jose Diridon Caltrain Station | 37.329732 | -121.901782 | 27 | San Jose | 8/6/2013 | 4899 |
1 | 3 | San Jose Civic Center | 37.330698 | -121.888979 | 15 | San Jose | 8/5/2013 | 574 |
2 | 4 | Santa Clara at Almaden | 37.333988 | -121.894902 | 11 | San Jose | 8/6/2013 | 1888 |
3 | 5 | Adobe on Almaden | 37.331415 | -121.893200 | 19 | San Jose | 8/5/2013 | 522 |
4 | 6 | San Pedro Square | 37.336721 | -121.894074 | 15 | San Jose | 8/7/2013 | 1321 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
65 | 77 | Market at Sansome | 37.789625 | -122.400811 | 27 | San Francisco | 8/25/2013 | 11023 |
66 | 80 | Santa Clara County Civic Center | 37.352601 | -121.905733 | 15 | San Jose | 12/31/2013 | 510 |
67 | 82 | Broadway St at Battery St | 37.798541 | -122.400862 | 15 | San Francisco | 1/22/2014 | 7460 |
68 | 83 | Mezes Park | 37.491269 | -122.236234 | 15 | Redwood City | 2/20/2014 | 189 |
69 | 84 | Ryland Park | 37.342725 | -121.895617 | 15 | San Jose | 4/9/2014 | 1077 |
Now we extract just the data needed for drawing our map, adding a color and an area to each station. The area is 0.3 times the count of the number of rentals starting at each station, where the constant 0.3 was chosen so that the circles would appear at an appropriate scale on the map.
1 | # Extract columns 'lat', 'long', 'name' from station_starts to starts_map_data |
lat | long | name | colors | areas | |
---|---|---|---|---|---|
0 | 37.329732 | -121.901782 | San Jose Diridon Caltrain Station | blue | 1469.7 |
1 | 37.330698 | -121.888979 | San Jose Civic Center | blue | 172.2 |
2 | 37.333988 | -121.894902 | Santa Clara at Almaden | blue | 566.4 |
3 | 37.331415 | -121.893200 | Adobe on Almaden | blue | 156.6 |
4 | 37.336721 | -121.894074 | San Pedro Square | blue | 396.3 |
... | ... | ... | ... | ... | ... |
65 | 37.789625 | -122.400811 | Market at Sansome | blue | 3306.9 |
66 | 37.352601 | -121.905733 | Santa Clara County Civic Center | blue | 153.0 |
67 | 37.798541 | -122.400862 | Broadway St at Battery St | blue | 2238.0 |
68 | 37.491269 | -122.236234 | Mezes Park | blue | 56.7 |
69 | 37.342725 | -121.895617 | Ryland Park | blue | 323.1 |
1 | # Draw map |
That huge blob in San Francisco shows that the eastern section of the city is the unrivaled capital of bike rentals in the Bay Area.
Randomness
Randomness
In the previous chapters we developed skills needed to make insightful descriptions of data. Data scientists also have to be able to understand randomness. For example, they have to be able to assign individuals to treatment and control groups at random, and then try to say whether any observed differences in the outcomes of the two groups are simply due to the random assignment or genuinely due to the treatment.
In this chapter, we begin our analysis of randomness. To start off, we will use Python to make choices at random. In numpy there is a sub-module called random
that contains many functions that involve random selection. One of these functions is called choice
. It picks one item at random from an array, and it is equally likely to pick any of the items. The function call is np.random.choice(array_name)
, where array_name
is the name of the array from which to make the choice.
Thus the following code evaluates to treatment with chance 50%, and control with chance 50%.
1 | # The following code evaluates to treatment with chance 50%, and control with chance 50%. |
The big difference between the code above and all the other code we have run thus far is that the code above doesn’t always return the same value. It can return either treatment
or control
, and we don’t know ahead of time which one it will pick. We can repeat the process by providing a second argument, the number of times to repeat the process.
1 | # Randomly choose 'treatment', 'control' from two_groups 10 times |
A fundamental question about random events is whether or not they occur. For example:
- Did an individual get assigned to the treatment group, or not?
- Is a gambler going to win money, or not?
- Has a poll made an accurate prediction, or not?
Once the event has occurred, you can answer “yes” or “no” to all these questions. In programming, it is conventional to do this by labeling statements as True or False. For example, if an individual did get assigned to the treatment group, then the statement, “The individual was assigned to the treatment group” would be True. If not, it would be False.
Booleans and Comparison
In Python, Boolean values, named for the logician George Boole, represent truth and take only two possible values: True and False. Whether problems involve randomness or not, Boolean values most often arise from comparison operators. Python includes a variety of operators that compare values. For example, 3 is larger than 1 + 1.
1 | 3 > 1 + 1 |
The value True
indicates that the comparison is valid; Python has confirmed this simple fact about the relationship between 3 and 1+1. The full set of common comparison operators are listed below.
Comparison |
Operator |
True example |
False Example |
---|---|---|---|
Less than |
< |
2 < 3 |
2 < 2 |
Greater than |
> |
3 > 2 |
3 > 3 |
Less than or equal |
<= |
2 <= 2 |
3 <= 2 |
Greater or equal |
>= |
3 >= 3 |
2 >= 3 |
Equal |
== |
3 == 3 |
3 == 2 |
Not equal |
!= |
3 != 2 |
2 != 2 |
Notice the two equal signs ==
in the comparison to determine equality. This is necessary because Python already uses =
to mean assignment to a name, as we have seen. It can’t use the same symbol for a different purpose. Thus if you want to check whether 5 is equal to the 10/2, then you have to be careful: 5 = 10/2
returns an error message because Python assumes you are trying to assign the value of the expression 10/2 to a name that is the numeral 5. Instead, you must use 5 == 10/2
, which evaluates to True.
1 | 5 = 10/2 |
1 | File "<ipython-input-13-e8c755f5e450>", line 1 |
An expression can contain multiple comparisons, and they all must hold in order for the whole expression to be True. For example, we can express that 1+1 is between 1 and 3 using the following expression.
1 | 1 < 1 + 1 < 3 |
The average of two numbers is always between the smaller number and the larger number. We express this relationship for the numbers x and y below. You can try different values of x and y to confirm this relationship.
1 | x = 12 |
Comparing Strings
Strings can also be compared, and their order is alphabetical. A shorter string is less than a longer string that begins with the shorter string.
1 | 'Dog' > 'Catastrophe' > 'Cat' |
Let’s return to random selection. Recall the array two_groups which consists of just two elements, treatment and control. To see whether a randomly assigned individual went to the treatment group, you can use a comparison:
1 | np.random.choice(two_groups) == 'treatment' |
1 | True |
As before, the random choice will not always be the same, so the result of the comparison won’t always be the same either. It will depend on whether treatment or control was chosen. With any cell that involves random selection, it is a good idea to run the cell several times to get a sense of the variability in the result.
Comparing an Array and a Value
Recall that we can perform arithmetic operations on many numbers in an array at once. For example, np.array(0, 5, 2)*2
is equivalent to np.array(0, 10, 4)
. In similar fashion, if we compare an array and one value, each element of the array is compared to that value, and the comparison evaluates to an array of Booleans.
1 | tosses = np.array(['Tails', 'Heads', 'Tails', 'Heads', 'Heads']) |
The numpy method count_nonzero
evaluates to the number of non-zero (that is, True) elements of the array.
1 | np.count_nonzero(tosses == 'Heads') |
Conditional Statements
In many situations, actions and results depends on a specific set of conditions being satisfied. For example, individuals in randomized controlled trials receive the treatment if they have been assigned to the treatment group. A gambler makes money if she wins her bet.
In this section we will learn how to describe such situations using code. A conditional statement is a multi-line statement that allows Python to choose among different alternatives based on the truth value of an expression. While conditional statements can appear anywhere, they appear most often within the body of a function in order to express alternative behavior depending on argument values.
A conditional statement always begins with an if
header, which is a single line followed by an indented body. The body is only executed if the expression directly following if
(called the if expression) evaluates to a true value. If the if expression evaluates to a false value, then the body of the if is skipped.
Let us start defining a function that returns the sign of a number.
1 | def sign(x): |
1 | sign(3) |
This function returns the correct sign if the input is a positive number. But if the input is not a positive number, then the if expression evaluates to a false value, and so the return statement is skipped and the function call has no value.
1 | sign(-3) |
So let us refine our function to return Negative if the input is a negative number. We can do this by adding an elif
clause, where elif
if Python’s shorthand for the phrase “else, if”.
1 | def sign(x): |
Now sign returns the correct answer when the input is -3:
1 | sign(-3) |
What if the input is 0? To deal with this case, we can add another elif clause:
1 | def sign(x): |
1 | sign(0) |
Equivalently, we can replace the final elif clause by an else clause, whose body will be executed only if all the previous comparisons are false; that is, if the input value is equal to 0.
1 | def sign(x): |
1 | sign(0) |
The General Form
A conditional statement can also have multiple clauses with multiple bodies, and only one of those bodies can ever be executed. The general format of a multi-clause conditional statement appears below.
1 | if <if expression>: |
There is always exactly one if
clause, but there can be any number of elif
clauses. Python will evaluate the if
and elif
expressions in the headers in order until one is found that is a true value, then execute the corresponding body. The else clause is optional. When an else header is provided, its else body is executed only if none of the header expressions of the previous clauses are true. The else clause must always come at the end (or not at all).
Example: Betting on a Die
Suppose I bet on a roll of a fair die. The rules of the game:
- If the die shows 1 spot or 2 spots, I lose a dollar.
- If the die shows 3 spots or 4 spots, I neither lose money nor gain money.
- If the die shows 5 spots or 6 spots, I gain a dollar.
We will now use conditional statements to define a function one_bet that takes the number of spots on the roll and returns my net gain.
1 | def one_bet(x): |
Let’s check that the function does the right thing for each different number of spots.
1 | one_bet(1), one_bet(2), one_bet(3), one_bet (4), one_bet(5), one_bet(6) |
As a review of how conditional statements work, let’s see what one_bet
does when the input is 3.
- First it evaluates the if expression, which is 3 <= 2 which is False. So
one_bet
doesn’t execute the if body. - Then it evaluates the first elif expression, which is 3 <= 4, which is True. So
one_bet
executes the first elif body and returns 0. - Once the body has been executed, the process is complete. The next elif expression is not evaluated.
If for some reason we use an input greater than 6, then the if expression evaluates to False as do both of the elif expressions. So one_bet
does not execute the if body nor the two elif bodies, and there is no value when you make the call below.
1 | one_bet(17) |
To play the game based on one roll of a die, you can use np.random.choice to generate the number of spots and then use that as the argument to one_bet
. Run the cell a few times to see how the output changes.
1 | one_bet(np.random.choice(np.arange(1, 7))) |
At this point it is natural to want to collect the results of all the bets so that we can analyze them. In the next section we develop a way to do this without running the cell over and over again.
Iteration
It is often the case in programming – especially when dealing with randomness – that we want to repeat a process multiple times. For example, recall the game of betting on one roll of a die with the following rules:
- If the die shows 1 or 2 spots, my net gain is -1 dollar.
- If the die shows 3 or 4 spots, my net gain is 0 dollars.
- If the die shows 5 or 6 spots, my net gain is 1 dollar.
The function bet_on_one_roll
takes no argument. Each time it is called, it simulates one roll of a fair die and returns the net gain in dollars.
1 | def bet_on_one_roll(): |
Playing this game once is easy:
1 | bet_on_one_roll() |
To get a sense of how variable the results are, we have to play the game over and over again. We could run the cell repeatedly, but that’s tedious, and if we wanted to do it a thousand times or a million times, forget it.
A more automated solution is to use a for statement to loop over the contents of a sequence. This is called iteration. A for statement begins with the word for
, followed by a name we want to give each item in the sequence, followed by the word in
, and ending with an expression that evaluates to a sequence. The indented body of the for statement is executed once for each item in that sequence.
1 | for animal in np.array(['cat', 'dog', 'rabbit']): |
It is helpful to write code that exactly replicates a for statement, without using the for statement. This is called unrolling the loop.
A for statement simple replicates the code inside it, but before each iteration, it assigns a new value from the given sequence to the name we chose. For example, here is an unrolled version of the loop above.
1 | animal = np.array(['cat', 'dog', 'rabbit']).item(0) |
Notice that the name animal
is arbitrary, just like any name we assign with =
.
Here we use a for statement in a more realistic way: we print the results of betting five times on the die as described earlier. This is called simulating the results of five bets. We use the word simulating to remind ourselves that we are not physically rolling dice and exchanging money but using Python to mimic the process.
To repeat a process n times, it is common to use the sequence np.arange(n)
in the for statement. It is also common to use a very short name for each item. In our code we will use the name i to remind ourselves that it refers to an item.
1 | for i in np.arange(5): |
In this case, we simply perform exactly the same (random) action several times, so the code in the body of our for statement does not actually refer to i.
Augmenting Arrays
While the for statement above does simulate the results of five bets, the results are simply printed and are not in a form that we can use for computation. Any array of results would be more useful. Thus a typical use of a for statement is to create an array of results, by augmenting the array each time.
The append
method in NumPy
helps us do this. The call np.append(array_name, value)
evaluates to a new array that is array_name augmented by value. When you use append, keep in mind that all the entries of an array must have the same type.
1 | pets = np.array(['Cat', 'Dog']) |
This keeps the array pets unchanged:
1 | pets |
But often while using for loops it will be convenient to mutate an array – that is, change it – when augmenting it. This is done by assigning the augmented array to the same name as the original.
1 | pets = np.append(pets, 'Another Pet') |
Example: Betting on 5 Rolls
We can now simulate five bets on the die and collect the results in an array that we will call the collection array. We will start out by creating an empty array for this, and then append the outcome of each bet. Notice that the body of the for loop contains two statements. Both statements are executed for each item in the given sequence.
1 | # Make an empty array outcomes |
Let us rewrite the cell with the for statement unrolled:
1 | # Make an empty array outcomes |
The contents of the array are likely to be different from the array that we got by running the previous cell, but that is because of randomness in rolling the die. The process for creating the array is exactly the same.
By capturing the results in an array we have given ourselves the ability to use array methods to do computations. For example, we can use np.count_nonzero
to count the number of times money changed hands.
1 | np.count_nonzero(outcomes) |
Example: Betting on 300 Rolls
Iteration is a powerful technique. For example, we can see the variation in the results of 300 bets by running exactly the same code for 300 bets instead of five.
1 | # Make an empty array outcomes |
The array outcomes contains the results of all 300 bets.
1 | # Times of the rolling |
To see how often the three different possible results appeared, we can use the array outcomes and px.bar
methods.
1 | # Save outcomes as DataFrame |
Outcome | count | |
---|---|---|
0 | -1.0 | 99 |
1 | 0.0 | 99 |
2 | 1.0 | 102 |
1 | # Draw bar chart |
Not surprisingly, each of the three outcomes -1, 0, and 1 appeared about 100 of the 300 times, give or take. We will examine the “give or take” amounts more closely in later chapters.
Simulation
Simulation is the process of using a computer to mimic a physical experiment. In this class, those experiments will almost invariably involve chance.
We have seen how to simulate the results of tosses of a coin. The steps in that simulation were examples of the steps that will constitute every simulation we do in this course. In this section we will set out those steps and follow them in examples.
Step 1: What to Simulate
Specify the quantity you want to simulate. For example, you might decide that you want to simulate the outcomes of tosses of a coin.
Step 2: Simulating One Value
Figure out how to simulate one value of the quantity you specified in Step 1. In our example, you have to figure out how to simulate the outcome of one toss of a coin. If your quantity is more complicated, you might need several lines of code to come up with one simulated value.
Step 3: Number of Repetitions
Decide how many times you want to simulate the quantity. You will have to repeat Step 2 that many times. In one of our earlier examples we had decided to simulate the outcomes of 1000 tosses of a coin, and so we needed 1000 repetitions of generating the outcome of a single toss.
Step 4: Coding the Simulation
Put it all together in code.
- Create an empty array in which to collect all the simulated values. We will call this the collection array.
- Create a “repetitions sequence,” that is, a sequence whose length is the number of repetitions you specified in Step 3. For n repetitions we will almost always use the sequence
np.arange(n)
. - Create a for loop. For each element of the repetitions sequence:
- Simulate one value based on the code you developed in Step 2.
- Augment the collection array with this simulated value.
That’s it! Once you have carried out the steps above, your simulation is done. The collection array contains all the simulated values.
At this point you can use the collection array as you would any other array. You can visualize the distribution of the simulated values, count how many simulated values fall into a particular category, and so on.
Number of Heads in 100 Tosses
It is natural to expect that in 100 tosses of a coin, there will be 50 heads, give or take a few.
But how many is “a few”? What’s the chance of getting exactly 50 heads? Questions like these matter in data science not only because they are about interesting aspects of randomness, but also because they can be used in analyzing experiments where assignments to treatment and control groups are decided by the toss of a coin.
In this example we will simulate the number of heads in 100 tosses of a coin. The histogram of our results will give us some insight into how many heads are likely.
Let’s get started on the simulation, following the steps above.
Step 1: What to Simulate
The quantity we are going to simulate is the number of heads in 100 tosses.
Step 2: Simulating One Value
We have to figure out how to make one set of 100 tosses and count the number of heads. Let’s start by creating a coin.
1 | # Define all possible values |
In our earlier example we used np.random.choice
and a for loop to generate multiple tosses. But sets of coin tosses are needed so often in data science that np.random.choice
simulates them for us if we include a second argument that is the number of times to toss.
Here are the results of 10 tosses.
1 | # Toss the coin 10 times |
1 | array(['Tails', 'Tails', 'Heads', 'Heads', 'Heads', 'Heads', 'Heads', |
We can count the number of heads by using np.count_nonzero
as before:
1 | # Count the number of Heads. Set 'Tails' as False(0), 'Head' as True(1) |
Our goal is to simulate the number of heads in 100 tosses, not 10. To do that we can just repeat the same code, replacing 10 by 100.
1 | # Toss the coin 100 times |
1 | 50 |
Step 3: Number of Repetitions
How many repetitions we want is up to us. The more we use, the more reliable our simulations will be, but the longer it will take to run the code. Python is pretty fast at tossing coins. Let’s go for 10,000 repetitions. That means we are going to do the following 10,000 times:
- Toss a coin 100 times and count the number of heads.
That’s a lot of tossing! It’s good that we have Python to do it for us.
Step 4: Coding the Simulation
We are ready to write the code to execute the entire simulation.
1 | # Tossing coin 100 times as one value. And repeat the process 10,000 times. |
Check that the array heads contains 10,000 entries, one for each repetition of the experiment.
1 | len(heads) |
To get a sense of the variability in the number of heads in 100 tosses, we can collect the results in a table and draw a histogram.
1 | # Make a DataFrame for storing the tossing result. |
Repetition | Number of Heads | |
---|---|---|
0 | 1 | 48.0 |
1 | 2 | 44.0 |
2 | 3 | 51.0 |
3 | 4 | 43.0 |
4 | 5 | 52.0 |
... | ... | ... |
9995 | 9996 | 53.0 |
9996 | 9997 | 53.0 |
9997 | 9998 | 47.0 |
9998 | 9999 | 54.0 |
9999 | 10000 | 58.0 |
1 | # Draw histogram |
Each bin has width 1 and is centered at each value of the number of heads.
Not surprisingly, the histogram looks roughly symmetric around 50 heads. The height of the bar at 50 is about 8% per unit. Since each bin is 1 unit wide, this is the same as saying that about 8% of the repetitions produced exactly 50 heads. That’s not a huge percent, but it’s the largest compared to the percent at every other number of heads.
The histogram also shows that in almost all of the repetitions, the number of heads in 100 tosses was somewhere between 35 and 65. Indeed, the bulk of the repetitions produced numbers of heads in the range 45 to 55.
While in theory it is possible that the number of heads can be anywhere between 0 and 100, the simulation shows that the range of probable values is much smaller.
This is an instance of a more general phenomenon about the variability in coin tossing, as we will see later in the course.
A More Compact Version of the Code
We wrote the code for the simulation to show each of the steps in detail. Here are the same steps written in a more compact form. You can see that the code starts out the same way as before, but then some steps are combined.
1 | # Step 4: Create an empty array in which to collect all the simulated values. We will call this the collection array. |
1 | # Show the results of 10,000 repetition |
Moves in Monopoly
Each move in the game Monopoly is determined by the total number of spots of two rolls of a die. If you play Monopoly, what should you expect to get when you roll the die two times?
We can explore this by simulating the sum of two rolls of a die. We will run the simulation 10,000 times as we did in the previous example. Notice that in this paragraph we have completed Steps 1 and 3 of our simulation process.
Step 2 is the one in which we simulate one pair of rolls and add up the number of spots.
1 | # Define all possible values of one die |
That simulates one value of the sum of two rolls. We are now all set to run the simulation according to the steps that are now familiar.
1 | # Step 4: Create an empty array in which to collect all the simulated values. We will call this the collection array. |
Here is a histogram of the results.
1 | # Make a DataFrame for storing the rolling result. |
Seven is the most common value, with the frequencies falling off symmetrically on either side.
Becase P(sum of two rolls of a die): 6 * 6 = 36
When the sum is 7, it could be the following combinations:
- 1 + 6
- 2 + 5
- 3 + 4
- 4 + 3
- 5 + 2
- 6 + 1
P(sum = 7) = 6/36 = 1/6 = 16.7%
1 | len(results[results['Sum of Two Rolls']==7.0]) / len(results) |
The percent of 'Sum of Two Rolls'=7
of simulation is close to the theoretical probability of P(sum=7).
The Monty Hall Problem
This problem has flummoxed many people over the years, mathematicians included. Let’s see if we can work it out by simulation.
The setting is derived from a television game show called “Let’s Make a Deal”. Monty Hall hosted this show in the 1960’s, and it has since led to a number of spin-offs. An exciting part of the show was that while the contestants had the chance to win great prizes, they might instead end up with “zonks” that were less desirable. This is the basis for what is now known as the Monty Hall problem.
The setting is a game show in which the contestant is faced with three closed doors. Behind one of the doors is a fancy car, and behind each of the other two there is a goat. The contestant doesn’t know where the car is, and has to attempt to find it under the following rules.
- The contestant makes an initial choice, but that door isn’t opened.
- At least one of the other two doors must have a goat behind it. Monty opens one of these doors to reveal a goat, displayed in all its glory in Wikipedia:
- There are two doors left, one of which was the contestant’s original choice. One of the doors has the car behind it, and the other one has a goat. The contestant now gets to choose which of the two doors to open.
The contestant has a decision to make. Which door should she choose to open, if she wants the car? Should she stick with her initial choice, or switch to the other door? That is the Monty Hall problem.
The Solution
In any problem involving chances, the assumptions about randomness are important. It’s reasonable to assume that there is a 1/3 chance that the contestant’s initial choice is the door that has the car behind it.
The solution to the problem is quite straightforward under this assumption, though the straightforward solution doesn’t convince everyone. Here it is anyway.
- The chance that the car is behind the originally chosen door is 1/3.
- The car is behind either the originally chosen door or the door that remains. It can’t be anywhere else.
- Therefore, the chance that the car is behind the door that remains is 2/3.
- Therefore, the contestant should switch.
That’s it. End of story.
Not convinced? Then let’s simulate the game and see how the results turn out.
Simulation
The simulation will be more complex that those we have done so far. Let’s break it down.
Step 1: What to Simulate
For each play we will simulate what’s behind all three doors:
- the one the contestant first picks
- the one that Monty opens
- the remaining door
So we will be keeping track of three quantities, not just one.
Step 2: Simulating One Play
The bulk of our work consists of simulating one play of the game. This involves several pieces.
The Goats
We start by setting up an array goats that contains unimaginative names for the two goats.
1 | # Define goats |
To help Monty conduct the game, we are going to have to identify which goat is selected and which one is revealed behind the open door. The function other_goat
takes one goat and returns the other.
1 | # Define goat remain |
Let’s confirm that the function works.
1 | other_goat('first goat'), other_goat('second goat'), other_goat('watermelon') |
The string ‘watermelon’ is not the name of one of the goats, so when ‘watermelon’ is the input then other_goat
does nothing.
The Options
The array hidden_behind_doors
contains the set of things that could be behind the doors.
1 | # Define all possbile options. |
We are now ready to simulate one play. To do this, we will define a function monty_hall_game
that takes no arguments. When the function is called, it plays Monty’s game once and returns a list consisting of:
- the contestant’s guess
- what Monty reveals when he opens a door
- what remains behind the other door
The game starts with the contestant choosing one door at random. In doing so, the contestant makes a random choice from among the car, the first goat, and the second goat.
If the contestant happens to pick one of the goats, then the other goat is revealed and the car is behind the remaining door.
If the contestant happens to pick the car, then Monty reveals one of the goats and the other goat is behind the remaining door.
1 | # Define Monty Hall Game |
Let’s play! Run the cell several times and see how the results change.
1 | monty_hall_game() |
Step 3: Number of Repetitions
To gauge the frequency with which the different results occur, we have to play the game many times and collect the results. Let’s run 10,000 repetitions.
Step 4: Coding the Simulation
It’s time to run the whole simulation.
We will play the game 10,000 times and collect the results in a table. Each row of the table will contain the result of one play.
One way to grow a table by adding a new row is to use the append method. If df
is a DataFrame and new_row
is a list containing the entries in a new row, then df.loc[len(df)] = new_row
adds the new row to the bottom of df
. To add/replace new_row
to specific location, run df.loc[<row_num>] = new_row
.
Note that append does not create a new table. It changes df
to have one more row than it did before.
First let’s create a table games that has three empty columns. We can do this by just specifying a list of the column labels, as follows.
1 | # Define DataFrame games for storing the game restults |
Notice that we have chosen the order of the columns to be the same as the order in which monty_hall_game returns the result of one game.
Now we can add 10,000 rows to trials. Each row will represent the result of one play of Monty’s game.
1 | # Play the game 10000 times and record the results in the table games |
1 | games |
Guess | Revealed | Remaining | |
---|---|---|---|
0 | car | first goat | second goat |
1 | second goat | first goat | car |
2 | car | first goat | second goat |
3 | car | second goat | first goat |
4 | second goat | first goat | car |
... | ... | ... | ... |
9995 | second goat | first goat | car |
9996 | first goat | second goat | car |
9997 | second goat | first goat | car |
9998 | first goat | second goat | car |
9999 | first goat | second goat | car |
The simulation is done. Notice how short the code is. The majority of the work was done in simulating the outcome of one game.
Visualization
To see whether the contestant should stick with her original choice or switch, let’s see how frequently the car is behind each of her two options.
1 | # SELECT 'Guess', COUNT(*) AS 'Original Door' FROM original_choice GROUP BY 'Guess'; |
Item | Original Door | |
---|---|---|
0 | car | 3276 |
1 | first goat | 3307 |
2 | second goat | 3417 |
1 | # SELECT 'Remaining', COUNT(*) AS 'Redmaining Door' FROM original_choice GROUP BY 'Remaining'; |
Item | Redmaining Door | |
---|---|---|
0 | car | 6724 |
1 | first goat | 1607 |
2 | second goat | 1669 |
As our earlier solution said, the car is behind the remaining door two-thirds of the time, to a pretty good approximation. The contestant is twice as likely to get the car if she switches than if she sticks with her original choice.
To see this graphically, we can join the two tables above and draw overlaid bar charts.
1 | # SELECT o.Item, o.'Original Door', r.'Redmaining Door' FROM original_choice o INNER JOIN remaining_door r USING(Item); |
Item | Original Door | Redmaining Door | |
---|---|---|---|
0 | car | 3276 | 6724 |
1 | first goat | 3307 | 1607 |
2 | second goat | 3417 | 1669 |
1 | # Bar chart |
Notice how the three blue bars are almost equal – the original choice is equally likely to be any of the three available items. But the gold bar corresponding to Car is twice as long as the blue.
The simulation confirms that the contestant is twice as likely to win if she switches.
Finding Probabilities
Over the centuries, there has been considerable philosophical debate about what probabilities are. Some people think that probabilities are relative frequencies; others think they are long run relative frequencies; still others think that probabilities are a subjective measure of their own personal degree of uncertainty.
In this course, most probabilities will be relative frequencies, though many will have subjective interpretations. Regardless, the ways in which probabilities are calculated and combined are consistent across the different interpretations.
By convention, probabilities are numbers between 0 and 1, or, equivalently, 0% and 100%. Impossible events have probability 0. Events that are certain have probability 1.
Math is the main tool for finding probabilities exactly, though computers are useful for this purpose too. Simulation can provide excellent approximations, with high probability. In this section, we will informally develop a few simple rules that govern the calculation of probabilities. In subsequent sections we will return to simulations to approximate probabilities of complex events.
We will use the standard notation $P(event)$ to denote the probability that “event” happens, and we will use the words “chance” and “probability” interchangeably.
When an Event Doesn’t Happen
If the chance that event happens is 40%, then the chance that it doesn’t happen is 60%. This natural calculation can be described in general as follows:
$$P(\mbox{an event doesn’t happen}) = 1 - P(\mbox{the event happens})$$
When All Outcomes are Equally Likely
If you are rolling an ordinary die, a natural assumption is that all six faces are equally likely. Then probabilities of how one roll comes out can be easily calculated as a ratio. For example, the chance that the die shows an even number is
$$\frac{\mbox{number of even faces}}{\mbox{number of all faces}} = \frac{\mbox{ #{ 2, 4, 6 } }}{\mbox{ #{ 1, 2, 3, 4, 5, 6 } }} = \frac{3}{6}$$
Similarly,
$$P(\mbox{die shows a multiple of 3}) = \frac{\mbox{ #{ 3, 6 } }}{\mbox{ #{ 1, 2, 3, 4, 5, 6 } }} = \frac{2}{6}$$
In general,
$$P(\mbox{an event happens}) = \frac{\mbox{ #{ outcomes that make the event happen } }}{\mbox{ #{ all outcomes } }}$$
provided all the outcomes are equally likely.
Not all random phenomena are as simple as one roll of a die. The two main rules of probability, developed below, allow mathematicians to find probabilities even in complex situations.
When Two Events Must Both Happen
Suppose you have a box that contains three tickets: one red, one blue, and one green. Suppose you draw two tickets at random without replacement; that is, you shuffle the three tickets, draw one, shuffle the remaining two, and draw another from those two. What is the chance you get the green ticket first, followed by the red one?
There are six possible pairs of colors: RB, BR, RG, GR, BG, GB (we’ve abbreviated the names of each color to just its first letter). All of these are equally likely by the sampling scheme, and only one of them (GR) makes the event happen. So
$$P(\mbox{green first, then red}) = \frac{\mbox{ #{ GR } }}{\mbox{ #{ RB, BR, RG, GR, BG, GB } }} = \frac{1}{6}$$
But there is another way of arriving at the answer, by thinking about the event in two stages. First, the green ticket has to be drawn. That has chance 1/3, which means that the green ticket is drawn first in about 1/3 of all repetitions of the experiment. But that doesn’t complete the event. Among the 1/3 of repetitions when green is drawn first, the red ticket has to be drawn next. That happens in about 1/2 of those repetitions, and so:
$$P(\mbox{green first, then red}) = \frac{1}{2} \space \mbox{of} \space \frac{1}{3} = \frac{1}{6}$$
This calculation is usually written “in chronological order,” as follows.
$$P(\mbox{green first, then red}) = \frac{1}{3} \times \frac{1}{2} = \frac{1}{6}$$
The factor of 1/2 is called ” the conditional chance that the red ticket appears second, given that the green ticket appeared first.”
In general, we have the multiplication rule:
$$P\mbox{(two events both happen)} = P\mbox{(one event happens)} \times P\mbox{(the other event happens, given that the first one happened)}$$
Thus, when there are two conditions – one event must happen, as well as another – the chance is a fraction of a fraction, which is smaller than either of the two component fractions. The more conditions that have to be satisfied, the less likely they are to all be satisfied.
When an Event Can Happen in Two Different Ways
Suppose instead we want the chance that one of the two tickets is green and the other red. This event doesn’t specify the order in which the colors must appear. So they can appear in either order.
A good way to tackle problems like this is to partition the event so that it can happen in exactly one of several different ways. The natural partition of “one green and one red” is: GR, RG.
Each of GR and RG has chance 1/6 by the calculation above. So you can calculate the chance of “one green and one red” by adding them up.
$$P(\mbox{one green and one red}) = P(\mbox{GR}) + P(\mbox{RG}) =
\frac{1}{6} + \frac{1}{6} = \frac{2}{6}$$
In general, we have the addition rule:
$$P\mbox{(an event happens)} = P\mbox{(first way it can happen)} + P\mbox{(second way it can happen)}$$
provided the event happens in exactly one of the two ways.
Thus, when an event can happen in one of two different ways, the chance that it happens is a sum of chances, and hence bigger than the chance of either of the individual ways.
The multiplication rule has a natural extension to more than two events, as we will see below. So also the addition rule has a natural extension to events that can happen in one of several different ways.
We end the section with examples that use combinations of all these rules.
At Least One Success
Data scientists often work with random samples from populations. A question that sometimes arises is about the likelihood that a particular individual in the population is selected to be in the sample. To work out the chance, that individual is called a “success,” and the problem is to find the chance that the sample contains a success.
To see how such chances might be calculated, we start with a simpler setting: tossing a coin two times.
If you toss a coin twice, there are four equally likely outcomes: HH, HT, TH, and TT. We have abbreviated “Heads” to H and “Tails” to T. The chance of getting at least one head in two tosses is therefore 3/4.
Another way of coming up with this answer is to work out what happens if you don’t get at least one head: both the tosses have to land tails. So
$$P(\mbox{at least one head in two tosses}) = 1 - P(\mbox{both tails}) = 1 - \frac{1}{4} = \frac{3}{4}$$
Notice also that
$$P(\mbox{both tails}) = \frac{1}{4} = \frac{1}{2} \cdot \frac{1}{2} = \left(\frac{1}{2}\right)^2$$
by the multiplication rule.
These two observations allow us to find the chance of at least one head in any given number of tosses. For example,
$$P(\mbox{at least one head in 17 tosses}) = 1 - P(\mbox{all 17 are tails}) = 1 - \left(\frac{1}{2}\right)^{17}$$
And now we are in a position to find the chance that the face with six spots comes up at least once in rolls of a die.
For example,
$$P(\mbox{a single roll is not 6}) = P(1) + P(2) + P(3) + P(4) + P(5)
= \frac{5}{6}$$
Therefore,
$$P(\mbox{at least one 6 in two rolls}) = 1 - P(\mbox{both rolls are not 6}) = 1 - \left(\frac{5}{6}\right)^2$$
and
$$P(\mbox{at least one 6 in 17 rolls}) = 1 - \left(\frac{5}{6}\right)^{17}$$
The table below shows these probabilities as the number of rolls increases from 1 to 50.
1 | # Define rolls |
Rolls | Chance of at least one 6 | |
---|---|---|
0 | 1 | 0.166667 |
1 | 2 | 0.305556 |
2 | 3 | 0.421296 |
3 | 4 | 0.517747 |
4 | 5 | 0.598122 |
... | ... | ... |
45 | 46 | 0.999772 |
46 | 47 | 0.999810 |
47 | 48 | 0.999842 |
48 | 49 | 0.999868 |
49 | 50 | 0.999890 |
The chance that a 6 appears at least once rises rapidly as the number of rolls increases.
1 | # Draw scatter plot |
In 50 rolls, you are almost certain to get at least one 6.
1 | # SELECT * FROM results where Rolls=50; |
Rolls | Chance of at least one 6 | |
---|---|---|
49 | 50 | 0.99989 |
Calculations like these can be used to find the chance that a particular individual is selected in a random sample. The exact calculation will depend on the sampling scheme. But what we have observed above can usually be generalized: increasing the size of the random sample increases the chance that an individual is selected.
Sampling and Empirical Distributions
Sampling and Empirical Distributions
An important part of data science consists of making conclusions based on the data in random samples. In order to correctly interpret their results, data scientists have to first understand exactly what random samples are.
In this chapter we will take a more careful look at sampling, with special attention to the properties of large random samples.
Let’s start by drawing some samples. Our examples are based on the top_movies.csv data set.
1 | # Import dataset top_movies |
Title | Studio | Gross | Gross (Adjusted) | Year | |
---|---|---|---|---|---|
0 | Star Wars: The Force Awakens | Buena Vista (Disney) | 906723418 | 906723400 | 2015 |
1 | Avatar | Fox | 760507625 | 846120800 | 2009 |
2 | Titanic | Paramount | 658672302 | 1178627900 | 1997 |
3 | Jurassic World | Universal | 652270625 | 687728000 | 2015 |
4 | Marvel's The Avengers | Buena Vista (Disney) | 623357910 | 668866600 | 2012 |
... | ... | ... | ... | ... | ... |
195 | The Caine Mutiny | Columbia | 21750000 | 386173500 | 1954 |
196 | The Bells of St. Mary's | RKO | 21333333 | 545882400 | 1945 |
197 | Duel in the Sun | Selz. | 20408163 | 443877500 | 1946 |
198 | Sergeant York | Warner Bros. | 16361885 | 418671800 | 1941 |
199 | The Four Horsemen of the Apocalypse | MPC | 9183673 | 399489800 | 1921 |
Sampling Rows of a Table
Each row of a data table represents an individual; in top
, each individual is a movie. Sampling individuals can thus be achieved by sampling the rows of a table.
The contents of a row are the values of different variables measured on the same individual. So the contents of the sampled rows form samples of values of each of the variables.
Deterministic Samples
When you simply specify which elements of a set you want to choose, without any chances involved, you create a deterministic sample.
You have done this many times, for example by using take:
You have done this many times, for example by using Pandas DataFrame.iloc
:
1 | # Deterministic Sample |
Title | Studio | Gross | Gross (Adjusted) | Year | |
---|---|---|---|---|---|
3 | Jurassic World | Universal | 652270625 | 687728000 | 2015 |
18 | Spider-Man | Sony | 403706375 | 604517300 | 2002 |
100 | Gone with the Wind | MGM | 198676459 | 1757788200 | 1939 |
You have also used Pandas Series.str.contains
:
1 | # Deterministic Sample |
Title | Studio | Gross | Gross (Adjusted) | Year | |
---|---|---|---|---|---|
22 | Harry Potter and the Deathly Hallows Part 2 | Warner Bros. | 381011219 | 417512200 | 2011 |
43 | Harry Potter and the Sorcerer's Stone | Warner Bros. | 317575550 | 486442900 | 2001 |
54 | Harry Potter and the Half-Blood Prince | Warner Bros. | 301959197 | 352098800 | 2009 |
59 | Harry Potter and the Order of the Phoenix | Warner Bros. | 292004738 | 369250200 | 2007 |
62 | Harry Potter and the Goblet of Fire | Warner Bros. | 290013036 | 393024800 | 2005 |
69 | Harry Potter and the Chamber of Secrets | Warner Bros. | 261988482 | 390768100 | 2002 |
76 | Harry Potter and the Prisoner of Azkaban | Warner Bros. | 249541069 | 349598600 | 2004 |
While these are samples, they are not random samples. They don’t involve chance.
Probability Samples
For describing random samples, some terminology will be helpful.
A population is the set of all elements from whom a sample will be drawn.
A probability sample is one for which it is possible to calculate, before the sample is drawn, the chance with which any subset of elements will enter the sample.
In a probability sample, all elements need not have the same chance of being chosen.
A Random Sampling Scheme
For example, suppose you choose two people from a population that consists of three people A, B, and C, according to the following scheme:
- Person A is chosen with probability 1.
- One of Persons B or C is chosen according to the toss of a coin: if the coin lands heads, you choose B, and if it lands tails you choose C.
This is a probability sample of size 2. Here are the chances of entry for all non-empty subsets:
1 | A: 1 |
Person A has a higher chance of being selected than Persons B or C; indeed, Person A is certain to be selected. Since these differences are known and quantified, they can be taken into account when working with the sample.
A Systematic Sample
Imagine all the elements of the population listed in a sequence. One method of sampling starts by choosing a random position early in the list, and then evenly spaced positions after that. The sample consists of the elements in those positions. Such a sample is called a systematic sample.
Here we will choose a systematic sample of the rows of top. We will start by picking one of the first 10 rows at random, and then we will pick every 10th row after that.
1 | """Choose a random start among rows 0 through 9; |
Title | Studio | Gross | Gross (Adjusted) | Year | |
---|---|---|---|---|---|
4 | Marvel's The Avengers | Buena Vista (Disney) | 623357910 | 668866600 | 2012 |
14 | The Lion King | Buena Vista (Disney) | 422783777 | 775573900 | 1994 |
24 | Star Wars: Episode III - Revenge of the Sith | Fox | 380270577 | 516123900 | 2005 |
34 | The Hunger Games: Mockingjay - Part 1 | Lionsgate | 337135885 | 354324000 | 2014 |
44 | Indiana Jones and the Kingdom of the Crystal S... | Paramount | 317101119 | 384231200 | 2008 |
54 | Harry Potter and the Half-Blood Prince | Warner Bros. | 301959197 | 352098800 | 2009 |
64 | Home Alone | Fox | 285761243 | 589287500 | 1990 |
74 | Night at the Museum | Fox | 250863268 | 322261900 | 2006 |
84 | Beverly Hills Cop | Paramount | 234760478 | 584205200 | 1984 |
94 | Saving Private Ryan | Dreamworks | 216540909 | 397999500 | 1998 |
104 | Snow White and the Seven Dreamworksarfs | Disney | 184925486 | 948300000 | 1937 |
114 | There's Something About Mary | Fox | 176484651 | 326064000 | 1998 |
124 | The Firm | Paramount | 158348367 | 332761100 | 1993 |
134 | National Lampoon's Animal House | Universal | 141600000 | 521045300 | 1978 |
144 | The Towering Inferno | Fox | 116000000 | 533968300 | 1974 |
154 | 9 to 5 | Fox | 103290500 | 334062200 | 1980 |
164 | Young Frankenstein | Fox | 86273333 | 397131200 | 1974 |
174 | The Ten Commandments | Paramount | 65500000 | 1139700000 | 1956 |
184 | Lawrence of Arabia | Columbia | 44824144 | 481836900 | 1962 |
194 | The Best Years of Our Lives | RKO | 23650000 | 478500000 | 1946 |
Run the cell a few times to see how the output varies.
This systematic sample is a probability sample. In this scheme, all rows have chance 1/10 of being chosen. For example, Row 23 is chosen if and only if Row 3 is chosen, and the chance of that is 1/10.
But not all subsets have the same chance of being chosen. Because the selected rows are evenly spaced, most subsets of rows have no chance of being chosen. The only subsets that are possible are those that consist of rows all separated by multiples of 10. Any of those subsets is selected with chance 1/10. Other subsets, like the subset containing the first 11 rows of the table, are selected with chance 0.
Random Samples Drawn With or Without Replacement
In this course, we will mostly deal with the two most straightforward methods of sampling.
The first is random sampling with replacement, which (as we have seen earlier) is the default behavior of np.random.choice
when it samples from an array.
The other, called a “simple random sample”, is a sample drawn at random without replacement. Sampled individuals are not replaced in the population before the next individual is drawn. This is the kind of sampling that happens when you deal a hand from a deck of cards, for example.
In this chapter, we will use simulation to study the behavior of large samples drawn at random with or without replacement.
Drawing a random sample requires care and precision. It is not haphazard, even though that is a colloquial meaning of the word “random”. If you stand at a street corner and take as your sample the first ten people who pass by, you might think you’re sampling at random because you didn’t choose who walked by. But it’s not a random sample – it’s a sample of convenience. You didn’t know ahead of time the probability of each person entering the sample; perhaps you hadn’t even specified exactly who was in the population.
Empirical Distributions
In data science, the word “empirical” means “observed”. Empirical distributions are distributions of observed data, such as data in random samples.
In this section we will generate data and see what the empirical distribution looks like.
Our setting is a simple experiment: rolling a die multiple times and keeping track of which face appears. The table die contains the numbers of spots on the faces of a die. All the numbers appear exactly once, as we are assuming that the die is fair.
1 | # Define all possible results |
Face | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
5 | 6 |
A Probability Distribution
The histogram below helps us visualize the fact that every face appears with probability 1/6. We say that the histogram shows the distribution of probabilities over all the possible faces. Since all the bars represent the same percent chance, the distribution is called uniform on the integers 1 through 6.
1 | # Histogram |
Variables whose successive values are separated by the same fixed amount, such as the values on rolls of a die (successive values separated by 1), fall into a class of variables that are called discrete. The histogram above is called a discrete histogram. Its bins are specified by the array die_bins
and ensure that each bar is centered over the corresponding integer value.
It is important to remember that the die can’t show 1.3 spots, or 5.2 spots – it always shows an integer number of spots. But our visualization spreads the probability of each value over the area of a bar. While this might seem a bit arbitrary at this stage of the course, it will become important later when we overlay smooth curves over discrete histograms.
Before going further, let’s make sure that the numbers on the axes make sense. The probability of each face is 1/6, which is 16.67% when rounded to two decimal places. The width of each bin is 1 unit. So the height of each bar is 16.67% per unit. This agrees with the horizontal and vertical scales of the graph.
Empirical Distributions
The distribution above consists of the theoretical probability of each face. It is not based on data. It can be studied and understood without any dice being rolled.
Empirical distributions, on the other hand, are distributions of observed data. They can be visualized by empirical histograms.
Let us get some data by simulating rolls of a die. This can be done by sampling at random with replacement from the integers 1 through 6. We have used np.random.choice
for such simulations before. But now we will introduce a Pandas method for doing this.
The Pandas method is called sample
. It draws at random with replacement from the rows of a table. Its argument is the sample size, and it returns a table consisting of the rows that were selected. An optional argument with_replacement=False specifies that the sample should be drawn without replacement, but that does not apply to rolling a die.
Here are the results of 10 rolls of a die.
numpy.random.choice
Stop using numpy.random.seed()
NEP 19 — Random Number Generator Policy
1 | # Numpy np.random.choice |
pandas.DataFrame.sample
pandas.Series.sample
pandas.core.groupby.DataFrameGroupBy.sample
1 | # Pandas pd.DataFrame.sample |
Face | |
---|---|
4 | 5 |
5 | 6 |
1 | 2 |
0 | 1 |
5 | 6 |
4 | 5 |
3 | 4 |
5 | 6 |
5 | 6 |
5 | 6 |
We can use the same method to simulate as many rolls as we like, and then draw empirical histograms of the results. Because we are going to do this repeatedly, we define a function empirical_hist_die
that takes the sample size as its argument, rolls a die as many times as its argument, and then draws a histogram of the observed results.
1 | # Define Empirical Histogram Function |
Empirical Histograms
Here is an empirical histogram of 10 rolls. It doesn’t look very much like the probability histogram above. Run the cell a few times to see how it varies.
1 | # Call function, sample size=10 |
When the sample size increases, the empirical histogram begins to look more like the histogram of theoretical probabilities.
1 | # Call function, sample size=100 |
1 | # Call function, sample size=1000 |
The Law of Averages
What we have observed above is an instance of a general rule.
If a chance experiment is repeated independently and under identical conditions, then, in the long run, the proportion of times that an event occurs gets closer and closer to the theoretical probability of the event.
For example, in the long run, the proportion of times the face with four spots appears gets closer and closer to 1/6.
Here “independently and under identical conditions” means that every repetition is performed in the same way regardless of the results of all the other repetitions.
Sampling from a Population
The law of averages also holds when the random sample is drawn from individuals in a large population.
As an example, we will study a population of flight delay times. The table united contains data for United Airlines domestic flights departing from San Francisco in the summer of 2015. The data are made publicly available by the Bureau of Transportation Statistics in the United States Department of Transportation.
There are 13,825 rows, each corresponding to a flight. The columns are the date of the flight, the flight number, the destination airport code, and the departure delay time in minutes. Some delay times are negative; those flights left early.
1 | # Import dataset united_summer2015.csv |
Date | Flight Number | Destination | Delay | |
---|---|---|---|---|
0 | 6/1/15 | 73 | HNL | 257 |
1 | 6/1/15 | 217 | EWR | 28 |
2 | 6/1/15 | 237 | STL | -3 |
3 | 6/1/15 | 250 | SAN | 0 |
4 | 6/1/15 | 267 | PHL | 64 |
... | ... | ... | ... | ... |
13820 | 8/31/15 | 1978 | LAS | -4 |
13821 | 8/31/15 | 1993 | IAD | 8 |
13822 | 8/31/15 | 1994 | ORD | 3 |
13823 | 8/31/15 | 2000 | PHX | -1 |
13824 | 8/31/15 | 2013 | EWR | -2 |
One flight departed 16 minutes early, and one was 580 minutes late. The other delay times were almost all between -10 minutes and 200 minutes, as the histogram below shows.
1 | # Minimum Delay |
1 | # Maximum Delay |
1 | # Calculate bins |
bins | Delay Count | Percent % | |
---|---|---|---|
0 | [-10, 0) | 4994 | 36.123 |
1 | [0, 10) | 4059 | 29.360 |
2 | [10, 20) | 1445 | 10.452 |
3 | [20, 30) | 773 | 5.591 |
4 | [30, 40) | 590 | 4.268 |
... | ... | ... | ... |
28 | [250, 260) | 6 | 0.043 |
29 | [260, 270) | 5 | 0.036 |
30 | [290, 300) | 5 | 0.036 |
31 | [230, 240) | 3 | 0.022 |
32 | [280, 290) | 1 | 0.007 |
1 | # Histogram |
For the purposes of this section, it is enough to zoom in on the bulk of the data and ignore the 0.8% of flights that had delays of more than 200 minutes. This restriction is just for visual convenience; the table still retains all the data.
1 | # Percent of Flight delay over than 200 minutes |
1 | # Calculate bins |
bins | Delay Count | Percent % | |
---|---|---|---|
0 | [-10, 0) | 4994 | 36.123 |
1 | [0, 10) | 4059 | 29.360 |
2 | [10, 20) | 1445 | 10.452 |
3 | [20, 30) | 773 | 5.591 |
4 | [30, 40) | 590 | 4.268 |
... | ... | ... | ... |
17 | [150, 160) | 32 | 0.231 |
18 | [160, 170) | 25 | 0.181 |
19 | [170, 180) | 22 | 0.159 |
20 | [180, 190) | 22 | 0.159 |
21 | [190, 200) | 19 | 0.137 |
1 | # Histogram |
The height of the [0, 10) bar is just under 3% per minute, which means that just under 30% of the flights had delays between 0 and 10 minutes. That is confirmed by counting rows:
1 | # Percent of Flight delay in [0, 10) |
Empirical Distribution of the Sample
Let us now think of the 13,825 flights as a population, and draw random samples from it with replacement. It is helpful to package our code into a function. The function empirical_hist_delay
takes the sample size as its argument and draws an empiricial histogram of the results.
1 | # Define Empirical Histogram Function |
As we saw with the dice, as the sample size increases, the empirical histogram of the sample more closely resembles the histogram of the population. Compare these histograms to the population histogram above.
1 | empirical_hist_delay(10) |
1 | empirical_hist_delay(100) |
The most consistently visible discrepancies are among the values that are rare in the population. In our example, those values are in the the right hand tail of the distribution. But as the sample size increases, even those values begin to appear in the sample in roughly the correct proportions.
1 | empirical_hist_delay(1000) |
Convergence of the Empirical Histogram of the Sample
What we have observed in this section can be summarized as follows:
For a large random sample, the empirical histogram of the sample resembles the histogram of the population, with high probability.
This justifies the use of large random samples in statistical inference. The idea is that since a large random sample is likely to resemble the population from which it is drawn, quantities computed from the values in the sample are likely to be close to the corresponding quantities in the population.
Empirical Distribution of a Statistic
The Law of Averages implies that with high probability, the empirical distribution of a large random sample will resemble the distribution of the population from which the sample was drawn.
The resemblance is visible in two histograms: the empirical histogram of a large random sample is likely to resemble the histogram of the population.
As a reminder, here is the histogram of the delays of all the flights in united, and an empirical histogram of the delays of a random sample of 1,000 of these flights.
1 | # Import dataset united_summer2015.csv |
Date | Flight Number | Destination | Delay | |
---|---|---|---|---|
0 | 6/1/15 | 73 | HNL | 257 |
1 | 6/1/15 | 217 | EWR | 28 |
2 | 6/1/15 | 237 | STL | -3 |
3 | 6/1/15 | 250 | SAN | 0 |
4 | 6/1/15 | 267 | PHL | 64 |
... | ... | ... | ... | ... |
13820 | 8/31/15 | 1978 | LAS | -4 |
13821 | 8/31/15 | 1993 | IAD | 8 |
13822 | 8/31/15 | 1994 | ORD | 3 |
13823 | 8/31/15 | 2000 | PHX | -1 |
13824 | 8/31/15 | 2013 | EWR | -2 |
1 | # Calculate bins |
bins | Delay Count | Percent % | |
---|---|---|---|
0 | [-10, 0) | 4994 | 36.429 |
1 | [0, 10) | 4059 | 29.608 |
2 | [10, 20) | 1445 | 10.541 |
3 | [20, 30) | 773 | 5.639 |
4 | [30, 40) | 590 | 4.304 |
... | ... | ... | ... |
17 | [150, 160) | 32 | 0.233 |
18 | [160, 170) | 25 | 0.182 |
19 | [170, 180) | 22 | 0.160 |
20 | [180, 190) | 22 | 0.160 |
21 | [190, 200) | 19 | 0.139 |
1 | # Histogram - Population |
1 | empirical_hist_delay(1000) |
The two histograms clearly resemble each other, though they are not identical.
Parameter
Frequently, we are interested in numerical quantities associated with a population.
- In a population of voters, what percent will vote for Candidate A?
- In a population of Facebook users, what is the largest number of Facebook friends that the users have?
- In a population of United flights, what is the median departure delay?
Numerical quantities associated with a population are called parameters. For the population of flights in united, we know the value of the parameter “median delay”:
1 | # Median of Delay |
The NumPy function median
returns the median (half-way point) of an array. Among all the flights in united, the median delay was 2 minutes. That is, about 50% of flights in the population had delays of 2 or fewer minutes:
1 | # Percentage of Delay that is lower or equal than 2.0 |
Half of all flights left no more than 2 minutes after their scheduled departure time. That’s a very short delay!
Note. The percent isn’t exactly 50 because of “ties,” that is, flights that had delays of exactly 2 minutes. There were 480 such flights. Ties are quite common in data sets, and we will not worry about them in this course.
1 | # Amount of Delay that equal than 2.0 |
Statistic
In many situations, we will be interested in figuring out the value of an unknown parameter. For this, we will rely on data from a large random sample drawn from the population.
A statistic (note the singular!) is any number computed using the data in a sample. The sample median, therefore, is a statistic.
Remember that sample_1000
contains a random sample of 1000 flights from united. The observed value of the sample median is:
1 | # Define sample_1000 |
1 | # Median of another 1000 sample from united |
Run the cell above a few times to see how the answer varies. Often it is equal to 2, the same value as the population parameter. But sometimes it is different.
Just how different could the statistic have been? One way to answer this is to simulate the statistic many times and note the values. A histogram of those values will tell us about the distribution of the statistic.
Let’s recall the main steps in a simulation.
Simulating a Statistic
We will simulate the sample median using the steps we set up in an earlier chapter when we started studying simulation. You can replace the sample size of 1000 by any other sample size, and the sample median by any other statistic.
Step 1: Decide which statistic to simulate. We have already decided that: we are going to simulate the median of a random sample of size 1000 drawn from the population of flight delays.
Step 2: Write the code to generate one value of the statistic. Draw a random sample of size 1000 and compute the median of the sample. We did this in the code cell above. Here it is again, encapsulated in a function.
1 | # Define function to generate one value of the statistic |
Step 3: Decide how many simulated values to generate. Let’s do 5,000 repetitions.
Step 4: Write the code to generate an array of simulated values. As in all simulations, we start by creating an empty array in which we will collect our results. We will then set up a for loop for generating all the simulated values. The body of the loop will consist of generating one simulated value of the sample median, and appending it to our collection array.
The simulation takes a noticeable amount of time to run. That is because it is performing 5000 repetitions of the process of drawing a sample of size 1000 and computing its median. That’s a lot of sampling and repeating!
1 | # Create Array medians to store repitions |
The simulation is done. All 5,000 simulated sample medians have been collected in the array medians. Now it’s time to visualize the results.
Visualization
Here are the simulated random sample medians displayed in the DataFrame simulated_medians
.
1 | # Createa DataFrame |
Sample Median | |
---|---|
0 | 3.0 |
1 | 3.0 |
2 | 2.0 |
3 | 3.0 |
4 | 3.0 |
... | ... |
4995 | 2.0 |
4996 | 3.0 |
4997 | 2.0 |
4998 | 2.0 |
4999 | 3.0 |
We can also visualize the simulated data using a histogram. The histogram is called an empirical histogram of the statistic. It displays the empirical distribution of the statistic. Remember that empirical means observed.
1 | # Histogram |
You can see that the sample median is very likely to be about 2, which was the value of the population median. Since samples of 1000 flight delays are likely to resemble the population of delays, it is not surprising that the median delays of those samples should be close to the median delay in the population.
This is an example of how a statistic can provide a good estimate of a parameter.
The Power of Simulation
If we could generate all possible random samples of size 1000, we would know all possible values of the statistic (the sample median), as well as the probabilities of all those values. We could visualize all the values and probabilities in the probability histogram of the statistic.
But in many situations including this one, the number of all possible samples is large enough to exceed the capacity of the computer, and purely mathematical calculations of the probabilities can be intractably difficult.
This is where empirical histograms come in.
We know that by the Law of Averages, the empirical histogram of the statistic is likely to resemble the probability histogram of the statistic, if the sample size is large and if you repeat the random sampling process numerous times.
This means that simulating random processes repeatedly is a way of approximating probability distributions without figuring out the probabilities mathematically or generating all possible random samples. Thus computer simulations become a powerful tool in data science. They can help data scientists understand the properties of random quantities that would be complicated to analyze in other ways.
Testing Hypotheses
Testing Hypotheses
Program: Data8-11.ipynb
Python for Data 24: Hypothesis Testing
Data scientists are often faced with yes-no questions about the world. You have seen some examples of such questions in this course:
- Is chocolate good for you?
- Did water from the Broad Street pump cause cholera?
- Have the demographics in California changed over the past decade?
Whether we answer questions like these depends on the data we have. Census data about California can settle questions about demographics with hardly any uncertainty about the answer. We know that Broad Street pump water was contaminated by waste from cholera victims, so we can make a pretty good guess about whether it caused cholera.
Whether chocolate or any other treatment is good for you will almost certainly have to be decided by medical experts, but an initial step consists of using data science to analyze data from studies and randomized experiments.
In this chapter, we will try to answer such yes-no questions, basing our conclusions on random samples and empirical distributions.
Assessing Models
In data science, a “model” is a set of assumptions about data. Often, models include assumptions about chance processes used to generate data.
Sometimes, data scientists have to decide whether or not their models are good. In this section we will discuss two examples of making such decisions. In later sections we will use the methods developed here as the building blocks of a general framework for testing hypotheses.
Jury Selection
Amendment VI of the United States Constitution states that, “In all criminal prosecutions, the accused shall enjoy the right to a speedy and public trial, by an impartial jury of the State and district wherein the crime shall have been committed.” One characteristic of an impartial jury is that it should be selected from a jury panel that is representative of the population of the relevant region. The jury panel is the group of people from which jurors are selected.
The question of whether a jury panel is indeed representative of a region’s population has an important legal implication: one could question whether a particular jury is impartial if some group from the population was systematically underrepresented on the jury panel.
Let’s consider a hypothetical county containing two cities: A and B. Let’s say that 26% of all eligible jurors live in A. Imagine there is a trial, and only 8 among 100 (8%) of the those selected for the jury panel are from A. The fairness of this discrepancy could certainly be called into question, especially if the accused is from A.
One might assert that the difference between 26% and 8% is small and might result from chance alone, rather than a systemic bias against selecting jurors from A. But is this assertion reasonable? If jury panelists were selected at random from the county’s eligible population, there would not be exactly 26 jurors from A on every 100-person jury panel, but only 8 would perhaps seem too low.
A Model
One view of the data – a model, in other words – is that the panel was selected at random and ended up with a small number of jurors from A just due to chance.
The model specifies the details of a chance process. It says the data are like a random sample from a population in which 26% of the people are from A. We are in a good position to assess this model, because:
- We can simulate data based on the model. That is, we can simulate drawing at random from a population of whom 26% are from A.
- Our simulation will show what a panel would look like if it were selected at random.
- We can then compare the results of the simulation with the composition of an actual jury panel.
- If the results of our simulation are not consistent with the composition of the panel, that will be evidence against the model of random selection, and therefore evidence against the fairness of the trial itself.
Let’s go through the process in detail.
The Statistic
First, we have to choose a statistic to simulate. The statistic has to be able to help us decide between the model and alternative views about the data. The model says the panel was drawn at random. The alternative viewpoint is that the panel was not drawn at random because it contained too few jurors from A. A natural statistic, then, is the number of panelists from A. Small values of the statistic will favor the alternative viewpoint.
Predicting the Statistic Under the Model
If the model were true, how big would the statistic typically be? To answer that, we have to start by working out the details of the simulation.
Generating One Value of the Statistic
First let’s figure out how to simulate one value of the statistic. For this, we have to sample 100 times at random from the population of eligible jurors and count the number of people from A we get.
One way is to set up a table representing the eligible population and use sample as we did in the previous chapter. But there is also a quicker way, using a np.random.multinomial
function tailored for sampling at random from categorical distributions. We will use it several times in this chapter.
The np.random.multinomial
function in the datascience library takes two arguments:
n
: Number of experiments.pvals
: Probabilities of each of thep
different outcomes. These must sum to 1 (however, the last element is always assumed to account for the remaining probability, as long assum(pvals[:-1]) <= 1)
.- size: Output shape. If the given shape is, e.g.,
(m, n, k)
, thenm * n * k
samples are drawn. Default is None, in which case a single value is returned.
It returns an array containing the distribution of the categories in a random sample of the given size taken from the population. That’s an array consisting of the sample proportions in all the different categories.
To see how to use this, remember that according to our model, the panel is selected at random from a population of eligible jurors among whom 26%
were from A and 74%
were from B. Thus the distribution of the two categories can be represented as the list [0.26, 0.74]
, which we have assigned to the name eligible_population. Now let’s sample at random 100 times from this distribution, and see what proportions of the two categories we get in our sample.
1 | # Define the eligible proportion |
That was easy! The proportion from A in the random sample is 21:79. And count A is 21.
Running the Simulation
To get a sense of the variability without running the cell over and over, let’s generate 10,000 simulated values of the count.
The code follows the same steps that we have used in every simulation. First, we define a function to simulate one value of the count, using the code we wrote above.
1 | # Define function one_simulated_count for generating one value and return A |
Next, we create an array of 10,000 simulated counts by using a for loop.
1 | # Repeat simulation 10,000 times |
A | |
---|---|
0 | 34.0 |
1 | 24.0 |
2 | 24.0 |
3 | 23.0 |
4 | 28.0 |
... | ... |
9995 | 21.0 |
9996 | 27.0 |
9997 | 33.0 |
9998 | 32.0 |
9999 | 26.0 |
The Prediction
To interpret the results of our simulation, we start as usual by visualizing the results by an empirical histogram.
1 | # Histogram |
Comparing the Prediction and the Data
Though the simulated counts are quite varied, very few of them came out to be eight or less. The value eight is far out in the left hand tail of the histogram. It’s the red dot on the horizontal axis of the histogram.
1 | # Histogram |
The simulation shows that if we select a panel of 100 jurors at random from the eligible population, we are very unlikely to get counts of jurors from A as low as the eight that we observed on the jury panel. This is evidence that the model of random selection of the jurors in the panel is not consistent with the data from the panel.
When the data and a model are inconsistent, the model is hard to justify. After all, the data are real. The model is just a set of assumptions. When assumptions are at odds with reality, we have to question those assumptions.
While it is possible that the panel could have been generated by chance, our simulation demonstrates that it is very unlikely. Reality is very much at odds with the model assumptions, so the most reasonable conclusion is that the assumptions are wrong. This jury panel was not selected by random sampling, but instead by some process with systemic bias, and the difference between 26% and 8% is not so small as to be explained well by chance alone.
This method of assessing models is very general. Here is an example in which we use it to assess a model in a completely different setting.
Mendel’s Pea Flowers
Gregor Mendel (1822-1884) was an Austrian monk who is widely recognized as the founder of the modern field of genetics. Mendel performed careful and large-scale experiments on plants to come up with fundamental laws of genetics.
Many of his experiments were on varieties of pea plants. He formulated sets of assumptions about each variety; these were his models. He then tested the validity of his models by growing the plants and gathering data.
Let’s analyze the data from one such experiment to see if Mendel’s model was good.
In a particular variety, each plant has either purple flowers or white. The color in each plant is unaffected by the colors in other plants. Mendel hypothesized that the plants should bear purple or white flowers at random, in the ratio 3:1.
Mendel’s Model
For every plant, there is a 75% chance that it will have purple flowers, and a 25% chance that the flowers will be white, regardless of the colors in all the other plants.
Approach to Assessment
To go about assessing Mendel’s model, we can simulate plants under the assumptions of the model and see what it predicts. Then we will be able to compare the predictions with the data that Mendel recorded.
The Statistic
Our goal is to see whether or not Mendel’s model is good. We need to simulate a statistic that will help us make this decision.
If the model is good, the percent of purple-flowering plants in the sample should be close to 75%. If the model is not good, the percent purple-flowering will be away from 75%. It may be higher, or lower; the direction doesn’t matter.
The key for us is the distance between 75% and the percent of purple-flowering plants in the sample. Big distances are evidence that the model isn’t good.
Our statistic, therefore, is the distance between the sample percent and 75%:
$$|\mbox{sample percent of purple-flowering plats} - 75|$$
Predicting the Statistic Under the Model
To see how big the distance would be if Mendel’s model were true, we can use np.random.multinomial
to simulate the distance under the assumptions of the model.
First, we have to figure out how many times to sample. To do this, remember that we are going to compare our simulation with Mendel’s plants. So we should simulate the same number of plants that he had.
Mendel grew a lot of plants. There were 929
plants of the variety corresponding to this model. So we have to sample 929
times.
Generating One Value of the Statistic
The steps in the calculation:
- Sample
929
times at random from the distribution specified by the model and find the sample proportion in the purple-flowering category. - Multiply the proportion by
100
to get a pecent. - Subtract
75
and take the absolute value of the difference.
That’s the statistic: the distance between the sample percent and 75
.
We will start by defining a function that takes a proportion and returns the absolute difference between the corresponding percent and 75
.
1 | # Define Distance function |
To simulate one value of the distance between the sample percent of purple-flowering plants and 75%
, under the assumptions of Mendel’s model, we have to first simulate the proportion of purple-flowering plants among 929 plants under the assumption of the model, and then calculate the discrepancy from 75%
.
1 | # Define sample proportions |
1 | # Run simulation once |
That’s one simulated value of the distance between the sample percent of purple-flowering plants and 75% as predicted by Mendel’s model.
Running the Simulation
To get a sense of how variable the distance could be, we have to simulate it many more times.
We will generate 10,000 values of the distance. As before, we will first use the code we developed above to define a function that returns one simulated value Mendel’s hypothesis.
1 | # Define one value |
Next, we will use a for loop to create 10,000 such simulated distances.
1 | # Simulation |
Distance | |
---|---|
0 | 0.134553 |
1 | 1.103337 |
2 | 1.749193 |
3 | 2.233584 |
4 | 0.941873 |
... | ... |
9995 | 0.618945 |
9996 | 2.664155 |
9997 | 0.834230 |
9998 | 2.717976 |
9999 | 0.941873 |
The Prediction
The empirical histogram of the simulated values shows the distribution of the distance as predicted by Mendel’s model.
1 | # Histogram |
Look on the horizontal axis to see the typical values of the distance, as predicted by the model. They are rather small. For example, a high proportion of the distances are in the range 0 to 1, meaning that for a high proportion of the samples, the percent of purple-flowering plants is within 1%
of 75%
, that is, the sample percent is in the range 74%
to 76%
.
Comparing the Prediction and the Data
To assess the model, we have to compare this prediction with the data. Mendel recorded the number of purple and white flowering plants. Among the 929 plants that he grew, 705 were purple flowering. That’s just about 75.89%.
1 | 705/929 |
So the observed value of our statistic – the distance between Mendel’s sample percent and 75 – is about 0.89:
1 | observed_statistic = distance_from_75(705/929) |
Just by eye, locate roughly where 0.89
is on the horizontal axis of the histogram. You will see that it is clearly in the heart of the distribution predicted by Mendel’s model.
The cell below redraws the histogram with the observed value plotted on the horizontal axis
1 | # Histogram |
The observed statistic is like a typical distance predicted by the model. By this measure, the data are consistent with the histogram that we generated under the assumptions of Mendel’s model. This is evidence in favor of the model.
Multiple Categories
We have developed a way of assessing models about chance processes that generate data in two categories. The method extends to models involving data in multiple categories. The process of assessment is the same as before, the only difference being that we have to come up with a new statistic to simulate.
Let’s do this in an example that addresses the same kind of question that was raised in the case of Robert Swain’s jury panel. This time, the data are more recent.
Jury Selection in Alameda County
In 2010, the American Civil Liberties Union (ACLU) of Northern California presented a report on jury selection in Alameda County, California. The report concluded that certain ethnic groups are underrepresented among jury panelists in Alameda County, and suggested some reforms of the process by which eligible jurors are assigned to panels. In this section, we will perform our own analysis of the data and examine some questions that arise as a result.
Some details about jury panels and juries will be helpful in interpreting the results of our analysis.
Jury Panels
A jury panel is a group of people chosen to be prospective jurors; the final trial jury is selected from among them. Jury panels can consist of a few dozen people or several thousand, depending on the trial. By law, a jury panel is supposed to be representative of the community in which the trial is taking place. Section 197 of California’s Code of Civil Procedure says, “All persons selected for jury service shall be selected at random, from a source or sources inclusive of a representative cross section of the population of the area served by the court.”
The final jury is selected from the panel by deliberate inclusion or exclusion. The law allows potential jurors to be excused for medical reasons; lawyers on both sides may strike a certain number of potential jurors from the list in what are called “peremptory challenges”; the trial judge might make a selection based on questionnaires filled out by the panel; and so on. But the initial panel is supposed to resemble a random sample of the population of eligible jurors.
Composition of Panels in Alameda County
The focus of the study by the ACLU of Northern California was the ethnic composition of jury panels in Alameda County. The ACLU compiled data on the ethnic composition of the jury panels in 11 felony trials in Alameda County in the years 2009 and 2010. In those panels, the total number of people who reported for jury service was 1,453. The ACLU gathered demographic data on all of these prosepctive jurors, and compared those data with the composition of all eligible jurors in the county.
The data are tabulated below in a table called jury
. For each ethnicity, the first value is the proportion of all eligible juror candidates of that ethnicity. The second value is the proportion of people of that ethnicity among those who appeared for the process of selection into the jury.
1 | # Import dataset |
Ethnicity | Eligible | Panels | |
---|---|---|---|
0 | Asian | 0.15 | 0.26 |
1 | Black | 0.18 | 0.08 |
2 | Latino | 0.12 | 0.08 |
3 | White | 0.54 | 0.54 |
4 | Other | 0.01 | 0.04 |
Some ethnicities are overrepresented and some are underrepresented on the jury panels in the study. A bar chart is helpful for visualizing the differences.
1 | # Bar chart |
Comparison with Panels Selected at Random
What if we select a random sample of 1,453
people from the population of eligible jurors? Will the distribution of their ethnicities look like the distribution of the panels above?
We can answer these questions by using np.random.multinomial
and augmenting the jury table with a column of the proportions in our sample.
Technical note. Random samples of prospective jurors would be selected without replacement. However, when the size of a sample is small relative to the size of the population, sampling without replacement resembles sampling with replacement; the proportions in the population don’t change much between draws. The population of eligible jurors in Alameda County is over a million, and compared to that, a sample size of about 1500 is quite small. We will therefore sample with replacement.
In the cell below, we sample at random 1453 times from the distribution of eligible jurors, and display the distribution of the random sample along with the distributions of the eligible jurors and the panel in the data.
1 | # Grab the proportion of Eligible |
Ethnicity | Eligible | Panels | Random Sample | |
---|---|---|---|---|
0 | Asian | 0.15 | 0.26 | 0.143152 |
1 | Black | 0.18 | 0.08 | 0.192017 |
2 | Latino | 0.12 | 0.08 | 0.128699 |
3 | White | 0.54 | 0.54 | 0.525809 |
4 | Other | 0.01 | 0.04 | 0.010323 |
The distribution of the Random Sample is quite close to the distribution of the Eligible population, unlike the distribution of the Panels.
As always, it helps to visualize.
1 | # Bar chart |
The bar chart shows that the distribution of the random sample resembles the eligible population but the distribution of the panels does not.
To assess whether this observation is particular to one random sample or more general, we can simulate multiple panels under the model of random selection and see what the simulations predict. But we won’t be able to look at thousands of bar charts like the one above. We need a statistic that will help us assess whether or not the model or random selection is supported by the data.
A New Statistic: The Distance between Two Distributions
We know how to measure how different two numbers are – if the numbers are $x$ and $y$, the distance between them is $|𝑥−$y$|$. Now we have to quantify the distance between two distributions. For example, we have to measure the distance between the blue and gold distributions below.
1 | # Bar chart |
For this we will compute a quantity called the total variation distance between two distributions. The calculation is as an extension of the calculation of the distance between two numbers.
To compute the total variation distance, we first take the difference between the two proportions in each category.
1 | # Augment the table with a column of differences between proportions |
Ethnicity | Eligible | Panels | Difference | |
---|---|---|---|---|
0 | Asian | 0.15 | 0.26 | 0.11 |
1 | Black | 0.18 | 0.08 | -0.10 |
2 | Latino | 0.12 | 0.08 | -0.04 |
3 | White | 0.54 | 0.54 | 0.00 |
4 | Other | 0.01 | 0.04 | 0.03 |
Take a look at the column Difference
and notice that the sum of its entries is 0: the positive entries add up to 0.14 (0.11 + 0.03), exactly canceling the total of the negative entries which is -0.14 (-0.1 - 0.04).
This is numerical evidence of the fact that in the bar chart, the gold bars exceed the blue bars by exactly as much as the blue bars exceed the gold. The proportions in each of the two columns Panels
and Eligible
add up to 1, and so the give-and-take between their entries must add up to 0.
To avoid the cancellation, we drop the negative signs and then add all the entries. But this gives us two times the total of the positive entries (equivalently, two times the total of the negative entries, with the sign removed). So we divide the sum by 2.
1 | # Add Absolute Difference |
Ethnicity | Eligible | Panels | Difference | Absolute Difference | |
---|---|---|---|---|---|
0 | Asian | 0.15 | 0.26 | 0.11 | 0.11 |
1 | Black | 0.18 | 0.08 | -0.10 | 0.10 |
2 | Latino | 0.12 | 0.08 | -0.04 | 0.04 |
3 | White | 0.54 | 0.54 | 0.00 | 0.00 |
4 | Other | 0.01 | 0.04 | 0.03 | 0.03 |
1 | jury_with_diffs['Absolute Difference'].sum() / 2 |
This quantity 0.14 is the total variation distance (TVD) between the distribution of ethnicities in the eligible juror population and the distribution in the panels.
We could have obtained the same result by just adding the positive differences. But our method of including all the absolute differences eliminates the need to keep track of which differences are positive and which are not.
Simulating One Value of the Statistic
We will use the total variation distance between distributions as the statistic to simulate. It will help us decide whether the model of random selection is good, because large values of the distance will be evidence against the model.
Keep in mind that the observed value of our statistic is 0.14, calculated above.
Since we are going to be computing total variation distance repeatedly, we will write a function to compute it.
The function total_variation_distance
returns the TVD between distributions in two arrays.
1 | # Define function total_variation_distance that returns the TVD between distributions in two arrays |
This function will help us calculate our statistic in each repetition of the simulation. But first, let’s check that it gives the right answer when we use it to compute the distance between the blue (eligible) and gold (panels) distributions above.
1 | total_variation_distance(jury.Panels, jury.Eligible) |
This agrees with the value that we computed directly without using the function.
In the cell below we use the function to compute the TVD between the distributions of the eligible jurors and one random sample. This is the code for simulating one value of our statistic. Recall that eligible_population
is the array containing the distribution of the eligible jurors.
1 | # Get the TVD the TVD between the distributions of the eligible jurors and one random sample |
Notice that the distance is quite a bit smaller than 0.14, the distance between the distribution of the panels and the eligible jurors.
We are now ready to run a simulation to assess the model of random selection.
Predicting the Statistic Under the Model of Random Selection
The total variation distance between the distributions of the random sample and the eligible jurors is the statistic that we are using to measure the distance between the two distributions. By repeating the process of sampling, we can see how much the statistic varies across different random samples.
The code below simulates the statistic based on a large number of replications of the random sampling process, following our usual sequence of steps for simulation. We first define a function that returns one simulated value of the total variation distance under the hypothesis of random selection. Then we use our function in a for loop to create an array tvds
consisting of 5,000 such distances.
1 | # Simulate one simulated value of |
1 | # Define tvds for storing simulation results |
TVD | |
---|---|
0 | 0.024845 |
1 | 0.012725 |
2 | 0.017543 |
3 | 0.017096 |
4 | 0.010943 |
... | ... |
4995 | 0.014783 |
4996 | 0.015175 |
4997 | 0.006118 |
4998 | 0.016889 |
4999 | 0.007213 |
The empirical histogram of the simulated distances shows that drawing 1453 jurors at random from the pool of eligible candidates results in a distribution that rarely deviates from the eligible jurors’ race distribution by more than about 0.05.
1 | # Histogram |
Assessing the Model of Random Selection
The panels in the study, however, were not quite so similar to the eligible population. The total variation distance between the panels and the population was 0.14, which is far out in the tail of the histogram above