Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter 3rd Edition
The objective of this book is to teach you what data analysis is and how to manipulate, process, clean, and crunch data in Python and data analysis tools. In order to become an effective data analyst, this book acts as a guide to the parts of the Python programming language and its data-oriented library ecosystem and tools. The book focuses on Python programming, libraries, and tools that you need for data analysis. The Python open-source ecosystem for doing data analysis (or data science) has evolved over the years and it has developed a large and active scientific computing and data analysis community. Python has now become one of the most important languages for data science, machine learning, and general software development. In recent years, Python's improved open-source pandas and scikit-learn libraries have made it a popular choice for data analysis tasks. Here is a summarized list of topics covered in this book:
- Use the Jupyter notebook and IPython shell for exploratory computing
- Learn basic and advanced features in NumPy
- Get started with data analysis tools in the pandas library
- Use flexible tools to load, clean, transform, merge, and reshape data
- Create informative visualizations with matplotlib
- Apply the pandas groupby facility to slice, dice, and summarize datasets
- Analyze and manipulate regular and irregular time series data
- Learn how to solve real-world data analysis problems with thorough, detailed examples
Here are the details of the topics covered in this book:
Why Python for Data Analysis?
Python Libraries (NumPy, pandas, matplotlib, IPython and Jupyter, SciPy, scikit-learn statsmodels)Installation (Miniconda on Windows, GNU/Linux, Miniconda on macOS)
Python Language Basics, IPython, and Jupyter Notebook
Language SemanticsScalar Types
Control Flow
Data Structures and Sequences
TupleList
Dictionary
Set
Built-In Sequence Functions
List, Set, and Dictionary Comprehensions
Functions
Namespaces, Scope, and Local FunctionsReturning Multiple Values
Functions Are Objects
Anonymous (Lambda) Functions
Generators
Errors and Exception Handling
Bytes and Unicode with Files
NumPy Basics: Arrays and Vectorized Computation
Create ndarray: A multidimensional array objectData types of ndarrays
Arithmeic with NumPy arrays
Basic indexing and slicing
Boolean indexing
Fancy indexing
Transposing arrays and swapping axes
Pseudorandom number generation
Universal functions: Fast element-wise array functions
Array-Oriented programming with arrays
Expressing conditional logic as array operations
Mathematical and statistical methods
Methods for boolean arrays
Unique and other set logic
File input and output with arrays
Linear algebra
Introduction to pandas data structures
SeriesData Frame
Index Objects
Reindexing
Dropping entries from an Axis
Indexing, Selection, and Filtering
Arithmetic and data alignment
Function application and mapping
Sorting and ranking
Axis indexes with duplicate labels
Correlation and covariance
Unique values, value counts, and membership
Data Loading, Storage, and File Formats
Reading and writing data in text formatReading text files in pieces
Writing data to text format
working with other delimited formats
JSON data
XML and HTML web scraping
Binary data formats
Reading Microsoft Excel files
Using HDF5 format
Interacting with Web APIs
Interacting with databases
Data Cleaning and Preparation
Handling missing dataFiltering out missing data
Filling in missing data
Data Transformation
Removing duplicates
Transforming data using a function or mapping
Replacing values
Renaming axis indexes
Discretization and binning
Detecting and filtering outliers
Permutation and random sampling
Computing indicator/dummy variables
Extension data types
String manipulation
Python built-in string object methods
Regular expressions
String functions in pandas
Categorical data
Background and motivation
Categorical extension type in pandas
Computations with categoricals
Categorical methods
Data Wrangling: Join, Combine, and Reshape
Hierarchical indexingReordering and sorting levels
Summary statistics by level
Indexing with a DataFrame's columns
Combining and merging datasets
Database-style DataFrame joins
Merging on index
Concatenating along an axis
Combining data with overlap
Reshaping and pivoting
Reshaping with Hierarchical indexing
Pivoting "Long" to "Wide" format
Pivoting "Wide" to "Long" format
Plogging and Visualization
Matplotlib API primerFigures and subplots
Colors, markers, and line styles
Ticks, labels, and legends
Annotations and drawings on a subplot
Saving plots to file
matplotlib configuration
Plotting with pandas and seaborn
Line plots
Bar plots
Histograms and density plots
Scatter or Point plots
Facet grids and categorical data
Other Python visualization tools
Data Aggregation and Group Operations
How to think about group operationsIterating over groups
Selecting a column or subset of columns
Grouping with dictionaries and series
Grouping with functions
Grouping by index levels
Column-wise and multiple-function application
Returning aggregated data without row indexes
General split-apply-combine
Suppressing the group keys
Quantile and bucket analysis
Example: Filling missing values with group-specific values
Example: Random sampling and permutation
Example: Group weighted average and correlation
Example: Group-wise linear regression
Group transforms and "Unwrapped" groupbys
Pivot tables and corss-tabulation
Cross-tabulations: Crosstab
Time Series
Data and time data types and toolsConverting between string and datetime
Time series basics
Indexing, selectin, subsetting
Time series with duplicate indices
Date ranges, frequencies, and shifting
Generating date ranges
Frequencies and date offsets
Shifting (leading and lagging) data
Time zone handling
Time zone localization and conversion
Operations with time zone-aware timestamp objects
Operations between different time zones
Periods and Period arithmetic
Period frequency conversion
Quarterly period frequencies
Converting timestamps to periods (and back)
Creating a PeriodIndex from arrays
Resampling and frequency conversion
Downsampling
Upsampling and interpolation
Resampling with periods
Grouped time resampling
Moving window functions
Exponentially weighted functions
Binary moving window functions
User-defined moving window functions
Modeling Libraries in Python
Interfacing between pandas and model codeCreating model descriptions with Patsy
Data transformations in Patsy formulas
Categorical data and Patsy
Introduction to statsmodels
Estimating linear models
Estimating time series processes
Introduction to scikit-learn
Data Analysis Examples
Counting time zones in Pure PythonCounting Time Zones with pandas
Measuring rating disagreement
Analyzing naming trends
Donation statistics by occupation and employer
Bucketing donations amounts
Donation statistics by state
Advanced NumPy
ndarray object internalsNumPy data type hierarchy
Advanced array manipulation
Reshaping arrays
C versus Fortran Order
Concatenating and splitting arrays
Repeating elements: tile and repeat
Fancy indexing equivalens: take and put
Broadcasting over other axes
Setting array values by broadcasting
Advanced ufunc usage
ufunc instance methods
Writing new ufuncs in Python
Structured and record arrays
Nested data types and multidimensional fields
Why use structured arrays?
Indirect sorts: argsort and lexsort
Alternative sort algorithms
Partially sorting arrays
numpy.searchsorted: Finding elements in a sorted array
Writing fast NumPy functions with Numba
Creating custom numpy.ufunc objects with Numba
Advanced array input and output
Memory-mapped files
HDF5 and other array storage options
The importance of Contiguous memory
IPython System
Terminal keyboard shortcutsMagic commands
The %run command
Executing code from the clipboard
Searching and reusing the command history
Input and output variables
Interacting with the operating system
Shell command and aliases
Directory bookmark system
Software development tools
Interactive debugger
Timing Code: %time and %timeit
Basic profiling: %prun and %run -p
Profiling and function line by line
Tips for productive code development using IPython
Reloading module dependencies
Code design tips
Advanced IPython features
Profiles and configuration