pandas 0.7.0.dev-5c74d9b documentation

Intro to Data Structures

We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get you started. The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To get started, import numpy and load pandas into your namespace:

In [224]: import numpy as np

# will use a lot in examples
In [225]: randn = np.random.randn

In [226]: from pandas import *

Here is a basic tenet to keep in mind: data alignment is intrinsic. Link between labels and data will not be broken unless done so explicitly by you.

We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and methods in separate sections.

Series

Series is a one-dimensional labeled array (technically a subclass of ndarray) capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

>>> s = Series(data, index=index)

Here, data can be many different things:

  • a Python dict
  • an ndarray
  • a scalar value (like 5)

The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

From ndarray

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [227]: s = Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [228]: s
Out[228]: 
a   -0.2837
b   -1.5371
c    0.1630
d   -0.6481
e   -1.7033

In [229]: s.index
Out[229]: Index([a, b, c, d, e], dtype=object)

In [230]: Series(randn(5))
Out[230]: 
0    0.6543
1   -1.1462
2    1.1444
3    0.1666
4    0.1479

Note

The values in the index must be unique. If they are not, an exception will not be raised immediately, but attempting any operation involving the index will later result in an exception. In other words, the Index object containing the labels “lazily” checks whether the values are unique. The reason for being lazy is nearly all performance-based (there are many instances in computations, like parts of GroupBy, where the index is not used).

From dict

If data is a dict, if index is passed the values in data corresponding to the labels in the index will be pulled out. Otherwise, an index will be constructed from the sorted keys of the dict, if possible.

In [231]: d = {'a' : 0., 'b' : 1., 'c' : 2.}

In [232]: Series(d)
Out[232]: 
a    0
b    1
c    2

In [233]: Series(d, index=['b', 'c', 'd', 'a'])
Out[233]: 
b    1.0
c    2.0
d    NaN
a    0.0

Note

NaN (not a number) is the standard missing data marker used in pandas

From scalar value If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [234]: Series(5., index=['a', 'b', 'c', 'd', 'e'])
Out[234]: 
a    5
b    5
c    5
d    5
e    5

Series is ndarray-like

As a subclass of ndarray, Series is a valid argument to most NumPy functions and behaves similarly to a NumPy array. However, things like slicing also slice the index.

In [235]: s[0]
Out[235]: -0.28367872471747613

In [236]: s[:3]
Out[236]: 
a   -0.2837
b   -1.5371
c    0.1630

In [237]: s[s > s.median()]
Out[237]: 
a   -0.2837
c    0.1630

In [238]: s[[4, 3, 1]]
Out[238]: 
e   -1.7033
d   -0.6481
b   -1.5371

In [239]: np.exp(s)
Out[239]: 
a    0.7530
b    0.2150
c    1.1770
d    0.5230
e    0.1821

We will address array-based indexing in a separate section.

Series is dict-like

A Series is alike a fixed-size dict in that you can get and set values by index label:

In [240]: s['a']
Out[240]: -0.28367872471747613

In [241]: s['e'] = 12.

In [242]: s
Out[242]: 
a   -0.2837
b   -1.5371
c    0.1630
d   -0.6481
e    12.000

In [243]: 'e' in s
Out[243]: True

In [244]: 'f' in s
Out[244]: False

If a label is not contained, an exception

>>> s['f']
KeyError: 'f'

>>> s.get('f')
nan

Vectorized operations and label alignment with Series

When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can be also be passed into most NumPy methods expecting an ndarray.

In [245]: s + s
Out[245]: 
a   -0.5674
b   -3.0741
c    0.3260
d   -1.2963
e    24.000

In [246]: s * 2
Out[246]: 
a   -0.5674
b   -3.0741
c    0.3260
d   -1.2963
e    24.000

In [247]: np.exp(s)
Out[247]: 
a    0.7530085
b    0.2150125
c    1.1770433
d    0.5230225
e    1.628e+05

A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [248]: s[1:] + s[:-1]
Out[248]: 
a    NaN
b   -3.074
c    0.326
d   -1.296
e    NaN

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing (NaN). Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

Note

In general, we chose to make the default result of operations between differently indexed objects yield the union of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is typically important information as part of a computation. You of course have the option of dropping labels with missing data via the dropna function.

Name attribute

Series can also have a name attribute:

In [249]: s = Series(np.random.randn(5), name='something')

In [250]: s
Out[250]: 
0   -1.33368
1   -0.17108
2    0.05042
3   -0.65029
4   -1.08380
Name: something

In [251]: s.name
Out[251]: 'something'

The Series name will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame as you will see below.

DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • Structured or record ndarray
  • A Series
  • Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

From dict of Series or dicts

The result index will be the union of the indexes of the various Series. If there are any nested dicts, these will be first converted to Series. If no columns are passed, the columns will be the sorted list of dict keys.

In [252]: d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']),
   .....:      'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [253]: df = DataFrame(d)

In [254]: df
Out[254]: 
   one  two
a  1.0  1  
b  2.0  2  
c  3.0  3  
d  NaN  4  

In [255]: DataFrame(d, index=['d', 'b', 'a'])
Out[255]: 
   one  two
d  NaN  4  
b  2.0  2  
a  1.0  1  

In [256]: DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
Out[256]: 
   two  three
d  4    NaN  
b  2    NaN  
a  1    NaN  

The row and column labels can be accessed respectively by accessing the index and columns attributes:

Note

When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.

In [257]: df.index
Out[257]: Index([a, b, c, d], dtype=object)

In [258]: df.columns
Out[258]: Index([one, two], dtype=object)

From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [259]: d = {'one' : [1., 2., 3., 4.],
   .....:      'two' : [4., 3., 2., 1.]}

In [260]: DataFrame(d)
Out[260]: 
   one  two
0  1    4  
1  2    3  
2  3    2  
3  4    1  

In [261]: DataFrame(d, index=['a', 'b', 'c', 'd'])
Out[261]: 
   one  two
a  1    4  
b  2    3  
c  3    2  
d  4    1  

From structured or record array

This case is handled identically to a dict of arrays.

In [262]: data = np.zeros((2,),dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

In [263]: data[:] = [(1,2.,'Hello'),(2,3.,"World")]

In [264]: DataFrame(data)
Out[264]: 
   A  B C    
0  1  2 Hello
1  2  3 World

In [265]: DataFrame(data, index=['first', 'second'])
Out[265]: 
        A  B C    
first   1  2 Hello
second  2  3 World

In [266]: DataFrame(data, columns=['C', 'A', 'B'])
Out[266]: 
  C      A  B
0 Hello  1  2
1 World  2  3

Note

DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.

From a list of dicts

In [267]: data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [268]: DataFrame(data2)
Out[268]: 
   a  b   c  
0  1  2   NaN
1  5  10  20.

In [269]: DataFrame(data2, index=['first', 'second'])
Out[269]: 
        a  b   c  
first   1  2   NaN
second  5  10  20.

In [270]: DataFrame(data2, columns=['a', 'b'])
Out[270]: 
   a  b 
0  1  2 
1  5  10

From a Series

The result will be a DataFrame with the same index as the input Series, and with one column whose name is the original name of the Series (only if no other column name provided).

Missing Data

Much more will be said on this topic in the Missing data section. To construct a DataFrame with missing data, use np.nan for those values which are missing. Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame constructor, and its masked entries will be considered missing.

Alternate Constructors

DataFrame.from_dict

DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like the DataFrame constructor except for the orient parameter which is 'columns' by default, but which can be set to 'index' in order to use the dict keys as row labels.

DataFrame.from_records

DataFrame.from_records takes a list of tuples or an ndarray with structured dtype. Works analogously to the normal DataFrame constructor, except that index maybe be a specific field of the structured dtype to use as the index. For example:

In [271]: data
Out[271]: 
array([(1, 2.0, 'Hello'), (2, 3.0, 'World')], 
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', '|S10')])

In [272]: DataFrame.from_records(data, index='C')
Out[272]: 
       A  B
Hello  1  2
World  2  3

DataFrame.from_items

DataFrame.from_items works analogously to the form of the dict constructor that takes a sequence of (key, value) pairs, where the keys are column (or row, in the case of orient='index') names, and the value are the column values (or row values). This can be useful for constructing a DataFrame with the columns in a particular order without having to pass an explicit list of columns:

In [273]: DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])
Out[273]: 
   A  B
0  1  4
1  2  5
2  3  6

If you pass orient='index', the keys will be the row labels. But in this case you must also pass the desired column names:

In [274]: DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])],
   .....:                      orient='index', columns=['one', 'two', 'three'])
Out[274]: 
   one  two  three
A  1    2    3    
B  4    5    6    

Column selection, addition, deletion

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [275]: df['one']
Out[275]: 
a    1.0
b    2.0
c    3.0
d    NaN
Name: one

In [276]: df['three'] = df['one'] * df['two']

In [277]: df['flag'] = df['one'] > 2

In [278]: df
Out[278]: 
   one  two  three flag 
a  1.0  1    1.0   False
b  2.0  2    4.0   False
c  3.0  3    9.0   True 
d  NaN  4    NaN   False

Columns can be deleted or popped like with a dict:

In [279]: del df['two']

In [280]: three = df.pop('three')

In [281]: df
Out[281]: 
   one flag 
a  1.0 False
b  2.0 False
c  3.0 True 
d  NaN False

When inserting a scalar value, it will naturally be propagated to fill the column:

In [282]: df['foo'] = 'bar'

In [283]: df
Out[283]: 
   one flag  foo
a  1.0 False bar
b  2.0 False bar
c  3.0 True  bar
d  NaN False bar

When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index:

In [284]: df['one_trunc'] = df['one'][:2]

In [285]: df
Out[285]: 
   one flag  foo  one_trunc
a  1.0 False bar  1.0      
b  2.0 False bar  2.0      
c  3.0 True  bar  NaN      
d  NaN False bar  NaN      

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.

By default, columns get inserted at the end. The insert function is available to insert at a particular location in the columns:

In [286]: df.insert(1, 'bar', df['one'])

In [287]: df
Out[287]: 
   one  bar flag  foo  one_trunc
a  1.0  1.0 False bar  1.0      
b  2.0  2.0 False bar  2.0      
c  3.0  3.0 True  bar  NaN      
d  NaN  NaN False bar  NaN      

Indexing / Selection

The basics of indexing are as follows:

Operation Syntax Result
Select column df[col] Series
Select row by label df.xs(label) or df.ix[label] Series
Select row by location (int) df.ix[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame

Row selection, for example, returns a Series whose index is the columns of the DataFrame:

In [288]: df.xs('b')
Out[288]: 
one          2
bar          2
flag         False
foo          bar
one_trunc    2
Name: b

In [289]: df.ix[2]
Out[289]: 
one          3
bar          3
flag         True
foo          bar
one_trunc    NaN
Name: c

Note if a DataFrame contains columns of multiple dtypes, the dtype of the row will be chosen to accommodate all of the data types (dtype=object is the most general).

For a more exhaustive treatment of more sophisticated label-based indexing and slicing, see the section on indexing. We will address the fundamentals of reindexing / conforming to new sets of lables in the section on reindexing.

Data alignment and arithmetic

Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels.

In [290]: df = DataFrame(randn(10, 4), columns=['A', 'B', 'C', 'D'])

In [291]: df2 = DataFrame(randn(7, 3), columns=['A', 'B', 'C'])

In [292]: df + df2
Out[292]: 
   A         B        C       D  
0  0.001597  0.83119 -1.1712  NaN
1  0.269174  3.23766  0.2677  NaN
2 -0.512615  0.09014  3.1199  NaN
3  0.683789  1.04174  2.2821  NaN
4  1.188105 -1.80520  1.1663  NaN
5 -1.972470  0.40705 -3.5135  NaN
6 -0.850053  0.30591  0.2369  NaN
7  NaN       NaN      NaN     NaN
8  NaN       NaN      NaN     NaN
9  NaN       NaN      NaN     NaN

When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the DataFrame columns, thus broadcasting row-wise. For example:

In [293]: df - df.ix[0]
Out[293]: 
   A        B        C       D       
0  0.00000  0.00000  0.0000  0.000000
1  0.80753  1.35790  2.4195 -1.338521
2 -0.07047 -0.81408  4.0269 -0.292816
3  0.07315 -0.51884  2.1953 -0.002121
4  1.34186  0.37182  2.5104 -1.543406
5 -0.52284  0.66483  0.9421  0.018494
6  0.10122 -0.06617  1.9432 -0.816732
7  0.74422  0.83385  3.4734 -0.664698
8  0.04516  0.77189  1.4057 -0.965116
9  0.86017  1.67748  1.4621 -1.165267

In the special case of working with time series data, if the Series is a TimeSeries (which it will be automatically if the index contains datetime objects), and the DataFrame index also contains dates, the broadcasting will be column-wise:

In [294]: index = DateRange('1/1/2000', periods=8)

In [295]: df = DataFrame(randn(8, 3), index=index,
   .....:                columns=['A', 'B', 'C'])

In [296]: df
Out[296]: 
            A       B       C      
2000-01-03  0.3610 -0.1919 -0.05832
2000-01-04 -0.6460 -1.0514 -0.71600
2000-01-05  0.6128  0.5007 -1.38043
2000-01-06  0.6236  0.7896  0.81829
2000-01-07  1.5591  0.3352  0.91940
2000-01-10 -1.3812  0.3652 -1.81063
2000-01-11 -0.6734 -1.9676 -0.40118
2000-01-12 -0.5830 -0.9986 -0.62928

In [297]: type(df['A'])
Out[297]: pandas.core.series.TimeSeries

In [298]: df - df['A']
Out[298]: 
            A  B       C      
2000-01-03  0 -0.5529 -0.41935
2000-01-04  0 -0.4053 -0.06995
2000-01-05  0 -0.1121 -1.99327
2000-01-06  0  0.1660  0.19470
2000-01-07  0 -1.2239 -0.63968
2000-01-10  0  1.7464 -0.42945
2000-01-11  0 -1.2942  0.27220
2000-01-12  0 -0.4156 -0.04623

Technical purity aside, this case is so common in practice that supporting the special case is preferable to the alternative of forcing the user to transpose and do column-based alignment like so:

In [299]: (df.T - df['A']).T
Out[299]: 
            A  B       C      
2000-01-03  0 -0.5529 -0.41935
2000-01-04  0 -0.4053 -0.06995
2000-01-05  0 -0.1121 -1.99327
2000-01-06  0  0.1660  0.19470
2000-01-07  0 -1.2239 -0.63968
2000-01-10  0  1.7464 -0.42945
2000-01-11  0 -1.2942  0.27220
2000-01-12  0 -0.4156 -0.04623

For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.

Operations with scalars are just as you would expect:

In [300]: df * 5 + 2
Out[300]: 
            A       B      C       
2000-01-03  3.8052  1.040  1.708422
2000-01-04 -1.2302 -3.257 -1.579975
2000-01-05  5.0642  4.504 -4.902156
2000-01-06  5.1179  5.948  6.091436
2000-01-07  9.7954  3.676  6.596998
2000-01-10 -4.9059  3.826 -7.053159
2000-01-11 -1.3669 -7.838 -0.005915
2000-01-12 -0.9152 -2.993 -1.146387

In [301]: 1 / df
Out[301]: 
            A       B       C     
2000-01-03  2.7698 -5.2107 -17.148
2000-01-04 -1.5479 -0.9511 -1.3967
2000-01-05  1.6317  1.9971 -0.7244
2000-01-06  1.6036  1.2665  1.2221
2000-01-07  0.6414  2.9834  1.0877
2000-01-10 -0.7240  2.7379 -0.5523
2000-01-11 -1.4850 -0.5082 -2.4926
2000-01-12 -1.7151 -1.0014 -1.5891

In [302]: df ** 4
Out[302]: 
            A        B         C        
2000-01-03  0.01699  0.001356  1.156e-05
2000-01-04  0.17421  1.221882  0.2628088
2000-01-05  0.14105  0.062863  3.6312753
2000-01-06  0.15121  0.388709  0.4483560
2000-01-07  5.90843  0.012623  0.7145249
2000-01-10  3.63921  0.017795  10.747824
2000-01-11  0.20561  14.98751  0.0259042
2000-01-12  0.11556  0.994512  0.1568082

Boolean operators work as well:

In [303]: df1 = DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)

In [304]: df2 = DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)

In [305]: df1 & df2
Out[305]: 
  a     b    
0 False False
1 False True 
2 True  False

In [306]: df1 | df2
Out[306]: 
  a    b   
0 True True
1 True True
2 True True

In [307]: df1 ^ df2
Out[307]: 
  a     b    
0 True  True 
1 True  False
2 False True 

In [308]: -df1
Out[308]: 
  a     b    
0 False True 
1 True  False
2 False False

Transposing

To transpose, access the T attribute (also the transpose function), similar to an ndarray:

# only show the first 5 rows
In [309]: df[:5].T
Out[309]: 
   2000-01-03  2000-01-04  2000-01-05  2000-01-06  2000-01-07
A  0.36103    -0.646       0.6128      0.6236      1.5591    
B -0.19191    -1.051       0.5007      0.7896      0.3352    
C -0.05832    -0.716      -1.3804      0.8183      0.9194    

DataFrame interoperability with NumPy functions

Elementwise NumPy ufuncs (log, exp, sqrt, ...) and various other NumPy functions can be used with no issues on DataFrame, assuming the data within are numeric:

In [310]: np.exp(df)
Out[310]: 
            A       B       C     
2000-01-03  1.4348  0.8254  0.9434
2000-01-04  0.5241  0.3495  0.4887
2000-01-05  1.8457  1.6499  0.2515
2000-01-06  1.8656  2.2025  2.2666
2000-01-07  4.7544  1.3982  2.5078
2000-01-10  0.2513  1.4409  0.1636
2000-01-11  0.5100  0.1398  0.6695
2000-01-12  0.5582  0.3684  0.5330

In [311]: np.asarray(df)
Out[311]: 
array([[ 0.361 , -0.1919, -0.0583],
       [-0.646 , -1.0514, -0.716 ],
       [ 0.6128,  0.5007, -1.3804],
       [ 0.6236,  0.7896,  0.8183],
       [ 1.5591,  0.3352,  0.9194],
       [-1.3812,  0.3652, -1.8106],
       [-0.6734, -1.9676, -0.4012],
       [-0.583 , -0.9986, -0.6293]])

The dot method on DataFrame implements matrix multiplication:

In [312]: df.T.dot(df)
Out[312]: 
   A      B      C    
A  6.444  3.334  4.677
B  3.334  7.131  1.784
C  4.677  1.784  7.772

Similarly, the dot method on Series implements dot product:

In [313]: s1 = Series(np.arange(5,10))

In [314]: s1.dot(s1)
Out[314]: 255

DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics are quite different in places from a matrix.

Console display

For very large DataFrame objects, only a summary will be printed to the console (here I am reading a CSV version of the baseball dataset from the plyr R package):

In [315]: baseball = read_csv('data/baseball.csv')

In [316]: print baseball
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 88641 to 89534
Data columns:
id       100  non-null values
year     100  non-null values
stint    100  non-null values
team     100  non-null values
lg       100  non-null values
g        100  non-null values
ab       100  non-null values
r        100  non-null values
h        100  non-null values
X2b      100  non-null values
X3b      100  non-null values
hr       100  non-null values
rbi      100  non-null values
sb       100  non-null values
cs       100  non-null values
bb       100  non-null values
so       100  non-null values
ibb      100  non-null values
hbp      100  non-null values
sh       100  non-null values
sf       100  non-null values
gidp     100  non-null values
dtypes: int64(10), float64(9), object(3)

However, using to_string will return a string representation of the DataFrame in tabular form, though it won’t always fit the console width:

In [317]: print baseball.ix[-20:, :12].to_string()
      id         year  stint team lg  g    ab   r    h    X2b  X3b  hr
88641 womacto01  2006  2     CHN  NL  19   50   6    14   1    0    1 
88643 schilcu01  2006  1     BOS  AL  31   2    0    1    0    0    0 
88645 myersmi01  2006  1     NYA  AL  62   0    0    0    0    0    0 
88649 helliri01  2006  1     MIL  NL  20   3    0    0    0    0    0 
88650 johnsra05  2006  1     NYA  AL  33   6    0    1    0    0    0 
88652 finlest01  2006  1     SFN  NL  139  426  66   105  21   12   6 
88653 gonzalu01  2006  1     ARI  NL  153  586  93   159  52   2    15
88662 seleaa01   2006  1     LAN  NL  28   26   2    5    1    0    0 
89177 francju01  2007  2     ATL  NL  15   40   1    10   3    0    0 
89178 francju01  2007  1     NYN  NL  40   50   7    10   0    0    1 
89330 zaungr01   2007  1     TOR  AL  110  331  43   80   24   1    10
89333 witasja01  2007  1     TBA  AL  3    0    0    0    0    0    0 
89334 williwo02  2007  1     HOU  NL  33   59   3    6    0    0    1 
89335 wickmbo01  2007  2     ARI  NL  8    0    0    0    0    0    0 
89336 wickmbo01  2007  1     ATL  NL  47   0    0    0    0    0    0 
89337 whitero02  2007  1     MIN  AL  38   109  8    19   4    0    4 
89338 whiteri01  2007  1     HOU  NL  20   1    0    0    0    0    0 
89339 wellsda01  2007  2     LAN  NL  7    15   2    4    1    0    0 
89340 wellsda01  2007  1     SDN  NL  22   38   1    4    0    0    0 
89341 weathda01  2007  1     CIN  NL  67   0    0    0    0    0    0 
89343 walketo04  2007  1     OAK  AL  18   48   5    13   1    0    0 
89345 wakefti01  2007  1     BOS  AL  1    2    0    0    0    0    0 
89347 vizquom01  2007  1     SFN  NL  145  513  54   126  18   3    4 
89348 villoro01  2007  1     NYA  AL  6    0    0    0    0    0    0 
89352 valenjo03  2007  1     NYN  NL  51   166  18   40   11   1    3 
89354 trachst01  2007  2     CHN  NL  4    7    0    1    0    0    0 
89355 trachst01  2007  1     BAL  AL  3    5    0    0    0    0    0 
89359 timlimi01  2007  1     BOS  AL  4    0    0    0    0    0    0 
89360 thomeji01  2007  1     CHA  AL  130  432  79   119  19   0    35
89361 thomafr04  2007  1     TOR  AL  155  531  63   147  30   0    26
89363 tavarju01  2007  1     BOS  AL  2    4    0    1    0    0    0 
89365 sweenma01  2007  2     LAN  NL  30   33   2    9    1    0    0 
89366 sweenma01  2007  1     SFN  NL  76   90   18   23   8    0    2 
89367 suppaje01  2007  1     MIL  NL  33   61   4    8    0    0    0 
89368 stinnke01  2007  1     SLN  NL  26   82   7    13   3    0    1 
89370 stantmi02  2007  1     CIN  NL  67   2    0    0    0    0    0 
89371 stairma01  2007  1     TOR  AL  125  357  58   103  28   1    21
89372 sprinru01  2007  1     SLN  NL  72   1    0    0    0    0    0 
89374 sosasa01   2007  1     TEX  AL  114  412  53   104  24   1    21
89375 smoltjo01  2007  1     ATL  NL  30   54   1    5    1    0    0 
89378 sheffga01  2007  1     DET  AL  133  494  107  131  20   1    25
89381 seleaa01   2007  1     NYN  NL  31   4    0    0    0    0    0 
89382 seaneru01  2007  1     LAN  NL  68   1    0    0    0    0    0 
89383 schmija01  2007  1     LAN  NL  6    7    1    1    0    0    1 
89384 schilcu01  2007  1     BOS  AL  1    2    0    1    0    0    0 
89385 sandere02  2007  1     KCA  AL  24   73   12   23   7    0    2 
89388 rogerke01  2007  1     DET  AL  1    2    0    0    0    0    0 
89389 rodriiv01  2007  1     DET  AL  129  502  50   141  31   3    11
89396 ramirma02  2007  1     BOS  AL  133  483  84   143  33   1    20
89398 piazzmi01  2007  1     OAK  AL  83   309  33   85   17   1    8 
89400 perezne01  2007  1     DET  AL  33   64   5    11   3    0    1 
89402 parkch01   2007  1     NYN  NL  1    1    0    0    0    0    0 
89406 oliveda02  2007  1     LAA  AL  5    0    0    0    0    0    0 
89410 myersmi01  2007  1     NYA  AL  6    1    0    0    0    0    0 
89411 mussimi01  2007  1     NYA  AL  2    2    0    0    0    0    0 
89412 moyerja01  2007  1     PHI  NL  33   73   4    9    2    0    0 
89420 mesajo01   2007  1     PHI  NL  38   0    0    0    0    0    0 
89421 martipe02  2007  1     NYN  NL  5    9    1    1    1    0    0 
89425 maddugr01  2007  1     SDN  NL  33   62   2    9    2    0    0 
89426 mabryjo01  2007  1     COL  NL  28   34   4    4    1    0    1 
89429 loftoke01  2007  2     CLE  AL  52   173  24   49   9    3    0 
89430 loftoke01  2007  1     TEX  AL  84   317  62   96   16   3    7 
89431 loaizes01  2007  1     LAN  NL  5    7    0    1    0    0    0 
89438 kleskry01  2007  1     SFN  NL  116  362  51   94   27   3    6 
89439 kentje01   2007  1     LAN  NL  136  494  78   149  36   1    20
89442 jonesto02  2007  1     DET  AL  5    0    0    0    0    0    0 
89445 johnsra05  2007  1     ARI  NL  10   15   0    1    0    0    0 
89450 hoffmtr01  2007  1     SDN  NL  60   0    0    0    0    0    0 
89451 hernaro01  2007  2     LAN  NL  22   0    0    0    0    0    0 
89452 hernaro01  2007  1     CLE  AL  2    0    0    0    0    0    0 
89460 guarded01  2007  1     CIN  NL  15   0    0    0    0    0    0 
89462 griffke02  2007  1     CIN  NL  144  528  78   146  24   1    30
89463 greensh01  2007  1     NYN  NL  130  446  62   130  30   1    10
89464 graffto01  2007  1     MIL  NL  86   231  34   55   8    0    9 
89465 gordoto01  2007  1     PHI  NL  44   0    0    0    0    0    0 
89466 gonzalu01  2007  1     LAN  NL  139  464  70   129  23   2    15
89467 gomezch02  2007  2     CLE  AL  19   53   4    15   2    0    0 
89468 gomezch02  2007  1     BAL  AL  73   169  17   51   10   1    1 
89469 glavito02  2007  1     NYN  NL  33   56   3    12   1    0    0 
89473 floydcl01  2007  1     CHN  NL  108  282  40   80   10   1    9 
89474 finlest01  2007  1     COL  NL  43   94   9    17   3    0    1 
89480 embreal01  2007  1     OAK  AL  4    0    0    0    0    0    0 
89481 edmonji01  2007  1     SLN  NL  117  365  39   92   15   2    12
89482 easleda01  2007  1     NYN  NL  76   193  24   54   6    0    10
89489 delgaca01  2007  1     NYN  NL  139  538  71   139  30   0    24
89493 cormirh01  2007  1     CIN  NL  6    0    0    0    0    0    0 
89494 coninje01  2007  2     NYN  NL  21   41   2    8    2    0    0 
89495 coninje01  2007  1     CIN  NL  80   215  23   57   11   1    6 
89497 clemero02  2007  1     NYA  AL  2    2    0    1    0    0    0 
89498 claytro01  2007  2     BOS  AL  8    6    1    0    0    0    0 
89499 claytro01  2007  1     TOR  AL  69   189  23   48   14   0    1 
89501 cirilje01  2007  2     ARI  NL  28   40   6    8    4    0    0 
89502 cirilje01  2007  1     MIN  AL  50   153  18   40   9    2    2 
89521 bondsba01  2007  1     SFN  NL  126  340  75   94   14   0    28
89523 biggicr01  2007  1     HOU  NL  141  517  68   130  31   3    10
89525 benitar01  2007  2     FLO  NL  34   0    0    0    0    0    0 
89526 benitar01  2007  1     SFN  NL  19   0    0    0    0    0    0 
89530 ausmubr01  2007  1     HOU  NL  117  349  38   82   16   3    3 
89533 aloumo01   2007  1     NYN  NL  87   328  51   112  19   1    13
89534 alomasa02  2007  1     NYN  NL  8    22   1    3    1    0    0 

DataFrame column types

The four main types stored in pandas objects are float, int, boolean, and object. A convenient dtypes attribute return a Series with the data type of each column:

In [318]: baseball.dtypes
Out[318]: 
id       object
year     int64
stint    int64
team     object
lg       object
g        int64
ab       int64
r        int64
h        int64
X2b      int64
X3b      int64
hr       int64
rbi      float64
sb       float64
cs       float64
bb       int64
so       float64
ibb      float64
hbp      float64
sh       float64
sf       float64
gidp     float64

The related method get_dtype_counts will return the number of columns of each type:

In [319]: baseball.get_dtype_counts()
Out[319]: 
int64      10
float64    9
object     3

DataFrame column attribute access and IPython completion

If a DataFrame column label is a valid Python variable name, the column can be accessed like attributes:

In [320]: df = DataFrame({'foo1' : np.random.randn(5),
   .....:                 'foo2' : np.random.randn(5)})

In [321]: df
Out[321]: 
   foo1    foo2  
0 -0.5480 -0.9662
1 -0.8526 -0.3326
2 -0.1263 -1.3273
3  1.7660  1.2258
4 -1.5933 -0.3484

In [322]: df.foo1
Out[322]: 
0   -0.5480
1   -0.8526
2   -0.1263
3    1.7660
4   -1.5933
Name: foo1

The columns are also connected to the IPython completion mechanism so they can be tab-completed:

In [5]: df.fo<TAB>
df.foo1  df.foo2

Panel

Panel is a somewhat less-used, but still important container for 3-dimensional data. The term panel data is derived from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s. The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data and, in particular, econometric analysis of panel data. However, for the strict purposes of slicing and dicing a collection of DataFrame objects, you may find the axis names slightly arbitrary:

  • items: axis 0, each item corresponds to a DataFrame contained inside
  • major_axis: axis 1, it is the index (rows) of each of the DataFrames
  • minor_axis: axis 2, it is the columns of each of the DataFrames

Construction of Panels works about like you would expect:

From 3D ndarray with optional axis labels

In [323]: wp = Panel(randn(2, 5, 4), items=['Item1', 'Item2'],
   .....:            major_axis=DateRange('1/1/2000', periods=5),
   .....:            minor_axis=['A', 'B', 'C', 'D'])

In [324]: wp
Out[324]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major) x 4 (minor)
Items: Item1 to Item2
Major axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor axis: A to D

From dict of DataFrame objects

In [325]: data = {'Item1' : DataFrame(randn(4, 3)),
   .....:         'Item2' : DataFrame(randn(4, 2))}

In [326]: Panel(data)
Out[326]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major) x 3 (minor)
Items: Item1 to Item2
Major axis: 0 to 3
Minor axis: 0 to 2

Note that the values in the dict need only be convertible to DataFrame. Thus, they can be any of the other valid inputs to DataFrame as per above.

One helpful factory method is Panel.from_dict, which takes a dictionary of DataFrames as above, and the following named parameters:

Parameter Default Description
intersect False drops elements whose indices do not align
orient items use minor to use DataFrames’ columns as panel items

For example, compare to the construction above:

In [327]: Panel.from_dict(data, orient='minor')
Out[327]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 4 (major) x 2 (minor)
Items: 0 to 2
Major axis: 0 to 3
Minor axis: Item1 to Item2

Orient is especially useful for mixed-type DataFrames.

Note

Unfortunately Panel, being less commonly used than Series and DataFrame, has been slightly neglected feature-wise. A number of methods and options available in DataFrame are not available in Panel. This will get worked on, of course, in future releases. And faster if you join me in working on the codebase.

From DataFrame using to_panel method

This method was introduced in v0.7 to replace LongPanel.to_long, and converts a DataFrame with a two-level index to a Panel.

In [328]: midx = MultiIndex(levels=[['one', 'two'], ['x','y']], labels=[[1,1,0,0],[1,0,1,0]])

In [329]: df = DataFrame({'A' : [1, 2, 3, 4], 'B': [5, 6, 7, 8]}, index=midx)

In [330]: df.to_panel()
Out[330]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major) x 2 (minor)
Items: A to B
Major axis: one to two
Minor axis: x to y

Item selection / addition / deletion

Similar to DataFrame functioning as a dict of Series, Panel is like a dict of DataFrames:

In [331]: wp['Item1']
Out[331]: 
            A       B        C       D     
2000-01-03  1.4248 -0.87920  0.5448  0.6272
2000-01-04  0.6966  0.10773 -0.7765 -1.2576
2000-01-05 -0.6007 -1.50485  1.0520  0.5866
2000-01-06 -2.6372 -0.01495 -1.2085 -1.2570
2000-01-07 -0.4997  0.42979 -0.2422 -0.7238

In [332]: wp['Item3'] = wp['Item1'] / wp['Item2']

The API for insertion and deletion is the same as for DataFrame. And as with DataFrame, if the item is a valid python identifier, you can access it as an attribute and tab-complete it in IPython.

Indexing / Selection

Operation Syntax Result
Select item wp[item] DataFrame
Get slice at major_axis label wp.major_xs(val) DataFrame
Get slice at minor_axis label wp.minor_xs(val) DataFrame

For example, using the earlier example data, we could do:

In [333]: wp['Item1']
Out[333]: 
            A       B        C       D     
2000-01-03  1.4248 -0.87920  0.5448  0.6272
2000-01-04  0.6966  0.10773 -0.7765 -1.2576
2000-01-05 -0.6007 -1.50485  1.0520  0.5866
2000-01-06 -2.6372 -0.01495 -1.2085 -1.2570
2000-01-07 -0.4997  0.42979 -0.2422 -0.7238

In [334]: wp.major_xs(wp.major_axis[2])
Out[334]: 
   Item1   Item2   Item3 
A -0.6007 -2.1386  0.2809
B -1.5049 -0.5927  2.5392
C  1.0520 -1.0591 -0.9933
D  0.5866  0.1188  4.9368

In [335]: wp.minor_axis
Out[335]: Index([A, B, C, D], dtype=object)

In [336]: wp.minor_xs('C')
Out[336]: 
            Item1   Item2   Item3 
2000-01-03  0.5448 -0.5437 -1.0020
2000-01-04 -0.7765 -1.5663  0.4957
2000-01-05  1.0520 -1.0591 -0.9933
2000-01-06 -1.2085 -0.1291  9.3611
2000-01-07 -0.2422  0.7591 -0.3191