Series and DataFrame¶
Series¶
Series is a one-dimensional labeled array. The axis labels are collectively referred to as the index. Creating a Series by passing a list of values, letting pandas create a default integer index:
>>> s = Series([1,3,5,np.nan,6,8])
>>> s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
DataFrame¶
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Creating a DataFrame by passing an array, with a datetime index and labeled columns:
>>> dates = date_range('20130101', periods=6)
>>> dates
DateTimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'])
>>> df = DataFrame(random.randn(6,4), index=dates, columns=list('ABCD'))
>>> df
A B C D
2013-01-01 0.730520 0.088258 0.488905 0.461837
2013-01-02 0.448598 0.697712 0.277767 0.759961
2013-01-03 0.219245 0.920414 0.886056 0.222002
2013-01-04 0.883879 0.439466 0.392876 0.994732
2013-01-05 0.881501 0.283149 0.247825 0.593564
2013-01-06 0.511849 0.077208 0.040160 0.683068
Creating a DataFrame by passing a dict of objects that can be converted to series-like.
>>> df = DataFrame({'A' : 1.,
... 'C' : [1,2,3,4],
... 'D' : array([3] * 4),
... 'E' : ['test','train','test','train'],
... 'F' : 'foo'})
>>> df
A C D E F
0 1.0 1 3 test foo
1 1.0 2 3 train foo
2 1.0 3 3 test foo
3 1.0 4 3 train foo
Viewing Data¶
Here is how to view the top and bottom rows of the frame:
>>> df.head()
A B C D
2013-01-01 0.730520 0.088258 0.488905 0.461837
2013-01-02 0.448598 0.697712 0.277767 0.759961
2013-01-03 0.219245 0.920414 0.886056 0.222002
2013-01-04 0.883879 0.439466 0.392876 0.994732
2013-01-05 0.881501 0.283149 0.247825 0.593564
...
>>> df.tail(3)
A B C D
2013-01-04 0.883879 0.439466 0.392876 0.994732
2013-01-05 0.881501 0.283149 0.247825 0.593564
2013-01-06 0.511849 0.077208 0.040160 0.683068
Display the index, columns, and the underlying array data:
>>> df.index
DateTimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'])
>>> df.columns
Index(['A', 'B', 'C', 'D'])
>>> df.values
array([[0.730519863614471, 0.08825840967622589, 0.4889045498516358, 0.461837214623537]
[0.4485983912283127, 0.6977123432245299, 0.2777673057578094, 0.7599608278137966]
[0.21924450192488787, 0.9204140116502296, 0.886055787176944, 0.22200160212508913]
[0.8838785592364334, 0.43946558709097283, 0.3928764411717487, 0.9947320023919717]
[0.8815007984632135, 0.2831489393823492, 0.24782537013522343, 0.5935642792213479]
[0.5118487849556497, 0.07720751395148084, 0.04016027357410157, 0.6830675875686567]])
describe()
shows a quick statistic summary of the data:
>>> df.describe()
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.612598 0.417701 0.388932 0.619194
std 0.265172 0.338873 0.286724 0.263857
var 0.070316 0.114835 0.082211 0.069621
max 0.883879 0.920414 0.886056 0.994732
min 0.219245 0.077208 0.040160 0.222002
Transposing the data:
>>> df.T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.730520 0.448598 0.219245 0.883879 0.881501 0.511849
B 0.088258 0.697712 0.920414 0.439466 0.283149 0.077208
C 0.488905 0.277767 0.886056 0.392876 0.247825 0.040160
D 0.461837 0.759961 0.222002 0.994732 0.593564 0.683068
Sorting by index:
>>> df.sort_index(ascending=False)
A B C D
2013-01-06 0.511849 0.077208 0.040160 0.683068
2013-01-05 0.881501 0.283149 0.247825 0.593564
2013-01-04 0.883879 0.439466 0.392876 0.994732
2013-01-03 0.219245 0.920414 0.886056 0.222002
2013-01-02 0.448598 0.697712 0.277767 0.759961
2013-01-01 0.730520 0.088258 0.488905 0.461837
Sorting by values:
>>> df.sort_values(by='B')
A B C D
2013-01-06 0.511849 0.077208 0.040160 0.683068
2013-01-01 0.730520 0.088258 0.488905 0.461837
2013-01-05 0.881501 0.283149 0.247825 0.593564
2013-01-04 0.883879 0.439466 0.392876 0.994732
2013-01-02 0.448598 0.697712 0.277767 0.759961
2013-01-03 0.219245 0.920414 0.886056 0.222002
Selection¶
Getting
Selecting a single column, which yields a Series, equivalent to df.A:
>>> df['A']
2013-01-01 0.730519863614471
2013-01-02 0.4485983912283127
2013-01-03 0.21924450192488787
2013-01-04 0.8838785592364334
2013-01-05 0.8815007984632135
2013-01-06 0.5118487849556497
Selecting via [], which slices the rows.
>>> df[0:3]
A B C D
2013-01-01 0.730520 0.088258 0.488905 0.461837
2013-01-02 0.448598 0.697712 0.277767 0.759961
2013-01-03 0.219245 0.920414 0.886056 0.222002
>>> df['20130102':'20130104']
A B C D
2013-01-02 0.448598 0.697712 0.277767 0.759961
2013-01-03 0.219245 0.920414 0.886056 0.222002
2013-01-04 0.883879 0.439466 0.392876 0.994732
Selection by Label
For getting a cross section using a label:
>>> df.loc[dates[0]]
A 0.730519863614471
B 0.08825840967622589
C 0.4889045498516358
D 0.461837214623537
Selecting on a multi-axis by label:
>>> df.loc[:,['A','B']]
A B
2013-01-01 0.730520 0.088258
2013-01-02 0.448598 0.697712
2013-01-03 0.219245 0.920414
2013-01-04 0.883879 0.439466
2013-01-05 0.881501 0.283149
2013-01-06 0.511849 0.077208
Showing label slicing, both endpoints are included:
>>> df.loc['20130102':'20130104',['A','B']]
A B
2013-01-02 0.448598 0.697712
2013-01-03 0.219245 0.920414
2013-01-04 0.883879 0.439466
Reduction in the dimensions of the returned object:
>>> df.loc['20130102',['A','B']]
A 0.4485983912283127
B 0.6977123432245299
For getting a scalar value:
>>> df.at[dates[0],'A']
0.730519863614471
Selection by Position
Select via the position of the passed integers:
>>> df.iloc[3]
A 0.8838785592364334
B 0.43946558709097283
C 0.3928764411717487
D 0.9947320023919717
By integer slices:
>>> df.iloc[3:5,0:2]
A B
2013-01-04 0.883879 0.439466
2013-01-05 0.881501 0.283149
By lists of integer position locations:
>>> df.iloc[[1,2,4],[0,2]]
A C
2013-01-02 0.448598 0.277767
2013-01-03 0.219245 0.886056
2013-01-05 0.881501 0.247825
For getting a value explicitly:
>>> df.iloc[1,1]
0.6977123432245299
For getting fast access to a scalar (equivalent to the prior method):
>>> df.iat[1,1]
0.6977123432245299
Grouping¶
>>> df = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
... 'foo', 'bar', 'foo', 'foo'],
... 'B' : ['one', 'one', 'two', 'three',
... 'two', 'two', 'one', 'three'],
... 'C' : random.randn(8),
... 'D' : random.randn(8)})
>>> df
A B C D
0 foo one 0.235064 0.235064
1 bar one -0.419857 -0.419857
2 foo two -0.888507 -0.888507
3 bar three -3.056019 -3.056019
4 foo two -0.476107 -0.476107
5 bar two 1.831309 1.831309
6 foo one -0.800894 -0.800894
7 foo three 0.936860 0.936860
Grouping and then applying the sum() function to the resulting groups.
>>> df.groupby('A').sum()
C D
foo -0.993584 -0.993584
bar -1.644567 -1.644567
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.
>>> df.groupby(['A','B']).sum()
C D
[foo, one] -0.565830 -0.565830
[bar, one] -0.419857 -0.419857
[foo, two] -1.364614 -1.364614
[bar, three] -3.056019 -3.056019
[bar, two] 1.831309 1.831309
[foo, three] 0.936860 0.936860
Time resample¶
To performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data):
>>> rng = date_range('1/1/2012', periods=100, freq='S')
>>> ts = Series(np.random.randint(0, 500, len(rng)), index=rng)
>>> ts.resample('5Min').sum()
2012-01-01 00:00 22561.0