金融数据的预处理

任何数据分析的第一步都是解析原始数据，包括从源中提取数据，然后清洗和填充缺失的数据(如果有的话)。虽然数据有多种形式，但Python使用有用的包可以很容易地读取时间序列数据。

这篇博客，我们将使用一些流行的python包检索和存储EOD数据和intraday数据。这些库旨在保持API的简单性，并使访问历史数据变得更容易。此外，我们还将了解如何从本地存储的传统数据源读取数据。

概念准备

但是先不急，我们先看一看金融数据的特性，在这里我们引入“时间序列”这个概念。时间序列是按时间顺序索引的一系列数据点。在等距时间点观察到的股票、商品和外汇价格系列等金融数据就是时间序列的一个例子。时间序列通常以滴答、秒、分、小时、日、周、月、季度和年为单位。它是一组按固定时间间隔观测的数据点序列，观测的频率是很重要的一个影响因素。关于时间序列的代码不多，文章最后我们会看到。

在了解时间序列和金融数据后，再介绍一下我们之前提到的EOD数据和intraday数据。

End-of-day (EOD) data 和 intraday data 都是金融市场数据的类型，但它们在时间范围和提供的信息类型上存在差异。

End-of-day data 指的是在交易日结束时捕获和记录的市场数据。它提供有关证券的开盘价和收盘价、当日最高和最低价以及交易量的信息。这种类型的数据通常由交易员和投资者用于分析市场趋势并做出针对下一个交易日的明智投资决策。

另一方面，intraday data 则在交易日内以较短的时间间隔提供市场数据，例如每分钟、每五分钟或每小时。Intraday data 提供更细致的市场活动信息，包括价格变动、交易量和买卖价差。这种类型的数据被日间交易员和其他投资者用于实时密切监测市场活动以便根据短期市场波动快速做出决策。

总之，虽然 end-of-day data 提供了交易日结束时市场活动的快照，intraday data 则提供了更详细的市场活动信息，因为它在整个交易日内提供了市场活动的实时信息。

导入库

这里我们导入会用到的library。

# Import data manipulation libraries
import pandas as pd
import numpy as np

# Import yahoo finance library
# 如果没有可以先下载，后面没有也一样，可能需要翻墙，下载代码如下：
# !pip install yfinance
import yfinance as yf

# Import cufflinks for visualization
import cufflinks as cf
cf.set_config_file(offline=True)

# Ignore warnings - optional
import warnings
warnings.filterwarnings('ignore')

# Check the package version
pd.__version__, cf.__version__, yf.__version__

数据检索

这里我们会使用yfinance.download这个function来检索数据。想了解function attributes和它的output可以运行以下代码。

# help(yf.download)
yf.download?

后面我们会分别介绍如何检索EOD, Intraday, Options data.

检索EOD数据

当使用 yfinance 检索公司股票的每日收盘数据时，period 参数被设置为 “max”。这意味着我们正在请求该股票最早可用日期以来的历史数据，这可能会因股票和数据提供商而异。使用 “max” 作为 period 参数在我们想要检索给定股票尽可能多的历史数据时非常有用。

但是，在我提供的示例代码中，我们还可以通过将 period 参数设置为特定值（例如 “5d”、“1m”、“3mo”、“1y” 等）来指定较短的时间段。这些值表示不同的时间段，如五天、一个月、三个月、一年等。检索EOD时，这里我们选取5d。

使用 “max” 的原因是检索一支股票的所有可用历史数据。但是，根据用例，我们也可以将 period 参数设置为特定值，以指定较短的时间段。

# Fetch the data by specifying the number of period
df1 = yf.download('SPY', period='5d', progress=False)

# Display the first five rows of the dataframe to check the results. 
df1.head()

我们可以不用period，而是通过设立start date和end date来检索特定时间的数据。

# Fetch data by specifying the the start and end dates
df2 = yf.download('SPY', start='2022-01-01', end='2022-12-31', progress=False)

# Display the first five rows of the dataframe to check the results. 
df2.head()

这里插入一个无关痛痒的内容。如果想显示结果的最后五行，运行以下代码。

df2.tail()

下面是另外一个例子。

# Fetch data for year to date (YTD)
df3 = yf.download('SPY', period='ytd', progress=False)

# Display the last five rows of the dataframe to check the results. 
df3.tail()

检索多个证券的每日收盘价

上一小节我们只检索了SPY的数据。如果我们想同时知道多家公司在同一交易日的历史收盘价数据呢？下面以几个纳斯达克上市公司为例，我们想检索苹果、亚马逊、微软、英伟达、特斯拉的历史数据。

# Specify stocks
nasdaq_stocks = ['AAPL', 'AMZN', 'MSFT', 'NVDA', 'TSLA']

# this is list of stocks
nasdaq_stocks

# Fetch data for multiple stocks at once
df4 = yf.download(nasdaq_stocks, period='ytd', progress=False)['Adj Close']

# Display dataframe
df4.tail()

检索多个证券的多个数据点

前面几个例子，我们分别对一直证券、多只证券检索了收盘价。但是要描述特定证券，我们不但但只有收盘价。

OHLCV 是金融市场中常用的缩写词，尤其在交易和投资领域中使用。它代表开盘价、最高价、最低价、收盘价和成交量等五个关键数据点，通常用于描述特定金融工具（如股票、商品或货币）的价格和交易活动。

开盘价指的是交易会话开始时工具的价格。
最高价指的是交易会话中工具达到的最高价格。
最低价指的是交易会话中工具达到的最低价格。
收盘价指的是交易会话结束时工具的价格。
成交量指的是交易会话期间交易的股票、合同或工具数量总计。

OHLCV 数据通常以蜡烛图的形式显示，提供了一个对给定时间段内价格走势和交易活动的视觉表现。通过分析 OHLCV 数据，交易员和投资者可以深入了解特定工具的供需动态，并做出更明智的交易决策。后面我们也会见到蜡烛图，但是不急，我们先进行一些铺垫。

仍然是之前提到的那五家公司股票，我们想知道它们的OHLCV。

# Fetch data for multiple fields using comprehension
ohlcv= {
    
    symbol: yf.download(symbol, period='250d', progress=False) 
             for symbol in nasdaq_stocks}
ohlcv

信息已经都有了，但如果在整个OHLCV中，我只想知道英伟达的数据，又要怎么办呢？见以下代码。

# Display NVDA stock data
ohlcv['NVDA']

如果只想知道英伟达的调整后收盘价，见以下代码。对比最后五行，和我们之前得到的是一样的。

# Display NVDA adjusted close data
ohlcv['NVDA']['Adj Close']

这里做一个小科普，知道的宝子们可以跳过。

Adj Close 是金融术语，指的是金融工具（如股票、商品或货币）的调整收盘价。它是在交易日内考虑任何企业行动（如股票拆分或股息）后的收盘价。

调整后的收盘价很重要，因为它反映了任何企业行动后金融工具的真实价值。例如，如果一家公司发放股息，股票价格通常会在除息日降低相应的股息金额。调整后的收盘价将反映这种价格下跌，并提供更准确的股票价值。

Adj Close 常用于金融分析，如技术分析和金融建模，以分析金融工具的历史表现并做出投资决策。它通常与其他金融指标（如成交量和波动性）一起使用，以提供该工具在市场上的全面行为画面。

检索intraday数据

# Retrieve intraday data for last five days
df6 = yf.download(tickers='SPY', period='5d', interval='1m', progress=False)

# Display last five rows of the dataframe
df6.tail()

检索期权链数据

期权链（option chain）是列出特定金融工具（如股票、商品或货币）所有可用期权合约的清单。它是一个表格，显示认购和认沽期权的各种行权价格和到期日，以及每个期权合约的买价和卖价以及隐含波动率。

期权链为交易员和投资者提供了重要的交易决策信息。通过分析期权链中的信息，交易员可以确定潜在的交易机会，比较不同期权合约的风险和回报，并相应地调整他们的交易策略。

一些期权链的关键元素包括：

行权价格：期权行使的价格。
到期日：期权到期的日期。
认购/认沽（Call/Put）：指期权是认购期权还是认沽期权。
买价/卖价：期权的买入或卖出价格。
隐含波动率：市场对未来基础工具波动程度的预期。

总的来说，期权链对于期权交易员和投资者来说是一个重要的工具，因为它提供了可用期权合约的全面视图，并帮助他们做出知情的交易决策。

检索SPY的期权链，设置期权的到期日为2024-03-15。

# Get SPY option chain
spy = yf.Ticker('SPY')

# 如果给定到期日是过去的时间，会报错
options = spy.option_chain('2024-03-15')
options

# Filter calls for strike above 440
df = options.calls[options.calls['strike']>440]
df.reset_index(drop=True, inplace=True)

# Check the filtered output
df.iloc[:,:7].head()

检索HTML

我们想从网页里检索数据。Nasdaq-100的维基百科网页中有很多表格，我们想检索表格中的数据。

# read data from wikipedia
nasdaq100 = pd.read_html('https://en.wikipedia.org/wiki/Nasdaq-100')

# fifth table
nasdaq100[4]

我们如果只需要Ticker这一列的数据：

# filtering tickers
nasdaq100[4]['Ticker']

只需要前十个Ticker：

# filter table for tickers
stocklist = list(nasdaq100[4]['Ticker'])
stocklist[:10]

数据存储

将数据导出为文件并储存在本地。需要我们现在桌面或者随便哪里建一个文件夹。比如，我在桌面建立一个叫data的文件夹，导出的表格会被存在那里。我们要保证填入的储存路径是对的。

# Dataframe to Excel
from pandas import ExcelWriter

# Storing the fetched data in a separate sheet for each security
writer = ExcelWriter('C:/Users/*****/Desktop/data/stocks.xlsx') 

# df.to_excel() - this is list comprehension
[pd.DataFrame(ohlcv[symbol]).to_excel(writer,symbol) for symbol in nasdaq_stocks]

# save file
writer.save()

# Save ohlcv data for each securities in stockname.csv format
[pd.DataFrame(ohlcv[symbol]).to_csv('C:/Users/*****/Desktop/data/'+symbol+'.csv') for symbol in nasdaq_stocks]

print('*** data saved ***')

数据加载

有时候我们需要从本地的表格中导入数据。就以刚才存储的表格为例。

# Reading Excel file
# Reading the fetched data in a spreadsheet
aapl = pd.read_excel('C:/Users/*****/Desktop/data/stocks.xlsx', sheet_name='AAPL',index_col=0, parse_dates=True)

# Display the last five rows of the data frame to check the results
aapl.tail()

# Read CSV file  
aapl = pd.read_csv('C:/Users/*****/Desktop/data/AAPL.csv', index_col=0, parse_dates=True, dayfirst=False) 

# Display the last five rows of the data frame to check the results
aapl.tail()

数据操作

接下来我们对dataframe进行一些基本的数据操作示例，这对于我们处理金融数据和时间序列是很关键的。

# subset selection 只需要开盘价、收盘价
df1.filter(['Open', 'Close'])

# drop columns 不看成交量
df1.drop('Volume', axis=1)

# drop rows 不看最后一天
df1.drop(df1.index[-1], axis=0)

数据重新采样

现在我们将重新采样时间序列的频率。

# Resampling to derive weekly values from daily time series
df_weekly = df4[['AAPL']].resample('W').last()

# Display the last five rows of the data frame to check the output
df_weekly.tail(5)

# Resampling to a specific day of the week: Thursday
df_weekly_thu = df4[['AAPL']].resample('W-THU').ffill()

# Display the last five rows of the data frame to check the output
df_weekly_thu.tail()

# Resampling to derive monthly values from daily time series
df_monthly = df4[['AAPL']].resample('M').last()

# Display the last five rows of the data frame to check the output
df_monthly.tail()

下面我们来玩点有意思的。从yahoo finance上下载Nasdaq的历史数据，把下载的CSV文件保存在之前建好的data文件夹里。

# Load csv file
#data = pd.read_csv('C:/Users/*****/Desktop/data/Nasdaq.csv', index_col=0, parse_dates=True)

# Load csv file, specify dayfirst
data = pd.read_csv('C:/Users/*****/Desktop/data/Nasdaq.csv', index_col=0, parse_dates=True, dayfirst=True)['2012':'2022']


# Verify the index
data.index

# Verify the data
data['Adj Close'].iplot()

折线图结果如图所示。
在这里插入图片描述

# Load the CSV file
data = pd.read_csv('C:/Users/38352/Desktop/data/Nasdaq.csv', index_col=0, parse_dates=True, dayfirst=True)['2012':'2022']

# Add a new column to calculate "change"
data['Change'] = 100.* np.log(data['Adj Close']).diff().fillna(0)

# Output first five values
data.head()

# Create a copy of Nasdaq dataframe & reset index
df = data.copy()
df.reset_index(inplace=True)

# Assign separate columns for month & year 
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Select the index value that corresponds to the maximum or minimum Nasdaq percentage change for each year

# Grouping dataframe to get the max and min values
max_min = {
    
    'Change': ['idxmax', 'idxmin']}
df.groupby(['Year']).agg(max_min).T

# Check the results for year 2020
df.loc[[2061, 2062]]

# Check the year wise maximum values
df.loc[df.groupby('Year')['Change'].idxmax()]

# Check the year wise minimum values
df.loc[df.groupby('Year')['Change'].idxmin()]

# Assign a new dataframe with pivoted values
newdf = pd.pivot_table(df, 
               index='Month', 
               columns='Year', 
               values='Change',
               aggfunc=np.sum)

newdf

# Analysing year wise statistics for Nasdaq returns
newdf.describe()

时间序列的交互式可视化

Plot Line Chart

df3['Close'][-30:].iplot(kind='line',title='SPY Price')

在这里插入图片描述

Plot OHLC Data

df3[-30:].iplot(kind='ohlc',title='SPY Price')

在这里插入图片描述

Plot Candlestick

df3[-30:].iplot(kind='candle', title='SPY Price')

在这里插入图片描述

Plot Selected Stocks

# Use secondary axis
df4[['AAPL', 'MSFT']].iplot(title='AAPL Vs MSFT', secondary_y='MSFT')

在这里插入图片描述

# Use subplots
df4[['AAPL', 'MSFT']].iplot(title='AAPL Vs MSFT Price Movement', subplots=True)

在这里插入图片描述

Normalized Plot

df4.normalize().iplot(title='Our Nasdaq-listed Stocks')

在这里插入图片描述

回报率时间序列的可视化

# Calculating Log Normal Returns

# Use numpy log function to derive log normal returns
daily_returns = np.log(df4).diff().dropna()

# Display the last five rows of the data frame to check the output
daily_returns.head(5)

Plot Daily Returns

# Plot Returns
daily_returns[['AAPL','MSFT']].iplot(title='Daily Log Returns')

在这里插入图片描述

Plot Annual Returns

# Plot Mean Annual Returns
(daily_returns.mean()*252).iplot(kind='bar')

在这里插入图片描述

Plot Rolling Returns

# To calculate 5 days rolling returns, simply sum daily returns for 5 days as log returns are additive
rolling_return = daily_returns.rolling(5).sum().dropna()

# Display the last five rows of the data frame to check the output
rolling_return.head(5)

# Plot Rolling Returns
rolling_return['NVDA'].iplot(title='5-Days Rolling Returns of Nvidia')

在这里插入图片描述

统计学中的时间序列

统计学是数学的一个分支，研究数据的收集、解释、组织和解释。统计有两大类:描述统计和推论统计。

描述性统计有助于我们有意义地理解数据，是数据分析的重要组成部分。而推论统计学则允许我们推断趋势并从中得出结论。

# Analysing the daily returns data
daily_returns.describe().T

对数正态分布

正态分布是统计学中最常见和应用最广泛的分布。它通常被称为“钟形曲线”或“高斯曲线”。金融时间序列虽然在短期内是随机的，但在较长时间内服从对数正态分布。

现在我们已经推导出了每日的对数收益，我们将绘制这个收益分布，并检查股票收益是否遵循对数正态分布。

# Plot log normal distribution of returns
daily_returns.iplot(kind='histogram', title = 'Histogram of Daily Returns', subplots=True)

在这里插入图片描述

Pairwise Correlation

Pairwise Correlation的中文翻译是“两两相关性”。它是一种统计方法，用于测量两个变量之间的关系。它计算两个变量之间的关系程度，并提供该关系的强度和方向的度量。

Pairwise Correlation通常用相关系数表示，相关系数的取值范围为-1到1。相关系数为1表示完美的正相关，即两个变量沿着相同的方向移动，而相关系数为-1表示完美的负相关，即两个变量沿着相反的方向移动。相关系数为0表示变量之间没有相关性。

Pairwise Correlation在许多领域中被广泛使用，例如金融、经济和社会科学，用于理解变量之间的关系并预测未来的行为。但是需要注意的是，相关性并不一定意味着因果关系，可能存在其他因素影响变量之间观察到的关系。

# Compute pairwise correlation
daily_returns.corrwith(daily_returns['AMZN'])

箱形图分析

接下来，我们来创建所需格式的新数据帧。这需要重置索引和删除非必需值来操作检索到的纳斯达克数据。

箱型图

描述性统计的箱型图（Box Plot），也称为盒须图、箱线图，是一种用于展示数据分布情况的图形化工具。它通过显示数据的五个数值摘要（最小值、下四分位数、中位数、上四分位数和最大值），帮助我们了解数据的中心趋势、离散程度、偏态程度等特征。

箱型图的构造如下：首先，将数据按大小顺序排列，然后将数据集分成四个等分，每个等分包含大约四分之一的数据。这四个等分的位置由四个数值摘要决定，即最小值、下四分位数、中位数和上四分位数。在箱型图中，一个矩形框表示了数据的四分位距（即上四分位数减去下四分位数），并将数据集的中间50%的数据包含在内。矩形框中间的一条线表示中位数。从矩形框中延伸出来的“须”（whiskers）表示数据集中未超过1.5倍四分位距的数据范围。在“须”之外的数据点被视为异常值，通常用单独的点或其他符号表示。

箱型图可以用于比较多个数据集之间的分布情况，以及检测是否存在离群值（outlier）。它是描述性统计的一种常用工具，广泛应用于数据分析、质量控制、实验设计等领域。

现在让我们用箱形图来分析纳斯达克指数的收益。

# Visualize Nasdaq Box Plot
newdf.iplot(kind='box', 
            title='Nasdaq Return Analysis', 
            yTitle='Returns (%)', 
            legend=False, boxpoints='outliers')

在这里插入图片描述