Step1.首先获取网页
http.Get方法请求获取的网页
判断是否有错误,如果没有,则通过ioutil.ReadAll阅读网页内容,ReadAll返回[]byte,读取时转为string。
Step2.使用正则表达式开始爬取电话号码
rePhone = `1[3457689]\d{9}`
rePhone2 = `(1[3457689]\d)(\d{4})(\d{4})`
\d代表数字,{9}代表9个数字。
(1[3457689]\d)(\d{4})(\d{4})中(\d{4})代表取4个数字放在一起,11位数字中间的那4位数字
re := regexp.MustCompile(rePhone) 解析正则表达式
allString := re.FindAllStringSubmatch(html, -1) 找到所有匹配的,返回[][]string
for _, v := range allString {
fmt.Println(v) 遍历[][]string,输出的是[]string
}
代码展开查看
package main import ( "fmt" "io/ioutil" "net/http" "os" "regexp" ) var ( //rePhone = `1[3457689]\d{9}` rePhone = `(1[3457689]\d)(\d{4})(\d{4})` ) func main() { //拿到html resp, err := http.Get("http://he.tiaohao.com/?dis=3") HandleEr(err, "http://he.tiaohao.com/?dis=3") bytes, _ := ioutil.ReadAll(resp.Body) html := string(bytes) //fmt.Println(html) //开始爬取电话号 re := regexp.MustCompile(rePhone) allString := re.FindAllStringSubmatch(html, -1) for _, v := range allString { fmt.Println(v) } }
func HandleEr(err error,when string){
if err!=nil{
fmt.Println(when,err)
os.Exit(1)
}
}
输出结果,有相同的原因是网页中有多重a标签包含号码
[15028015320 150 2801 5320]
[15028015320 150 2801 5320]
[15028096212 150 2809 6212]
[15028096212 150 2809 6212]
[15297537067 152 9753 7067]
[15297537067 152 9753 7067]
[18303058151 183 0305 8151]
[18303058151 183 0305 8151]
[18230217989 182 3021 7989]
[18230217989 182 3021 7989]
[18303013250 183 0301 3250]
[18303013250 183 0301 3250]
[15127056031 151 2705 6031]
[15127056031 151 2705 6031]
[18303025237 183 0302 5237]
[18303025237 183 0302 5237]
[17832919386 178 3291 9386]
[17832919386 178 3291 9386]
[15932780389 159 3278 0389]
[15932780389 159 3278 0389]
[15531088721 155 3108 8721]
[15531088721 155 3108 8721]
[15632088092 156 3208 8092]
[15632088092 156 3208 8092]
[17692038852 176 9203 8852]
[17692038852 176 9203 8852]
[15632088637 156 3208 8637]
[15632088637 156 3208 8637]
[15632088252 156 3208 8252]
[15632088252 156 3208 8252]
[13288797517 132 8879 7517]
[13288797517 132 8879 7517]
[15631052887 156 3105 2887]
[15631052887 156 3105 2887]
[13288792157 132 8879 2157]
[13288792157 132 8879 2157]
[15630388371 156 3038 8371]
[15630388371 156 3038 8371]
[17633108872 176 3310 8872]
[17633108872 176 3310 8872]
[17752921501 177 5292 1501]
[17752921501 177 5292 1501]
[17752921506 177 5292 1506]
[17752921506 177 5292 1506]
[17752921630 177 5292 1630]
[17752921630 177 5292 1630]
[17752921632 177 5292 1632]
[17752921632 177 5292 1632]
[17752921732 177 5292 1732]
[17752921732 177 5292 1732]
[17752922162 177 5292 2162]
[17752922162 177 5292 2162]
[17752923013 177 5292 3013]
[17752923013 177 5292 3013]
[17752923036 177 5292 3036]
[17752923036 177 5292 3036]
[17752923276 177 5292 3276]
[17752923276 177 5292 3276]
[17752923295 177 5292 3295]
[17752923295 177 5292 3295]