Go爬取电话号码

Step1.首先获取网页
   http.Get方法请求获取的网页
   判断是否有错误,如果没有,则通过ioutil.ReadAll阅读网页内容,ReadAll返回[]byte,读取时转为string。
Step2.使用正则表达式开始爬取电话号码
   rePhone = `1[3457689]\d{9}`
   rePhone2 = `(1[3457689]\d)(\d{4})(\d{4})`
   \d代表数字,{9}代表9个数字。
   (1[3457689]\d)(\d{4})(\d{4})中(\d{4})代表取4个数字放在一起,11位数字中间的那4位数字
   

    re := regexp.MustCompile(rePhone)               解析正则表达式
    allString := re.FindAllStringSubmatch(html, -1)    找到所有匹配的,返回[][]string
    for _, v := range allString {
    fmt.Println(v)                                    遍历[][]string,输出的是[]string
    }

代码展开查看


package main
import (
	"fmt"
	"io/ioutil"
	"net/http"
	"os"
	"regexp"
)
var (
	//rePhone = `1[3457689]\d{9}`
	rePhone = `(1[3457689]\d)(\d{4})(\d{4})`
)
func main() {
	//拿到html
	resp, err := http.Get("http://he.tiaohao.com/?dis=3")
	HandleEr(err, "http://he.tiaohao.com/?dis=3")
	bytes, _ := ioutil.ReadAll(resp.Body)
	html := string(bytes)
	//fmt.Println(html)
	//开始爬取电话号
	re := regexp.MustCompile(rePhone)
	allString := re.FindAllStringSubmatch(html, -1)
	for _, v := range allString {
			fmt.Println(v)
	}
}

func HandleEr(err error,when string){
         if err!=nil{
         fmt.Println(when,err)
         os.Exit(1)
   }
}


输出结果,有相同的原因是网页中有多重a标签包含号码


[15028015320 150 2801 5320]
[15028015320 150 2801 5320]
[15028096212 150 2809 6212]
[15028096212 150 2809 6212]
[15297537067 152 9753 7067]
[15297537067 152 9753 7067]
[18303058151 183 0305 8151]
[18303058151 183 0305 8151]
[18230217989 182 3021 7989]
[18230217989 182 3021 7989]
[18303013250 183 0301 3250]
[18303013250 183 0301 3250]
[15127056031 151 2705 6031]
[15127056031 151 2705 6031]
[18303025237 183 0302 5237]
[18303025237 183 0302 5237]
[17832919386 178 3291 9386]
[17832919386 178 3291 9386]
[15932780389 159 3278 0389]
[15932780389 159 3278 0389]
[15531088721 155 3108 8721]
[15531088721 155 3108 8721]
[15632088092 156 3208 8092]
[15632088092 156 3208 8092]
[17692038852 176 9203 8852]
[17692038852 176 9203 8852]
[15632088637 156 3208 8637]
[15632088637 156 3208 8637]
[15632088252 156 3208 8252]
[15632088252 156 3208 8252]
[13288797517 132 8879 7517]
[13288797517 132 8879 7517]
[15631052887 156 3105 2887]
[15631052887 156 3105 2887]
[13288792157 132 8879 2157]
[13288792157 132 8879 2157]
[15630388371 156 3038 8371]
[15630388371 156 3038 8371]
[17633108872 176 3310 8872]
[17633108872 176 3310 8872]
[17752921501 177 5292 1501]
[17752921501 177 5292 1501]
[17752921506 177 5292 1506]
[17752921506 177 5292 1506]
[17752921630 177 5292 1630]
[17752921630 177 5292 1630]
[17752921632 177 5292 1632]
[17752921632 177 5292 1632]
[17752921732 177 5292 1732]
[17752921732 177 5292 1732]
[17752922162 177 5292 2162]
[17752922162 177 5292 2162]
[17752923013 177 5292 3013]
[17752923013 177 5292 3013]
[17752923036 177 5292 3036]
[17752923036 177 5292 3036]
[17752923276 177 5292 3276]
[17752923276 177 5292 3276]
[17752923295 177 5292 3295]
[17752923295 177 5292 3295]

猜你喜欢

转载自www.cnblogs.com/akmfwei/p/12634008.html
今日推荐