爬虫抓网页知识小结

1,用于抓取时间date的 但是时间的格式多变

import java.util.regex.Matcher;
import java.util.regex.Pattern; 
Pattern p = Pattern.compile("\\bon\\b");
Matcher m = p.matcher(str);
if(m.find()){
str = m.group()
}

2,去掉article中的属性是数字开头的
import java.util.regex.Matcher
import java.util.regex.Pattern
strReg = "(?<=<[^<>]{1,9999})\\s[^a-zA-Z\\s<>\"_]+[^=\\s<>\"]*\\s?="
Pattern pattern = Pattern.compile(strReg)
Matcher matcher = pattern.matcher(str_page_html)
i = 0
while(matcher.find()){
replacedStr =  matcher.group()
str_page_html = str_page_html.replace(replacedStr, " a" + i.toString() + "=")
i++
}

3,Xpath连接字符串的

concat(//span[@class='elementDate latestNews_Date'],//span[@class='elementTime latestNews_Time'])

4,xpath的一些语法


     1、child  选取当前节点的所有子元素

     2、parent  选取当前节点的父节点

     3、descendant 选取当前节点的所有后代元素(子、孙等)

     4、ancestor  选取当前节点的所有先辈(父、祖父等)

     5、descendant-or-self 选取当前节点的所有后代元素(子、孙等)以及当前节点本身

     6、ancestor-or-self  选取当前节点的所有先辈(父、祖父等)以及当前节点本身

     7、preceding-sibling 选取当前节点之前的所有同级节点

     8、following-sibling 选取当前节点之后的所有同级节点

     9、preceding   选取文档中当前节点的开始标签之前的所有节点

     10、following   选取文档中当前节点的结束标签之后的所有节点

     11、self  选取当前节点

     12、attribute  选取当前节点的所有属性

     13、namespace 选取当前节点的所有命名空间节点

     preceding-sibling,选取当前节点之前的所有同级节点,同一个parent下该节点之前的节点,即“哥哥”节点(是同父的哥哥节点)。

5,article_url中转化非英文字符:
import java.net.URLEncoder
if( str_article_url.contains("XXXXXXXXXXX")){
temp_str_befor = "xxxxxxxxxxx"
temp_str_after = URLEncoder.encode(temp_str_befor,"UTF-8")
str_article_url =str_article_url.replace(temp_str_befor,temp_str_after)
        }

6,分 秒 时  day  week设置
SimpleDateFormat sdf = new SimpleDateFormat(date_format.toString())
//str_date_posted = sys.datetime(date_format.toString())
SimpleDateFormat sdf = new SimpleDateFormat(date_format.toString())
//str_date_posted = sys.datetime(date_format.toString())
Calendar cal=Calendar.getInstance()
if(str.contains("Hour")){
hoursNum = str.split("Hour")[0].trim()
println "hoursNum=="+hoursNum
      cal.set(Calendar.HOUR,-Integer.parseInt(hoursNum))
str = sdf.format(cal.getTime())
}
if(str.contains("Day")){
daysNum = str.split("Day")[0].trim()
println "daysNum=="+daysNum
cal.set(Calendar.DAY_OF_YEAR,-Integer.parseInt(daysNum))
str = sdf.format(cal.getTime())
}
if(str.contains("Week")){
weeksNum = str.split("Week")[0].trim()
println "weeksNum=="+weeksNum
cal.set(Calendar.WEEK_OF_YEAR,-Integer.parseInt(weeksNum))
str = sdf.format(cal.getTime())
}Calendar cal=Calendar.getInstance()
if(str.contains("Hour")){
hoursNum = str.split("Hour")[0].trim()
println "hoursNum=="+hoursNum
cal.set(Calendar.HOUR,-Integer.parseInt(hoursNum))
str = sdf.format(cal.getTime())
}
if(str.contains("Day")){
daysNum = str.split("Day")[0].trim()
println "daysNum=="+daysNum
cal.set(Calendar.DAY_OF_YEAR,-Integer.parseInt(daysNum))
str = sdf.format(cal.getTime())
}
if(str.contains("Week")){
weeksNum = str.split("Week")[0].trim()
println "weeksNum=="+weeksNum
cal.set(Calendar.WEEK_OF_YEAR,-Integer.parseInt(weeksNum))
str = sdf.format(cal.getTime())
}

代码格式不齐请原谅 待续...

猜你喜欢

转载自west-singapore-gmail-com.iteye.com/blog/1253792