1,用于抓取时间date的 但是时间的格式多变
import java.util.regex.Matcher;
import java.util.regex.Pattern;
Pattern p = Pattern.compile("\\bon\\b");
Matcher m = p.matcher(str);
if(m.find()){
str = m.group()
}
2,去掉article中的属性是数字开头的
import java.util.regex.Matcher
import java.util.regex.Pattern
strReg = "(?<=<[^<>]{1,9999})\\s[^a-zA-Z\\s<>\"_]+[^=\\s<>\"]*\\s?="
Pattern pattern = Pattern.compile(strReg)
Matcher matcher = pattern.matcher(str_page_html)
i = 0
while(matcher.find()){
replacedStr = matcher.group()
str_page_html = str_page_html.replace(replacedStr, " a" + i.toString() + "=")
i++
}
3,Xpath连接字符串的
concat(//span[@class='elementDate latestNews_Date'],//span[@class='elementTime latestNews_Time'])
4,xpath的一些语法
1、child 选取当前节点的所有子元素
2、parent 选取当前节点的父节点
3、descendant 选取当前节点的所有后代元素(子、孙等)
4、ancestor 选取当前节点的所有先辈(父、祖父等)
5、descendant-or-self 选取当前节点的所有后代元素(子、孙等)以及当前节点本身
6、ancestor-or-self 选取当前节点的所有先辈(父、祖父等)以及当前节点本身
7、preceding-sibling 选取当前节点之前的所有同级节点
8、following-sibling 选取当前节点之后的所有同级节点
9、preceding 选取文档中当前节点的开始标签之前的所有节点
10、following 选取文档中当前节点的结束标签之后的所有节点
11、self 选取当前节点
12、attribute 选取当前节点的所有属性
13、namespace 选取当前节点的所有命名空间节点
preceding-sibling,选取当前节点之前的所有同级节点,同一个parent下该节点之前的节点,即“哥哥”节点(是同父的哥哥节点)。
5,article_url中转化非英文字符:
import java.net.URLEncoder
if( str_article_url.contains("XXXXXXXXXXX")){
temp_str_befor = "xxxxxxxxxxx"
temp_str_after = URLEncoder.encode(temp_str_befor,"UTF-8")
str_article_url =str_article_url.replace(temp_str_befor,temp_str_after)
}
6,分 秒 时 day week设置
SimpleDateFormat sdf = new SimpleDateFormat(date_format.toString())
//str_date_posted = sys.datetime(date_format.toString())
SimpleDateFormat sdf = new SimpleDateFormat(date_format.toString())
//str_date_posted = sys.datetime(date_format.toString())
Calendar cal=Calendar.getInstance()
if(str.contains("Hour")){
hoursNum = str.split("Hour")[0].trim()
println "hoursNum=="+hoursNum
cal.set(Calendar.HOUR,-Integer.parseInt(hoursNum))
str = sdf.format(cal.getTime())
}
if(str.contains("Day")){
daysNum = str.split("Day")[0].trim()
println "daysNum=="+daysNum
cal.set(Calendar.DAY_OF_YEAR,-Integer.parseInt(daysNum))
str = sdf.format(cal.getTime())
}
if(str.contains("Week")){
weeksNum = str.split("Week")[0].trim()
println "weeksNum=="+weeksNum
cal.set(Calendar.WEEK_OF_YEAR,-Integer.parseInt(weeksNum))
str = sdf.format(cal.getTime())
}Calendar cal=Calendar.getInstance()
if(str.contains("Hour")){
hoursNum = str.split("Hour")[0].trim()
println "hoursNum=="+hoursNum
cal.set(Calendar.HOUR,-Integer.parseInt(hoursNum))
str = sdf.format(cal.getTime())
}
if(str.contains("Day")){
daysNum = str.split("Day")[0].trim()
println "daysNum=="+daysNum
cal.set(Calendar.DAY_OF_YEAR,-Integer.parseInt(daysNum))
str = sdf.format(cal.getTime())
}
if(str.contains("Week")){
weeksNum = str.split("Week")[0].trim()
println "weeksNum=="+weeksNum
cal.set(Calendar.WEEK_OF_YEAR,-Integer.parseInt(weeksNum))
str = sdf.format(cal.getTime())
}
代码格式不齐请原谅 待续...
爬虫抓网页知识小结
猜你喜欢
转载自west-singapore-gmail-com.iteye.com/blog/1253792
今日推荐
周排行