大家先看哈下面的网页:
现在有个需求,项目组要求我们爬取到“子专业名称”,直接上代码。
/**
* 获得子专业名称
* @param url
* @return
*/
public static String getSonSubjectName(String url){
String sonSubjectName=null;
try {
if(url!=null&&!"".equals(url.trim())){
// 创建httpClient实例
CloseableHttpClient httpClient = HttpClients.createDefault();
// 创建httpGet实例
HttpGet httpGet = new HttpGet(url);
CloseableHttpResponse response = httpClient.execute(httpGet);
String content = null;
if(response != null){
HttpEntity entity = response.getEntity();
content = EntityUtils.toString(entity, "UTF-8"); // 获取网页内容
int firstEndIndex=content.indexOf("navcrumbId-1");
int secondEndIndex=content.indexOf("navcrumbId-2");
String resultStr=content.substring(firstEndIndex,secondEndIndex);
Document document = Jsoup.parse(resultStr); // 解析网页,得到文档对象
Elements elements1=document.getElementsByClass("navcrumb-item");// 获得节点
sonSubjectName=elements1.get(1).text();
}
if(response != null){
response.close();
}
if(httpClient != null){
httpClient.close();
}
}
} catch (Exception e) {
logger.error("WyUtil.getSonSubjectName()----error", e);
}
return sonSubjectName;
}
创建实体类
public static void main(String[] args) {
System.out.println(getSonSubjectName("https://study.163.com/course/introduction.htm?courseId=1006073263&_trace_c_p_k2_=bca7cf19265c4b66b5e9cdcd63e59bbc"));
}
运行,结果如下:
奶思