不过现在JREX已经没有人维护了最新版是在05年发布的
"JRex" is a Java Browser Component with set of API's for Embedding Mozilla GECKO within a Java Application.
一、 安装
网址: http://jrex.mozdev.org/
1. 解压缩 jrex_gre.zip 到 C:\jrex_gre 目录中
2. 然后将 jrex-bin-log-1.0b1_dom3.zip中文件复制到 C:\jrex_gre 目录中。
3. 直接运行run.bat即可看到用jrex实现的java浏览器,还不错噢。
注意,那个JAVA_HOME应该是JRE的,而不是JDK的,否则会找不到的一个jwt.dll
"C:\Program Files\Java\jre1.5.0_06/bin/java"
二、 编程
实现效果: firefox中的view generated Source
代码如下:
import java.io.File; import java.io.FileOutputStream; import java.io.OutputStreamWriter; import java.io.StringWriter; import javax.swing.JFrame; import javax.swing.JPanel; import javax.xml.transform.OutputKeys; import javax.xml.transform.Result; import javax.xml.transform.Source; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerFactory; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; import org.mozilla.jrex.JRexFactory; import org.mozilla.jrex.event.progress.ProgressEvent; import org.mozilla.jrex.navigation.WebNavigation; import org.mozilla.jrex.navigation.WebNavigationConstants; import org.mozilla.jrex.ui.JRexCanvas; import org.mozilla.jrex.window.JRexWindowManager; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.Node; public class Render implements org.mozilla.jrex.event.progress.ProgressListener { boolean done = false; public boolean parsePage(String url) throws Exception { System.setProperty("jrex.browser.usesetupflags", "true"); System.setProperty("jrex.browser.allow.images", "false"); //不加载图片 System.setProperty("jrex.browser.allow.plugin", "false"); //不加载flash // The JRexCanvas is the main browser component. The WebNavigator // is used to access the DOM. JRexCanvas canvas = null; WebNavigation navigation = null; // Start up JRex/Gecko. JRexFactory.getInstance().startEngine(); // Get a window manager and put the browser in a Swing frame. // Based on Dietrich Kappe's code. JRexWindowManager winManager = (JRexWindowManager) JRexFactory .getInstance().getImplInstance(JRexFactory.WINDOW_MANAGER); winManager.create(JRexWindowManager.SINGLE_WINDOW_MODE); JPanel panel = new JPanel(); JFrame frame = new JFrame(); frame.getContentPane().add(panel); winManager.init(panel); // Get the JRexCanvas, set Render to handle progress events so // we can determine when the page is loaded, and get the // WebNavigator object. canvas = (JRexCanvas) winManager.getBrowserForParent(panel); canvas.addProgressListener(this); navigation = canvas.getNavigator(); // Load and process the page. navigation.loadURI(url, WebNavigationConstants.LOAD_FLAGS_NONE, null, null, null); // Swing magic. frame.setSize(640, 480); frame.setVisible(false); // Check if the DOM has loaded every two seconds. while (!done) { Thread.sleep(2000); } // Get the DOM and recurse on its nodes. Document doc = navigation.getDocument(); Element ex = doc.getDocumentElement(); File file = new File("d:\\youtube.html"); FileOutputStream outer = new FileOutputStream(file); OutputStreamWriter sw = new OutputStreamWriter(outer,"utf-8"); sw.write(xmlToString(ex)); sw.close(); System.out.println(xmlToString(ex)); return true; } public static String xmlToString(Node node) throws Exception { Source source = new DOMSource(node); StringWriter stringWriter = new StringWriter(); Result result = new StreamResult(stringWriter); TransformerFactory factory = TransformerFactory.newInstance(); Transformer transformer = factory.newTransformer(); transformer.setOutputProperty(OutputKeys.METHOD, "html"); transformer.transform(source, result); return stringWriter.getBuffer().toString(); } /** * onStateChange is invoked several times when DOM loading is complete. Set * the done flag the first time. */ public void onStateChange(ProgressEvent event) { if (!event.isLoadingDocument()) { if (done) return; done = true; } } public static void main(String[] args) throws Exception { //String url = "http://www.youtube.com/watch?v=XOHE2KsmdGg"; //String url = "http://www.cnn.com"; String url = "http://www.56.com/u42/v_MzY2NTYxNjc.html"; //String url = "http://ilovelate.blog.163.com"; Render p = new Render(); p.parsePage(url); System.exit(0); } public void onLinkStatusChange(ProgressEvent event) { } public void onLocationChange(ProgressEvent event) { } public void onProgressChange(ProgressEvent event) { } public void onSecurityChange(ProgressEvent event) { } public void onStatusChange(ProgressEvent event) { } }
运行该代码需要设置vm arguments
-Djrex.dom.enable=true
-Djrex.gre.path=c:\jrex_gre
注意修改File file = new File("d:\\youtube.html"); 输出文件。
设置环境变量
JAVA_HOME = C:\Java\jre1.5.0 不是jdk目录。
JREX_GRE_PATH=c:\jrex_gre
不足和问题
Render是使用JRex的一个简单例子,但不是全部。我在挖掘网页时使用Render的一个子类,它工作的很好,但是我测试的例子都是很正常的网页。
我使用一个事件监听器来判断页面是否加载完毕。Render的parsePage方法每过两秒就检测一下doneflag。如果页眉不能加载,就会死循环。
还有当它加载嵌入的浏览器时,浏览器窗口会显示出来,直到加载成功。我没有考虑这个问题因为在我的挖掘任务中不需要浏览器窗口。