文件上传的秘密（一）造自己的工具

RFC1867文档对WEB表单上传文件做了详细的描述，但J2EE的Servlet规范中却没有针对此功能规定一个API，没有接口也没有抽象类，更不要说一个具体类了。幸好，著名的开源组织Apache的官网上有一个Common File Upload这个项目，给广大的J2EE开发者解决了这个比较麻烦的问题。会用Common File Upload这个开源组件解决表单文件上传问题是一回事，能知道这个组件的优缺点是另外一回事，如果能知道RFC1867文档中对于表单上传文件的规定、并实现文件上传的功能，是另外一回事。

干嘛干嘛，你这不是闲的蛋疼嘛，有现成的轮子不用，非要再造一个相同的轮子呢？听起来很有道理，可是呢，只能这么说，自行车轮子是不装在汽车上的，福特车的车轮子装不上奥迪车的，为了造出最好的车，就得要亲子动手再造一个最合适的轮子，写软件也是同样的道理。另外，‭ ‬作为一个想成为优秀程序员的人来说，决不会因为使用了Common File Upload而感到自豪。掌握某种技术的原理，并有所创新，才是程序员的王道。本文就是要从零开始，实现文件上传。

好了，废话了这么多，就先看看这个表单上传文件究竟是个什么东东。RFC1867文档规定了HTTP表单上传文件的方式，简明扼要的说，表单的ENCTYPE必须是multipart/form-data，‭ ‬必须是POST方式提交。每个文件的内容经浏览器编码后，使用一个不会和文件内容重复的串把多个文件或者输入域分割开来，这个串叫boundary。为了便于理解RFC1867文档对于文件上传时，浏览器是如何对文件编码的，我们把一个含有文件输入和文本输入的表单，经浏览器编码后，发送到服务器端的请求dump下来，看看究竟。

-----------------------------168072824752491622650073

Content-Disposition: form-data; name="_file1"; filename="Image023.jpg"

Content-Type: image/jpeg



ˇÿˇ‡  JFIF          ˇ·
çExif  II*       (                           :           I
      ˇÿˇ€ C
…
…

-----------------------------168072824752491622650073

Content-Disposition: form-data; name="_text1.text1"



some text in mulitpart form

-----------------------------168072824752491622650073--

经浏览器编码后，请求的内容被boundary分割，如果是文件，Content‭-‬Disposition内容中会有文件的名字，紧跟其后，Content‭-‬Type内容包含了文件类型的信息，如果只是一个文本输入域，在boundary后，既没有文件名，也没有Content‭-‬Type内容，当然，文本输入域是肯定不会有文件名的。请求内容的最后，一个boundary串加上两个”-”，表示表单的内容结束。其实，RFC1867文档规定表单文件上传的内容比这个要复杂、详细，有兴趣可以参考官方文档和百度文库中的文档。
http‭://‬www.ietf.org/rfc/rfc1867‭.‬txt
http‭://‬wenku.baidu.com/view/3438982458fb770bf78a5573‭.‬html

在初步搞清楚表单文件上传编码后内容，其实接下来的问题比较明确，就是根据boundary串，分割请求内容，解析出上传文件的内容。是的，这是一条正确的线路，然，事情就是如此想象的简单？
虽然可以在请求header中首先拿到boundary，但是boundary并没有告诉开发者，一个文件内容的区块是从哪里开始，到哪里结束。

先废话一句，经编码后的内容是二进制的，把二进制的内容转化成字符串，再查找boundary串，这种做法的效率是有问题的，而且，java的字符串处理的效率本身就不高，所以，此路不通。那如何在一大块二进制的数据中查找到一小块连续的二进制数据呢？‭ ‬貌似没有现成的方法，经过考量，发现字符串查找的方法可以借鉴，字符串本质也是二进制，用字符查找算法来查找二进制的内容理论上不存在问题。‭ ‬

针对boundary本身内容的特性－－基本无重复和中等长度，这里选择Boyer和Moore在80年代发明的字符串查找算法，简称BM算法，BM算法的复杂度是O(M+N‭)‬，效率极高，不过BM查找算法是针对ASCII码中那些常见的字母和一些符号，而这里是要查找二进制，有一些区别，需要对原查算法稍作改进，以后再专门写一篇博客详细说明，现在的假定情况是改进的BM算法能在一大块的二进制数据区内的任意开始位置查找一个特定的小的二进制数据块，如果能查找到，返回该数据块出现的起始位置，否则返回-1，这个函数用静态的方法来表示，

/**
	 * Returns the index within this string of the first occurrence of the
	 * specified substring. If it is not a substring, return -1.
	 * 
	 * @param text
	 *            The string to be scanned
	 * @param word
	 *            The target string to search
	 * @return The start index of the substring
	 */

public static int indexOf(byte[] text, byte[] word, int start)

有了这个强力的算法后，接下来的事情开始变得简单，需要做的事情只有三点
1)    拿到boundary的值，查找这个boundary在整个提交上来的请求中的位置
2)    根据boundary后的一些内容，把识别为文件的数据块写入到文件中
3)    重复这个过程直到请求的输入流结束

为了扶助这个过程的顺利进行，需要定义一个表示上传文件的类，这里也叫MultipartFile，当初始化该类的实例时，会新建一个文件输出流，用来写入从上传请求输入流读取的文件数据块，同时该类也提供关闭这个文件输出流的方法。

class MultiPartFile {
	private String name;
	private int start, end;

	public MultiPartFile(String name) throws IOException {
		super();
		this.name = name;
		fos = new FileOutputStream(name);
	}

	public void append(byte[] buff, int off, int len) throws IOException {
		fos.write(buff, off, len);
	}

	public void append(byte[] buff) throws IOException {
		fos.write(buff);
	}

	public void close() throws IOException {
		fos.flush();
		fos.close();
	}
}

FileUploadPharser是真正执行对请求数据分析上传文件的类，在parse方法中，分析Request请求后，并返回MultiPartFile数组。该类的主要代码如下

public class FileUploadParser {

	private static final String _ENCTYPE = "multipart/form-data";
	private static final String _FILE_NAME_KEY = "filename";

	private static final byte[] _CTRF = { 0X0D, 0X0A };
	
	private int bufferSize = 0x20000;

	private byte[] boundary;

	private HttpServletRequest request;

	private String dir;
	
	private String encoding;

	public FileUploadParser(HttpServletRequest request, String dir) {
		this.request = request;
		this.dir = dir;
	}

        public List<MultiPartFile> parse() throws IOException {
		List<MultiPartFile> files = new ArrayList<MultiPartFile>();
		this.parseEnctype();
		byte[] buffer = new byte[bufferSize];
		int c = 0;
		boolean hasFile = false;
		boolean end = true;
		while ((c = request.getInputStream().read(buffer)) != -1) {
			boolean isNewSegment = true;
			int index = 0;
			while ((index = BoyerMoore.indexOf(buffer, boundary, index)) != -1) {
				if (end) {
					MultiPartFile mpf = parseFile(buffer, index);
					if (mpf != null) {
						files.add(mpf);
						index = mpf.getStart();
						end = false;
						hasFile = true;
					} else {
						hasFile = false; 
						end = true;
						index += boundary.length;
					}
				} else if (hasFile) {
					// write buffer to last opening file if current index identifies the start of boundary.
					// and close the file.
					MultiPartFile writer = files.get(files.size() - 1);
					if (isNewSegment) {
						writer.append(buffer, 0, index - 4);
					} else {
						int off = writer.getStart();
						writer.append(buffer, off, index - off - 4);
					}
					writer.close();

					// start a new parse action
					MultiPartFile next = parseFile(buffer, index);
					if (next != null) {
						files.add(next);
						index = next.getStart();
						end = false;
						hasFile = true;
					}
					else {
						hasFile = false;
						end = true;
						index += boundary.length;
					}

				}
				isNewSegment = false;

				/*
				 * // create a new MultiPartFile object if found the boundary //
				 * firstly if (files.size() == 0) { MultiPartFile mpf =
				 * parseFile(buffer, index); if (mpf != null) { files.add(mpf);
				 * index = mpf.getStart(); newSegment = false; hasFile = true;
				 * end = false; } else { hasFile = false; continue; // skip next
				 * boundary } } // append the buffer into exists MultiPartFile
				 * object and // then parse next part of content else {
				 * MultiPartFile last = files.get(files.size() - 1); if (hasFile
				 * && newSegment) { last.append(buffer, 0, index - 4);
				 * last.close(); end = true; } else if (hasFile) { int s =
				 * last.getStart(); last.append(buffer, s, index - s - 4);
				 * last.close(); end = true; } else { continue; } newSegment =
				 * false; hasFile = false; index += boundary.length;
				 * 
				 * MultiPartFile next = parseFile(buffer, index); if (next !=
				 * null) { files.add(next); index = next.getStart(); hasFile =
				 * true; } else { hasFile = false; continue; // skip next
				 * boundary } }
				 */

			}
			// not found boundary, append the buffer into file
			if (!end) {
				MultiPartFile writer = files.get(files.size() - 1);
				if (isNewSegment) {
					writer.append(buffer, 0, c);
				} else {
					int off = writer.getStart();
					writer.append(buffer, off, c - off);
				}
			}
		}
		return files;
	}
 ....
// 其他辅助函数略
 }

至此，解析上传文件内容并保存的工作就完成了，但是事情还是没有结束，浏览器在向服务器端发送数据时，会对发送的内容进行编码，这些编码的内容需要一个解码的过程，特别是需要处理中文的web应用。

文件上传的秘密（一）造自己的工具

猜你喜欢