网页抓取乱码问题

显示全部楼层 · 2011-1-5 15:41:32

sohuurl="http://news.sohu.com/20101025/n276363856.shtml"；
public static String utfhtmlcontent(String sohuurl) {
StringBuffer stringBuffer = new StringBuffer();
URL requestUrl;
String string="";
try {
requestUrl = new URL(sohuurl);
HttpURLConnection connection = (HttpURLConnection) requestUrl

.openConnection();
connection.setRequestProperty("User-Agent",

"Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
connection.connect();
InputStream is = connection.getInputStream();
while ((is.read()) != -1) {

int all = is.available();

byte[] b = new byte[all];

is.read(b);

stringBuffer.append(new String(b, "UTF-8"));

string=string+new String(b, "UTF-8");
}
if (is != null) {

is.close();
}
} catch (Exception e) {
e.printStackTrace();
}
return string;
}
截取出来的就是乱码，转码成UTF-8也是乱码请问要显示真正的源码要如何呢。
转换过还是乱码唯一区别就是乱码不同了~~~我知道是GB2312 还是截取不到我试过转换其他格式比如上面的UTF-8 还是不行

千问 · 2011-1-5 15:41:32

通过分析源代码发现说明不能用utf8

千问 · 2011-1-5 15:41:32

里一般都有注明我知道你是抓网页啊，你先分析网页的HTML(随便什么编码，英文是不会乱码的)，根据里的标识再重新分析 ------------

千问 · 2011-1-5 15:41:32

这个页面不是GB2312嘛只要是中文的Windows，可以用Encoding.Default.GetString( b )得到需要的字符串

千问 · 2011-1-5 15:41:32

你换成GBK或者GB2312试试吧