使用RCurl抓取网页数据 | Nicolas的博客

Android with root Git for version control Lircd with Raspberry Pi for IR receiver and sender Tips for Windows Depolying your own password management tool -- KeeWeb Depoly your flask app into Heroku Fix shit IE code manually ISBN to Book Category by Scraping DangDang A Generic Makefile for C/C++ Program Configure Raspberry pi Remove watermark with PyPDF2 tips for docker Anaconda+TensorFlow+CUDA Snippets Configure Remote Mathematica Kernel Build your own ngrok server Access Array SSL VPN 使用Rstudio制作html5幻灯片 tips for Mac OS X system Tips for ipython notebook 配置Ubuntu server + Openbox (Obuntu) tips for Vimperator tips for Vim 安装CUDA My First Jekyll Blog rsync常见选项在Linux中读取Ipod touch的文件 tip for texmacs 在VPS上建站的一些tip Gnuplot绘图札记 Samba系统和autofs自动挂载 Linux中alsamixer声卡无法录音搭建自己的RSS订阅器——Tiny Tiny RSS Grub2引导安装Ubuntu awk tips 将Ubuntu系统装入U盘 The Great Rtorrent 编译GCC 再这样剁手!!!该死的libgd 使用ulimit进行资源限制使用SSH代理上IPV6 使用RCurl抓取网页数据修复Ubuntu Grub记 openbox中的文件关联在Ubuntu 12.04下编译qtiplot 处理BCM4312网卡驱动纪实配置我的Ubuntu Server记 Cygwin杂记 Linux 使普通用户具有以超级权限执行脚本让firefox自定义地处理文件类型 WordPress优秀主题及插件在phpcloud上搭建wordpress UBUNTU下用pptpd做VPN server ubuntu升级内核过后的一些问题安装telnet服务 kubuntu札记 64位kubuntu札记统计软件R Virtualbox stardict星际译王 Ubuntu重装windows系统后的grub引导修复 SSH服务及花生壳域名解析采用cbp2make工具由code::blocks工程创建makefile文件 UBUNTU 札记

tips for R tips for C/C++ 我的第一个QT程序 Project Euler Problem 7 Project Euler Problem 6 Project Euler Problem 5 Project Euler Problem 4 Project Euler Problem 8 Project Euler Problem 14 Project Euler problem 13 Project Euler problem 12 Project Euler Problem 11 Project Euler Problem 9 Project Euler - Problem 10 跨平台编程中的宏定义一个天真的bug 函数对象的妙用-避免函数重定义 ggplot包不常见函数集合 C/C++ printf函数格式控制符的写法 SSE指令集加速运算 MATLAB 常用函数 MATLAB札记

Android with root Tips for Windows tips for docker Snippets 配置Ubuntu server + Openbox (Obuntu) tips for Vim tips for R My First Jekyll Blog 配置我的Ubuntu Server记 ggplot包不常见函数集合

Null Hypothesis and Alternative Hypothesis 均值标准误(standard error of the mean, SEM)的含义和计算方法无法生成具有任意度序列的网络证明高阶平均度函数的单调递增性

List of Books Read in 2023 List of Books Read in 2020 List of Books Read in 2019 List of Books Read in 2018

Linux 34

Git for version control ISBN to Book Category by Scraping DangDang Build your own ngrok server Access Array SSL VPN 在Linux中读取Ipod touch的文件在VPS上建站的一些tip Samba系统和autofs自动挂载 Samba系统和autofs自动挂载 Linux中alsamixer声卡无法录音搭建自己的RSS订阅器——Tiny Tiny RSS 我的第一个QT程序 Grub2引导安装Ubuntu awk tips 将Ubuntu系统装入U盘 The Great Rtorrent 编译GCC 再这样剁手!!!该死的libgd 使用ulimit进行资源限制使用SSH代理上IPV6 修复Ubuntu Grub记 openbox中的文件关联在Ubuntu 12.04下编译qtiplot 处理BCM4312网卡驱动纪实 Linux 使普通用户具有以超级权限执行脚本更改rm命令为移动到回收站让firefox自定义地处理文件类型 UBUNTU下用pptpd做VPN server ubuntu升级内核过后的一些问题安装telnet服务 kubuntu札记 64位kubuntu札记 Ubuntu重装windows系统后的grub引导修复 SSH服务及花生壳域名解析 UBUNTU 札记

Ubuntu 9

将Ubuntu系统装入U盘使用ulimit进行资源限制修复Ubuntu Grub记配置我的Ubuntu Server记 UBUNTU下用pptpd做VPN server ubuntu升级内核过后的一些问题 stardict星际译王 Ubuntu重装windows系统后的grub引导修复 UBUNTU 札记

hg 1

UBUNTU 札记

git 2

Git for version control UBUNTU 札记

GSL 1

UBUNTU 札记

SSH 4

使用SSH代理上IPV6 配置我的Ubuntu Server记 SSH服务及花生壳域名解析 UBUNTU 札记

TeX 1

UBUNTU 札记

Code::Blocks 1

采用cbp2make工具由code::blocks工程创建makefile文件

Makefile 2

A Generic Makefile for C/C++ Program 采用cbp2make工具由code::blocks工程创建makefile文件

Matlab 3

Snippets MATLAB 常用函数 MATLAB札记

花生壳 1

SSH服务及花生壳域名解析

grub 1

Ubuntu重装windows系统后的grub引导修复

stardict 1

stardict星际译王

Windows 3

Tips for Windows 我的第一个QT程序 Virtualbox

VirtualBox 1

Virtualbox

R 18

Snippets 使用Rstudio制作html5幻灯片 tips for R 无法生成具有任意度序列的网络 Project Euler Problem 7 Project Euler Problem 6 Project Euler Problem 5 Project Euler Problem 4 Project Euler Problem 8 Project Euler Problem 14 Project Euler problem 13 Project Euler problem 12 Project Euler Problem 11 Project Euler Problem 9 Project Euler - Problem 10 使用RCurl抓取网页数据 ggplot包不常见函数集合统计软件R

Kubuntu 2

kubuntu札记 64位kubuntu札记

Telnet 1

安装telnet服务

C++ 7

我的第一个QT程序跨平台编程中的宏定义一个天真的bug 配置我的Ubuntu Server记函数对象的妙用-避免函数重定义 C/C++ printf函数格式控制符的写法 SSE指令集加速运算

SSE 1

SSE指令集加速运算

VPN 1

UBUNTU下用pptpd做VPN server

pptpd 1

UBUNTU下用pptpd做VPN server

WordPress 2

WordPress优秀主题及插件在phpcloud上搭建wordpress

Firefox 1

让firefox自定义地处理文件类型

Shell 4

配置我的Ubuntu Server记 Linux 使普通用户具有以超级权限执行脚本更改rm命令为移动到回收站让firefox自定义地处理文件类型

C 1

C/C++ printf函数格式控制符的写法

printf 1

C/C++ printf函数格式控制符的写法

ggplot 2

Snippets ggplot包不常见函数集合

cygwin 1

Cygwin杂记

cgroup 1

配置我的Ubuntu Server记

CUDA 3

Anaconda+TensorFlow+CUDA 安装CUDA 配置我的Ubuntu Server记

NTP 1

配置我的Ubuntu Server记

Server 1

配置我的Ubuntu Server记

VNC 1

配置我的Ubuntu Server记

qtiplot 1

在Ubuntu 12.04下编译qtiplot

mime 1

openbox中的文件关联

openbox 1

openbox中的文件关联

Grub 1

修复Ubuntu Grub记

RCurl 1

使用RCurl抓取网页数据

IPV6 1

使用SSH代理上IPV6

不务正业 1

证明高阶平均度函数的单调递增性

cgroups 1

使用ulimit进行资源限制

ulimit 1

使用ulimit进行资源限制

Freetype 1

再这样剁手!!!该死的libgd

Gnuplot 2

Gnuplot绘图札记再这样剁手!!!该死的libgd

LibGD 1

再这样剁手!!!该死的libgd

GCC 1

编译GCC

libtorrent 1

The Great Rtorrent

rtorrent 1

The Great Rtorrent

OS 2

Grub2引导安装Ubuntu 将Ubuntu系统装入U盘

awk 1

awk tips

tips 3

Git for version control tip for texmacs awk tips

Project Euler 11

Project Euler Problem 7 Project Euler Problem 6 Project Euler Problem 5 Project Euler Problem 4 Project Euler Problem 8 Project Euler Problem 14 Project Euler problem 13 Project Euler problem 12 Project Euler Problem 11 Project Euler Problem 9 Project Euler - Problem 10

Nvidia 1

Grub2引导安装Ubuntu

QT 1

我的第一个QT程序

VS2010 1

我的第一个QT程序

phpcloud 1

搭建自己的RSS订阅器——Tiny Tiny RSS

RSS 1

搭建自己的RSS订阅器——Tiny Tiny RSS

TTRSS 1

搭建自己的RSS订阅器——Tiny Tiny RSS

crontab 1

搭建自己的RSS订阅器——Tiny Tiny RSS

alsa 1

Linux中alsamixer声卡无法录音

capture 1

Linux中alsamixer声卡无法录音

autofs 1

Samba系统和autofs自动挂载

NIS 1

Samba系统和autofs自动挂载

Samba 1

Samba系统和autofs自动挂载

Apache 1

在VPS上建站的一些tip

texmacs 1

tip for texmacs

ifuse 1

在Linux中读取Ipod touch的文件

Ipod 1

在Linux中读取Ipod touch的文件

rsync 1

rsync常见选项

Jekyll 1

My First Jekyll Blog

Markdown 2

Snippets My First Jekyll Blog

C/C++ 2

A Generic Makefile for C/C++ Program tips for C/C++

Vim 1

tips for Vim

Vimperator 1

tips for Vimperator

firefox 1

tips for Vimperator

Openbox 1

配置Ubuntu server + Openbox (Obuntu)

Ubuntu server 1

配置Ubuntu server + Openbox (Obuntu)

Obuntu 1

配置Ubuntu server + Openbox (Obuntu)

Ipython 1

Tips for ipython notebook

Mac 1

tips for Mac OS X system

SEM 1

均值标准误(standard error of the mean, SEM)的含义和计算方法

statistics 1

Null Hypothesis and Alternative Hypothesis

Rstudio 1

使用Rstudio制作html5幻灯片

html5 1

使用Rstudio制作html5幻灯片

ngrok 1

Build your own ngrok server

Mathematica 2

Snippets Configure Remote Mathematica Kernel

Bash 1

Snippets

Html 1

Snippets

CSS 1

Snippets

Python 4

ISBN to Book Category by Scraping DangDang Remove watermark with PyPDF2 Anaconda+TensorFlow+CUDA Snippets

matplotlib 1

Snippets

Latex 1

Snippets

JAVA 1

Snippets

Golang 1

Snippets

Tensorflow 1

Anaconda+TensorFlow+CUDA

Conda 1

Anaconda+TensorFlow+CUDA

Docker 1

tips for docker

Inkscape 1

Remove watermark with PyPDF2

PyPDF2 1

Remove watermark with PyPDF2

WaterMark 1

Remove watermark with PyPDF2

ARM 1

Configure Raspberry pi

Raspberry pi 1

Configure Raspberry pi

Flask 2

Depoly your flask app into Heroku ISBN to Book Category by Scraping DangDang

BeautifulSoup 1

ISBN to Book Category by Scraping DangDang

Heroku 2

Depoly your flask app into Heroku ISBN to Book Category by Scraping DangDang

API 1

ISBN to Book Category by Scraping DangDang

IE 1

Fix shit IE code manually

Javascript 1

Fix shit IE code manually

KeeWeb 1

Depolying your own password management tool -- KeeWeb

nginx 1

Depolying your own password management tool -- KeeWeb

Powershell 1

Tips for Windows

Visual-Studio 1

Tips for Windows

WSL 1

Tips for Windows

ldd 1

Tips for Windows

lirc 1

Lircd with Raspberry Pi for IR receiver and sender

infrared 1

Lircd with Raspberry Pi for IR receiver and sender

Android 1

Android with root

使用RCurl抓取网页数据

2013年10月03日

老板要搞大文章，看师兄在数据上挣扎，就抽空写个简单的R代码抓取网页数据。

本来是想上12306直接抓列车时刻表来着，但是那个有验证码，不好搞。所以上火车票抓数据好了。

基本的代码如下，等有空了再来更新数据库存储，遍历等问题。

require(RCurl)
require(XML)
myHttpheader <- c(
  "User-Agent"="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0 ",
  "Accept"="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Accept-Language"="zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3",
  "Connection"="keep-alive",
  "Accept-Charset"="GB2312,utf-8;q=0.7,*;q=0.7",
  "Cache-Control"="max-age=0"
)

cHandle <- getCurlHandle(httpheader = myHttpheader)
d <- debugGatherer()

from <- paste(paste('%',unlist(iconv('北京',to='GB2312',toRaw=TRUE)),sep=''),collapse='')
to <- paste(paste('%',unlist(iconv('上海',to='GB2312',toRaw=TRUE)),sep=''),collapse='')
# 这里是关键，得到中文字符GB2312编码的URL encode
url <- paste("http://search.huochepiao.com/chaxun/result.asp?txtChuFa=",
             from,"&txtDaoDa=",to,"&Submit=%d5%be%d5%be%b2%e9%d1%af",sep="")
webpage <- getURL(url, .opts = list(debugfunction=d$update,verbose = TRUE), curl=cHandle, .encoding='UTF-8')
#webpage <- iconv(webpage,"GB2312","UTF-8") # 如果这里先做了编码转换，后面htmlParse解析的时候就不用设置编码了

data_html <- htmlParse(webpage,asText=TRUE, encoding='GB2312')
data_final <- readHTMLTable(data_html, which = 6, header = TRUE) # 读取第6个表格

注：

填写表单

getForm("http://alexa.ip138.com/post/search.asp",zip="518100") postForm("http://www.shenzhenpost.com.cn/services/postcode/civilcode.asp", "key"="武汉","B1"="查 询",way="add")

可以用firefox上的live http header插件分析一下http header以及post content。