Skip to content

Commit 633e0fe

Browse files
committed
document for avalon
1 parent 18a3af4 commit 633e0fe

5 files changed

Lines changed: 96 additions & 1 deletion

File tree

webmagic-avalon.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
WebMagic-Avalon项目手册
2+
=======
3+
WebMagic-Avalon项目的目标是打造一个可配置、可管理的爬虫,以及一个可分享配置/脚本的平台,从而减少熟悉的开发者的开发量,并且让**不熟悉Java技术的人**也能简单的使用一个爬虫。
4+
5+
## Part1:webmagic-scripts
6+
7+
目标:使得可以用简单脚本的方式编写爬虫,从而为一些常用场景提供可流通的脚本。
8+
例如:我需要抓github的仓库数据,可以这样写一个脚本(javascript):
9+
10+
[https://github.com/code4craft/webmagic/tree/master/webmagic-scripts](https://github.com/code4craft/webmagic/tree/master/webmagic-scripts)
11+
12+
这个功能目前实现了一部分,但最终结果仍在实验阶段。欢迎大家积极参与并提出意见。
13+
14+
## Part2:webmagic-pannel
15+
16+
一个集成了加载脚本、管理爬虫的后台。计划中。
17+
18+
## Part3:webmagic-market
19+
20+
一个可以分享、搜索和下载脚本的站点。计划中。
21+
22+
## 如何参与
23+
24+
webmagic目前

webmagic-scripts/README.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
webmagic-scripts
2+
======
3+
## 目标:
4+
使得可以用简单脚本的方式编写爬虫,从而为一些常用场景提供可流通的脚本。
5+
6+
## 实例:
7+
例如:我需要抓github的仓库数据,可以这样写一个脚本(javascript):
8+
9+
```javascript
10+
var name=xpath("//h1[@class='entry-title public']/strong/a/text()")
11+
var readme=xpath("//div[@id='readme']/tidyText()")
12+
var star=xpath("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()")
13+
var fork=xpath("//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()")
14+
var url=page.getUrl().toString()
15+
if (name!=null){
16+
println(name)
17+
println(readme)
18+
println(star)
19+
println(url)
20+
}
21+
22+
urls("(https://github\\.com/\\w+/\\w+)")
23+
urls("(https://github\\.com/\\w+)")
24+
```
25+
26+
然后使用webmagic加载并启动它,无需下载依赖、编写代码、执行的过程。
27+
28+
如果已经有人写好了脚本,那么你直接使用就可以了!
29+
30+
## 语言:
31+
32+
选用javascript是因为用户面比较广。目前还支持ruby语言,选用ruby是因为ruby的语法编写DSL更简洁:
33+
34+
```ruby
35+
name= xpath "//h1[@class='entry-title public']/strong/a/text()"
36+
readme = xpath "//div[@id='readme']/tidyText()"
37+
star = xpath "//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"
38+
fork = xpath "//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()"
39+
url=$page.getUrl().toString()
40+
41+
puts name,readme,star,fork,url unless name==nil
42+
43+
urls "(https://github\\.com/\\w+/\\w+)"
44+
urls "(https://github\\.com/\\w+)"
45+
```
46+
47+
这个功能目前仍在实验阶段。欢迎大家积极参与并提出意见。
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
var name=xpath("//h1[@class='entry-title public']/strong/a/text()")
2+
var readme=xpath("//div[@id='readme']/tidyText()")
3+
var star=xpath("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()")
4+
var fork=xpath("//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()")
5+
var url=page.getUrl().toString()
6+
if (name!=null){
7+
println(name)
8+
println(readme)
9+
println(star)
10+
println(url)
11+
}
12+
13+
urls("(https://github\\.com/\\w+/\\w+)")
14+
urls("(https://github\\.com/\\w+)")

webmagic-scripts/src/main/resources/js/oschina.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,4 @@ var config = {
88
}
99
title = $("div.BlogTitle h1"),
1010
content = $("div.BlogContent")
11-
urls("http://my\\.oschina\\.net/flashsword/blog/\\d+")
11+
urls("http://my\\.oschina\\.net/flashsword/blog/\\d+")
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
name= xpath "//h1[@class='entry-title public']/strong/a/text()"
2+
readme = xpath "//div[@id='readme']/tidyText()"
3+
star = xpath "//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"
4+
fork = xpath "//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()"
5+
url=$page.getUrl().toString()
6+
7+
puts name,readme,star,fork,url unless name==nil
8+
9+
urls "(https://github\\.com/\\w+/\\w+)"
10+
urls "(https://github\\.com/\\w+)"

0 commit comments

Comments
 (0)