Skip to content

Configurable Spider #106

Open
Open
@code4craft

Description

@code4craft

Write spider by config file or scripts.

Choices:

1. xml

<spider>
    <site>
        <charset>utf-8</charset>
        <user-agent></user-agent>
        <cookies>
            <cookie domain="" path="" name="" value="">
            </cookie>
        </cookies>
        <heads>
            <head name="" value=""/>
        </heads>
    </site>

    <startUrls>
        <url></url>
    </startUrls>

    <extraction targetUrl="" helpUrl="">
        <field name="title">
            <extractor type="xpath" value="//div[@class='title']"/>
        </field>
        <field name="content">
            <extractor type="xpath" value="//div[@class='content']"/>
        </field>
    </extraction>

</spider>

2. json

3. yaml

4.javascript

var name=xpath("//h1[@class='entry-title public']/strong/a/text()")
var readme=xpath("//div[@id='readme']/tidyText()")
var star=xpath("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()")

5.jruby

name= xpath "//h1[@class='entry-title public']/strong/a/text()"
readme = xpath "//div[@id='readme']/tidyText()"
star = xpath "//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"
fork = xpath "//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()"

6. Java

Just write PageProcessor and load it dynamicly…

7. Groovy

8. Scala

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions