Crawler Story

Crawler:

  • What will the crawler crawl?
    • Content types: atom+xml, xhtml, html, kml
    • Operant properties: <link>, <a>
  • What criteria for assessing links to traverse
    • URLs that refer to Pleiades (from httpd logs)
    • Seed with URLs for recognized types surfaced by established Pleiades collaborators
    • Respect robots.txt
    • Will crawler dump a local copy of what it finds, or extract metadata on the fly?

Index:

  • DC metadata as harvested
  • Harvesting statistics (date, time, ??? - there should be something canonical?)
  • Geodata as harvested
  • links (what metadata about links?)

Index services api:

A vocabulary of link types/relationships/verbs?