Crawler Story
Crawler:
- What will the crawler crawl?
- Content types: atom+xml, xhtml, html, kml
- Operant properties: <link>, <a>
- What criteria for assessing links to traverse
- URLs that refer to Pleiades (from httpd logs)
- Seed with URLs for recognized types surfaced by established Pleiades collaborators
- Respect robots.txt
- Will crawler dump a local copy of what it finds, or extract metadata on the fly?
Index:
- DC metadata as harvested
- Harvesting statistics (date, time, ??? - there should be something canonical?)
- Geodata as harvested
- links (what metadata about links?)
Index services api:
A vocabulary of link types/relationships/verbs?
- consider: geoRelations (http://www.mindswap.org/2004/geo/geoOntologies.shtml )
- consider: scholarly ontologies (http://kmi.open.ac.uk/projects/scholonto/resources/Scholonto2.rdfs cf: http://kmi.open.ac.uk/projects/scholonto/index.html )
- consider: pleiades thesauri (esp. for uncertainty): http://pleiades.stoa.org/thesaurus
