[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [ProgSoc] Web-site ripper
Patrick Kennedy@nospam.KEYCORP LTD.
05/29/98 07:48 PM
}I am in the process of creating a perl script which, given a URL, will go
}through and download all linked pages to that URL (with the same base
}URL). Much like BlackWidow for windows, buthopefully faster. much faster.
}In fact the only reason I do this is because Balckwindo is so damn slow.
there are other spiders (which is what they call these beasties, aka web
crawlers) which are faster
}The questions I have to ask are:
}- Is it possible to view the index of a directory even if it has an
}index.html file inside? i.e. alisting of all the files in that subpath?
}(I am using lynx -s to extract the files atm)
yes, unless there is a "default" page in that directory, in which case the
server will always display it
if there is no default, then you can unless the permission to "allow
directory browsing" is not enabled
}- What tags are required? at the moment I download things following :
} href, src,backgroud.
} What else do browsers usually download? (dont say Java just yet)
huh? just download the whole page
is late...may not have gotted yer meanin'
}- With an applet tag, what fields convey the information of where the
}applet is? I don't know much about applets.
ummmmm testing me smarts now....the src field of the tag isn't it?
}- ANy other suggeestions, tips, offers of money?
take a look at how other spiders work.....heaps of them have been written
with many have source code available.
May even by some perl ones
}We hope you have enjoyed your stay.
well it's no nuudie bar.... :)
W e b m a s t e r
Keycorp Limited, Sydney Australia
+612 9414 5429 (b) +612 0412 577 362 (m)
You are subscribed to the progsoc mailing list. To unsubscribe, send a
message containing "unsubscribe" to email@example.com.
If you are having trouble, ask firstname.lastname@example.org for help.
This list is archived at <http://www.progsoc.uts.edu.au/lists/progsoc/>