Retrieving documents from the web in .NET
My current project requires me to fetch pages from the web and analyse their contents. I’ll be the first one to admit that I didn’t have a clue how to do this – it’s not exactly your everyday type of operation, after all.
Still, I knew that the XmlReader class allows you to plug in a URL at construction time, so I figured that it must be doing exactly this type of operation. Trolling through MSDN for a while proved fruitless, so instead I decided to just look at the code using Reflector. Bingo!
Actually it turns out to be very easy – the key class is XmlUrlResolver. Once you’ve got that figured out, it’s all pretty self explanatory. For example, here’s a chunk of code that will pull whatever document is at the nominated URL, and return it as a string:
/// <summary>
/// Retrieve the content as a string, from the nominated url
/// </summary>
/// <param name="url">Url to retrieve the content from</param>
/// <returns>String value of the content located at the url</returns>
static private string Get(string url) {
XmlUrlResolver resolver = new XmlUrlResolver();
Uri uri = resolver.ResolveUri(null, url);
using (Stream urlContentStream = (Stream) resolver.GetEntity(uri, String.Empty, typeof(Stream))) {
using (StreamReader reader = new StreamReader(new BufferedStream(urlContentStream))) {
return reader.ReadToEnd();
}
}
}