Retrieving documents from the web in .NET

My current project requires me to fetch pages from the web and analyse their contents. I’ll be the first one to admit that I didn’t have a clue how to do this – it’s not exactly your everyday type of operation, after all.

Still, I knew that the XmlReader class allows you to plug in a URL at construction time, so I figured that it must be doing exactly this type of operation. Trolling through MSDN for a while proved fruitless, so instead I decided to just look at the code using Reflector. Bingo!

Actually it turns out to be very easy – the key class is XmlUrlResolver. Once you’ve got that figured out, it’s all pretty self explanatory. For example, here’s a chunk of code that will pull whatever document is at the nominated URL, and return it as a string:


/// <summary>
/// Retrieve the content as a string, from the nominated url
/// </summary>
/// <param name="url">Url to retrieve the content from</param>
/// <returns>String value of the content located at the url</returns>
static private string Get(string url) {
    XmlUrlResolver resolver = new XmlUrlResolver();
    Uri uri = resolver.ResolveUri(null, url);
    using (Stream urlContentStream = (Stream) resolver.GetEntity(uri, String.Empty, typeof(Stream))) {
        using (StreamReader reader = new StreamReader(new BufferedStream(urlContentStream))) {
            return reader.ReadToEnd();
        }
    }
}
Posted in .Net by Gerrod at December 1st, 2006.

Leave a Reply