Dotnet Core Web Scraping
Introduction Web scraping is a popular term for various significant methods used to extract web metadata or gather valuable information across the Internet. Generally, this is accomplished with exclusive software that simulates web surfing to gather specific bits.
Apr 24, 2019 For Prometheus metrics in ASP NET Core, we will be using prometheus-net. Let us start by installing it from NuGet. Dotnet add package prometheus-net.AspNetCore. I am writing a program in dot net that will execute scripts and command line programs using the framework 2.0's Process object. I want to be able to access the screen buffers of the process in my program. I've investigated this and it appears that I need to access console stdout and stderr buffers. The Combo Touch for the new 11-inch iPad Pro is now available for $200.
There are many reasons you may need a website scraper. One of the biggest reasons I use website scrapers is to prevent me from visiting a site to look for something on a regular basis and losing the time spent on that site. For instance, when COVID-19 first hit, I visited the stats page on the Pennsylvania Department of Health each day. Another instance may be to watch for a sale item during Amazon’s Prime Day.
Getting Started
To get started, we’ll want to create an Azure Function. We can do that a few different ways:
- Use the Azure extension for Visual Studio
- Use the Azure Portal
At this point, use the method that you feel most comfortable with. I tend to use the command line or the Azure extension for Visual Studio Code as they tend to leave the codebase very clean. I’m making this function with C# so I can use some 3rd party libraries.
In my case, I’ve called my HttpTrigger
function ScrapeSite
.
Modifying the Function
Once the function is created, it should look like this:
We’ll bring in the NuGet package for HtmlAgilityPack
so we can grab the appropriate area of our page. To do this, we’ll use a command line, navigate to our project and run:
In my case, I’m going to connect to Walmart and look at several Xbox products. I’ll be querying the buttons on the page to look at the InnerHtml
of the button and ensure that it does not read “Get in-stock alert”. If it does, that means that the product is out of stock.
Our first step is to connect to the URL and read the page content. I’ll do this by creating a sealed class that can be used to help deliver the properties back to the function:
In this case, I’ll be returning a boolean value as well as the URL that I’m attempting to scrape from. This will allow me to redirect the user to that location when necessary.
Next, I’m going to add a static class called Scraper
. This will actually handle the majority of the scraping process. The class will take advantage of the HtmlWeb.LoadFromWebAsync()
method in the HtmlAgilityPack
package. The reason for this is that the built-in HttpClient()
lacks the necessary headers to properly call most sites. If we use this library instead, most websites will record us as a bot.
After we connect to the URL, we’ll use a selector to grab all buttons and then use a LINQ query to count how many buttons contain the text “Get in-stock alert”. We’ll update the ProductAvailability
object and return it back.
Finally, we’ll update our function to call the GetProductAvailability
method multiple times:
Results
Now, we can run our function from within Visual Studio Code. To do this, hit the F5
key. This will require that you have the Azure Functions Core Tools installed. If you do not, you’ll be prompted to install it. After it’s installed and you press F5
, you’ll be prompted to visit your local URL for your function. If successful, you should see the following results (as of this post) for the above two products:
Conclusion
In this post we created a new Azure Function, built the function using VS Code, and connected to Walmart.com to obtain product information. If you’re interested in reviewing the finished product, be sure to check out the repository below:
Scraping Framework containing :
- a web client able to simulate a web browser.
- an HtmlAgilityPack extension to select elements using css selector (like JQuery)
Release Notes
Alpha seems to be stable so I release.
Dependencies
.NETStandard 2.0
- FSharp.Core(>= 4.5.2)
- HtmlAgilityPack(>= 1.7.4)
- System.Runtime.Caching(>= 4.5.0-preview1-26216-02)
Used By
NuGet packages (16)
Showing the top 5 NuGet packages that depend on ScrapySharp:
Package | Downloads |
---|---|
RealmeyeSharp Gets user infomation on realmeye.how to use got to project url and look at example | |
TripleA TripleA is an extensible framework for building components for use in test frameworks to target system and deployment verification. | |
InstantQuick.SharePoint.Provisioning CSOM based provisioning library for SharePoint 2013, 2016, and SharePoint Online | |
Smallcode.Net Fluent HttpWebClient, Http parser and Json parser. | |
WeiXinApi |
GitHub repositories (1)
Showing the top 1 popular GitHub repositories that depend on ScrapySharp:
Repository | Stars |
---|---|
ferventdesert/Hawk visualized crawler & ETL IDE written with C#/WPF |
Dotnet Core Web Scraping Free
Version History
Dotnet Core Web Scraping Tool
Version | Downloads | Last updated |
---|---|---|
3.0.0 | 87,367 | 10/2/2018 |
3.0.0-alpha2 | 1,636 | 4/5/2018 |
3.0.0-alpha1 | 547 | 4/5/2018 |
2.6.2 | 101,481 | 7/4/2016 |
2.6.1 | 4,583 | 4/28/2016 |
2.6.0 | 781 | 4/28/2016 |
2.5.1 | 665 | 4/28/2016 |
2.5.0 | 7,479 | 2/22/2016 |
2.4.0 | 736 | 2/19/2016 |
2.4.0-beta1 | 563 | 2/5/2016 |
2.3.0 | 1,047 | 1/29/2016 |
2.2.63 | 17,496 | 11/21/2013 |
2.2.62 | 770 | 11/21/2013 |
2.2.61 | 1,685 | 10/4/2013 |
2.2.60 | 761 | 10/4/2013 |
2.2.59 | 800 | 10/2/2013 |
2.2.57 | 1,176 | 9/26/2013 |
2.2.56 | 1,110 | 9/12/2013 |
2.2.0 | 945 | 9/9/2013 |
2.1.55 | 1,646 | 7/24/2013 |
2.1.54 | 776 | 7/24/2013 |
2.1.53 | 1,538 | 6/20/2013 |
2.0.57-beta | 1,279 | 5/17/2013 |
2.0.56-beta | 718 | 5/6/2013 |
2.0.55-beta | 753 | 4/26/2013 |
2.0.54-beta | 739 | 4/24/2013 |
2.0.53-beta | 1,082 | 4/5/2013 |
2.0.52 | 921 | 6/20/2013 |
2.0.52-beta | 789 | 3/25/2013 |
2.0.51-beta | 745 | 3/22/2013 |
2.0.50-beta | 722 | 3/21/2013 |
2.0.49-beta | 713 | 3/21/2013 |
2.0.48-beta | 706 | 3/20/2013 |
2.0.47 | 805 | 6/20/2013 |
2.0.47-beta | 692 | 3/20/2013 |
2.0.46-beta | 784 | 3/20/2013 |
2.0.45-beta | 707 | 3/20/2013 |
2.0.44-beta | 716 | 3/19/2013 |
2.0.43-beta | 747 | 3/14/2013 |
2.0.42-beta | 866 | 2/13/2013 |
2.0.41-beta | 786 | 2/6/2013 |
2.0.40-beta | 764 | 2/6/2013 |
2.0.39-beta | 739 | 2/6/2013 |
2.0.38-beta | 709 | 2/4/2013 |
2.0.37-beta | 748 | 2/4/2013 |
2.0.36-beta | 735 | 1/22/2013 |
2.0.35-beta | 730 | 1/15/2013 |
2.0.34-beta | 705 | 1/4/2013 |
2.0.33-beta | 720 | 1/4/2013 |
2.0.32-beta | 781 | 1/4/2013 |
2.0.31-beta | 696 | 1/4/2013 |
2.0.30-beta | 765 | 1/4/2013 |
2.0.29-beta | 711 | 1/4/2013 |
2.0.28-beta | 749 | 1/4/2013 |
2.0.27-beta | 720 | 1/2/2013 |
1.5.0 | 3,308 | 12/25/2012 |
1.4.3.1 | 1,034 | 12/11/2012 |
1.4.3 | 2,044 | 8/24/2012 |
1.4.2 | 839 | 8/17/2012 |
1.4.1 | 754 | 8/16/2012 |
1.4.0 | 788 | 8/16/2012 |
1.3.2 | 2,098 | 4/10/2012 |
1.3.0 | 950 | 4/3/2012 |
1.2.2 | 1,010 | 3/5/2012 |
1.2.1 | 923 | 2/20/2012 |
1.2.0 | 903 | 2/16/2012 |
1.1.0 | 1,148 | 12/7/2011 |
1.0.0 | 1,295 | 9/29/2011 |