HTML Scraping with Visual Basic and AsyncRead
Ken Spencer
Download the code for this article:Serving0103.exe (36KB)
Browse the code for this article at Code Center: StockTest

E

xchanging information with anonymous parties on the Web is no easy task today. For example, your Web site may have to cull information from another site on a regular basis, like those stock Web sites that pull data from NASDAQ (https://www.nasdaq.com). Or you may have to perform scheduled information updates to another Web site. Some Web sites make it easy for you to exchange information by supplying you with URLs that either allow you to query their data or that accept your data. Other sites provide FTP access to allow you to post or retrieve information.
      One good way to request information from these kinds of sites, especially when you want to automate the process, is to use the AsyncRead feature of Visual Basic®. It is hidden in the ActiveX® control functionality of Visual Basic, but is nonetheless quite useable. AsyncRead allows you to request a URL in an asynchronous fashion. The request is made, the control waits for the request and download to complete, then fires an event in your code to which you can respond. AsyncRead is nice because you can initiate several requests, then have each of them return one at a time where your code can respond to them individually. You can change the behavior of AsyncRead by specifying AsyncReadOptions as a parameter. For instance, you can use vbAsyncReadSynchronousDownload to cause it to operate synchronously.
      Alternatively, you could use the Win32® Internet functions (InternetOpen, hInternetConnect, and HttpOpenRequest) to make the request.
      Let's take at look at a sample application that uses AsyncRead to retrieve data. The core of this application is an ActiveX control (quote.ctl) that takes a stock symbol (such as MSFT) and retrieves several pieces of information about the stock from the NASDAQ Web site. The control displays the last-traded stock price, the time of the price quote, and the six-month history for the stock. You can retrieve this information from the site itself, but quote.ctl automates the retrieval. Figure 1 shows the application in action.

Figure 1 Using AsyncRead to Retrieve Data
Figure 1 Using AsyncRead to Retrieve Data

      Some Microsoft buddies of mine wrote the original quote control for Visual Basic 5.0. In those days, Internet Database Connector files were being used for Web sites such as NASDAQ. To make this control work today required a bit of sleuthing to find the correct files and directories. To find the files containing the necessary HTML, I browsed the NASDAQ site until I found the information. Then I used the page address to retrieve the HTML with AsyncRead.

Digging into the Code

      Let's walk through the code for the control so you can see where I found the hooks into the pages and files. Quote.ctl starts by declaring a module-level variable to hold the stock symbol, as shown in the first line of Figure 2. Next, the StockSymbol property procedures are defined for the StockSymbol property.
      In the Property Let code in Figure 2, the PropertyChanged statement lets the container know that the property has changed. The next line executes the Timer event for Timer1. This causes the timer to fire immediately and fetch the page for the new symbol.
      When the control starts, the UserControl_Initialize event code in Figure 2 fires. This code simply initializes the m_StockSymbol variable to MSFT.
      Users can double-click the control, causing it to start Microsoft Internet Explorer and open a detailed information page for the stock. This is implemented using the UserControl_DblClick event code in Figure 3, which executes the NavigateTo method of the HyperLink object. The HyperLink object is available in the control whether the control is hosted in a Visual Basic form or an HTML page in Internet Explorer.

Making the Request

      Timer1 does most of the real work to request pages and files whenever the Timer event fires (see Figure 4). The first If statement makes sure the control is in user mode before executing any requests. If you do not use this check, requests will be made even when the control is open in design mode.
      The three lines calling the CancelAsyncRead method cancel any current requests generated by the control. Each call to CancelAsyncRead specifies the name of the request to cancel. (The request names are created when the AsyncRead requests are made.)
      The error handler is pointed to by a special error handler (HandleAsyncReadError) for Internet type errors.
      The (three) calls to the AsyncRead methods make the HTTP requests. To make a request, the sHTMLURL variable is first set to the complete URL to retrieve. For instance, the first request retrieves the quote page by building the URL like this:

  sHTMLURL = https://quotes.nasdaq.com/Quote.dll? _
  
mode=stock&symbol=" & m_StockSymbol & _
"&symbol=&symbol=&symbol=&symbol=&symbol=" _
&symbol=&symbol=&symbol=&symbol=&quick.x=0&quick.y=0"

 

The first symbol tag of the querystring is set to the stock symbol to retrieve (m_StockSymbol).
      Next, the AsyncRead method is called to retrieve the page:

  AsyncRead sHTMLURL, vbAsyncTypeByteArray, "quote"
  

 

The second parameter to AsyncRead is the vbAsyncTypeByteArray constant. This constant specifies that the returned data should be in a byte array. The last parameter is the name of the request, in this case quote. Specifying a name for the request allows the code to programmatically determine when a request completes.
      The next two calls to AsyncRead work in the same way, except that they download images instead of an HTML page. The URL for each of these requests must build the file name for the image. For instance, to download the company's logo image, the URL is constructed this way:

  sHTMLURL = "https://www.nasdaq.com/logos/" + _
  
m_StockSymbol + ".GIF"

 

      At this point, the URL is passed to AsyncRead. The second parameter is now vbAsyncTypePicture, to indicate that an image is being downloaded. The last parameter is set to company:

  AsyncRead sHTMLURL, vbAsyncTypePicture, "company"
  

 

      The last call to AsyncRead downloads the six-month history graph for the requested stock. This request works exactly like the one for the company logo except the graphic file name and folder path is different.
      What happens when one of the async requests completes? This is the beauty of AsyncReadâ€"the UserControl_AsyncReadComplete event is fired, as shown here:

  Private Sub UserControl_AsyncReadComplete(asyncprop As AsyncProperty)
  
If (asyncprop.PropertyName = "company" _
Or asyncprop.PropertyName = "history") Then
HandleAsyncPicture asyncprop
ElseIf (asyncprop.PropertyName = "quote") Then
HandleAsyncHTML asyncprop
End If
End Sub

 

This is where the names you specified for each AsyncRead operation come in handy. The asyncprop object is passed as a parameter to this event procedure. The PropertyName property of this object contains the name you specified for the AsyncRead request.
      The previous code uses an If statement to execute either the HandleAsyncPicture or HandleAsyncHTML functions, depending on whether a graphic file or HTML file is returned.
      The HandleAsyncPicture function is shown in Figure 5. First, the StockPicture variable is defined with a type of Picture. This variable will contain the picture that is returned. Next, the StockPicture variable is set to the Value property of the asyncprop object:

  Set StockPicture = asyncprop.Value
  

 

At this point, StockPicture contains the downloaded image.
      Next, the If statement sets the x and y position for the image based on whether the image is a logo or not. If the image is a logo, the PropertyName property equals company, so the first part of the If executes. If the image is not a logo, it is a six-month history, so the CurrentX variable is set to display the image in the center of the control. Finally, the last line of the procedure executes the PaintPicture statement, which actually paints the image on the control:

  PaintPicture StockPicture, CurrentX, CurrentY
  

 

      The HandleAsyncHTML function shown in Figure 6 is much more complex than the HandleAsyncPicture function. HandleAsyncHTML must take the returned HTML and manipulate it to extract the requested information. Since the page is simply an HTML page, this technique requires brute-force coding to extract the data you need.
      The first part of the function defines several variables. The first four variables start with HTML, and are used in the processing of the HTML code. In particular, the HTMLAsByteArray is used to store the returned HTML in byte format as it is returned from AsyncRead. The other HTML variables are used as working variables for HTML. The StockQuoteTime and StockQuoteInfo variables are used to store the time of the quote and the quote value, which are retrieved from the HTML.
      The first major task in handling the HTML is to retrieve it, put it into the byte array, and turn it into a string. The following line puts the byte array into the HTMLAsByteArray variable:

  HTMLAsByteArray = asyncprop.Value
  

 

This line determines the end of the array:

  HTMLByteCount = UBound(HTMLAsByteArray)
  

 

      Next, the For...Next loop processes the HTMLByteCount array one character at a time, and stores the resulting string in HTMLAsString:

  For i = 1 To HTMLByteCount
  
HTMLAsString = HTMLAsString + Chr$(HTMLAsByteArray(i))
Next i

 

      The fun part of this application is the brute-force extraction of the data from the HTML. HTML is all about formatting, right? So digging the data out of an HTML stream requires looking through the HTML and figuring out how to uniquely identify the data. I needed to find the price quote and the time of the quote. Digging through the HTML, I found the quote time was prefixed with the words "As of". This allowed me to search for this string in the HTML and then extract the time. Next, I found that directly following the time was a closing font tag (</font>). This allowed me to extract the time with the following code:

  i = InStr(1, HTMLAsString, "As of")
  
If i = 0 Then GoTo HandleBadStockSymbol
HTMLAsStringTemp = Right(HTMLAsString, Len(HTMLAsString) - _
(i + 4))
j = InStr(1, HTMLAsStringTemp, "</font>")
StockQuoteTime = Trim(Left$(HTMLAsStringTemp, j - 1))

 

      To find the stock price, I again had to dig into the HTML. The stock price is preceded by a bold tag and dollar sign (<b>$). Just after the stock prices is the terminating bold tag (</b>). Using this bit of information I parse the stock price, looking for the following characters.

  i = InStr(1, HTMLAsString, "<b>$")
  
HTMLAsStringTemp = Right$(HTMLAsString, Len(HTMLAsString) - _
(i + 3))
j = InStr(1, HTMLAsStringTemp, "</b>")
StockQuoteInfo = "$ " & Trim(Mid$(HTMLAsStringTemp, 1, j - 1))

 

Once the time and price are retrieved, they are placed in the captions of the lblTime and lblQuote label controls.
      Notice that this procedure has several error handlers (HandleBadHTMLError, HandleAsyncReadError, and HandleBadStockSymbol). Changing the error handler on the fly allows the code to return the correct exception message. The error handlers also serve as goto targets when a non-runtime error occurs. This is not the most elegant error handling, but it works. Since this was an update to an existing app, I decided not to rewrite the error handler.

Conclusion

      You can use the quote control on a Visual Basic form, a Web page, or anywhere else an ActiveX control can be used. Simply set the StockSymbol property of the control to cause the control to fetch the stock data and automatically update every minute or so.
      Now that you've looked at AsyncRead, let's think about what using it in an application like the quote control really means. Parsing HTML is a brute-force, error-prone method of extracting data. In this example, if NASDAQ changes from using <b>$ at the start of a price quote to <strong>$, my code breaks.
      If the stock information had been in XML format, it would have been easy to retrieve. I could have simply downloaded the XML using AsyncRead and loaded the resulting XML string into the MSXML parser and processed it. As long as the schema stays the same, it works.
      This problem could also have been solved easily if the NASDAQ site exposed its data using a Web Service. The Web Service would take a stock symbol as a parameter and return the quote time and price. That's it. However, until more Web sites start delivering data in XML and as Web Services, brute-force coding will remain a reality.

Send questions and comments for Ken to serving@microsoft.com.

Ken Spencer works for the 32X Tech Corporation (https://www.32X.com), which produces a line of high-quality developer courseware. Ken also spends much of his time consulting or teaching private courses.

From the March 2001 issue of MSDN Magazine