2006.12.01 04:47 PM

Reading HTML Tables with the JET HTML Import Installable ISAM Driver

Update 12/4/2006

Just realized I didn't explain that the HTML Import installable ISAM driver doesn't require that an HTML file's table first be attached to an existing JET database in order to be opened. Sorry. It can also be opened directly using the Extended Properties parameter of the JET OLEDB driver's connection string. So, for instance, the following code works just like the code described below, except that it doesn't attach the HTML table to an existing MDB file.

Dim conn As ADODB.Connection

Set conn = New ADODB.Connection
Call conn.Open("Provider=Microsoft.Jet.OLEDB.4.0;" & _
               "Data Source=http://www.bennetyee.org/ucsd-pages/area.html;" & _
               "Extended Properties=""HTML Import;HDR=YES;IMEX=1""")

Dim rs As ADODB.Recordset

Set rs = New ADODB.Recordset
Call rs.Open("select count(*) from [Area Code Listing, by Number]", conn, adOpenDynamic, adLockReadOnly)
Debug.Print rs.Fields(0).Value
Call rs.Close

Note the use of the connection string's "Data source" parameter to name the HTML file, and the identification of the HTML table in the SQL itself instead of in the TableDef's SourceTableName property.


The recent post HTML Tables at the Daily Dose of Excel reminded me of something I'd learned about and forgotten years ago. The post describes a technique for using Internet Explorer to retrieve information from a web page. It relies on automating IE to open and parse a page at some URL. It then gets what it wants from the page by navigating the resulting IE document object model. The post targets Excel, but there's nothing about the technique that would prevent it from being used in Word, Access, or any environment that supports COM automation. Some of the post's comments also point out that the Excel Worksheet object includes a QueryTables feature allowing you to easily connect to and retrieve one or more tables from an HTML file and paste them into the sheet. (For the record, I prefer the latter approach.)

In any case, what this reminded me of was that the Access JET database engine also includes an installable ISAM driver for directly attaching to and reading the content of table elements from HTML files. The JET HTML Import installable ISAM driver is used by Access when importing or linking HTML-type data using the File | Get External Data menu. It can also be used directly to create linked TableDefs via Data Access Objects (DAO) or ActiveX Data Object (ADO) extensions for DDL and Security (ADOX).

The following Access VBA code uses DAO to create a new TableDef named AreaCodes linked to a table element in an HTML file at an HTTP address.

Dim db As DAO.Database
Dim td As DAO.TableDef

' Assumes we're in Access and adding the linked TableDef to the current database.  You could
' do this external to Access, but you'd have to create a Connection and open the database yourself.

Set db = CurrentDb()

Set td = db.CreateTableDef("AreaCodes")
td.Connect = "HTML Import;HDR=YES;IMEX=1;DATABASE=http://www.bennetyee.org/ucsd-pages/area.html"
td.SourceTableName = "Area Code Listing, by Number"
Call db.TableDefs.Append(td)

Dim rs As DAO.Recordset

Set rs = db.OpenRecordset("select count(*) from AreaCodes")
Debug.Print rs.Fields(0).Value
Call rs.Close

The following ADO/ADOX code does essentially the same thing.

Dim conn  As ADODB.Connection

Set conn = New ADODB.Connection
Call conn.Open("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\whatever\somefile.mdb")

Dim cat   As ADOX.Catalog
Dim table As ADOX.table

Set cat = New ADOX.Catalog
Set cat.ActiveConnection = conn

Set table = New ADOX.table
With table
  .Name = "AreaCodes"
  Set .ParentCatalog = cat
  .Properties("Jet OLEDB:Create Link").Value = True
  .Properties("Jet OLEDB:Link Provider String").Value = "HTML Import;HDR=YES;IMEX=1;DATABASE=http://www.bennetyee.org/ucsd-pages/area.html"
  .Properties("Jet OLEDB:Remote Table Name").Value = "Area Code Listing, by Number"
End With

Call cat.Tables.Append(table)

Dim rs As ADODB.Recordset

Set rs = New ADODB.Recordset
Call rs.Open("select count(*) from AreaCodes", conn, adOpenDynamic, adLockReadOnly)
Debug.Print rs.Fields(0).Value
Call rs.Close

Call conn.Close

In both cases the results look like this.

The ancient MSDN article Working with Tables in Microsoft Access by David Shank is one of the few resources available on this subject and it includes a lot of useful information. It provides background on all of the JET installable ISAM drivers and explains their various provider string parameters (here's more on the latter, if you're interested). The article also includes a general-purpose version of the ADO/ADOX code above that's good for use with any of the installable ISAM drivers. However, as far as the article's explanation of the HTML Import installable ISAM driver itself goes, it's not perfect.

For instance, in my experience Mr. Shank's explanation of how to specify which table to retrieve (via the ADOX Table property "Jet OLEDB:Remote Table Name" or the DAO TableDef.SourceTableName property) from an HTML file having many table elements in it doesn't always pan out. While it is true that a captioned table can be addressed by name, you aren't likely to find many tables in the wild with a caption element in them. And, if that's the case, Mr. Shank explains that you can "refer to them sequentially as Table1, Table2, and so on", and that "[t]he I-ISAM driver [will] interpret these references as the first unnamed table in the HTML file, the second unnamed table in the HTML file, and so on." However, in my experience, this isn't true. I've never successfully used Table1, Table2, etc. to target a table element in an HTML file having multiple table elements. Which doesn't mean it isn't true in some cases, it's just not as true as the article would seem to suggest.

Another thing Mr. Shank doesn't mention is the fragile nature of the HTML Import installable ISAM driver in the face of many common HTML and table element constructs. For instance, many column spans simply defeat it, and I've experienced problems with tables containing multiple tbody elements or col/colgroup elements. Nested tables, both the inners and outers, aren't accessible (or at least I haven't figured out how to do it), and some badly formed documents will simply derail it. Of course this doesn't mean it's a complete prima donna. In the area codes HTML used above (at least at the time of this writing) the target table resides within an unclosed p element, but it still works. On the other hand, there doesn't appear to be any way to access the Yahoo rate table used in the Daily Dose of Excel post linked above.

Mr. Shank also doesn't mention the limited types of content the driver can successfully connect with over HTTP. For instance, I've never successfully connected to and read a table from an HTTP resource having an extension other than HTM or HTML. Trying to link to a resource having an extension like ASP, ASPX, JSP, etc. always results in run-time error 3027, "Cannot update. Database or object is read-only." In addition, I don't think it's possible to access protected content, even if Windows integrated authentication is enabled on it. Trying to do so always gets me run-time error 3011, "The Microsoft Jet database engine could not find the object '[connection string]'. Make sure the object exists and that you spell its name and the path name correctly." And, as far as all that goes, sometimes I can't even connect with unprotected resources having HTML extensions. For example, trying to connect with the unprotected resource http://www.ewbi.com/ewbi.develop/samples/jet.isam.test.html, a simple one-table HTML file (described below), using the VBA code above gets me run-time error 3652, "Internet login failure." And sometimes it gets me the more desperate sounding run-time error 3654, "Internal internet failure." I suppose your mileage might vary, but overall the driver doesn't seem very forgiving about where and what it connects to.

Once you do get connected, though, or if you simply target HTML files on your local drive, the driver is pretty good about establishing the data type of a table's columns based on its contents. For the installable ISAM drivers other than HTML Import, which is read-only, the IMEX connection parameter informs the driver how to determine the data type of the data's columns. IMEX stands for "import/export" and its values are:

  • 0 is Export mode
  • 1 is Import mode
  • 2 is Linked/Update mode

For many of the JET installable ISAM drivers, particularly the read-write Excel driver, the IMEX value will vary how many rows are read in order to determine the data type of the table's columns. In my experience, though, the IMEX value has no affect on the HTML Import driver. It appears to always read the first 25 rows of data to establish column data types regardless of the IMEX value. And, once a column's data type is established, data in rows beyond the 25th that cannot be cast to the specified type are treated as invalid values. Finally, blank td elements and td elements containing only a space or   entity are treated as NULL values.

The following HTML table illustrates these points. It contains 26 rows, 25 of which contain a number and string (as well as some blanks and   entities), and the 26th contains a string and number. (In case you're interested, I've made this table available as both an HTML and ASP resource on my server for testing the driver's HTTP access.)

<html>
  <head>
    <title>My Table</title>
  </head>
  <body>
    <table>
      <thead>
        <tr><td>col1</td><td>col2</td></tr>
      </thead>
      <tbody>
        <tr><td>1</td><td>a string</td></tr>
        <tr><td>&nbsp;</td><td>&nbsp;</td></tr>
        <tr><td>3</td><td>a string</td></tr>
        <tr><td></td><td></td></tr>
        <tr><td>5</td><td>a string</td></tr>
        <tr><td>6</td><td>a string</td></tr>
        <tr><td>7</td><td>a string</td></tr>
        <tr><td>8</td><td>a string</td></tr>
        <tr><td>9</td><td>a string</td></tr>
        <tr><td>10</td><td>a string</td></tr>
        <tr><td>11</td><td>a string</td></tr>
        <tr><td>12</td><td>a string</td></tr>
        <tr><td>13</td><td>a string</td></tr>
        <tr><td>14</td><td>a string</td></tr>
        <tr><td>15</td><td>a string</td></tr>
        <tr><td>16</td><td>a string</td></tr>
        <tr><td>17</td><td>a string</td></tr>
        <tr><td>18</td><td>a string</td></tr>
        <tr><td>19</td><td>a string</td></tr>
        <tr><td>20</td><td>a string</td></tr>
        <tr><td>21</td><td>a string</td></tr>
        <tr><td>22</td><td>a string</td></tr>
        <tr><td>23</td><td>a string</td></tr>
        <tr><td>24</td><td>a string</td></tr>
        <tr><td>25</td><td>a string</td></tr>
        <tr><td>a string</td><td>26</td></tr>
      </tbody>
    </table>
  </body>
</html>

After creating and attaching a TableDef linked to this table, here's what we get.

Based on the first 25 rows, the first column was defined as a number. As a result, the string value in the 26th row is considered bad data.

If we back that row up one, so the col1 string appears in row 25, here's what we get.

Now the first column is defined as a string.


Comments

Nice site. Very useful contents. I've been looking for information for a long time, and I've found it exactly here. Thank you

Alex Taylorm | 2007.08.17 08:35 AM

ok!

ZhangYuan | 2009.04.10 08:50 PM

Great Post. You find most blog posts littered with rubbish nowadays,
however this has been a very informative post.

johnstevens | 2009.06.22 10:18 AM

What if a page has two html tables? I can only get my code to read the first table. How can the second (or what ever number) table be selected?

Reuben | 2010.12.09 08:29 AM

Reuben, as described in the post, it doesn't appear to be possible to address tables that aren't captioned, other than the first one, notwithstanding David Shanks suggestion to the contrary in his linked article.

ewbi.develops | 2010.12.09 08:47 AM



Post a Comment

 
  (optional)
  (no html)
 
   


TrackBack

TrackBack URL:  http://www.typepad.com/services/trackback/6a00d8341c7bd453ef00d8356ea83369e2

Listed below are links to weblogs that reference Reading HTML Tables with the JET HTML Import Installable ISAM Driver: