API
@-Formulas
JavaScript
LotusScript
Reg Exp
Web Design
Notes Client
XPages
 
Get HTML From A URL
A ND6 customer of ours had a need to get all the HTML from a web page. There are probably several ways to do this, but we had a Java servlet that read data from a web page for another project. So we just pulled the important Java out of that servlet, created a Java class (in a script library) that could be called from a LotusScript agent. The customer could then write any wrapper in LotusScript around the Java class, and with a few instructions on how to use the Java class would be able to pull the HTML out of any web page.

In this solution, we will read all the HTML out of a web page and return a string. This works on web pages where the resulting HTML is small enough to fit into the string variable (2GB characters, which is very large in itself). In part 2 (next week) we will explore web pages that are larger than this size (if you need that). First, go to the script library section in your ND6 designer and create a new Java library. The library should be called "GetHTML", and should contain this code:

import java.io.*;
import java.net.*;

public class GetHTML {
   
   public String getHTML(String urlToRead) {
      URL url; // The URL to read
      HttpURLConnection conn; // The actual connection to the web page
      BufferedReader rd; // Used to read results from the web page
      String line; // An individual line of the web page HTML
      String result = ""; // A long string containing all the HTML
      try {
         url = new URL(urlToRead);
         conn = (HttpURLConnection) url.openConnection();
         conn.setRequestMethod("GET");
         rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
         while ((line = rd.readLine()) != null) {
            result += line;
         }
         rd.close();
      } catch (Exception e) {
         e.printStackTrace();
      }
      return result;
   }
}


For those of you familiar with Java, this should be fairly straightforward. For those of you unfamiliar with Java, the nice thing is that you don't really have to understand it to use it here. You will end up with a Java class that can be used in LotusScript. It will be the "black box" that does the work for you.

Now, when you want to use this Java class, there are a couple of things you'll need to do. You'll want to include the library, just like you normally do with any script library, but you'll also need to tell Domino how to connect to Java. So, in the (Options) part of your LotusScript (agent, form, script library, etc.) you will add these two lines:

Uselsx "*javacon"
Use "GetHTML" ' Java library
(It goes without saying that you will have Option Declare in there, as well, doesn't it?)

The first line is the Java connector that comes with ND6. This allows you to use a Java class in LotusScript. The second line brings in the library, just like any other library.

Now, when you are ready to use the class, you need to create an object that will be the class. There are a few steps to go through:

Const myURL = "http://www.breakingpar.com"
Dim js As JAVASESSION
Dim getHTMLClass As JAVACLASS
Dim getHTMLObject As JavaObject
Dim html As String
   
Set js = New JAVASESSION
Set getHTMLClass = js.GetClass("GetHTML")
Set getHTMLObject = getHTMLClass.CreateObject
html = getHTMLObject.getHTML(myURL)

We have set up a constant to the URL to read. The could be a variable and come from wherever you want. We create some Java variables to interface with the Java class. Including the "*javacon" LSX means that those Java types will appear in type-ahead in the ND6 designer. So they become "built-in" because we included the LSX.

It's a progression to get the class. First, create a Java Session. Then, get the class. The Java class name should be the same as the library name to make things easier, but technically is the name in the public class line in the Java library. Then, we create an object from that class. Once the object is created, we can run any public properties or methods from that class. Our example just has one public method called "getHTML". That method takes in a string of the URL as a parameter and returns a string. So the variable "html" is defined as a string and is the return value from the method. The Java method will be called and the results put into the "html" variable. At this point, your LotusScript can do whatever it want with the variable.

Next week we'll explore how to enhance this so the only limitation we have is physical memory, for those web pages that have more than 2 GB of text.