[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.6 How to make a new shimbun module

Shimbun’ is a library set of emacs-w3m that enables you to read certain web contents using Gnus, Wanderlust, or Mew as if they were email messages. Here we will explain how to make a new ‘shimbun’ module.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.6.1 Overview

When you make a new ‘shimbun’ module ‘foobar’ for reading contents of http://www.foobar.net, what you have to do first is to put the following S expressions in the first part of the ‘sb-foobar.el’ file:

 
(require 'shimbun)
(luna-define-class shimbun-foobar (shimbun) ())

We will explain what they are below, so you can understand they are just incantations now. You have to use the same suffix ‘foobar’ in the file name (‘sb-foobar.el’) and the class name (‘shimbun-foobar’) as the second argument for the luna-define-class macro.

Major jobs of the ‘shimbun-foobar’ module can be classified broadly into the following four categories (note that you may rephrase “folder” with “group” if you are a Gnus user):

  1. Getting a page source from http://www.foobar.net in order to gather articles’ subjects etc. when a MUA opens the ‘foobar’ folder.
  2. Gathering subjects and other necessary informations from the page source in order to make headlines of articles and returning them as the structured list called headers.
  3. Getting a page source for an article from the web site, for example, http://www.foobar.net/053003.html, when MUA requires to display an article in the ‘foobar’ folder, and
  4. Removing cruft, e.g. advertisements, from the page source and formatting a raw article.

shimbun-headers of ‘shimbun.el’ does the first job, shimbun-get-headers does the second, shimbun-article does the third and shimbun-make-contents does the last.

The shimbun-headers method does the first job, the shimbun-get-headers method does the second, the shimbun-article method does the third and the shimbun-make-contents method does the last thing. The default methods for those categories are defined in the ‘shimbun.el’ module.

Open the ‘shimbun.el’ file. You may see unfamiliar definitions like luna-define-generic or luna-define-method there. Hm, they look like defun, don’t you? You may also see there’s just a doc-string in the former definition and the same symbol is declared again in the later form. And further, there are some symbols only declared by the luna-define-generic form, not by the luna-define-method form. What on earth are we seeing? Isn’t the program not written in the Emacs-Lisp language?

The truth is that the ‘shimbun’ modules use the ‘luna.el’ module provided by FLIM which enables you to write object oriented programs in the Emacs-Lisp language.

There are method programs defined rigidly for the specific purposes in the ‘shimbun.el’ module. The shimbun-headers method gets a page source from a certain URL, the shimbun-get-headers method gathers subjects and other informations, etc… (see above). They do routine works, so they cannot take proper method to meet various web contents in the world. Eh? Oh, you shouldn’t believe in a heresy!

The ‘shimbun.el’ module only provides the default method functions. Remember the defadvice feature. There are three ways to modify the behavior of a function: :before, :around and :after. Similarly, each default ‘shimbun’ method function can be modified for a certain purpose (note that the :around method-qualifier can be omitted). And it should be written specially that the modification will be effective only when the specified ‘shimbun’ module is selected.

Now as you may have understood that the luna-define-generic form provides only a husk in a sense, the luna-define-method form defines an actual function which can be different for each ‘shimbun’ module, and the luna-define-class form declares the ‘shimbun’ class in the first part of the ‘sb-foobar.el’ module.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.6.2 Getting web page and header information

Let’s identify a target web page URL to gather subjects and other informations first. If a web site uses a frame, a target is only one of the web pages. Second, lets create a body of the shimbun-index-url method function using the luna-define-method form in your ‘sb-foobar.el’ file. And make the user customizable variable shimbun-foobar-groups, which we will explain later(11).

 
(defvar shimbun-foobar-url "http://www.foobar.net")

(luna-define-method shimbun-index-url ((shimbun shimbun-foobar))
  shimbun-foobar-url)

(defvar shimbun-foobar-groups '("news"))

After you create a body of the shimbun-index-url method, the shimbun-headers method can get a web page source since the ‘shimbun.el’ module already has the default shimbun-headers method. After the shimbun-headers method gets a web page source, it calls the shimbun-get-headers method to gather headers information. As the ‘shimbun.el’ module does not have the shimbun-get-headers method, you have to create it in your ‘sb-foobar.el’ file.

Now look carefully in the page source and create the shimbun-get-headers method in your ‘sb-foobar.el’ file.

Create a regular expression that can gather headers information. Minimally necessary information are subject, date, author, URL and message-id of an article. They are used in MUA as Subject, Date, From, Xref and Message-ID.

If you want to make an article from a line in a web page source, like:

 
<a href="053003.html">some talks on May 30(posted by Mikio &lt;foo@bar.net&gt;)</a>

use the following regexp:

 
"<a href=\"\\(\\([0-9][0-9][0-9][0-9]\\)[0-9][0-9]\\.html\\)\">\\([^<(]+\\)(posted by \\([^<]+\\))<\/a>"

You can get a value for Xref by (match-string 1). You can get a value for Date by modifying a value of (match-string 2). Subject by (match-string 3) and From from (match-string 4). You can modify them further for showing additional information in MUA.

If URL of an article is a relative path like above, use shimbun-expand-url to expand it before putting information to header. If each article doesn’t have a each unique URLs (i.e. URL of headers and URL of articles are just same), you have to ask Emacs to remember body of an article when gathering headers information, For more detail see the files ‘sb-palmfan.el’, ‘sb-dennou.el’ and ‘sb-tcup.el’.

Sometimes you cannot identify Date information when gathering headers information only from a web page source. If so, leave it, just set a null string, "" to its value. If you can identify Date only when you see contents of an article, you can set it at that time by using shimbun-make-contents method. And you may use a fixed From for a web site (e.x. "webmaster@foobar.net").

Be careful when you build a message-id. Make sure it has uniqueness otherwise you may not be able to read some articles in the ‘shimbun(12). Assure uniqueness by building message-id using date information, a domain of the page and/or a part of URL of the page. And use ‘@’ but ‘:’ as a part of message-id in order to display inline images. See RFC2387 and RFC822 for more detail.

Put these information to header using function shimbun-create-header of the ‘shimbun.el’ module.

A bare bone of shimbun-get-headers in your ‘sb-foobar.el’ file is as follows:

 
(luna-define-method shimbun-get-headers ((shimbun shimbun-foobar)
                                         &optional range)
  (let ((regexp "....")
        subject from date id url headers)
    ...
    (catch 'stop
      (while (re-search-forward regexp nil t nil)
        ...
        (when (shimbun-search-id shimbun id)
          (throw 'stop nil))
        (push (shimbun-create-header
               0 subject from date id "" 0 0 url)
              headers)))
    headers))

Note that you can access ‘shimbun-foobar’ instance via temporary variable shimbun in the method.

Now we will explain a user variable shimbun-foobar-groups.

Assume that you have two groups of articles in http://www.foobar.net and there are two different web pages for such groups in where ‘shimbun’ module gathers header information. For examples, there are what’s new information of the web site in http://www.foobar.net/whatsnew/index.hmtl, and there are archive lists of email messages posted to ML in http://www.foobar.net/ml/index.html. In such case you may want to access the group by ‘shimbun’ folders ‘foobar.whatsnew’ and ‘foobar.ml’. If so, put the following S expressions to the ‘sb-foobar.el’ file.

 
(defvar shimbun-foobar-url "http://www.foobar.net")

(defvar shimbun-foobar-group-path-alist
  '(("whatsnew" . "/whatsnew/index.html")
    ("ml" . "/ml/index.html")))

(defvar shimbun-foobar-groups
  (mapcar 'car shimbun-foobar-group-path-alist))

(luna-define-method shimbun-index-url ((shimbun shimbun-foobar))
  (concat shimbun-foobar-url
          (cdr (assoc (shimbun-current-group-internal shimbun)
                      shimbun-foobar-group-path-alist))))

You can get the current group by using shimbun-current-group-internal. You can use it in shimbun-get-headers method (or others) in order to change its behavior in accordance with the current group.

Each ‘shimbun’ module needs at least one group. There is not a special rule for naming a group, but if you don’t find out a good name, use ‘news’ or ‘main’.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.6.3 Displaying an article

shimbun-article method defined in the ‘shimbun.el’ module gets URL from Xref information of header, get a web page source from the URL, and call shimbun-make-contents in working buffer of the source. Major job of shimbun-make-contents is to process such HTML. Imagine that a working buffer has a web page source of an article. shimbun-make-contents defined in the ‘shimbun.el’ module insert (i) header information to top of the buffer, (ii) ‘<html>’, ‘<body>’ and etc. right after the information, and (iii) ‘</body>’ and ‘</html>’ to end of the buffer. MUA displays an article as a HTML mail.

Not only HTML articles, but also articles in the ‘text/plain’ format can be generated. See section Making text/plain articles.

If you don’t want to process an article, you don’t have to define shimbun-make-contents in the ‘sb-foobar.el’ module.

If you want to remove some part of a web page source of an article at its top and its end, set regexp to shimbun-foobar-content-start that matches content start and shimbun-foobar-content-end that matches content end.

 
(defvar shimbun-foobar-content-start "^<body>$")
(defvar shimbun-foobar-content-end "^<\/body>$")

shimbun-clear-contents, which is called by shimbun-make-contents defined in the ‘shimbun.el’ module, will remove HTML source from point-min to shimbun-foobar-content-start and from shimbun-foobar-content-end to point-max using the regexps. Note that it will not remove any HTML source when either of the regexp searches fails.

If you want to remove more unnecessary parts (e.x. advertisements) diligently, define shimbun-clear-contents in your new ‘sb-foobar.el’ file as follows:

 
(luna-define-method shimbun-clear-contents :around ((shimbun shimbun-foobar)
                                                    header)
  ;; cleaning up
  (while (re-search-forward "..." nil t nil)
    (delete-region (match-beginning 0) (match-end 0)))
  (luna-call-next-method))

For more details see shimbun-make-contents in the ‘sb-ibm-dev.el’ file.

I said in the subsection of Getting web page and header information that if each article doesn’t have a each unique URLs you have to ask Emacs to remember body of an article when gathering headers information, In such case you don’t have to get a web page from URL of Xref in ‘shimbun-article’ method. Just get texts from Emacs memories and put them with pretty printing. For more detail see definitions of ‘shimbun-article’ method of ‘sb-palmfan.el’, ‘sb-dennou.el’ or ‘sb-tcup.el’.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.6.4 Inheriting shimbun module

There are some famous mailing list manager (or archiver).

If you find out one of such mailing list managers’ names in a web page source when you analyze it in the step of See section Getting web page and header information, you are very lucky(13). The modules ‘sb-mailman.el’, ‘sb-mhonarc.el’, ‘sb-fml.el’ and ‘sb-mailarc.el’ have the shimbun-get-headers method, etc, already, when you write small code that is not defined in such ‘shimbun’ modules, your new ‘sb-foobar.el’ module works!

If you use the ‘sb-mailman.el’ module, write the following S expressions to the top of the ‘sb-foobar.el’ file:

 
(require 'sb-mailman)
(luna-define-class shimbun-foobar (shimbun-mailman) ())

Those above mean that ‘shimbun’ module ‘shimbun-foobar’ inherits shimbun-mailman class(14) and methods defined in the ‘sb-mailman.el’ module will be used in ‘shimbun-foobar’ by default. You can overwrite some of parent methods, if necessary.

See the ‘sb-pilot-mailsync.el’ file as a sample that uses the ‘sb-mailman.el’ module. You can feel how easy to create a new ‘shimbun’ module by using such parent modules.

Note that there are some localized version of such mailing list manager, for examples, some of them show Date information in Japanese. The modules ‘sb-mailman.el’, ‘sb-mhonarc.el’, ‘sb-fml.el’ and ‘sb-mailarc.el’ assumes that mailing list managers are not localized.

If you want to read via ‘shimbun’ a web site that uses localized mailing list manager, you may have to overwrite some methods in the parent module.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.6.5 Making text/plain articles

Even if the MUA is reinforced by emacs-w3m so as to be able to read HTML articles, ‘text/plain’ articles might be more convenient in some cases. To make the ‘sb-foobar’ module generate ‘text/plain’ articles rather than ‘text/html’ articles, there are two ways to do that.

Whichever the ways you use, you should note that the ‘text/plain’ articles cannot contain images, links, etc.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.6.6 Zenkaku to hankaku conversion

“Zenkaku” or “zenkaku character(s)” is a term commonly used to call Japanese wide characters, and “hankaku” is an opposite term for ordinary ASCII characters. There is a complete set of zenkaku characters corresponding to at least the ASCII character set.

Some Japanese web sites tend to use zenkaku characters a lot, and those articles might not necessarily be comfortable to read. If you feel so, you can use this feature that converts those zenkaku ASCII characters into hankaku. To do that, set the shimbun-foobar-japanese-hankaku variable to t. Where foobar is a server name to which you subscribe for shimbun articles. That is, you have to use it per server.

If you prefer to convert zenkaku to hankaku only in the body of articles, use the value body instead of t. Contrarily the value header or subject specifies to perform it only in subjects.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

9.6.7 Coding convention of Shimbun


[ < ] [ > ]   [ << ] [ Up ] [ >> ]

This document was generated by TSUCHIYA Masatoshi on January 30, 2019 using texi2html 1.82.