天泣記

2011-01-14 (Fri)

手元に、svn diff のラッパーがある。その最後の部分は

command = ["svn", "diff"]
command.concat ...

puts "% " + shellescape(command)
STDOUT.flush
exec *command

というような感じになっていた。shellescape は配列 (や複数の引数) を受け取って、/bin/sh の流儀でエスケープする関数である。

ふと思い立って、これを、(git のように) 出力が端末ならページャを起動するようにしてみた。(diff した後、コマンドラインを編集して ci するとき、|less を削る必要がなくなる。削り忘れたときにエディタの出力が less に行っておかしくなることも防げる)

古い Ruby でも動くように考えて、結局以下のようになった。

if STDOUT.tty?
  pager = ENV['PAGER'] || 'more'
  io = IO.popen(shellescape("sh", "-c",
                              shellescape("echo", "% " + shellescape(command)) + ";\n" +
                              shellescape("exec", command)))
  STDIN.reopen io
  exec pager
else
  puts "% " + shellescape(command)
  STDOUT.flush
  exec *command
end

shellescape が何重にもなって、ちょっとナニである。

Ruby 1.9 以降に限るとどのくらい簡単になるかな。

fd が child process に漏れる話 (の将来) について、とても詳しく述べられている。

Austin Group Defect Tracker - Add fdwalk system interface

いきなりことごとく close するのがダメな理由の例は message catalog や posix_trace_* が動かなくなることがままあるから。

まぁ、valgrind も動かなかったし、それは理解できる理由ではある。(だから、Ruby ではことごとく close するのではなく、ことごとく FD_CLOEXEC をつける、のだ)

2011-01-21 (Fri)

.NET の正規表現を試してみる。

$ cat match.cs 
using System;
using System.Text.RegularExpressions;

class Regexip {
  static void Main(string[] args)  {
    if (args.Length < 2) {
      Console.WriteLine("match pattern string");
    } else {
      Regex regex = new Regex(args[0]);
      Match m = regex.Match(args[1]);
      if (m.Success) {
        Console.WriteLine("Value = " + m.Value);
        for (int j = 1; j < m.Groups.Count; j++) {
          for (int i = 0; i < m.Groups[j].Captures.Count; i++) {
            Console.WriteLine("Groups[" + j + "].Captures[" + i + "].Value = " + m.Groups[j].Captures[i].Value);
          }
        }
      }
      else {
        Console.WriteLine("fail");
      }
    }
  }
} 

$ /cygdrive/c/windows/Microsoft.NET/Framework/v3.5/csc match.cs 
Microsoft (R) Visual C# 2008 Compiler version 3.5.30729.1
for Microsoft (R) .NET Framework version 3.5
Copyright (C) Microsoft Corporation. All rights reserved.


$ ./match foo foo
Value = foo

で、お目当ての Balancing Group Definition (日本語訳では「グループ定義の均等化」となっているが、良い訳だとは思わない) (?< name1 - name2 > subexpression) を試す。

$ ./match '^(?<A>a)+(?<B-A>b)+(?(A)(?!))(?<C-B>c)+(?(B)(?!))$' aaabbbccc
Value = aaabbbccc
Groups[3].Captures[0].Value = b
Groups[3].Captures[1].Value = bbc
Groups[3].Captures[2].Value = bbbcc

^(?<A>a)+(?<B-A>b)+(?(A)(?!))(?<C-B>c)+(?(B)(?!))$ は {a^n b^n c^n | n>=1} という CFG (文脈自由言語) でない集合にマッチする。

これは CFG ではなく、当然 subexp call では実現できない。

ちゃんと理解するために、.Net の正規表現エンジンっぽいものを Ruby で書いてみよう。

def match(re, str, b=0)
  ary = str.split(//)
  try(re, ary, b, Hash.new([].freeze)) {|e, cap| return [b...e, cap] }
  nil
end

def try(re, ary, pos, cap, &b)
  case re[0]
  when :lit; _, ch = re; try_lit(ch, ary, pos, cap, &b)
  when :cat; _, *rs = re; try_cat(rs, ary, pos, cap, &b)
  when :alt; _, *rs = re; try_alt(rs, ary, pos, cap, &b)
  when :rep; _, r = re; try_rep(r, ary, pos, cap, &b)
  when :cap; _, n, r = re; try_cap(n, r, ary, pos, cap, &b)
  when :condnm; _, n, r1, r2 = re; try_condnm(n, r1, r2, ary, pos, cap, &b)
  when :balance; _, n1, n2, r = re; try_balance(n1, n2, r, ary, pos, cap, &b)
  end
end

def try_lit(ch, ary, pos, cap)
  if pos < ary.length && ary[pos] == ch
    yield pos + 1, cap
  end
end

# r1 r2 ...
def try_cat(rs, ary, pos, cap, &block)
  if rs.empty?
    yield pos, cap
  else
    r, *rest = rs
    try(r, ary, pos, cap) {|pos2, cap2|
      try_cat(rest, ary, pos2, cap2, &block)
    }
  end
end

# r1 | r2 | ...
def try_alt(rs, ary, pos, cap, &block)
  rs.each {|r|
    try(r, ary, pos, cap, &block)
  }
end

# r*
def try_rep(r, ary, pos, cap, &block)
  try(r, ary, pos, cap) {|pos2, cap2|
    if pos < pos2
      try_rep(r, ary, pos2, cap2, &block)
    end
  }
  yield pos, cap
end

# (?<n>r)
def try_cap(n, r, ary, pos, cap, &block)
  try(r, ary, pos, cap) {|pos2, cap2|
    cap3 = cap2.dup
    cap3[n] = cap3[n] + [pos...pos2]
    yield pos2, cap3
  }
end

# (?(n)r1|r2)
# (?(n)r1)      if r2 == nil
def try_condnm(n, r1, r2, ary, pos, cap, &block)
  if !cap[n].empty?
    try(r1, ary, pos, cap) {|pos2, cap2|
      yield pos2, cap2
    }
  elsif r2
    try(r2, ary, pos, cap) {|pos2, cap2|
      yield pos2, cap2
    }
  else
    yield pos, cap
  end
end

# (?<n1-n2>r)
# (?<-n2>r)
def try_balance(n1, n2, r, ary, pos, cap, &block)
  try(r, ary, pos, cap) {|pos2, cap2|
    if !cap2[n2].empty?
      cap3 = cap2.dup
      *cap3[n2], top = cap3[n2]
      cap3[n1] = cap3[n1] + [top.end...pos] if n1
      yield pos2, cap3
    end
  }
end

p match([:cat, [:rep, [:cap, :A, [:lit, "a"]]],
               [:rep, [:balance, :B, :A, [:lit, "b"]]],
               [:condnm, :A, [:alt]],
               [:rep, [:balance, :C, :B, [:lit, "c"]]],
               [:condnm, :B, [:alt]]],
        "aaabbbcccddd")

これを実行すると

[0...9, {:A=>[], :B=>[], :C=>[5...6, 4...7, 3...8]}]

となって、ちゃんと aaabbbcccddd の先頭の aaabbbccc (のインデックス) にマッチしている。

一回以上の繰り返しの + を実装してなくて面倒だったので

^(?<A>a)+(?<B-A>b)+(?(A)(?!))(?<C-B>c)+(?(B)(?!))

じゃなくて

^(?<A>a)*(?<B-A>b)*(?(A)(?!))(?<C-B>c)*(?(B)(?!))

だが。

実装すると細かいところまで理解せざるを得ないが、(?<n1-n2>r) でキャプチャの内容を調べたり変えたりするのは r にマッチした「後」なのだな。

$ ./match '(?<n>foo)(?<m-n>(?(n)a|b))' fooa
Value = fooa
Groups[2].Captures[0].Value = 

$ ./match '(?<n>foo)(?<m-n>(?(n)a|b))' foob
fail

$ ./match '(?<m-n>(?<n>foo))(?(n)a|b)' fooa
fail

$ ./match '(?<m-n>(?<n>foo))(?(n)a|b)' foob
Value = foob
Groups[1].Captures[0].Value = foo

2011-01-22 (Sat)

さて、Balancing Group Definition で表現できて subexp call で表現できないものが存在することはわかった。

では、subexp call で表現できて Balancing Group Definition で表現できないものは存在するか?

Balancing Group Definition はスタックを扱えるとはいっても、スタックの状態による条件判断はスタックが空かどうかしかない。(try_balance 内の if の条件は !cap2[n2].empty? で、スタックの要素の値は判断に寄与しない。) そのあたりが狙い目ではないか。(CFG は PDA (Pushdown automaton) と等価で、PDA はスタック先頭のシンボルを見るので、それができないということは直感的にかなり怪しい。)

S = ""
  | "(" S ")"
  | "<" S ">"

というのはどうだろう。2種類の括弧がネストしているというものである。

XML みたいなのを考えてもいい。

S = ""
  | "<div>" S "</div>"
  | "<span>" S "</span>"

まぁ、どれでも同じなので、正規表現にしたときにエスケープが不要な以下を考えよう。

S = ""
  | "a" S "b"
  | "c" S "d"

そうやって、subexp call を使って正規表現にすると、/\A(?<S>|a\g<S>b|c\g<S>d)\z/ となる。

これを Balancing Group Definition で表現できるか?

スタックを使って実装するなら、最初の [ac]* の部分をスタックに push していって、後の [bd]* の部分を読みながらスタックの先頭が対応する文字であることを確認しつつ pop していく、というのが自然だろう。

最初の [ac]* で push していくのは (?<S>[ac])* でいい。だが、[bd]* で pop していくのは ... pop するだけなら (?<-S>[bd])* でいいが、それでは対応する文字であることを確認できない。

try_balance ではスタックの要素の中身を検査できないので、他の何かを考える必要がる。だが、try_condnm も同様に検査できない。上で実装した範囲には、cap の中身で条件判断しているのはそのふたつだけなので、どうしようもない。

では、実装してないところになにかあるかというと、どうもなさそうである。いちばん近いのは backref で、スタックの先頭要素と対象文字列(の現在位置から)を比較できるが、今回やりたいのはその比較ではないので使えない。(今回は先頭要素が a のとき、対象文字列の現在位置に b があることを確認したい。b じゃなくて a なら backref でうまくいくのだが)

というわけで、subexp call で表現できるが、Balancing Group Definition では表現できないものは存在する、と思う。

Balancing Group Definition はスタックというよりは、むしろ「数を数えられる」と理解したほうがいいかもしれない。

(?<n>r) で n++, (?<-n>r) で n--, (?(n)r1|r2) は if (n != 0) r1 else r2 ということで。ただし n は非負整数。

ということは、a と b の数が等しい文字列にマッチするなんてものできそうだな。

... できた。

[:cat, [:rep, [:alt, [:cap, :A, [:lit, "a"]],
                     [:cap, :B, [:lit, "b"]]]],
       [:rep, [:cat, [:balance, nil, :A, [:cat]],
                     [:balance, nil, :B, [:cat]]]],
       [:condnm, :A, [:alt]],
       [:condnm, :B, [:alt]]]

いや、動かない。ぬぅ。

2番めの rep で、空文字列を繰り返さないといけないので、try_rep の無限再帰防止条件にひっかかっているのか。以下のように変えれば動く。厳密にいえば、これは無限再帰を許してしまう気がするが。

# r*
def try_rep(r, ary, pos, cap, &block)
  try(r, ary, pos, cap) {|pos2, cap2|
    if pos != pos2 || cap != cap2
      try_rep(r, ary, pos2, cap2, &block)
    end
  }
  yield pos, cap
end

変えずに動かすには... 以下のようにすればいいか。

[:cat, [:rep, [:condnm, :A,
               [:alt, [:cap, :A, [:lit, "a"]], 
                      [:balance, nil, :A, [:lit, "b"]]],
               [:condnm, :B,
                [:alt, [:balance, nil, :B, [:lit, "a"]], 
                       [:cap, :B, [:lit, "b"]]],
                [:alt, [:cap, :A, [:lit, "a"]], 
                       [:cap, :B, [:lit, "b"]]]]]],
       [:condnm, :A, [:alt]],
       [:condnm, :B, [:alt]]]

「数えられる」で扱えるのは非負整数なので、途中段階で a が多いときも b が多いときも両方ありうるこの場合、A と B の両方を使っている。

天泣記

2011-01-01 (Sat)

2011-01-09 (Sun)

2011-01-14 (Fri)

2011-01-18 (Tue)

2011-01-21 (Fri)

2011-01-22 (Sat)

2011-01-27 (Thu)