~matubara/ChangeLog / 2007-12

01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

2007-12-26 Wed

■ boost::regex [cxx][programming]

boost::regex は Perl みたいな感じで正規表現を使わせてくれるライブラリ。

#include <iostream>
#include <boost/regex.hpp>

int main() {
  using namespace std;
  using boost::regex;
  using boost::sregex_token_iterator;
  string s("a/b c/d e/f");
  sregex_token_iterator
    i(s.begin(), s.end(), regex("\\s"), -1),
    end;
  for ( ; i != end; ++i ) {
    string w(i->first, i->second);
    cout << w << endl;
    sregex_token_iterator
      j(w.begin(), w.end(), regex("/"), -1);
    for ( ; j != end; ++j ) {
      cout << string(j->first, j->second) << endl;
    }
  }
}

wregex, wsregex_token_iterator を使うと、とりあえず多バイト文字が扱える。

u32regex, u32regex_token_iterator を使うと、Unicode の文字クラスが使える（ICU必要）。
http://www.boost.org/libs/regex/doc/icu_strings.html

[ 固定リンク ]

■ Text algorithms by M. Crochemore and W. Rytter [string][algorithm][book]

<http://web.njit.edu/~rytter/TEACHING/TEXTS/book.html>
少し古い(1994)文字列アルゴリズムの本。

[ 固定リンク ]

2007-12-25 Tue

2005

■ アナグラム [string][ruby]

意外な単語がアナグラムだったりします。

thread
HATRED

reproduce
PROCEDURE

thousand
HANDOUTS

generate
TEENAGER

process
CORPSES
PROCESS

ruby -e'class String; def sort(); self.split(//).sort.join;end;end; dic={}; ARGV.each{|f| File.open(f).each{|x| x.chomp!; y=x.upcase.gsub(/\W/,""); sy = y.sort; dic[sy] ||= {}; dic[sy][y]=1;}}; STDIN.each{|x| puts((dic[x.upcase.gsub(/\W/,"").sort] || {""=>1}).keys)}' /usr/share/dict/{american,british}-english <(ruby -e'STDIN.each{|x| y=Regexp.new("<K>(.*?)</K><K>(.*?)</K>").match(x); if y then; puts y[2]; end}'< /usr/share/dict/jedict.sdic | kakasi -Ha |kakasi -Ka| iconv -f euc-jp -t utf-8)

引数にファイルを取る。
ファイルの形式は一行一単語。
単語はユーザーが入力する文字列と同じ文字セットと仮定する。
単語をソートしたときのハッシュ値を使って単語の同値類を作る。

2番目のシングルクォート以降にかかれている引数ファイルを、
ユーザーの環境に合わせて調整する。

あとでやる：
活用展開済みの形態素解析辞書を使って、
その辞書での最大アナグラムを見つける。

あとできづいた：
Programming Pearls で既出

sign | sort | squash

http://www.cs.bell-labs.com/cm/cs/pearls/s02.pdf

(2007-12-25T11:35:00+0900)
翻訳アナグラム問題 TransAnagram(A,B)
2つの文字列集合A,Bが与えられる。
a \in A, b \in B かつ、a は bのアナグラムであるような対
(a,b)をすべて出力せよ。
（あるいは、アナグラム同値類のうち、A,Bの両方にまたがるものを列挙せよ）

guitar
GUITAR
TAGURI
TAGIRU

chaos
CHAOS
SOCHA

pain
ANPI

tuesday
TUESDAY
YATSUDE
YATSUDE

heuristic
ICHIRETSU
HEURISTIC

hatena
ATHENA

(2007-12-25T11:37:57+0900)
Anagram(A) = TransAnagram(A,A)

(2007-12-25T11:39:20+0900)
TransAnagramの解１
A,Bを混ぜてAnagramを解く。
その出力のうち、A,Bの要素が混在している同値類が出力である。

(2007-12-25T11:45:10+0900)
練習問題
AnagramはP問題か？

(2007-12-25T15:16:56+0900)
See also:

http://www.notwork.org/~gotoken/mag/cmagazine/gokudo/8th/
文字を素数に置き換えて、ハッシュ値の計算を高速化する

http://blog.notdot.net/archives/38-Damn-Cool-Algorithms,-Part-3-Anagram-Trees.html
半順序関係である is-a-subanagram-of に関する列挙のための Anagram Trees

http://www.google.com/search?q=anagram%20algorithm

[ 固定リンク ]

~matubara/ChangeLog / 2007-12 →移動しました

2007-12-26 Wed

■ boost::regex [cxx][programming]

■ Text algorithms by M. Crochemore and W. Rytter [string][algorithm][book]

2007-12-25 Tue

■ アナグラム [string][ruby]

2007-12-21 Fri

■ N-gram Template Library [cxx][nlp]