Konwersja kodu za pomocą symboli wieloznacznych

kamil_w · 13 Luty 2012 01:20

Ma ktoś z was może pomysł jak taki kod:

akt 1

tekst aktu

akt 2

tekst aktu

rozdział 1

tekst rozdziału

rozdział 2

tekst rozdziału

podrozdział 1

tekst podrozdziału

akt 3

tekst aktu

rozdział 1

tekst rozdziału

akt 4

tekst aktu[/code] zamienić na taki:

[code] Akt 1 tekst aktu Akt 2 tekst aktu rozdział 1 tekst rozdziału rozdział 2 tekst rozdziału podrozdział 1 tekst podrozdziału Akt 3 tekst aktu rozdział 1 tekst rozdziału Akt 4 tekst aktu

w taki sposób, żeby móc to zautomatyzować robiąc z tego makro? Mi zabrakło wyobraźni.

Airborn · 13 Luty 2012 07:35

Najbardziej automatycznie będzie przygotować własny arkusz XSLT

kamil_w · 13 Luty 2012 09:02

No ok. A jaki algorytm?

EDIT:

Sprawdziłem proponowana metodę i wydaje mi się, że zmiana przy wykorzystaniu wyrażeń regularnych (symboli wieloznacznych) jest wygodniejsza.

matzu · 13 Luty 2012 16:53

Nie ukrywam, że również jestem ciekaw w jaki sposób można tutaj użyć XSLT. Jeśli się faktycznie da, to będzie to zdecydowanie najlepsze rozwiązanie.

177 · 13 Luty 2012 19:52

Te nagłówki w tagach h1, h2, itd … łatwo pozamieniać przy użyciu wyrażeń regularnych.

W perlowym stylu:

s/^(\s*)?<h([0-9]+)>(.+?)<\/h\2>/$1<title><p>$3<\/p><\/title>/mg[/code]

Przykład: http://ideone.com/la3cz

Ale, jak na razie, absolutnie nie wiem jak zrobić te sekcje używając tylko wyrażeń regularnych. Jakieś pętle, warunki, itp. i może ujdzie.

matzu · 13 Luty 2012 21:41

Wrzucam obiecany programik do czasu, gdy nie znajdziesz lepszego rozwiązania (wymaga .NET Framework 4):

http://www.speedyshare.com/file/ad5rS/Parser.exe

Przykładowe wywołanie:

Parser “D:\input.txt”

Klasa Tag

using System.Linq;

using System.Xml.Linq;


namespace Parser

{

    public class Tag

    {

        public Tag()

        {

            Title = string.Empty;

            Text = string.Empty;

            Tags = new Tags();

        }


        public string Title { get; private set; }

        public string Text { get; private set; }

        public Tags Tags { get; private set; }


        public void Load(int startIndex, int endIndex, string[] lines)

        {

            Title = TagUtil.GetTagContents(lines[startIndex]);

            Text = TagUtil.GetTagContents(lines[startIndex + 1]);


            Tags.Load(startIndex + 2, endIndex, lines);

        }


        public XElement Save()

        {

            XElement result = null;


            if (Tags.Count > 0)

                result = new XElement("section",

                    new XElement("title", new XElement("p", Title)),

                    new XElement("p", Text),

                    from element in Tags.Save()

                    select element);

            else

                result = new XElement("section",

                    new XElement("title", new XElement("p", Title)),

                    new XElement("p", Text));


            return result;

        }

    }

}

Klasa Tags

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Text;

using System.Xml.Linq;


namespace Parser

{

    public class Tags : List

    {

        public Tags()

            : base()

        { }


        public void Load(string filePath, Encoding fileEncoding)

        {

            StreamReader reader = null;

            List lines = new List();


            try

            {

                reader = new StreamReader(filePath, fileEncoding);


                string line = string.Empty;

                while ((line = reader.ReadLine()) != null)

                {

                    line = line.Trim();

                    if (!string.IsNullOrEmpty(line))

                        lines.Add(line);

                }

            }

            catch { throw; }

            finally

            {

                if (reader != null)

                    reader.Close();

            }


            Load(0, lines.Count - 1, lines.ToArray());

        }


        public void Load(int startIndex, int endIndex, string[] lines)

        {

            if (startIndex > endIndex)

                return;


            int lowestTagLevel = TagUtil.GetTagLevel(lines[startIndex]).Value;


            int? tagStartIndex = null;

            for (int i = startIndex; i <= endIndex; i++)

            {

                int? tagLevel = TagUtil.GetTagLevel(lines[i]);

                bool addTag = false;


                if (tagLevel.HasValue && tagLevel == lowestTagLevel)

                    if (tagStartIndex == null)

                        tagStartIndex = i;

                    else

                        addTag = true;


                if (addTag || (tagStartIndex.HasValue && (i + 1 > endIndex)))

                {

                    int tagEndIndex = endIndex;

                    if (addTag)

                        tagEndIndex = i - 1;


                    Tag tag = new Tag();

                    tag.Load(tagStartIndex.Value, tagEndIndex, lines);

                    Add(tag);


                    tagStartIndex = null;

                    if (addTag)

                        tagStartIndex = i;

                }

            }

        }


        public void Save(string filePath)

        {

            XDocument doc = new XDocument(

                new XDeclaration("1.0", "utf-8", "yes"),

                new XElement("root",

                    from element in Save()

                    select element

                ));


            doc.Save(filePath);

        }


        public XElement[] Save()

        {

            var elements = from tag in this

                           select tag.Save();

            return elements.ToArray();

        }

    }

}

Klasa TagUtil

using System.Text.RegularExpressions;


namespace Parser

{

    public static class TagUtil

    {

        private static Regex tagContentsRegex;

        private static Regex tagLevelRegex;


        static TagUtil()

        {

            tagContentsRegex = new Regex("<.*?>(?.*?)*?>", RegexOptions.IgnoreCase | RegexOptions.Singleline);

            tagLevelRegex = new Regex(@"[\d]+)>", RegexOptions.IgnoreCase | RegexOptions.Singleline);

        }


        public static string GetTagContents(string line)

        {

            return tagContentsRegex.Match(line).Groups["tagContents"].Value;

        }


        public static int? GetTagLevel(string line)

        {

            string tagLevel = tagLevelRegex.Match(line).Groups["tagLevel"].Value;


            if (string.IsNullOrEmpty(tagLevel))

                return null;

            return int.Parse(tagLevel);

        }

    }

}

Klasa implementująca Main

using System;

using System.IO;

using System.Text;


namespace Parser

{

    class Program

    {

        static void Main(string[] args)

        {

            string filePath = string.Empty;

            if (args.Length > 0)

                filePath = args[0];


            if (File.Exists(filePath))

            {

                try

                {

                    Tags tags = new Tags();

                    tags.Load(filePath, Encoding.UTF8);

                    tags.Save(Path.Combine(

                        Path.GetDirectoryName(filePath),

                        string.Format("{0}{1}", Path.GetFileNameWithoutExtension(filePath), ".xml")

                        ));

                }

                catch (Exception ex)

                {

                    Console.WriteLine("Wystąpił nieoczekiwany wyjątek.\n\nSzczegóły błędu:\n\n{0}", ex.Message);

                }

            }

            else

                Console.WriteLine("Nieprawidłowa ścieżka pliku.");

        }

    }

}

kamil_w · 13 Luty 2012 22:22

Póki co kombinowałem z symbolami wieloznacznymi w wordzie i wymodziłem taką metodę (podaję metodę, a nie kod, ale mam nadzieję, że zrozumiecie zasadę działania):

matzu · 13 Luty 2012 22:56

Pisałem jak trzeba użyć Dokładnie tak: Parser.exe “E:\input.txt”

A w szczegółach błędu nic nie ma? I czy możesz podesłać mi ten plik input.txt?

kamil_w · 13 Luty 2012 23:03

Tekst z wiersza poleceń:

Plik wsadowy:

http://dl.dropbox.com/u/5730855/Dokumenty/input.txt

matzu · 13 Luty 2012 23:13

Już widzę. Problem powodują linijki nr 3

akt 1

tekst aktu

tekst aktu

…

i ostatnia, czyli:

matzu · 14 Luty 2012 00:21

No faktycznie, przednia zabawa. Sęk w tym, że skoro na wejściu masz gotowy plik HTML, to nie wiem, czy nie byłoby lepiej skorzystać z bibliotek, które umożliwiają parsowanie HTML (dla .NET jest to np. http://forum.dobreprogramy.pl/wyrazenia-regularne-wyciaganie-danych-znacznikow-t476551.html). Teraz, gdy ten plik jest już taki “rozjechany”, to dużo rzeczy trzeba pisać samemu i nie można skorzystać z gotowca. W każdym bądź razie wrzucam zaktualizowany parser:

http://www.speedyshare.com/file/38wVy/Parser.exe (BTW Ty programujesz w .NET? Bo nie wiem, czy jest sens, żebym pokazywał kod)

Przy okazji dodam, że parser nie zadziała poprawnie, jeśli w jednej linijce będzie kilka tagów, np. kilka akapitów (

). Poszczególne tagi h i p muszą być oddzielone znakiem nowej linii (innymi słowy musi to wyglądać tak, jak w zamieszczonych przez Ciebie przykładach). Po tym co napisałeś zgaduję, że wcale tak nie musi być, czyli, że jeden akapit może zajmować np. kilka linijek lub np. znaczniki zamykający i otwierający mogą być zaraz obok siebie

. Możesz mi powiedzieć, czy tak faktycznie jest? Jeśli tak, to ten parser, Ci się nie przyda. Ja poprawek dalszych w takiej sytuacji wprowadzać nie będę, bo lepiej byłoby po prostu inaczej zabrać się do tego problemu (tak jak napisałem w pierwszym akapicie).

Klasa Tag

using System.Collections.Generic;

using System.Linq;

using System.Xml.Linq;


namespace Parser

{

    public class Tag

    {

        public Tag()

        {

            Title = string.Empty;

            Paragraphs = new List();

            Tags = new Tags();

        }


        public string Title { get; private set; }

        public List Paragraphs { get; private set; }

        public Tags Tags { get; private set; }


        public void Load(int startIndex, int endIndex, string[] lines)

        {

            Title = TagUtil.GetHTagContents(lines[startIndex]);


            for (int i = startIndex + 1; i <= endIndex; i++)

            {

                int? tagLevel = TagUtil.GetTagLevel(lines[i]);

                if (tagLevel.HasValue)

                    break;


                string paragraph = TagUtil.GetPTagContents(lines[i]);

                if (!string.IsNullOrEmpty(paragraph))

                    Paragraphs.Add(paragraph);

            }


            startIndex += 2;

            if (startIndex <= endIndex)

                Tags.Load(startIndex, endIndex, lines);

        }


        public XElement Save()

        {

            XElement result = null;


            result = new XElement("section",

                new XElement("title", new XElement("p", Title)));


            if (Paragraphs.Count > 0)

                result.Add(

                    from paragraph in Paragraphs

                    select new XElement("p", paragraph)

                    );


            if (Tags.Count > 0)

                result.Add(

                    from element in Tags.Save()

                    select element

                    );


            return result;

        }

    }

}

Klasa Tags

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Text;

using System.Xml.Linq;


namespace Parser

{

    public class Tags : List

    {

        public Tags()

            : base()

        { }


        public void Load(string filePath, Encoding fileEncoding)

        {

            StreamReader reader = null;

            List lines = new List();


            try

            {

                reader = new StreamReader(filePath, fileEncoding);


                string line = string.Empty;

                while ((line = reader.ReadLine()) != null)

                {

                    line = line.Trim();

                    if (!string.IsNullOrEmpty(line))

                        lines.Add(line);

                }

            }

            catch { throw; }

            finally

            {

                if (reader != null)

                    reader.Close();

            }


            Load(0, lines.Count - 1, lines.ToArray());

        }


        public void Load(int startIndex, int endIndex, string[] lines)

        {

            int? lowestTagLevel = TagUtil.GetTagLevel(lines[startIndex]);

            if (!lowestTagLevel.HasValue)

                return;


            int? tagStartIndex = null;

            for (int i = startIndex; i <= endIndex; i++)

            {

                int? tagLevel = TagUtil.GetTagLevel(lines[i]);

                bool addTag = false;


                if (tagLevel.HasValue && tagLevel == lowestTagLevel)

                    if (tagStartIndex == null)

                        tagStartIndex = i;

                    else

                        addTag = true;


                if (addTag || (tagStartIndex.HasValue && (i + 1 > endIndex)))

                {

                    int tagEndIndex = endIndex;

                    if (addTag)

                        tagEndIndex = i - 1;


                    Tag tag = new Tag();

                    tag.Load(tagStartIndex.Value, tagEndIndex, lines);

                    Add(tag);


                    tagStartIndex = null;

                    if (addTag)

                        tagStartIndex = i;

                }

            }

        }


        public void Save(string filePath)

        {

            XDocument doc = new XDocument(

                new XDeclaration("1.0", "utf-8", "yes"),

                new XElement("root",

                    from element in Save()

                    select element

                ));


            doc.Save(filePath);

        }


        public XElement[] Save()

        {

            var elements = from tag in this

                           select tag.Save();

            return elements.ToArray();

        }

    }

}

Klasa TagUtil

using System.Text.RegularExpressions;


namespace Parser

{

    public static class TagUtil

    {

        private static Regex pTagContentsRegex;

        private static Regex hTagContentsRegex;

        private static Regex tagLevelRegex;


        static TagUtil()

        {

            string tagContentsFormat = "<.*?{0}.*?>(?.*?)<.*?/.*?{0}.*?>";

            string pTagContents = string.Format(tagContentsFormat, "p");

            string hTagContents = string.Format(tagContentsFormat, "h");


            pTagContentsRegex = new Regex(pTagContents, RegexOptions.IgnoreCase | RegexOptions.Singleline);

            hTagContentsRegex = new Regex(hTagContents, RegexOptions.IgnoreCase | RegexOptions.Singleline);

            tagLevelRegex = new Regex(@"[\d]+)>", RegexOptions.IgnoreCase | RegexOptions.Singleline);

        }


        public static string GetPTagContents(string line)

        {

            return pTagContentsRegex.Match(line).Groups["tagContents"].Value;

        }


        public static string GetHTagContents(string line)

        {

            return hTagContentsRegex.Match(line).Groups["tagContents"].Value;

        }


        public static int? GetTagLevel(string line)

        {

            string tagLevel = tagLevelRegex.Match(line).Groups["tagLevel"].Value;


            if (string.IsNullOrEmpty(tagLevel))

                return null;

            return int.Parse(tagLevel);

        }

    }

}

Klasa Program

bez zmian

kamil_w · 14 Luty 2012 08:41

C# nie jest mi obcy, ale wielkim “wirtuozem .neta” też nie jestem. Kod się przyda. Zawsze czegoś nowego mogę się nauczyć.

EDIT:

Przykład kodu z książki:

Dla Johna i Gail,

którzy dzielili się ze mną mięsem

i miodem

Prolog

Ogon komety przecinał blask jutrzenki niczym czerwona szrama krwawiąca ponad turniami Smoczej Skały, rana zadana różowo-purpurowemu niebu. Maester stał na wystawionym na podmuchy wiatru balkonie, z dala od swych komnat. Tu właśnie przybywały po długim locie kruki. Ptaki splamiły swymi odchodami wznoszące się po obu jego stronach kamienne chimery wysokości dwunastu stóp, przedstawiające piekielnego ogara i wiwernę. Dwie z otaczającego starożytną fortecę tysiąca. Gdy przybył na Smoczą Skałę, niepokoił go widok groteskowych, kamiennych posągów, z biegiem lat przyzwyczaił się jednak do nich. Uważał je teraz za starych przyjaciół. Obserwowali we trójkę niebo, pełni złych przeczuć. Cressen nie wierzył w znaki. Mimo to… choć był już stary, nigdy w życiu nie widział komety, która byłaby choć w połowie tak jasna lub miała taki kolor, straszliwy kolor krwi, płomieni i zmierzchu. Zastanawiał się, czy patrzyły już na taką jego chimery. Były tutaj znacznie dłużej od niego i będą tu nadal stały, gdy jego już od dawna nie będzie. Gdyby kamienne języki umiały mówić…Cóż za głupota. Stał wsparty o mur, w dole fale uderzały z hukiem o brzeg, a pod palcami czuł szorstki, czarny kamień. _Mówiące chimery i wieszcze znaki na niebie. Jestem stojącym nad grobem, do cna zdziecinniałym starcem._Czyżby będąca owocem wieloletnich starań mądrość opuściła go wraz ze zdrowiem i siłą? Był maesterem, uczył się w wielkiej Cytadeli Starego Miasta i tam też otrzymał swój łańcuch. Jak to możliwe, że głowę wypełniały mu przesądy, jakby był ciemnym parobkiem?[/code]

PS: Może faktycznie dobrze by było zrobić taki program w C#. Podejrzewam, że nie tylko mi by się przydał, bo te konwertery doc ->ebook, które sprawdzałem są dość niedokładne i przeważnie zamiast czyścić kod, zaśmiecają go jeszcze bardziej.