Hallo,

Typically, I handle my data wranglers jobs using Python and Perl. I have a lot of legacy code in both languages, so I need to find some free time to rewrite all my stuff in Nim.

Sometimes, I need to extract key information from an unstructured data file... and I would like to start using Nim to do that.

Please, loook at this thread: https://opendata.stackexchange.com/questions/92/good-tools-to-parse-repetitive-unstructured-data

In your opinion, how could I use Nim to solve a pattern like the one below? I mean, what is the best way to parse a great number of lines of repetitive unstructured data ?

Maria Teresa’s Babies Early Enrichment Center/Daycare
825 23rd Street South
Arlington, VA 22202
703-979-BABY (2229)
22.
Maria Teresa Desaba, Owner/Director; Tony Saba, Org. Director.
Website: www.mariateresasbabies.com
Serving children 6 wks to 5yrs full-time.


National Science Foundation Child  Development Center
23.
4201 Wilson Blvd., Suite 180  22203
703-292-4794
Website:  www.brighthorizons.com 112 children, ages 6 wks - 5 yrs.
7:00 a.m.  6:00 p.m. Summer Camp for children 5 - 9 years.

2017-10-09 22:03:57
Taking a "big picture" approach:
  • If you are going to automate extracting data, you must know the rules that define the data "fields" you want to extract, or else no tool will do it for you.
  • most tools have some regular expression (RE) capability (not just perl), but is an RE the answer to how do I identify each field?. If you can delineate between fields without using an RE, then that may make your code faster, easier to modify, or more readable (but it also may not).
  • Are these files small enough that you can read the whole file into memory and process a single string of data, or do you need to process the file iteratively (by line, or chuck of data) to save on memory?
If you end up using regular expressions in Nim,
cheat by leveraging what is already done in nimgrep. It is easy to end up with RE that is slow in Nim, so if speed is an issue, make sure you benchmark it. (The RE isn't slow because it uses PCRE. The reason (I believe) is that it is easy (for a newbie) to write code with lots of string allocations that slow it down).
The following probably applies to more structured data, but I'll mention it for completeness:
@Araq posted about how using a tool like sqlite can be a good tool if you are then manipulating the extracted data.
2017-10-10 01:34:23
Thanks for your advice. Cheers 2017-10-12 00:25:30