Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease.

In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.

http://pragprog.com/titles/d-snrem

@staffannoteberg

#regularexpressions #patternmatching #regex

**Steven Sanderson** @spsanderson@rstats.me · Jan 17

Jan 17

Steven Sanderson @spsanderson@rstats.me

Struggling with text processing in Linux?

My latest blog post at https://www.spsanderson.com/steveondata/posts/2025-01-17/ breaks down common challenges and offers practical solutions! Learn how to tackle duplicates and sort data with ease.

#Text #Blog #Technology #Textprocessing #Programming #Linux #Help

Let me know what you think!

**Steven P. Sanderson II, MPH** @stevensanderson@mstdn.social · Jan 17

Jan 17

Steven P. Sanderson II, MPH @stevensanderson@mstdn.social

Struggling with text processing in Linux?

My latest blog post at https://www.spsanderson.com/steveondata/posts/2025-01-17/ breaks down common challenges and offers practical solutions! Learn how to tackle duplicates and sort data with ease.

#Text #Blog #Technology #Textprocessing #Programming #Linux #Help

Let me know what you think!

**Veronica Olsen** @veronica@mastodon.online · Nov 12, 2024 *

Nov 12, 2024 *

Veronica Olsen @veronica@mastodon.online

Back when I first wrote text processing code in the 90s on my Amiga 1200, I always used the ¤ symbol as a placeholder character for splitting and replacing to exclude things I wanted skipped without affecting character count. It was available on the Norwegian keyboard, and practically never used in text.

Recently I discovered that Unicode has two "Not a character" symbols perfect for the same usage: \uFFFE and \uFFFF.

They can be really useful!

#Code #Python #TextProcessing

Continued thread

**Veronica Olsen** @veronica@mastodon.online · Oct 23, 2024 *

Oct 23, 2024 *

Veronica Olsen @veronica@mastodon.online

2. Immediately after the split, replace U+FFFF with newline, but keep both versions of the line, and pass the one with the U+FFFF to the text paragraph parser. Everything else (like headings) gets the cleaned one.

3. After paragraph lines with a single break between them (belonging to the same paragraph) have been processed, THEN I replace the U+FFFF characters there.

It seems to work, but it took me like 3-4 hours to crack.

4/4

#Python #TextProcessing #Unicode

Continued thread

**Veronica Olsen** @veronica@mastodon.online · Oct 23, 2024

Oct 23, 2024

Veronica Olsen @veronica@mastodon.online

I tried using the alternative line and paragraph separators from Unicode, but splitlines accepts them too. Then I discovered these Unicode characters:

U+FFFE <noncharacter-FFFE> not a character.
U+FFFF <noncharacter-FFFF> not a character.

The solution, then was:

1. Replace all occurrences of [br] with or without a trailing newline, using regex pattern "(?i)(?<!\\)(\[br\]\n?)", with a U+FFFF character.

3/4

#Python #TextProcessing #Unicode

Continued thread

**Veronica Olsen** @veronica@mastodon.online · Oct 23, 2024

Oct 23, 2024

Veronica Olsen @veronica@mastodon.online

This works fine in principle, but it is incredibly hard to figure out exactly when to make the replacement.

For instance, if I do it too early, the parser will split on the breaks as I use splitlines() early on. If I do it too late, I get double line breaks some places.

2/4

#Python #TextProcessing #Unicode

**Veronica Olsen** @veronica@mastodon.online · Oct 23, 2024 *

Oct 23, 2024 *

Veronica Olsen @veronica@mastodon.online

I've been struggling with solving an issue with my text editor project. The editor is plain text and uses a blank line to separate paragraphs.

The editor has an option to preserve or not preserve single line breaks inside paragraphs when generating the output.

However, some users want to not preserve them, but still want to be able to add hard breaks sometimes. So I've been trying out using [br] as a hard break shortcode.

1/4

#Python #TextProcessing #Unicode

**Grumpy Old Techie** @grumpyoldtechie@hostux.social · Oct 21, 2024

Oct 21, 2024

Grumpy Old Techie @grumpyoldtechie@hostux.social

Maybe I’m just growing really old. Today I stumbled across a GitHub repository with a few hundred lines of python that could be one or two awk, sed or grep oneliners.
Seriously if you are using Linux or one of the BSDs learn how to use the standard text utilities that come with the OS.
In modern times jq should be added to the traditional list.

#text #python #unix

**Veronica Olsen** @veronica@mastodon.online · Oct 18, 2024 *

Oct 18, 2024 *

Veronica Olsen @veronica@mastodon.online

I started working on a Python class to write MS Office Word documents from already tokenized formatted text. It took me 5 hours to get a working version that can handle most of the formatting I need.

I have already done this with the Open Document format. It took me significantly longer, but I do steal some code from that code for DocX.

That said, DocX is actually easier to generate the XML for it turns out.

#Python #Code #Documents

**Wesley Moore** @wezm@mastodon.decentralised.social · Sep 24, 2024

Sep 24, 2024

Wesley Moore @wezm@mastodon.decentralised.social

Discovered a neat new tool last week: https://github.com/wr7/refold

It's similar to `fmt` and `fold` except that it automatically handles prefixes. Vim/Neovim `gq` can do this out of the box but fails (for me at least) when multiple prefixes are present, such as a Markdown block-quote inside Rust comments. E.g.

```
// > Some quoted text
// > to reflow.
```

`refold` handles this.

GitHubGitHub - wr7/refold: A commandline tool for performing text-wrappingA commandline tool for performing text-wrapping. Contribute to wr7/refold development by creating an account on GitHub.

#Rust #RustLang #TextProcessing

**IT News** @itnewsbot@schleuss.online · Feb 1, 2021

Feb 1, 2021

IT News @itnewsbot@schleuss.online

13,000 Regular Expressions Make An Editor’s Life Easier - Being an editor is a job that seems deceptively easy until you are hauled over the coals for letti... - https://hackaday.com/2021/02/01/13000-regular-expressions-make-an-editors-life-easier/ #softwaredevelopment #regularexpressions #textprocessing #softwarehacks #regex

Hackaday13,000 Regular Expressions Make An Editor’s Life EasierBy Jenny List

**Doc Edward Morbius ** @dredmorbius@mastodon.cloud · Dec 28, 2019

Dec 28, 2019

Doc Edward Morbius  @dredmorbius@mastodon.cloud

Stupid Awk text-processing tricks: Reframe your record and field delimiters

A longer write-up on the text-processing stuff I've been mucking with for the past few weeks.

Changing yoru RS (record seps) and FS (field seps) values can be ... tremendously useful.

https://joindiaspora.com/posts/16860494

joindiaspora.com JoinDiaspora* - Sign in diaspora* is the online social world where you are in control.

#awk #gawk #scripting

Continued thread

**Doc Edward Morbius ** @dredmorbius@mastodon.cloud · Dec 26, 2019

Dec 26, 2019

Doc Edward Morbius  @dredmorbius@mastodon.cloud

For added joy: records (classification entries) and components (descriptions, cross-references, filing/usage notes) can span lines, columns, and pages. So getting something usable takes work.

#textProcessing #pdfConversion #pdftotext #sed #awk #perl #python

(See thread.)

Recent searches

Search options

Administered by:

Server stats:

#textprocessing