[Question] Sed and regex string manipulation

[email protected]

Thank you very much for taking your time and trying to help me with comments and all !

you need a full featured markdown parser for this.

Do you mean something like pandoc? Someone pointed me to it and it seems it can covert to GitHub-Flavored Markdown !

Sorry for the very late response !! Here is the working bash script another user helped me put together.

However there are some cases when this script will fail, e. g. if there is an escaped ] character in the link text. You cannot avoid such mistakes using only simple regexps, you need a full featured markdown parser for this.

Thanks for the pointer will give it a try to see how it works out with my actual script If you are curious here's the thing:

#! /bin/bash

files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

[email protected]

I'll give another go at it

[email protected]

Hello! I will take a look at it, I just haven't had a chance over the last day. Give me a couple days and I will give some feedback. Bear in mind I am not an expert, so I might not have much to offer, but I'll share what I can.

[email protected]

This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

[email protected]

I did it!! It also handles the case where an external link and internal link are on the same line

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Here is my annotated file

# Begin loop
:l;

# Bisect first link in pattern space into pattern space and append to hold space
# Example: `text [label](file#fragment)'
#   Pattern space: `file#fragment)'
#   Hold space: `text [label]('
# Steps:
#   1. Strategically insert \n
#       1a. If this fails, branch out
#   2. Append to hold space (this creates two \n's. It feels weird for the
#      first iteration, but that's ok)
#   3. Copy hold space to pattern space, remove first \n, then trim off
#      everything past the second \n
#   4. Swap pattern/hold, and trim off everything up to and incl the last \n
s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;
Te;
H;
g; s/\n//; s/\n.*//;
x; s/.*\n//;

# Modify only if it is an internal link
/^https?:/! {
    # Add hyphens
    :h;
    s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;
    th;
    # Make lowercase
    s/(#[^)]*\))/\L\1/;
};

# "conditional" branch so it checks the next conditional again
tl;

# Exit: join pattern space to hold space, then move to pattern space.
# Since the loop uses H instead of h, have to make sure hold space is empty
:e;
H;
z;
x; s/\n//;

[email protected]

Wow ! Thank you ! It did a rapid test on a test-file.md

[Just a test](#just-a-test)
[Just a link](https://mylink/%20with%20space.com)
[External link](readme.md#just-a-test)
[Link with numbers](readme.md#1-3-this-is-another-test)
[Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test)

Great job ! Thank you very much !!! I'm really impressed what someone with proper knowledge can do ! However, I really do not want to mess around with your regex... This will only call for disaster xD ! I will keep preciously your regex and annotated file in my knowledge base, I'm sure some time in the future I will come back to it and try to break it down as learning process.

Thank you very much !!!

[email protected]

Hello Sorry to pin you, I just gave pandoc a try but it doesn't work and I had to dig a bit further into the web to find out why !

Links to Headings with Spaces are not specified by CommonMark and each tool implement a different approach... Most replace space with hyphens other use URL encoding (%20). So even though pandoc looks awesome it doesn't work for my use case (or did i miss something? Feel free to comment).

You can give it a try on https://pandoc.org/try/ with commonmark to gfm:

[Just a test](#Just a test)
[Just a link](https://mylink/%20with%20space.com)
[External link](Readme.md#JUST%20a%20test)
[Link with numbers](readme.md#1.3%20this%20is%20another%20test)
[Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test)

If you prefere a cli version:

pandoc --from=commonmark_x --to=gfm+gfm_auto_identifiers "/home/user/Documents/test.md" -o "pandoc_test.md"

[email protected]

Thank you ! It does actually ticks every use case (for my files) looks pretty rad !

This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).

I totally agree but I will keep your regex as reference, in the near future I will give it a try to decompose you regex as learning process but it looks rather very complex !

Another user came up with the following solution:

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Just as a little experiment, If you want to spend some time and give me a answer, what do you think? It's a another way to achieve the same kind of results but they are significantly different. I know there a thousand ways to achieve the same results but I'm kinda curious how it looks from an experts eyes :).

Thanks again for your help and the time you took to write up a complex regex for my use case !

[email protected]

Hey take your time Don't worry even if you forget, you did more than enough to help some random on the web ! 2 other users came up with a plain/bare bone regex solution if you want to have a look and maybe there's something you can learn out of it? (I doubt it xD).

Plain sed regex (https://lemmy.ml/post/25346014/16453351)

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Plain Pearl regex (https://lemmy.ml/post/25346014/16453161)

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

Nonetheless, I really prefere your solution because as someone else said I will have an easier time to change a script I "understand". Soo thanks again !

[email protected]

Hey I just did a quick web search and found this. I haven't used the tool specifically before. However I recommend either searching the web for a similar tool or using a chatgpt like tool to create a python script that'll achieve your end result. Sed and regex are cool and useful, but they're only going to make it more difficult to achieve what you need.

[email protected]

No problem. I think this is a great "final boss" question for learning sed, because it turns out it is deceptively hard!! You have to understand not only a lot about regex, but about sed to get it right. I learned a lot about sed just by tackling this problem!

I really do not want to mess around with your regex

It is very delicate for sure, but one part you can for sure change is at the # Add hyphens part. In the regex you can see (%20|\.). These are a list of "characters" which get converted to hyphens. For example, you could modify it to (%20|\.|\+) and it will convert +s to -s as well!

Still it is not perfect:

If the link spans multiple lines, the regex won't match
If the link contains escaped characters like \\\\\[LINK](#LINK) or [LINK\]\\\\](#LINK)
If the link is inside a code block ``` it will get changed (which may or may not be intended)

But for a sed-only solution this is about as good as it will get I'm afraid.

Overall I'm very happy with it. Someday I would like to make a video that goes into depth about sed, since it is tricky to learn just from the docs.

[email protected]

Well, I'm not going to even try understanding the various features used in that sed command. I do know how to use basic loops with labels, but I never bothered with all the buffer manipulation stuff. I'd rather use awk/perl/python for those cases.

agnos.is Forums

[Question] Sed and regex string manipulation