[Question] Sed and regex string manipulation

[email protected]

Oh god! I'm sorry about the missing )! I must have dropped it when copying things from my notes over to post the comment! (≧▽≦)

Despite my error, I'm glad it worked, and even happier that you were able to take what we had worked out and modify it further to fit your other requirements. It's fun helping each other out, and it's also great learning.

I learn by problem solving, so I've got all my notes from working on this in my knowledge base as well!

In the future, feel free to ping me if you need help with other linux/cli/bash things. As I've mentioned before I'm no expert, but happy to help where I can.

[email protected]

No apologies necessary!

[email protected]

Hello I promise this is the last time I will bother you (I know what you are going to say :P) ! If it's not to much could you give me just a few hints on how I could improve a bit the final script?

#! /bin/bash

files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

This works perfectly en fulfills all my needs (thanks !!) ! However I'm not very fond of the variable string manipulation ($mdlinks2), if you have some tips without spoiling to much, would be great, otherwise it's okay, it works exactly how I have imagined it and ticks all use cases. Also If you could give some pointer for an overall improvement or if you see something that could potentially create some strange loop or looks off feel free to comment in your spare time :).

Another question which has nothing to do with the post and gets a bit off topic... You gave me the right push I needed and I saw the power and usefulness of proper knowledge with sed/bash/Pearl. It's time I finally learn a scripting language ! I want to hear your opinion on what tools would you recommend? Most people would say Python for beginners but I heard so much good things about Pearl (Exiftool is a good example of how powerful Pearl can be) but the syntax scares me out a little bit compared to Python.

Any good book material you have in mind for a beginner?

Thanks again for everything !!!

[email protected]

Hello Sorry for the late response !!! I was busy working it out with another user ! However out of curiosity gave your sed regex a try, but there seems a missing ( somewhere ! I tried to fix the issue but your regex is way over my capabilities ! If you are sed/regex fanatic a want to give it another try feel free :). Right now I found a solution with another user that works great here's the script in question if you are interested:

#! /bin/bash

files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Replace spaces (%20) from markdown links to - after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

It's not very elegant but it does the job... While working on it with another very friendly user I came across other thing I haven't though of like:

Converting 1.2 to 1-2 (e.g. [Just a placeholder](#1.2%20Just%20a%20link%20to%20header))
Linking to another markdown file (e.g. [Just a placeholder](Another%20File.md#1.2%20Just%20a%20link%20to%20header))
The link to file before the # need to keeps it's original form (e.g. [Just a placeholder](Another%20File.md#1-2-just-a-link-tp-header))

Well I think that bare bone sed/regex wasn't the right tool, but in a bash script it does exactly what I'm expecting

Thanks for your help and pointers !

[email protected]

Thanks for the pointer I wasn't aware pandoc was able to do that It seems It can convert to Github-Flavored Markdown !! I have to give it a try Still I learned a lot from another user about regex/sed and Pearl !

[email protected]

Yeah probably bare bone regex was a mistake however a friendly user gave me a step by step guide on how to achieve my goal:

#! /bin/bash

files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

If you know a better way to achieve similar results I'm very open for every new lead and learn something new !

[email protected]

Sorry for the late response... I was busy with another user :S My English is so bad I'm not able to response to every one at the same time... Whatever...

I tried your pearl regex substitution and effectively it does what I ask from my post, so thank you very much for your help ! However, I missed a few use cases were your regex breaks... But that's on me, your command works as expected !!!

[Link with numbers](Another%20Markdown%20file.md#1.3%20this%20is%20another%20test.md)

The part before the hashtag need to keeps it's original form (even with %20) because it links to a markdown file directly and not a header (Hope it's comprehensible?). It took me a lot of time with another user and we came to a wrapped up script that does everything:

#! /bin/bash

files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

If you are motivated you can still improve your regex If you want I'm kinda curious If it's possible with a one-liner ! Thank again for your help and sorry for the late response !!

[email protected]

Hello Sorry for the very late response !

Effectively your regex is very close as a one line, I'm pretty impress ! :0 However I missed to mention something In my post (I only though about it after working on it with another user in the comments...). There a 2 things missing on your beautiful and complex regex:

Numbering with dots also needs to have a dash in between (actually I think every special characters like spaces or a dots are converted to a dash )

FROM
---------------
[Link with numbers](readme.md#1.3%20this%20is%20another%20test)

TO
---------------
[Link with numbers](readme.md#1-3-this-is-another-test)

The part before the hashtag needs to keep it original form (links to a real file)

FROM
---------------
[Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test.md)

TO
---------------
[Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test.md)

Sorry for the trouble I wasn't aware of all the GitHub-Flavored Markdown syntax :/. I got a a very cool working script that works perfectly with another user but If you want to modify your regex and try to solve the issue in pure regex feel free I'm very curious how It could look like (god regex is so obscure and at the same time it has some beauty in it !)

#! /bin/bash

files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

[email protected]

Thank you very much for taking your time and trying to help me with comments and all !

you need a full featured markdown parser for this.

Do you mean something like pandoc? Someone pointed me to it and it seems it can covert to GitHub-Flavored Markdown !

Sorry for the very late response !! Here is the working bash script another user helped me put together.

However there are some cases when this script will fail, e. g. if there is an escaped ] character in the link text. You cannot avoid such mistakes using only simple regexps, you need a full featured markdown parser for this.

Thanks for the pointer will give it a try to see how it works out with my actual script If you are curious here's the thing:

#! /bin/bash

files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

[email protected]

I'll give another go at it

[email protected]

Hello! I will take a look at it, I just haven't had a chance over the last day. Give me a couple days and I will give some feedback. Bear in mind I am not an expert, so I might not have much to offer, but I'll share what I can.

[email protected]

This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

[email protected]

I did it!! It also handles the case where an external link and internal link are on the same line

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Here is my annotated file

# Begin loop
:l;

# Bisect first link in pattern space into pattern space and append to hold space
# Example: `text [label](file#fragment)'
#   Pattern space: `file#fragment)'
#   Hold space: `text [label]('
# Steps:
#   1. Strategically insert \n
#       1a. If this fails, branch out
#   2. Append to hold space (this creates two \n's. It feels weird for the
#      first iteration, but that's ok)
#   3. Copy hold space to pattern space, remove first \n, then trim off
#      everything past the second \n
#   4. Swap pattern/hold, and trim off everything up to and incl the last \n
s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;
Te;
H;
g; s/\n//; s/\n.*//;
x; s/.*\n//;

# Modify only if it is an internal link
/^https?:/! {
    # Add hyphens
    :h;
    s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;
    th;
    # Make lowercase
    s/(#[^)]*\))/\L\1/;
};

# "conditional" branch so it checks the next conditional again
tl;

# Exit: join pattern space to hold space, then move to pattern space.
# Since the loop uses H instead of h, have to make sure hold space is empty
:e;
H;
z;
x; s/\n//;

[email protected]

Wow ! Thank you ! It did a rapid test on a test-file.md

[Just a test](#just-a-test)
[Just a link](https://mylink/%20with%20space.com)
[External link](readme.md#just-a-test)
[Link with numbers](readme.md#1-3-this-is-another-test)
[Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test)

Great job ! Thank you very much !!! I'm really impressed what someone with proper knowledge can do ! However, I really do not want to mess around with your regex... This will only call for disaster xD ! I will keep preciously your regex and annotated file in my knowledge base, I'm sure some time in the future I will come back to it and try to break it down as learning process.

Thank you very much !!!

[email protected]

Hello Sorry to pin you, I just gave pandoc a try but it doesn't work and I had to dig a bit further into the web to find out why !

Links to Headings with Spaces are not specified by CommonMark and each tool implement a different approach... Most replace space with hyphens other use URL encoding (%20). So even though pandoc looks awesome it doesn't work for my use case (or did i miss something? Feel free to comment).

You can give it a try on https://pandoc.org/try/ with commonmark to gfm:

[Just a test](#Just a test)
[Just a link](https://mylink/%20with%20space.com)
[External link](Readme.md#JUST%20a%20test)
[Link with numbers](readme.md#1.3%20this%20is%20another%20test)
[Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test)

If you prefere a cli version:

pandoc --from=commonmark_x --to=gfm+gfm_auto_identifiers "/home/user/Documents/test.md" -o "pandoc_test.md"

[email protected]

Thank you ! It does actually ticks every use case (for my files) looks pretty rad !

This might work, but I think it is best to not tinker further if you already have a working script (especially one that you understand and can modify further if needed).

I totally agree but I will keep your regex as reference, in the near future I will give it a try to decompose you regex as learning process but it looks rather very complex !

Another user came up with the following solution:

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Just as a little experiment, If you want to spend some time and give me a answer, what do you think? It's a another way to achieve the same kind of results but they are significantly different. I know there a thousand ways to achieve the same results but I'm kinda curious how it looks from an experts eyes :).

Thanks again for your help and the time you took to write up a complex regex for my use case !

[email protected]

Hey take your time Don't worry even if you forget, you did more than enough to help some random on the web ! 2 other users came up with a plain/bare bone regex solution if you want to have a look and maybe there's something you can learn out of it? (I doubt it xD).

Plain sed regex (https://lemmy.ml/post/25346014/16453351)

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Plain Pearl regex (https://lemmy.ml/post/25346014/16453161)

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

Nonetheless, I really prefere your solution because as someone else said I will have an easier time to change a script I "understand". Soo thanks again !

[email protected]

Hey I just did a quick web search and found this. I haven't used the tool specifically before. However I recommend either searching the web for a similar tool or using a chatgpt like tool to create a python script that'll achieve your end result. Sed and regex are cool and useful, but they're only going to make it more difficult to achieve what you need.

[email protected]

No problem. I think this is a great "final boss" question for learning sed, because it turns out it is deceptively hard!! You have to understand not only a lot about regex, but about sed to get it right. I learned a lot about sed just by tackling this problem!

I really do not want to mess around with your regex

It is very delicate for sure, but one part you can for sure change is at the # Add hyphens part. In the regex you can see (%20|\.). These are a list of "characters" which get converted to hyphens. For example, you could modify it to (%20|\.|\+) and it will convert +s to -s as well!

Still it is not perfect:

If the link spans multiple lines, the regex won't match
If the link contains escaped characters like \\\\\[LINK](#LINK) or [LINK\]\\\\](#LINK)
If the link is inside a code block ``` it will get changed (which may or may not be intended)

But for a sed-only solution this is about as good as it will get I'm afraid.

Overall I'm very happy with it. Someday I would like to make a video that goes into depth about sed, since it is tricky to learn just from the docs.

[email protected]

Well, I'm not going to even try understanding the various features used in that sed command. I do know how to use basic loops with labels, but I never bothered with all the buffer manipulation stuff. I'd rather use awk/perl/python for those cases.

agnos.is Forums

[Question] Sed and regex string manipulation