[Question] Sed and regex string manipulation
-
[email protected]replied to [email protected] last edited by
Here's a solution with
perl
(assuming you don't want to change http/https after the start of(
instead of start of a line):perl -pe 's/\[[^]]+\]\(\K(?!https?)[^)]+(?=\))/lc $&=~s|%20|-|gr/ge' ip.txt
-
[email protected]replied to [email protected] last edited by
skip the following substitute command if the line contains an http link in markdown format
Why you assume there's only one link in the line?
-
[email protected]replied to [email protected] last edited by
I didn't test this, but it will change the whole URL while changes are only needed in its fragment component (after the first
#
). -
[email protected]replied to [email protected] last edited by
Obligatory regex was a mistake post
-
[email protected]replied to [email protected] last edited by
Hmm, OP mentioned "Only edit whatβs between parentheses" - don't see anywhere that whole URL shouldn't be changed...
-
[email protected]replied to [email protected] last edited by
Paths are constant, only anchors are generated by forgejo.
-
[email protected]replied to [email protected] last edited by
Why you assume there's only one link in the line?
They did not want external (http) links to be modified as that would break it:
-
[Example](https://example.com/#Some%20Link)
-
[Example](https://example.com/#some-link)
I compromised by thinking that it might be unlikely enough to have an external http link AND internal link within the same line. You could probably still do it, my first thought was
[^h][^t][^t][^p]
but that would cause issues for#ttp
and#A
so i just gave up. Instead I think you'd want a different approach, like breaking each link onto their own line, do the same external/internal check before the substitution, and join the lines afterward.Also, you perform substitutions in the whole URL instead of the fragment component
That requirement i missed. I just assumed the filename would be replaced the same way too Lol. Not too hard to fix tho
-
-
[email protected]replied to [email protected] last edited by
Don't reinvent the wheel! https://github.com/jgm/pandoc
-
[email protected]replied to [email protected] last edited by
Not home so I can't try it but do you need to be so specific to match the whole markdown syntax?
You might be able to get away with
s/#(\w+%20)*\w+\.\w{2,3}/\L&/g; /#(\w+%20)*\w+\.\w{2,3}/ s/%20/-/g
basically, matching #this%20is%20LIKELY%20a%20link.md
as opposed to matching whole markdown linklowercasing that entire match,
then on a search matching stuff that looks like that, replace the %20 with a hyphen (combined into a single sed command). this only fails when an http link falls within the same line as a markdown hyperlink -
[email protected]replied to [email protected] last edited by
Hello !!!
Sorry for the very late response had something else to do. I will read everything carefully and response to every post I also thought about it over night and I think that sed and and regex wasn't the best option here (as other have mentioned it).
I think a python script or bash (as you have mentioned it a bit later ) would be a better way. I'm sorry that I put you through all of this... wrong tool for the job :s.
-
[email protected]replied to [email protected] last edited by
First, thanks again for sharing your knowledge with me I really appreciate the time/effort you took to write all of this. I know those are a lot of thank you but I'm really grateful for all of this, this is very valuable information I will keep in my knowledge base. It's really time I learn proper bash/python/Pearl? scripting with all those tools (grep/sed/regex).
Second, YOU MISSED A DAMNED parentheses you fool xD !
mdlinks="$(grep -Po ']\((?!https).*\)' ~/mkdn)"
Took me some time to figured it out with a very non informative errorbashscript.sh: line 8: unexpected EOF while looking for matching "'
but as expected it works !From ------- [Just a test](#Just%20a%20test.md) [Just a link](https://mylink/%20with%20space.com) %20 To ------- [Just a test](#Just-a-test.md) [Just a link](https://mylink/%20with%20space.com) %20
Next to show you my appreciation and not to take everything for granted and being spoon feed for everything, I tried to find a solution myself for something else, I will try to explain the best I can how I solved it.
From ------- [Just a test](Another%20markdown%20file.md#Hello%20World) To ------- [Just a test](Another%20markdown%20file.md#hello-world)
The part before the hashtag needs to keep it's initial form (it links to the original markdown file). So, because just playing around with Pearl and regex (which doesn't end well doing this blindly without the proper knowledge) I did some simple string manipulation. It's not very elegant but does the trick, thankfully to your well written breakdown.
- I printed out the $mdlinks variable just to see what it prints out
- Copied and changed your Pearl/regex to find the first hashtag (#) and save it into a new variable ($mdlinks2)
- Feed your $mdlinks variable into my new Pearl/regex
- Feed my new variable into done? (I'm a bit confused here but okay xD)
#! /bin/bash mdlinks="$(grep -Po ']\((?!https).*\)' "/home/dany/newtest.md")" echo $mdlinks mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" echo $mdlinks2 while IFS= read -r line; do dashlink="$(echo "$line" | sed 's|%20|-|g')" sed -i "s/$line/${dashlink}/" "/home/dany/newtest.md" done <<<"$mdlinks2"
Yes, not very elegant but It's the best I could do currently However, I still got a YES effect
To answer your question:
Quick question as Iβm working on this, in the new link example, is the BDMV and other capitalized text in this link supposed to be converted to lowercase, or to remain uppercase?
As you can see in my string manipulation above, the part before the # needs to keep it's original form (Sorry wasn't aware of this before working with the original files) I solved it with some string manipulation as shown above.
I'm a bit tired from all this searching/trail&error, tomorrow I will try to wrap everything up and answer your post below ! Also, I need to clean up the mess I made in my home directory xD.
Thanks again for your help ! Have a good night/day !
-
[email protected]replied to [email protected] last edited by
Oh god! I'm sorry about the missing
)
! I must have dropped it when copying things from my notes over to post the comment! (β§β½β¦)Despite my error, I'm glad it worked, and even happier that you were able to take what we had worked out and modify it further to fit your other requirements. It's fun helping each other out, and it's also great learning.
I learn by problem solving, so I've got all my notes from working on this in my knowledge base as well!
In the future, feel free to ping me if you need help with other linux/cli/bash things. As I've mentioned before I'm no expert, but happy to help where I can.
-
[email protected]replied to [email protected] last edited by
No apologies necessary!
-
[email protected]replied to [email protected] last edited by
Hello I promise this is the last time I will bother you (I know what you are going to say :P) ! If it's not to much could you give me just a few hints on how I could improve a bit the final script?
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
This works perfectly en fulfills all my needs (thanks !!) ! However I'm not very fond of the variable string manipulation ($mdlinks2), if you have some tips without spoiling to much, would be great, otherwise it's okay, it works exactly how I have imagined it and ticks all use cases. Also If you could give some pointer for an overall improvement or if you see something that could potentially create some strange loop or looks off feel free to comment in your spare time :).
Another question which has nothing to do with the post and gets a bit off topic... You gave me the right push I needed and I saw the power and usefulness of proper knowledge with sed/bash/Pearl. It's time I finally learn a scripting language ! I want to hear your opinion on what tools would you recommend? Most people would say Python for beginners but I heard so much good things about Pearl (Exiftool is a good example of how powerful Pearl can be) but the syntax scares me out a little bit compared to Python.
Any good book material you have in mind for a beginner?
Thanks again for everything !!!
-
[email protected]replied to [email protected] last edited by
Hello Sorry for the late response !!! I was busy working it out with another user ! However out of curiosity gave your sed regex a try, but there seems a missing
(
somewhere ! I tried to fix the issue but your regex is way over my capabilities ! If you are sed/regex fanatic a want to give it another try feel free :). Right now I found a solution with another user that works great here's the script in question if you are interested:#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Replace spaces (%20) from markdown links to - after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
It's not very elegant but it does the job... While working on it with another very friendly user I came across other thing I haven't though of like:
- Converting 1.2 to 1-2 (e.g.
[Just a placeholder](#1.2%20Just%20a%20link%20to%20header)
) - Linking to another markdown file (e.g.
[Just a placeholder](Another%20File.md#1.2%20Just%20a%20link%20to%20header)
) - The link to file before the # need to keeps it's original form (e.g.
[Just a placeholder](Another%20File.md#1-2-just-a-link-tp-header)
)
Well I think that bare bone sed/regex wasn't the right tool, but in a bash script it does exactly what I'm expecting
Thanks for your help and pointers !
- Converting 1.2 to 1-2 (e.g.
-
[email protected]replied to [email protected] last edited by
Thanks for the pointer I wasn't aware pandoc was able to do that It seems It can convert to Github-Flavored Markdown !! I have to give it a try Still I learned a lot from another user about regex/sed and Pearl !
-
[email protected]replied to [email protected] last edited by
Yeah probably bare bone regex was a mistake however a friendly user gave me a step by step guide on how to achieve my goal:
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
If you know a better way to achieve similar results I'm very open for every new lead and learn something new !
-
[email protected]replied to [email protected] last edited by
Sorry for the late response... I was busy with another user :S My English is so bad I'm not able to response to every one at the same time... Whatever...
I tried your pearl regex substitution and effectively it does what I ask from my post, so thank you very much for your help ! However, I missed a few use cases were your regex breaks... But that's on me, your command works as expected !!!
[Link with numbers](Another%20Markdown%20file.md#1.3%20this%20is%20another%20test.md)
The part before the hashtag need to keeps it's original form (even with %20) because it links to a markdown file directly and not a header (Hope it's comprehensible?). It took me a lot of time with another user and we came to a wrapped up script that does everything:
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
If you are motivated you can still improve your regex If you want I'm kinda curious If it's possible with a one-liner ! Thank again for your help and sorry for the late response !!
-
[email protected]replied to [email protected] last edited by
Hello Sorry for the very late response !
Effectively your regex is very close as a one line, I'm pretty impress ! :0 However I missed to mention something In my post (I only though about it after working on it with another user in the comments...). There a 2 things missing on your beautiful and complex regex:
- Numbering with dots also needs to have a dash in between (actually I think every special characters like spaces or a dots are converted to a dash )
FROM --------------- [Link with numbers](readme.md#1.3%20this%20is%20another%20test) TO --------------- [Link with numbers](readme.md#1-3-this-is-another-test)
- The part before the hashtag needs to keep it original form (links to a real file)
FROM --------------- [Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test.md) TO --------------- [Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test.md)
Sorry for the trouble I wasn't aware of all the GitHub-Flavored Markdown syntax :/. I got a a very cool working script that works perfectly with another user but If you want to modify your regex and try to solve the issue in pure regex feel free I'm very curious how It could look like (god regex is so obscure and at the same time it has some beauty in it !)
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
-
[email protected]replied to [email protected] last edited by
Thank you very much for taking your time and trying to help me with comments and all !
you need a full featured markdown parser for this.
Do you mean something like pandoc? Someone pointed me to it and it seems it can covert to GitHub-Flavored Markdown !
Sorry for the very late response !! Here is the working bash script another user helped me put together.
However there are some cases when this script will fail, e. g. if there is an escaped ] character in the link text. You cannot avoid such mistakes using only simple regexps, you need a full featured markdown parser for this.
Thanks for the pointer will give it a try to see how it works out with my actual script If you are curious here's the thing:
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"