[Question] Sed and regex string manipulation
-
[email protected]replied to [email protected] last edited by
Okay. To address the
%20
and thehttps
links, and theplaceholder
links, I came up with a bash script to handle this.Because of the variation in the links, instead of trying to write a
sed
command that will match only%20
in anchor markdown links, and placeholder links, while ignoring https links and ignoring all other text in the document.To do that, I used
grep
, awhile
loop,IFS
, andsed
Here's the script:
#! /bin/bash mdlinks="$(grep -Po ']\((?!https).*\)' ~/mkdn" while IFS= read -r line; do dashlink="$(echo "$line" | sed 's/%20/-/g')" sed -i "s/$line/${dashlink}/" /path/to/file done <<<"$mdlinks"
I'm not sure how familiar you are with bash scripting, so I'll do the same breakdown:
#! /bin/bash
- This tells the shell what interpreter to use for the script. In this case it's bash.mdlinks="$(grep -Po ']\((?!https).*\)' /path/to/file"
- This line usesgrep
to search for markdown link enclosures excluding https links and to output only the text that matches and saves all of that into a variable calledmdlinks
. Each link match will be a new line inside the variable.The breakdown of the
grep
command is as followes:grep
- invokes the grep command-Po
- two command flags. TheP
tellsgrep
to use perl regular expressions. Theo
tells grep to only print the output that matches, rather than the entire line.'
- opens the regex statement]\(
- finds a closing bracket followed by an opening parentheses(?!https)
- This is a negative look ahead, which a feature available in perl regex. This tells grep not to match if it finds thehttps
string. The parentheses encloses the negative look ahead. The?!
Is what indicates it's a negative look ahead, and thehttps
is the string to look for and ignore.'
- closes the regex statement/path/to/file
- the file to search for matcheswhile IFS= read -r line; do
- this invokes awhile
loop using the Internal Field Separator (IFS=
), which by default includes newline character. This allows the loop to take in the variable containing all of the matched links and separate them line by line to work on one at a time. Theread
command does what it says and reads the input given. In this case our variablemdlinks
. The-r
flag tellsread
to ignore the backslash character and just treat it as a normal part of the input.line
is the variable that each line will be saved in as they are worked through the loop. The;
endswhile
setup, anddo
opens the loop for the commands we want to run using the input saved inline
.dashlink="$(echo "$line" | sed 's/%20/-/g')"
- This command sequence runs the markdown link saved in theline
variable into sed to find all instances of%20
and replace them with a-
.dashlink
- the variable we're saving the new link with dashes to.=
- separates the variable from the input being saved into the variable."
- opens this command string for variable expansion.$
- tellsbash
to do command substition, meaning that the output of the following commands will be saved to the variable, rather than the actual text of the commands that follows.(
- opens the command setecho
- prints the given text or arguments to standard output, in this case the given argument is the variable$line
"
- tellsbash
to expand any variables contained within the quote set while ignoring any nonstandard characters like spaces or special shell characters that are saved in the variable.$line
- the variable containing our active markdown link from the text document"
- the closing quote ending the argument and the expansion enclosure|
- This is a pipe, which redirects the standard output of the command on the left into the command on the right. Meaning we're taking the markdown link currently saved in the variable and feeding it intosed
sed
- invokessed
so we can manipulate our text, and becausesed
is receiving redirected input, and we've specified no flags, the modified text will be printed to standard output.'s/%20/-/g'
- Our pattern match/substitution, which will find all occurrences of the string%20
in the markdown link fed into sed and replace them with-
.)"
- closes our command sequence for command substitution, and the variable expansion. At this point the text printed to standard output bysed
is saved to the variabledashlink
The next line is:
sed -i "s/$line/${dashlink}/" /path/to/file
, which usessed
to take theline
anddashlink
variables and use them to find the exact original markdown link in the text containing the%20
sequences, and replace it with the properly formatted markdown link with dashes.sed -i
- invokessed
and uses the-i
flag to edit the file in place."
- The double quote enclosure allows the expansion of variables in the pattern match/replacement sequence so it searches for the markdown link, and not the literal text string$line
.s/
- opens our match/modify sequence.$line
- the original markdown link that will be found/
- ends the pattern matching section${dashlink}
- The variable containing the previously modified markdown link that now has dashes. This expands to that properly formatted link which will be written into the text file replacing the malformed link. I don't know why this link has to be enclosed in curly braces while the first one does not./"
- ends the text modification section and closes the variable expansion./path/to/file
- the file to be worked onFinally we have
done<<<"$mdlinks"
, which ends the while loop and feeds themdlinks
variable into it.done
- closes thewhile
loop<<<
- This feeds the given argument into thewhile
loop for processing"
- expands the variable within while ignoring nonstandard characters$mdlinks
- the variable we're feeding in with all of our links containing%20
, except for https links."
- closes the variable expansion.If you've never written/created your own bash script, here's what you need to do.
-
in your home directory, or in the directory you're working in with these files, use a text editor like vim or nano or gedit or kate or whatever plain text editor you want to to create a new file. Call the file whatever you want.
-
Paste the entirety of the script text into the file. Modify the file paths as needed to work the file you want to work. if working multiple files, you'll need to update the script for each new file path as you finish one and move on to the next
-
Save and exit the file
-
Make the file executable at the terminal with
sudo chmod +x /path/to/script/file
-
To run it:
- Change directory to the directory that contains the script file (if you're not already there)
- at the command line use the command
. ./name-of-script-file
-
-
[email protected]replied to [email protected] last edited by
Here's a solution with
perl
(assuming you don't want to change http/https after the start of(
instead of start of a line):perl -pe 's/\[[^]]+\]\(\K(?!https?)[^)]+(?=\))/lc $&=~s|%20|-|gr/ge' ip.txt
-
[email protected]replied to [email protected] last edited by
skip the following substitute command if the line contains an http link in markdown format
Why you assume there's only one link in the line?
-
[email protected]replied to [email protected] last edited by
I didn't test this, but it will change the whole URL while changes are only needed in its fragment component (after the first
#
). -
[email protected]replied to [email protected] last edited by
Obligatory regex was a mistake post
-
[email protected]replied to [email protected] last edited by
Hmm, OP mentioned "Only edit whatโs between parentheses" - don't see anywhere that whole URL shouldn't be changed...
-
[email protected]replied to [email protected] last edited by
Paths are constant, only anchors are generated by forgejo.
-
[email protected]replied to [email protected] last edited by
Why you assume there's only one link in the line?
They did not want external (http) links to be modified as that would break it:
[Example](https://example.com/#Some%20Link)
[Example](https://example.com/#some-link)
I compromised by thinking that it might be unlikely enough to have an external http link AND internal link within the same line. You could probably still do it, my first thought was
[^h][^t][^t][^p]
but that would cause issues for#ttp
and#A
so i just gave up. Instead I think you'd want a different approach, like breaking each link onto their own line, do the same external/internal check before the substitution, and join the lines afterward.Also, you perform substitutions in the whole URL instead of the fragment component
That requirement i missed. I just assumed the filename would be replaced the same way too Lol. Not too hard to fix tho
-
[email protected]replied to [email protected] last edited by
Don't reinvent the wheel! https://github.com/jgm/pandoc
-
[email protected]replied to [email protected] last edited by
Not home so I can't try it but do you need to be so specific to match the whole markdown syntax?
You might be able to get away with
s/#(\w+%20)*\w+\.\w{2,3}/\L&/g; /#(\w+%20)*\w+\.\w{2,3}/ s/%20/-/g
basically, matching #this%20is%20LIKELY%20a%20link.md
as opposed to matching whole markdown linklowercasing that entire match,
then on a search matching stuff that looks like that, replace the %20 with a hyphen (combined into a single sed command). this only fails when an http link falls within the same line as a markdown hyperlink -
[email protected]replied to [email protected] last edited by
Hello !!!
Sorry for the very late response had something else to do. I will read everything carefully and response to every post
I also thought about it over night and I think that sed and and regex wasn't the best option here (as other have mentioned it).
I think a python script or bash (as you have mentioned it a bit later ) would be a better way. I'm sorry that I put you through all of this... wrong tool for the job :s.
-
[email protected]replied to [email protected] last edited by
First, thanks again for sharing your knowledge with me I really appreciate the time/effort you took to write all of this. I know those are a lot of thank you
but I'm really grateful for all of this, this is very valuable information I will keep in my knowledge base. It's really time I learn proper bash/python/Pearl? scripting with all those tools (grep/sed/regex).
Second, YOU MISSED A DAMNED parentheses you fool xD !
mdlinks="$(grep -Po ']\((?!https).*\)' ~/mkdn)"
Took me some time to figured it out with a very non informative errorbashscript.sh: line 8: unexpected EOF while looking for matching "'
but as expected it works !From ------- [Just a test](#Just%20a%20test.md) [Just a link](https://mylink/%20with%20space.com) %20 To ------- [Just a test](#Just-a-test.md) [Just a link](https://mylink/%20with%20space.com) %20
Next to show you my appreciation and not to take everything for granted and being spoon feed for everything, I tried to find a solution myself for something else, I will try to explain the best I can how I solved it.
From ------- [Just a test](Another%20markdown%20file.md#Hello%20World) To ------- [Just a test](Another%20markdown%20file.md#hello-world)
The part before the hashtag needs to keep it's initial form (it links to the original markdown file). So, because just playing around with Pearl and regex (which doesn't end well doing this blindly without the proper knowledge) I did some simple string manipulation. It's not very elegant but does the trick, thankfully to your well written breakdown.
- I printed out the $mdlinks variable just to see what it prints out
- Copied and changed your Pearl/regex to find the first hashtag (#) and save it into a new variable ($mdlinks2)
- Feed your $mdlinks variable into my new Pearl/regex
- Feed my new variable into done? (I'm a bit confused here but okay xD)
#! /bin/bash mdlinks="$(grep -Po ']\((?!https).*\)' "/home/dany/newtest.md")" echo $mdlinks mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" echo $mdlinks2 while IFS= read -r line; do dashlink="$(echo "$line" | sed 's|%20|-|g')" sed -i "s/$line/${dashlink}/" "/home/dany/newtest.md" done <<<"$mdlinks2"
Yes, not very elegant but It's the best I could do currently
However, I still got a YES effect
To answer your question:
Quick question as Iโm working on this, in the new link example, is the BDMV and other capitalized text in this link supposed to be converted to lowercase, or to remain uppercase?
As you can see in my string manipulation above, the part before the # needs to keep it's original form
(Sorry wasn't aware of this before working with the original files) I solved it with some string manipulation as shown above.
I'm a bit tired from all this searching/trail&error, tomorrow I will try to wrap everything up and answer your post below
! Also, I need to clean up the mess I made in my home directory xD.
Thanks again for your help ! Have a good night/day !
-
[email protected]replied to [email protected] last edited by
Oh god! I'm sorry about the missing
)
! I must have dropped it when copying things from my notes over to post the comment! (โงโฝโฆ)Despite my error, I'm glad it worked, and even happier that you were able to take what we had worked out and modify it further to fit your other requirements. It's fun helping each other out, and it's also great learning.
I learn by problem solving, so I've got all my notes from working on this in my knowledge base as well!
In the future, feel free to ping me if you need help with other linux/cli/bash things. As I've mentioned before I'm no expert, but happy to help where I can.
-
[email protected]replied to [email protected] last edited by
No apologies necessary!
-
[email protected]replied to [email protected] last edited by
Hello
I promise this is the last time I will bother you (I know what you are going to say :P) ! If it's not to much could you give me just a few hints on how I could improve a bit the final script?
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
This works perfectly en fulfills all my needs (thanks !!) ! However I'm not very fond of the variable string manipulation ($mdlinks2), if you have some tips without spoiling to much, would be great, otherwise it's okay, it works exactly how I have imagined it and ticks all use cases. Also If you could give some pointer for an overall improvement or if you see something that could potentially create some strange loop or looks off feel free to comment in your spare time :).
Another question which has nothing to do with the post and gets a bit off topic... You gave me the right push I needed and I saw the power and usefulness of proper knowledge with sed/bash/Pearl. It's time I finally learn a scripting language ! I want to hear your opinion on what tools would you recommend? Most people would say Python for beginners but I heard so much good things about Pearl (Exiftool is a good example of how powerful Pearl can be) but the syntax scares me out a little bit compared to Python.
Any good book material you have in mind for a beginner?
Thanks again for everything !!!
-
[email protected]replied to [email protected] last edited by
Hello
Sorry for the late response !!! I was busy working it out with another user ! However out of curiosity gave your sed regex a try, but there seems a missing
(
somewhere ! I tried to fix the issue but your regex is way over my capabilities ! If you are sed/regex fanatic a want to give it another try feel free :). Right now I found a solution with another user that works great here's the script in question if you are interested:#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Replace spaces (%20) from markdown links to - after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
It's not very elegant but it does the job... While working on it with another very friendly user I came across other thing I haven't though of like:
- Converting 1.2 to 1-2 (e.g.
[Just a placeholder](#1.2%20Just%20a%20link%20to%20header)
) - Linking to another markdown file (e.g.
[Just a placeholder](Another%20File.md#1.2%20Just%20a%20link%20to%20header)
) - The link to file before the # need to keeps it's original form (e.g.
[Just a placeholder](Another%20File.md#1-2-just-a-link-tp-header)
)
Well I think that bare bone sed/regex wasn't the right tool, but in a bash script it does exactly what I'm expecting
Thanks for your help and pointers !
- Converting 1.2 to 1-2 (e.g.
-
[email protected]replied to [email protected] last edited by
Thanks for the pointer I wasn't aware pandoc was able to do that
It seems It can convert to Github-Flavored Markdown !! I have to give it a try
Still I learned a lot from another user about regex/sed and Pearl
!
-
[email protected]replied to [email protected] last edited by
Yeah probably bare bone regex was a mistake however a friendly user gave me a step by step guide on how to achieve my goal:
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
If you know a better way to achieve similar results I'm very open for every new lead and learn something new !
-
[email protected]replied to [email protected] last edited by
Sorry for the late response... I was busy with another user :S My English is so bad I'm not able to response to every one at the same time... Whatever...
I tried your pearl regex substitution and effectively it does what I ask from my post, so thank you very much for your help ! However, I missed a few use cases were your regex breaks... But that's on me, your command works as expected !!!
[Link with numbers](Another%20Markdown%20file.md#1.3%20this%20is%20another%20test.md)
The part before the hashtag need to keeps it's original form (even with %20) because it links to a markdown file directly and not a header (Hope it's comprehensible?). It took me a lot of time with another user and we came to a wrapped up script that does everything:
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
If you are motivated you can still improve your regex If you want
I'm kinda curious If it's possible with a one-liner ! Thank again for your help and sorry for the late response !!
-
[email protected]replied to [email protected] last edited by
Hello
Sorry for the very late response !
Effectively your regex is very close as a one line, I'm pretty impress ! :0 However I missed to mention something In my post (I only though about it after working on it with another user in the comments...). There a 2 things missing on your beautiful and complex regex:
- Numbering with dots also needs to have a dash in between (actually I think every special characters like spaces or a dots are converted to a dash )
FROM --------------- [Link with numbers](readme.md#1.3%20this%20is%20another%20test) TO --------------- [Link with numbers](readme.md#1-3-this-is-another-test)
- The part before the hashtag needs to keep it original form (links to a real file)
FROM --------------- [Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test.md) TO --------------- [Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test.md)
Sorry for the trouble I wasn't aware of all the GitHub-Flavored Markdown syntax :/. I got a a very cool working script that works perfectly with another user but If you want to modify your regex and try to solve the issue in pure regex feel free
I'm very curious how It could look like (god regex is so obscure and at the same time it has some beauty in it !)
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"