[Question] Sed and regex string manipulation
-
[email protected]replied to [email protected] last edited by
annotated it is working like this:
# use a loop to iteratively replace the %20 with -, since doing s/%20/-/g would replace too much. we loop until it cant substitute any more # label for looping :loop; # skip the following substitute command if the line contains an http link in markdown format /\[[^]]*\](http/! # capture each part of the link, and join it together with - s/\(\[[^]]*\]\)\(([^)]*\)%20\([^)]*)\)/\1\2-\3/g; # if the substitution made a change, loop again, otherwise break t loop; # convert all insides to the link lowercase if the line doesnt contain an http link /\[[^]]*\](http/! # this is outside the loop rather than in the s command above because if the link doesnt contain %20 at all then it won't convert to lowercase s/\(\[[^]]*\]\)\(([^)]*)\)/\1\L\2/g
-
[email protected]replied to [email protected] last edited by
Okay. To address the
%20
and thehttps
links, and theplaceholder
links, I came up with a bash script to handle this.Because of the variation in the links, instead of trying to write a
sed
command that will match only%20
in anchor markdown links, and placeholder links, while ignoring https links and ignoring all other text in the document.To do that, I used
grep
, awhile
loop,IFS
, andsed
Here's the script:
#! /bin/bash mdlinks="$(grep -Po ']\((?!https).*\)' ~/mkdn" while IFS= read -r line; do dashlink="$(echo "$line" | sed 's/%20/-/g')" sed -i "s/$line/${dashlink}/" /path/to/file done <<<"$mdlinks"
I'm not sure how familiar you are with bash scripting, so I'll do the same breakdown:
#! /bin/bash
- This tells the shell what interpreter to use for the script. In this case it's bash.mdlinks="$(grep -Po ']\((?!https).*\)' /path/to/file"
- This line usesgrep
to search for markdown link enclosures excluding https links and to output only the text that matches and saves all of that into a variable calledmdlinks
. Each link match will be a new line inside the variable.The breakdown of the
grep
command is as followes:grep
- invokes the grep command-Po
- two command flags. TheP
tellsgrep
to use perl regular expressions. Theo
tells grep to only print the output that matches, rather than the entire line.'
- opens the regex statement]\(
- finds a closing bracket followed by an opening parentheses(?!https)
- This is a negative look ahead, which a feature available in perl regex. This tells grep not to match if it finds thehttps
string. The parentheses encloses the negative look ahead. The?!
Is what indicates it's a negative look ahead, and thehttps
is the string to look for and ignore.'
- closes the regex statement/path/to/file
- the file to search for matcheswhile IFS= read -r line; do
- this invokes awhile
loop using the Internal Field Separator (IFS=
), which by default includes newline character. This allows the loop to take in the variable containing all of the matched links and separate them line by line to work on one at a time. Theread
command does what it says and reads the input given. In this case our variablemdlinks
. The-r
flag tellsread
to ignore the backslash character and just treat it as a normal part of the input.line
is the variable that each line will be saved in as they are worked through the loop. The;
endswhile
setup, anddo
opens the loop for the commands we want to run using the input saved inline
.dashlink="$(echo "$line" | sed 's/%20/-/g')"
- This command sequence runs the markdown link saved in theline
variable into sed to find all instances of%20
and replace them with a-
.dashlink
- the variable we're saving the new link with dashes to.=
- separates the variable from the input being saved into the variable."
- opens this command string for variable expansion.$
- tellsbash
to do command substition, meaning that the output of the following commands will be saved to the variable, rather than the actual text of the commands that follows.(
- opens the command setecho
- prints the given text or arguments to standard output, in this case the given argument is the variable$line
"
- tellsbash
to expand any variables contained within the quote set while ignoring any nonstandard characters like spaces or special shell characters that are saved in the variable.$line
- the variable containing our active markdown link from the text document"
- the closing quote ending the argument and the expansion enclosure|
- This is a pipe, which redirects the standard output of the command on the left into the command on the right. Meaning we're taking the markdown link currently saved in the variable and feeding it intosed
sed
- invokessed
so we can manipulate our text, and becausesed
is receiving redirected input, and we've specified no flags, the modified text will be printed to standard output.'s/%20/-/g'
- Our pattern match/substitution, which will find all occurrences of the string%20
in the markdown link fed into sed and replace them with-
.)"
- closes our command sequence for command substitution, and the variable expansion. At this point the text printed to standard output bysed
is saved to the variabledashlink
The next line is:
sed -i "s/$line/${dashlink}/" /path/to/file
, which usessed
to take theline
anddashlink
variables and use them to find the exact original markdown link in the text containing the%20
sequences, and replace it with the properly formatted markdown link with dashes.sed -i
- invokessed
and uses the-i
flag to edit the file in place."
- The double quote enclosure allows the expansion of variables in the pattern match/replacement sequence so it searches for the markdown link, and not the literal text string$line
.s/
- opens our match/modify sequence.$line
- the original markdown link that will be found/
- ends the pattern matching section${dashlink}
- The variable containing the previously modified markdown link that now has dashes. This expands to that properly formatted link which will be written into the text file replacing the malformed link. I don't know why this link has to be enclosed in curly braces while the first one does not./"
- ends the text modification section and closes the variable expansion./path/to/file
- the file to be worked onFinally we have
done<<<"$mdlinks"
, which ends the while loop and feeds themdlinks
variable into it.done
- closes thewhile
loop<<<
- This feeds the given argument into thewhile
loop for processing"
- expands the variable within while ignoring nonstandard characters$mdlinks
- the variable we're feeding in with all of our links containing%20
, except for https links."
- closes the variable expansion.If you've never written/created your own bash script, here's what you need to do.
-
in your home directory, or in the directory you're working in with these files, use a text editor like vim or nano or gedit or kate or whatever plain text editor you want to to create a new file. Call the file whatever you want.
-
Paste the entirety of the script text into the file. Modify the file paths as needed to work the file you want to work. if working multiple files, you'll need to update the script for each new file path as you finish one and move on to the next
-
Save and exit the file
-
Make the file executable at the terminal with
sudo chmod +x /path/to/script/file
-
To run it:
- Change directory to the directory that contains the script file (if you're not already there)
- at the command line use the command
. ./name-of-script-file
-
-
[email protected]replied to [email protected] last edited by
Here's a solution with
perl
(assuming you don't want to change http/https after the start of(
instead of start of a line):perl -pe 's/\[[^]]+\]\(\K(?!https?)[^)]+(?=\))/lc $&=~s|%20|-|gr/ge' ip.txt
-
[email protected]replied to [email protected] last edited by
skip the following substitute command if the line contains an http link in markdown format
Why you assume there's only one link in the line?
-
[email protected]replied to [email protected] last edited by
I didn't test this, but it will change the whole URL while changes are only needed in its fragment component (after the first
#
). -
[email protected]replied to [email protected] last edited by
Obligatory regex was a mistake post
-
[email protected]replied to [email protected] last edited by
Hmm, OP mentioned "Only edit what’s between parentheses" - don't see anywhere that whole URL shouldn't be changed...
-
[email protected]replied to [email protected] last edited by
Paths are constant, only anchors are generated by forgejo.
-
[email protected]replied to [email protected] last edited by
Why you assume there's only one link in the line?
They did not want external (http) links to be modified as that would break it:
-
[Example](https://example.com/#Some%20Link)
-
[Example](https://example.com/#some-link)
I compromised by thinking that it might be unlikely enough to have an external http link AND internal link within the same line. You could probably still do it, my first thought was
[^h][^t][^t][^p]
but that would cause issues for#ttp
and#A
so i just gave up. Instead I think you'd want a different approach, like breaking each link onto their own line, do the same external/internal check before the substitution, and join the lines afterward.Also, you perform substitutions in the whole URL instead of the fragment component
That requirement i missed. I just assumed the filename would be replaced the same way too Lol. Not too hard to fix tho
-
-
[email protected]replied to [email protected] last edited by
Don't reinvent the wheel! https://github.com/jgm/pandoc
-
[email protected]replied to [email protected] last edited by
Not home so I can't try it but do you need to be so specific to match the whole markdown syntax?
You might be able to get away with
s/#(\w+%20)*\w+\.\w{2,3}/\L&/g; /#(\w+%20)*\w+\.\w{2,3}/ s/%20/-/g
basically, matching #this%20is%20LIKELY%20a%20link.md
as opposed to matching whole markdown linklowercasing that entire match,
then on a search matching stuff that looks like that, replace the %20 with a hyphen (combined into a single sed command). this only fails when an http link falls within the same line as a markdown hyperlink -
[email protected]replied to [email protected] last edited by
Hello !!!
Sorry for the very late response had something else to do. I will read everything carefully and response to every post I also thought about it over night and I think that sed and and regex wasn't the best option here (as other have mentioned it).
I think a python script or bash (as you have mentioned it a bit later ) would be a better way. I'm sorry that I put you through all of this... wrong tool for the job :s.
-
[email protected]replied to [email protected] last edited by
First, thanks again for sharing your knowledge with me I really appreciate the time/effort you took to write all of this. I know those are a lot of thank you but I'm really grateful for all of this, this is very valuable information I will keep in my knowledge base. It's really time I learn proper bash/python/Pearl? scripting with all those tools (grep/sed/regex).
Second, YOU MISSED A DAMNED parentheses you fool xD !
mdlinks="$(grep -Po ']\((?!https).*\)' ~/mkdn)"
Took me some time to figured it out with a very non informative errorbashscript.sh: line 8: unexpected EOF while looking for matching "'
but as expected it works !From ------- [Just a test](#Just%20a%20test.md) [Just a link](https://mylink/%20with%20space.com) %20 To ------- [Just a test](#Just-a-test.md) [Just a link](https://mylink/%20with%20space.com) %20
Next to show you my appreciation and not to take everything for granted and being spoon feed for everything, I tried to find a solution myself for something else, I will try to explain the best I can how I solved it.
From ------- [Just a test](Another%20markdown%20file.md#Hello%20World) To ------- [Just a test](Another%20markdown%20file.md#hello-world)
The part before the hashtag needs to keep it's initial form (it links to the original markdown file). So, because just playing around with Pearl and regex (which doesn't end well doing this blindly without the proper knowledge) I did some simple string manipulation. It's not very elegant but does the trick, thankfully to your well written breakdown.
- I printed out the $mdlinks variable just to see what it prints out
- Copied and changed your Pearl/regex to find the first hashtag (#) and save it into a new variable ($mdlinks2)
- Feed your $mdlinks variable into my new Pearl/regex
- Feed my new variable into done? (I'm a bit confused here but okay xD)
#! /bin/bash mdlinks="$(grep -Po ']\((?!https).*\)' "/home/dany/newtest.md")" echo $mdlinks mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" echo $mdlinks2 while IFS= read -r line; do dashlink="$(echo "$line" | sed 's|%20|-|g')" sed -i "s/$line/${dashlink}/" "/home/dany/newtest.md" done <<<"$mdlinks2"
Yes, not very elegant but It's the best I could do currently However, I still got a YES effect
To answer your question:
Quick question as I’m working on this, in the new link example, is the BDMV and other capitalized text in this link supposed to be converted to lowercase, or to remain uppercase?
As you can see in my string manipulation above, the part before the # needs to keep it's original form (Sorry wasn't aware of this before working with the original files) I solved it with some string manipulation as shown above.
I'm a bit tired from all this searching/trail&error, tomorrow I will try to wrap everything up and answer your post below ! Also, I need to clean up the mess I made in my home directory xD.
Thanks again for your help ! Have a good night/day !
-
[email protected]replied to [email protected] last edited by
Oh god! I'm sorry about the missing
)
! I must have dropped it when copying things from my notes over to post the comment! (≧▽≦)Despite my error, I'm glad it worked, and even happier that you were able to take what we had worked out and modify it further to fit your other requirements. It's fun helping each other out, and it's also great learning.
I learn by problem solving, so I've got all my notes from working on this in my knowledge base as well!
In the future, feel free to ping me if you need help with other linux/cli/bash things. As I've mentioned before I'm no expert, but happy to help where I can.
-
[email protected]replied to [email protected] last edited by
No apologies necessary!